Data Science in Linguistics: Extracting Phonotactics from Word Lists


Consider the words ‘brick’, ‘blick’, and ‘bnick’. ‘brick’ is a word in English and ‘blick’ and ‘bnick’ are not. But most English speakers have the sense that ‘blick’ could be a word of English but ‘bnick’ could not be. One theory for why people have this sense is that ‘blick’ obeys a set of rules that characterize the English lexicon and that ‘bnick’ does not. (An example of such a rule might be: ‘a word cannot begin with a stop followed by a nasal’.) The study of such rules and the sound patterns of the lexicon in general is known as phonotactics, a subfield of phonology.

For many years the dominant idea in linguistics was that in each language there are criteria for whether a word is phonotactically valid or not, and a word could only be in the lexicon if it was phonotactically valid. Lately, this has shifted to the idea that phonotactics assigns a well-formedness score to words where some words are extremely good, some words are extremely bad, and most fall somewhere in between.

I will give a survey of how data science techniques are allowing researchers to extract information about the gradient phonotactics of the world’s languages.

UBC Robson Square

Speaker Bio

Dr. Paul Tupper is Professor of Mathematics at SFU. His major interest is applying mathematics of various sorts to problems in linguistics and psychology.