I've been thinking about a recent article by Alexander M. Petersen et al., "Languages cool as they expand: Allometric scaling and the decreasing need for new words." The paper uses Ngrams data to establish the dynamics for the entry of new words into natural languages. Mark Liberman argues that the bulk of change in the Ngrams corpus involves things like proper names and alphanumeric strings, rather than actual vocabulary change, which keeps the paper from being more than 'though-provoking.' Liberman's fundamental objection is that although the authors say they are talking about 'words,' it would be better for them to describe their findings in terms of 'tokens.' Words seem good and basic, but dissolve on close inspection into a forest of inflected forms, capitals, and OCR mis-readings. So it's hard to know whether the conclusions really apply to 'words' even if they do to 'tokens.'
There's a serious problem, though, with religiously talking only about "tokens." Or better, an unserious problem: talking about 'tokens' isn't as fun as talking about words. Research like this (a similar paper, though probably less interesting to most, is here) is predicated on human language responding to quantitative laws. And so it needs a fundamental element. For demography (a major touchstone in this paper), that's people. For physics (the source of the 'cool as they expand' metaphor) they are atoms. For texts, it's better if the unit is something in our actual experience like words rather than an abtruse concept like tokens. Liberman's point is interesting in part because it deflates the strikingness of the claim. The real finding, he's saying, would be about words, not about tokens: findings about alphanumeric strings are not intrinsically interesting.
It occurs to me there's a sensible argument for going yet further. Looking for language to obey physical laws might undermine even the basic concept of an atomic unit of language. Whether words or tokens, though, the fundamental assumption is that whitespace does delineate some fundamental element of language. In the most trivial sense, that's obviously not true. Tokenizing is hard and ultimately arbitrary. For example: in the Google Ngrams set (and Bookworm, which uses the same rules), the text 'C#' is one 'word' and 'N#' is two; 'database' is one, 'data base' is two, and 'data-base' is three; and "couldn't" is either two words or three depending on whether you use the 2009 or 2012 corpus.
In the systems-dynamics view, "evolution" is straightforwardly a word that enters the language at some point, and "natural selection" is not. The thing that shows a culture's core vocabulary is its use of those individual words. Things like "natural selection" are interesting, but not necessary for modeling. And since it's much more simple--more 'elegant'--to work with words alone, it's a reasonable simplification. (And a hugely convenient one: the 1grams are trivial to do most sorts of research with in a software package of your choosing, but the 2- and 3-grams start to require a lot more attention to memory management and data indexing and all sorts of other inelegancies.) And while "natural selection" might be a concept, other ngrams—"core vocabulary is its use"—are obviously not. So let's abstract them away.
But maybe we can disagree about this: words aren't as fundamental as atoms. As Liberman notes, an enormous number of the 1grams are proper nouns. That suggests we need greater length to delineate concepts. So while "Adams" might represent a single person in a small, constrained community, over time we need longer and longer strings to express a simple idea--"John Adams" (2) in 1776, "John Quincy Adams" (3) in 1820, "John Adams the composer" (4) in the 1970s. I have more than once exceeded the 5-gram limit in trying to distinguish "the composer John Adams who wrote Nixon in China" (9) from "John Adams the composer who lives in Alaska" (8). So where should we draw the line? This might seem like a rather pointless and irresolvable debate, best solved by cutting the Gordian knot and taking 'words' as the most obvious choice for basic unit. Linguists might care about the difference between 'tokens' and 'words' and 'ngrams' and 'concepts,' but system dynamicists don't need to care.
But there's an interesting contradiction buried inside the dynamics case. Both the papers I link to above frame much of their work around Zipf's law. Petersen et al say that while Zipf's and Heaps' laws have been "exhaustively tested on relatively small snapshots of empirical data, here we test the validity of these laws using extremely large corpora." Much of their analysis is predicated on kinks in the data: it turns out that Zipf's law does not hold in certain cases (above words of a certain length), and so they offer up psychological and cultural explanations of why that might be. (Zipf's law states that word frequency is inversely proportional to rank among the most common words).
Others have noted this discover before. Petersen et al eventually cite Cancho and Taylor, which I'm not reading right now because it's behind a firewall and I'm on a train: even the Wikipedia page on Zipf's law shows the trail-off after about 100K words in the wikipedia language use chart, although without mentioning it. It takes 1990s-sized corpuses to detect this, not something the length of Ngrams.
What they don't cite is my favorite finding about Zipf's law. Le Quan Ha et al found 10 years ago that the trail-off effect is only true for 'words'; if you use indefinite-length ngrams, the pattern holds much farther out. In a succinct and persuasive paper (pdf link), they argue that while unigrams, bigrams, and trigrams each almost follow Zipf's law on their own, it's only by combining them all together that the Zipf distribution is fully revealed. So while Cancho and Taylor argue for 'two different regimes' of scaling on unigram counts, they argue for an expansive definition of tokens but just one regime. The key images from their paper show the fall-off for unigrams in English and Mandarin, but a continuous decline when using all n-grams combined. (I don't know anything about Mandarin, but it seems like a pretty obvious case where "orthographical unit" and "token" and "word" will obviously fall apart).
|They showed that Zipf's law falls away after 10^5 words for English and 10^3 for Mandarin|
...but the following n-grams:
I know: 2
I know what: 1
I know what I: 1
I know what I know: 1
know what: 1
know what I: 1
...and so on.
Just to be sure they weren't completely wrong, I checked their finding on the text corpus I had closest to hand, the first three seasons of dialogue on the TV show Downton Abbey. Zipf's law works far better on 1-, 2-, and 3-grams combined than it does individually: I think this is the relevant test. You can see that the points don't really form a line as you'd expect in the first chart, but come closer in the second.
I love this finding, because it upends the idea that words are atoms that follow laws like Zipf's. The thing that's statistically regular isn't the naturalized concept of 'words' or even 'tokens': it's n-grams, which isn't something we have a strong intuitive sense of. (Nor is it discrete: any given word is a member of a theoretical huge number of different ngrams in a text when n is allowed to be large.) So the numbers are suggesting that a model of words as the fundamental units of language doesn't jive with the idea of language following quantitative rules.
But Cancho and Taylor--who write about "two regime scaling" on the unigram counts and the reasons for that--seem to be getting the cites in the subset of textual studies that wants to treat words as dynamic systems rather than Quan Ha et al. (Here's another example from even more recently.) There are probably lots of reasons for this. But I do think an important one is that quantification is just less appealing when the units aren't discrete and understandable. I wrote last time about how humanists flee into topic models because it lets them escape the post-lapsarian confusions of words. There's a funny dynamic: one of the reasons scientists are drawn in to describing words is that it imposes a sense of order on the frustrating chaos of language.
If you don't care whether language responds to laws, you won't care about this finding. But if you are beholden to a particle-dynamics, statistically-oriented view of language where things like Zipf's law ought to hold, then it suggests the right data to use is the entirety of the Ngrams corpus, not the 1-grams. Because that's where the real system dynamics seem to be happening. Working with a framework where Zipf's law holds for some time, but then stops holding, and the reasons are complicated--that's like epicycles on a planetary orbit. It's ugly.
But it's at a cost. It suggests two completely different ways of doing research on a system including tens of thousands of tokens--using words as a test set, and using Ngrams of all lengths. (I haven't checked the results of the more substantive findings of Petersen et al using two- and three-grams, but given how much they rely on Zipf's law, I wouldn't be surprised if it showed a completely different result vis-a-vis languages cooling.) Probably most research will work on individual tokens for both practical (you can load them into memory) and prejudicial (it just doesn't sound as good to say "in the beginning was a series of overlapping variable-length ngrams; and the overlapping variable-length ngrams were with God, and the overlapping variable-length ngrams were God").
The point is, neither of those reasons strike me as particularly good. If I believed that looking at the internal mechanism of language/culture as a system following mechanical laws, I'd probably have to think about using the larger sets, for all its computational complexity. And even if I don't, it's necessary to keep some very heavy skepticism around the use of words as units in a lot of contexts where I like to use them.