Now to the final term in my sentence from earlier— “How often, compared to what we would expect, does a given word appear with any other given word?”. Let’s think about How much more often. I though this was more complicated than it is for a while, so this post will be short and not very important.
Basically, I’m just using the percentage of time more often as the measuring stick—I fiddled around with standard deviations for a while, but I don’t have a good way to impute expected variations, and percentages seems to work well enough. I do want to talk for a minute about an aspect that I’ve glossed over so far—how do we measure the occurrences of a word relative to itself?
I can think of four options:
1. Assume that each word appears with itself exactly as many times as we’d expect.
2. Assume that each word has the average ‘lumpiness’ of the words in our sample—a word that appears with itself a lot will get a positive score for appearances with itself, and one that is more evenly distributed will get a negative score. This is odd because it will push words that are relatively spread further away from words they frequently appear with, and words that are highly concentrated closer to the ones they appear with. I can’t think of a good reason to do this.
3. Assume each word is distributed at random throughout the text, and treat every occurrence above that baseline as unexpected. A word that appears 10,000 times in 25,000 books will appear with itself maybe 1000 times at random—if it appears with itself 5000 times, we could treat that as 4000 extra appearances.
4. Assume each word is thinly spread through all the books, and treat every occurrence above that as unexpected. I can’t think of any reason to do this, either.
There’s a lot to be said for the third approach, but I’m using the first one right now. The problem with the third is that words appear with themselves both a) far more frequently than with any other word, and b) at quite varying rates that are related to overall frequency. Since I’m assuming we already know about a word’s lumpiness for the first part of these calculations, I’m keeping that assumption for the second part.