My interest in texts as a historian is particularly focused on books in libraries. Used carefully, an academic library is sufficient to answer many important historical questions. (That statement might seem too obvious to utter, but it's not--the three most important legs of historical research are books, newspapers, and archives, and the archival leg has been lengthening for several decades in a way that tips historians farther into irrelevance.) A fair concern about studies of word frequency is that they can ignore the particular histories of library acquisition patterns--although I think Anita Guerrini takes that point a bit too far in her recent article on culturomics in Miller-McCune. (By the way, the Miller-McCune article on science PhDs is my favorite magazine article of the last couple of years). A corollary benefit, though, is that they help us to start understanding better just what is included in our libraries, both digital and brick.
Background: right now, I need a list of of the most common English words. (Basically to build a much larger version of the database I've been working with; making it is teaching me quite a bit of computer science but little history right now). I mean 'most common' expansively: earlier I found that about 200,000 words gets pretty much every word worth analyzing. There were some problems with the list I ended up producing. The obvious one, the one I'm trying to fix, is that words from the early 19th century, when many fewer books were published, will be artificially depressed compared to newer ones.
But it turns out that a secular increase in words published per year isn't the only effect worth fretting about. Words in the Google Books corpus doesn't just increase steadily over time. Looking at the data series on overall growth, one period immediately jumped out at me: