Sunday, August 28, 2011

Wars, Recessions, and the size of the ngrams corpus

Hank wants me to post more, so here's a little problem I'm working on. I think it's a good example of how quantitative analysis can help to remind us of old problems, and possibly reveal new ones, with library collections.

My interest in texts as a historian is particularly focused on books in libraries. Used carefully, an academic library is sufficient to answer many important historical questions. (That statement might seem too obvious to utter, but it's not--the three most important legs of historical research are books, newspapers, and archives, and the archival leg has been lengthening for several decades in a way that tips historians farther into irrelevance.) A fair concern about studies of word frequency is that they can ignore the particular histories of library acquisition patterns--although I think Anita Guerrini takes that point a bit too far in her recent article on culturomics in Miller-McCune. (By the way, the Miller-McCune article on science PhDs is my favorite magazine article of the last couple of years). A corollary benefit, though, is that they help us to start understanding better just what is included in our libraries, both digital and brick.

Background: right now, I need a list of of the most common English words. (Basically to build a much larger version of the database I've been working with; making it is teaching me quite a bit of computer science but little history right now). I mean 'most common' expansively: earlier I found that about 200,000 words gets pretty much every word worth analyzing. There were some problems with the list I ended up producing. The obvious one, the one I'm trying to fix, is that words from the early 19th century, when many fewer books were published, will be artificially depressed compared to newer ones.

But it turns out that a secular increase in words published per year isn't the only effect worth fretting about. Words in the Google Books corpus doesn't just increase steadily over time. Looking at the data series on overall growth, one period immediately jumped out at me:



Thursday, August 4, 2011

Graphing and smoothing

I mentioned earlier I've been rebuilding my database; I've also been talking to some of the people here at Harvard about various follow-up projects to ngrams. So this seems like a good moment to rethink a few pretty basic things about different ways of presenting historical language statistics. For extensive interactions, nothing is going to beat a database or direct access to text files in some form. But for quick interactions, which includes a lot of pattern searching and public outreach, we have some interesting choices about presentation.

This post is mostly playing with graph formats, as a way to think through a couple issues on my mind and put them to rest. I suspect this will be an uninteresting post for many people, but it's probably going to live on the front page for a little while given my schedule the next few weeks. Sorry, visitors!