Sunday, December 19, 2010

Not included in ngrams: Tom Sawyer

I wrote yesterday about how well the filters applied to remove some books from ngrams work for increasing the quality of year information and OCR compared to Google books.

But we have no idea what books are in there. There's no connection to the texts from the data.

I'm particularly interested in how they deal with subsequent editions of books. Their methodology (pdf) talks about multiple editions of Tom Sawyer. I think it says that they eliminate multiple copies of the same edition but keep different years.

I thought I'd check this. There are about 5 occasions in Tom Sawyer where the phrase "Huck said" appears with separating quotes, and 11 for "said Huck." Both are phrases that basically appear only in Tom Sawyer in the 19th century (the latter also has a tiny life in legal contracts involving huckaback, and a few other places), so we can use it as a fair proxy for different editions. The first edition of Tom Sawyer was 1881: there are loads of later ones, obviously. Here's what you get from ngrams:

Three big spikes around 1900, and nothing before. Until about 1940, the ratio is somewhat consistent with the internal usage in the book, 11 to 5, although "said huck" is a little overrepresented as we might think. Note:
  • No edition of Tom Sawyer shows up until 20 years after its first publication;
  • There's probably one edition apiece in 1899 and 1901, and two or three in 1903. Those are all around the authorized edition of Twain's works. Either they're catching multiple copies of that edition, or Tom Sawyer was just coming into the public domain (which, for those of you who don't know, is something that used to happen like LPs or smallpox) which led them to rush out the collected edition. Mark Twain and copyright is such a popular issue I can't find the answer right away. I talked at the end of this post about how hard it is to tell that "Collected Works of Mark Twain, vol. 1" and "Innocents Abroad" are the same book. I find a little reassuring that even Google seems to have the same problem. I've had some success using clustering based on patterns of word use. 
So what's the point? I know I said we shouldn't criticize based on metadata; but I'm equally irate at the idea that ngrams truly takes the temperature of American culture. Maybe not including Tom Sawyer as part of "English", or "English One Million", or "English Novels" is a good example of the shortcomings of this approach. 

And then: Tom Sawyer _does_ show up in their "American English" sample before 1899. 

"American English" is supposedly a subset of the "English" sample, but clearly that's not the case. Something's wrong here with the data they're presenting. It doesn't match their own description of it. That's always a bad thing.

Any ideas what it is?

For the record, this works for other books with distinctive character names: Pilgrim's Progress, for example, is a little noisier:


  1. This is unrelated, but I wanted to post somewhere the ngram for 02138, which, before the invention of zip codes, shows what percentage of books have Harvard library bookplates in them. It falls off completely right in 1922.

  2. Very very good post. You've included all the great information in this post. Thanks a million for that. Cheers!