So as interesting as Google Ngrams is for all sorts of purposes, it seems it might always end right in 2008. (I could have sworn the 2012 update included through 2011 in some collections; but all seem to end in 2008 now.)
Lo: the Ngram chart of three major publishers, showing the percentage of times each is mentioned compared to all other words in the corpus:
This, if anything, massively understates the importance of BiblioBazaar to Google Books in 2008: using the 2009 data, here's the percentage of all books (rather than words) mentioning each of those three presses:
This doesn't mean that 35% of all the books in 2000 are Macmillan, because other books will cite Macmillan in footnotes. I bet almost every university press book in the humanities and social sciences cites Harvard University Press in 1999. But given that BiblioBazaar barely existed before 2008, hardly any non-BiblioBazaar books would mention the company in 2008. So apparently, BiblioBazaar is almost 45% of the 2008 sample in Google Ngrams. That's incredible.
How did "BiblioBazaar" supplant the largest presses in just one year?
This has messed up other sources: Publisher's Weekly reported in 2010 on how BiblioBazaar leapt to the top of the Bowker's charts. In that, the company's president is asked if they really published 270,000 books in 2009:
"If by ‘produce' you mean create a cover file that will print at multiple POD vendors, a book block that will print at multiple POD vendors, and metadata to sell that book in global sales channels, then yes, we did produce that many titles," said Mitchell Davis, president of BiblioLife, parent company of BiblioBazaar.And what sort of books are they?
All of the company's content is in the public domain, and are basically "historical reprints," Davistold PW, with foreign language books, and their "added layers of complexity" the fastest growing category of books. "Dealing with out-of-copyright materials lets us leverage our knowledge and relationships in the global bookselling industry more easily as we build out what is shaping up to be a pretty killer platform," he noted.In other words, the entire 19th century sprang back into print. You can see this in the Ngrams charts pretty clearly: "thou","thee", and"thy," for example, shoot up fivefold in 2007-2008 after centuries of decline. I've noticed this before, but now I'm inclined to think most of it can be attributed to this one company.
Or "railway." This isn't qualitatively different from other shifts in the Ngrams database: the gaps at 1922 as the library composition shifts, for example. But it is quantitatively of a different order. The complete immateriality of books post-2008 means that minor decisions about whether an e-book actually exists or not can cause shifts in 40% of the corpus. I spent some time looking for words that shift at the 1922 break, and though they do exist (it seems that the loss of Harvard drops most medical terms, for example), the shifts are a few percentage points: nothing that anyone should be taking seriously in that noisy a dataset anyway. But half the corpus: that's something else entirely.
Among other strange things, that means Ngrams is almost certainly at its most useful right now; with each year, it gets further and further out of date, and it will be extremely hard to update it without making a lot of extremely hard choices.
Quick postlude: I should hasten to say that everyone involved with Ngrams is aware of all the class of problems like this, and that the quick credibility-check for anyone citing Ngrams is that they don't use the post-2000 books as any sort of evidence. There's a reason the default settings stop in 2000, and that the Michel et al paper urges you not to use the newer data. But I hadn't realized before how different 2008 was, in particular.