Sapping Attention: Biblio bizarre: who publishes in Google Books

Thursday, April 3, 2014

Biblio bizarre: who publishes in Google Books

Here's a little irony I've been meaning to post. Large scale book digitization makes tools like Ngrams possible; but it also makes tools like Ngrams obsolete for the future. It changes what a "book" is in ways that makes the selection criteria for Ngrams—if it made it into print, it must have some significance—completely meaningless.

So as interesting as Google Ngrams is for all sorts of purposes, it seems it might always end right in 2008. (I could have sworn the 2012 update included through 2011 in some collections; but all seem to end in 2008 now.)

Lo: the Ngram chart of three major publishers, showing the percentage of times each is mentioned compared to all other words in the corpus:

This, if anything, massively understates the importance of BiblioBazaar to Google Books in 2008: using the 2009 data, here's the percentage of all books (rather than words) mentioning each of those three presses:

This doesn't mean that 35% of all the books in 2000 are Macmillan, because other books will cite Macmillan in footnotes. I bet almost every university press book in the humanities and social sciences cites Harvard University Press in 1999. But given that BiblioBazaar barely existed before 2008, hardly any non-BiblioBazaar books would mention the company in 2008. So apparently, BiblioBazaar is almost 45% of the 2008 sample in Google Ngrams. That's incredible.

How did "BiblioBazaar" supplant the largest presses in just one year?

This has messed up other sources: Publisher's Weekly reported in 2010 on how BiblioBazaar leapt to the top of the Bowker's charts. In that, the company's president is asked if they really published 270,000 books in 2009:

"If by ‘produce' you mean create a cover file that will print at multiple POD vendors, a book block that will print at multiple POD vendors, and metadata to sell that book in global sales channels, then yes, we did produce that many titles," said Mitchell Davis, president of BiblioLife, parent company of BiblioBazaar.

And what sort of books are they?

All of the company's content is in the public domain, and are basically "historical reprints," Davistold PW, with foreign language books, and their "added layers of complexity" the fastest growing category of books. "Dealing with out-of-copyright materials lets us leverage our knowledge and relationships in the global bookselling industry more easily as we build out what is shaping up to be a pretty killer platform," he noted.

In other words, the entire 19th century sprang back into print. You can see this in the Ngrams charts pretty clearly: "thou","thee", and"thy," for example, shoot up fivefold in 2007-2008 after centuries of decline. I've noticed this before, but now I'm inclined to think most of it can be attributed to this one company.

Or "railway." This isn't qualitatively different from other shifts in the Ngrams database: the gaps at 1922 as the library composition shifts, for example. But it is quantitatively of a different order. The complete immateriality of books post-2008 means that minor decisions about whether an e-book actually exists or not can cause shifts in 40% of the corpus. I spent some time looking for words that shift at the 1922 break, and though they do exist (it seems that the loss of Harvard drops most medical terms, for example), the shifts are a few percentage points: nothing that anyone should be taking seriously in that noisy a dataset anyway. But half the corpus: that's something else entirely.

Among other strange things, that means Ngrams is almost certainly at its most useful right now; with each year, it gets further and further out of date, and it will be extremely hard to update it without making a lot of extremely hard choices.

Quick postlude: I should hasten to say that everyone involved with Ngrams is aware of all the class of problems like this, and that the quick credibility-check for anyone citing Ngrams is that they don't use the post-2000 books as any sort of evidence. There's a reason the default settings stop in 2000, and that the Michel et al paper urges you not to use the newer data. But I hadn't realized before how different 2008 was, in particular.

7 comments:

UnknownApril 3, 2014 at 2:16 PM
Wow. I knew they were an enormous problem, but I hadn't come close to understanding the scale of the challenge they pose.

As bad as their polluting the data may be, there's an even more worrisome element of their business model. The parent company partnered with libraries, offering to scan their pre-1923 titles for free in exchange for digital copies. (It's been alleged that another publisher, Kessinger Press, was simply downloading the Google scans and repackaging them. BiblioBazaar says that's not what it's doing. I have no way to evaluate that claim.)

What's fascinating to me is that these BiblioBazaar editions of public domain works are not available to be read on Google Books. Not as full books. Not as limited numbers of pages. Not even as text snippets. They're simply blank. If you want them, you'll have to buy them as print-on-demand editions off Amazon or another seller. Now, why should that be?

Let me offer another general and vague observation. Quite often, when there's a BiblioBazaar edition of a work, I can't seem to find a fully accessible version on Google Books. And in a few cases, I've been using a scanned pre-1923 edition of a work in research, and returned to find it no longer accessible, and noticed that there is now a BiblioBazaar edition.

That's not dispositive. Google Books is screwy in a lot of ways. But there's a track record of other publishers - most notably Kensington - allegedly filing fraudulent reports of copyright violations with Google, using its automatic processes to remove full-text editions of public domain works from Google Books. That leaves only the expensive print-on-demand editions for many titles.

So when I see a shady publisher slapping modern copyright dates on public domain works, and preventing anyone from seeing even snippets of the texts, I get very worried and very suspicious.
ReplyDelete
Replies
Adam CrymbleApril 3, 2014 at 2:54 PM
If ever there was a case for making sure we understand our sources and data, you just made it. Very cool.
ReplyDelete
Replies
Adam CrymbleApril 6, 2014 at 7:34 AM
I don't suppose this makes n-grams obsolete. Maybe it's just an opportunity to acknowledge that what we really want in the datasets is first editions only. Unless we're using the corpus for another reason. In which case we might not. It comes down to control over what's in or what's out though. And at the moment we really don't have that.
ReplyDelete
Replies
IljaJuly 18, 2014 at 4:56 AM
On a different level, this has made it remarkably difficult to find authentic old books. A search on Bookfinder et al. will typically only reveal an avalanche of POD junk.
ReplyDelete
Replies

Add comment