Sunday, February 10, 2019

How badly is Google Books search broken, and why?

I periodically write about Google Books here, so I thought I'd point out something that I've noticed recently that should be concerning to anyone accustomed to treating it as the largest collection of books: it appears that when you use a year constraint on book search, the search index has dramatically constricted to the point of being, essentially, broken.

Here's an example. While writing something, I became interested in the etymology of the phrase 'set in stone.' Online essays seem to generally give the phrase an absurd antiquity--they talk about Hammurabi and Moses, as if it had been translated from language to language for decades. I thought that it must be more recent--possibly dating from printers working with lithography in the 19th century.

So I put it into Google Ngrams. As it often is, the results were quite surprising; about 8,700 total uses in about 8,000 different books before 2002, the majority of which are after 1985. Hammurabi is out, but lithography doesn't look like a likely origin for widespread popularity either.

That's much more modern that I would have thought--this was not a pat phrase until the 1990s. That's interesting, so I turned to Google Books to find the results. Of those 8,000 books published before 2002, how many show up in the Google Books search result with a date filter before 2002?

Just five. Two books that have "set in stone" in their titles (and thus wouldn't need a working full-text index), one book from 2001, and two volumes of the Congressional record. 99.95% of the books that should be returned in this search--many of which, in my experience, were generally returned four years ago or so--have vanished.

Many of these books *do* still exist in the HathiTrust index.

Changing the date does not produce the results you'd expect, either. "Set in stone" with a date filter set before 1990 returns *nothing*, with a single non-book result returned from a 1982 Washington Post article that has wandered into the Google index. This is especially interesting, because it means that the displayed representation of the two congressional serial's volumes dates as being 1900 is *not* being used for the purpose of retrieval. This is probably wise: books listed as being published in 1900 in the library catalogs feeding into Google can be from any time. Choosing a date before 2020 (which should return all books) adds only a few books to the 2002 listing.

When you search for the term with no date restrictions, Google claims to be returning 100,000-ish results. I have no way of assessing if this is true; but scrolling through results, they do include a few pre-1990 books that didn't show up in the earlier searches.

What's going on? I don't know. I guess I blame the lawyers: I suspect that the reasons have to do with the way the Google books project has become a sort of Herculaneum-on-the-Web, frozen in time at the moment that anti-Books lawsuits erupted in earnest 11 years ago. The site is still littered with pre-2012 branding and icons, and the still-live "project history" page ends with the words "stay tuned..." after describing their annual activity for 2007.

So possibly Google has one year it displays for books online as a best guess, and another it uses internally to represent the year they have legal certainty a book is released. So maybe those volumes of the congressional record have had their access rolled back as Google realized that 1900 might actually mean 1997; and maybe Google doesn't feel confident in library metadata for most of its other books, and doesn't want searchers using date filters to find improperly released books.

Oddly, this pattern seems to work differently on other searches. Trying to find another rare-ish term in Google Ngrams, I settled on "rarely used word"; the Ngrams database lists 192 uses before 2002. Of those, 22 show up in the Google index. A 90% disappearance rate is bad, but still a far cry from 99.95%.

So we can't even know how bad the uncertainty is. One intriguing possibility is that the searches I'm using are themselves caught up in the algorithms used to classify books. If I worked at Google, I would have implemented a text-based date-prediction algorithm to flag erroneously classified books. (I have actually done this and sent a list to the HathiTurst of books they may have erroneously released into the public domain. It works). If they use trigrams, it's possible that a term like "set in stone," because of its recency, might *itself* be pushing a bunch of 20th century books into the realm of uncertainty.

Partly this is the story that we all know: Google Books has failed to live up to its promise as the company has moved away from its original mission of organizing information for people. But the particular ways that it has actually eroded, including this one, are worth documenting, because it's easy to think that search tools that worked perfectly well a few years ago won't have been consciously degraded.