A few years ago, I felt I had some general sense of what was in the Books search engine and how it works. That sense is diminishing as things change more and more. I used to think I had a sense of how search engines work: you put in some words or phrases, and a computer traverses a sorted index to find instances of the word or phrase you entered; it then returns the documents with the highest share of those words, possibly weighted by something like TF-IDF.
Nowadays it's far more complicated than that. This post is just some notes on my trying to figure out one strange Google result, and what it says about how things get returned.
What's happening nowadays has little to do with that mental model. Idly wondering whether a strange phrase in the Wikipedia article on the 19th century Supreme Court justice Benjamin Curtis was plagiarized, I put it into Google books search. Consider the results here.
First, we get a photograph from the Watertown Public Library in Massachusetts, Google Books is full of library catalog items which are not in fact books; in this case, though, it seems to be simply the top search result because the precise phrase I searched for is not present in any books. (The normal Google Books search format stuff is missing).
The next results though, are more interesting in what they say about the internal workings of the Google Books engine. They do not contain the phrase in question, but they are all about Benjamin Curtis. There are only a dozen or so, not a full set of Curtis-related documents; in fact, it seems that Google Books is presenting me with the "Further Reading" section (not the bibliography) of the Wikipedia page from which the quote came. This seems to be general. Search for any arbitrary phrase in Wikipedia, and you'll get a list back including some of the cited and required texts.
Why is it doing that? What mechanics of a search engine would cause this to happen? Is Books translating otherwise unmatched queries into Wikipedia pages, and returning their contents? Not in general: most unmatched phrases turn up nothing.
I can start to understand this individual rule by typing in random strings. I came up with the phrase "such a place would never be", which appears in no books but does have 6 search results on the web.
That phrase returns a series of books, starting with a book titled "John Brown to James Brown" which one Amazon reviewer described as "where such a place would never be imagined to have existed." The relationship of the rest of the books, which touch mostly on early history of the Mormons and moneymaking schemes, is less clear: I'd bet they all appeared on the Amazon page as "frequently bought," but I can't be sure. (I don't get "frequently bought" when I visit book, and the Google cache doesn't have it either.
So--there's some form of linked data about books driving Google Books search, driven by the open web possibly as a fallback, or possibly as part of the core search. It seems to work best from structured data like Amazon or Wikipedia, but also engage in some pretty wild guesses based on semantic parsing.
Basically, the web index is hooked into Books search in ways that aren't obvious or transparent. And it leads to an extremely strange world of stacking algorithms on algorithms; that an Amazon review would lead to a phrase giving some random books assembled by Amazon at some point in the past is completely inscrutable.
EDIT, one day later:
It occurs to me that part of what's going here may be that the same algorithm is used for books as in Google's image search. If you do an image search for these phrases, it pops up images from the Wikipedia article and the Amazon page (and, now, this very blog post). Google appears to treat books both as collections of text to be searched, *and* as entities that exist on the web described through the text of web pages.