Thursday, February 3, 2022

What's in the Hathi Trust?

(This is a post I've had unpublished since writing it in 2016. Just hitting publish without reviewing right now because it's something I find myself periodically looking at the charts for).

As we get ready to launch the full Hathi Trust+Bookworm to allow tracking words across 13 million books, I've been working on fixing up the metadata from the original MARC records.

This is useful information to have for anyone using Hathi to find books; it's hard to know the general outlines of a collection like this. So what follows are some general outlines about what books are included in the Hathi Trust. This is closely related, by the way, to what books are included in Google Books; more on that below.

One hugely important question is where the books come from. Different libraries had different scanning agreements with Google, and collect differently. Medical terms show a sharp drop-off in Google Ngrams after 1922 not because they were used less in writing, but because Harvard's excellent medical library wasn't scanned by Google for post-1922 books. Hathi has some of the same issues in its scope; Harvard (dark blue) disappears after 1922 along with the New York Public, and the collection become largely Michigan (light orange) and California (light blue). A huge spike of misdated items from the year 1900 can be blamed mostly on California. The sharp contraction of size in the corpus after 1922 can partially be blamed on the libraries that disappear, but California too contributes fewer books. It will take some work to create a sample that is relatively consistent across the copyright line.

Books by originating library (top 10) by year. Click to enlarge.

Subject domains

Many of these libraries use the Library of Congress classification for their books: we've adopted it as the default subject classification for Bookworm because nothing else (subject headings, for example) is nearly so highly populated.

I was surprised to see how heavily represented two classifications are in the post-1920 period. DS, dark blue, is the history of Asia; PL, dark purple, is Asian Literature. After 1960 or so, both of these classes are larger than their American counterparts (E/F and PS). "Asia" is a much larger and more populated area; still, these indicate that the post-1970 Hathi collection has a less myopic view of the world than I might have expected from purely American universities.

 Book Scanners

Less surprising is the universe of scanners. One of the reasons to understand Hathi is that it's as close as we can get, in many ways, to knowing what's in Google books. The big difference is that Google includes many libraries not in Hathi (such as Oxford) and Hathi includes some books not in Google Books. As the visualization below makes clear though, the preponderance of books were scanned by Google, and only other organizations make a numerically significant contribution before 1922: the Internet Archive at a variety of libraries (green) and Microsoft at Cornell (red). (Note that I'm limiting he time scale here to just post-1815, not 1750)


One major question about the default search under the new format is whether we'll restrict it to just English or make it cross language. As the LC classes indicated, there are two different worlds before and after copyright; pre-1922 is basically English (green), French (light blue), and German (dark blue), while post-1922 begins to bring in significant numbers of texts in Japanese and Chinese (dark and light orange) and Russian. Chinese in particular is a pretty substantial corpus; I'm curious to see if the tokenizers worked well enough to make this a useful tool.

Relative language usage tells the same story from a different angle: here the y axis is *percent* of the corpus, not absolute number of texts. (It narrows slightly after 1960 because language diversity increases). This makes clear that the corpus becomes more English dominated over time, and better acknowledges French and Latin as significant languages early in the corpus.

1 comment: