All the cool kids are talking about shortcomings in digitized text databases. I don't have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream interests—the neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that it's not just at the margins we're missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Here's an example.
Thanks to a question from Hank in the comments on a previous post, I went looking in my database for books by Herman Melville. I noticed that most of the ones I have are published in the 20th century. That's not surprising, given Melville's obscurity in his lifetime. Still, I'd like to have the books for any study of 19th century culture. The lack of first editions has bothered me before with Mark Twain; I actually changed my publisher list last time I remade the database so it would catch some of his works with obscure presses. But even though Melville usually published with Harper, Open Library only has a few Melville texts published in his lifetime that meet my metadata criteria: two 1847 Typees, one 1849 Mardi, one 1856 Piazza Tales: that's it. No Moby-Dick, no Billy Budd, and only a microform copy of the Confidence Man without a library call number. HathiTrust's earliest copy of Moby Dick, just like OL's, is from 1892. Google Books' interface is less well FRBRized, but a search for the first edition of Moby Dick shows mostly the bad initial reviews, along with a number of empty bibliographic records Google doesn't seem to realize refer to the same edition. The list that search returns for me starts with one that declares Moby-Dick a joint project among Melville, Mark Twain, and Mortimer Adler. (I assume Adler wrote all the pedantic bits about whale biology, and Twain cashed his check as soon as he wrote the bit about knocking people's hats off in the street.)
Why no Moby Dick in the libraries? Here's my guess. The Google book digitization project is the only source for Google Books, and the biggest for Open Library and Hathi. We've got different sources, but they're all using the same library books from the same scanning sessions. Think about how the Google scanning project worked. They set up cameras in various library collections and started scanning books shelf-by-shelf, I believe, which is a sensible way to get a bunch of digital texts in a hurry. But there's a catch: no university library would possibly still have a first edition of Moby-Dick on its shelves in the 2000s. Any good library would have moved it to rare books long ago. If they didn't, some enterprising undergraduate would have snatched it up to pay a year or two's tuition. It's a cultural artifact, a prince among books: it's too important to leave among the plebes in the stacks.
So just because the first edition of Moby-Dick is such a cultural touchstone, just because we want to preserve it so much, it wasn't among the first 10 million or so volumes we put in our most important digital libraries. Perhaps some collection did their own scan of Moby-Dick in the early days of digitization. I'd be surprised if not. But if so, it isn't easy to find: Yale seems to have given up after two pages on the copy in Beinecke, and that's the only thing I'm turning up on Google. Any academic-led scanning project might well have started with this book; but the quantity over quality approach that Google Books has used means we still don't have it easily accessible. It's not the only one. The Adventures of Huckleberry Finn exists in the Bodleian Library copy of the 1884 first British edition, but the first American edition (1885) seems to be missing. The Bodleian's indifference about American classics seems also to be responsible for Google Books' copy of the Confidence Man, not present in Open Library, but a lot of books are missing entirely: there's no Tom Sawyer, either, no Origin of the Species until 1861… I'm sure the list goes on.
That's ironic, but also a neat little parable about how how the touchstones of the mid-century academy are approaching the Internet. We're so focused on preserving the book that most contemporary academic research in the humanities is as inaccessible as it's ever been: journal articles available only from within university campuses, and books not available online at all. Since scholars haven't been heavily involved in putting things online, not only the first edition of Moby-Dick but most of the current scholarship about Moby-Dick is still nearly invisible on the Internet. Protecting the culture of the past to be used like it always has been means excluding it from new currents of consumptions.
That's a neat story, but is it really a big deal? For some text analysis, to be sure, this is a bit of a pain. I'd really like a complete run of Melville or Twain's works to compare to the language of their contemporaries--but lacking the first editions, I have to rely on later publications. In extreme cases of late rehabilitation like The Confidence Man, Open Library has only that one oddly cataloged microfilm copy before the copyright cutoff: and where there are public domain copies, it requires really good FRBRization to be able to get the original publication year. And of course, as I said about Tom Sawyer and Google ngrams a while ago, it may be hard to sell humanists on the idea that text analysis measures the entire culture when it's missing its most central documents.
But for the real distant reading stuff, I don't think it matters much. For any project that actually takes advantage of what digital reading allows, quantity matters far more than quality. If we wait for nice TEI editions of all books to show up, it will be decades before anyone could leverage the most interesting techniques one can use on large bodies of texts. Either you're doing Melville studies, in which case you can just add a copy of Moby-Dick to your database, or you're not, in which case a few nautical terms here and a few whale skeletons there aren't going to change the language very much. Any study that's results would be changed by a couple books is probably trying too hard to wring evidence out of a small sample size. As long as I'm right that the really famous books are missing just because they're so famous, and not because the database is completely ridden with holes, the general picture of the language should be fine. At some point, I'd hope Google, Hathi, or Open Library would take the lead in scanning books from rare-books libraries—I can't imagine, honestly, that the last library will actually be missing these texts for long. But I would be surprised if there were more than a few dozen books in the 19th century that meet Moby Dick's criteria of incredible modern value and original obscurity.
Nonetheless, it's a helpful reminder that we've got a long way to go before we can talk about comprehensive book digitization. Our current collection of texts is skewed in all sorts of strange ways. Not only by library collection patterns, but by where they keep keep their physical books, by what books are easier to enter consistent metadata for, by how certain authors' reputations waxed and waned… This is all more evidence that we're just beginning to get a sense of how our big digital libraries differ from our old stone ones. And that without checking each others' work for mistakes of the type that only specialists, or librarians, or archivists can catch, we might find ourselves in some uncomfortable situations.