So there are four things I'm immediately interested from yesterday's Google/Harvard paper.
- A team of linguists, computer scientists and other non-humanists published that paper in Science about using Google data for word counts to outline the new science of 'culturomics';
- They described the methodology they used to get word counts out of the raw metadata and scans, which presumably represents the best Google could do in 2008-09;
- Google released a web site letting you chart the shifts in words and phrases over time;
- Google released the core data powering that site containing data on word, book, and page occurrences for various combinations of words.
Twitter seems largely focused on #3 as a fascinating tool/diversion, the researchers seem to hope that #1 will create a burst of serious research using #4, and anyone doing research in the field should be eagerly scanning #2 for clues about what the state of art is—how far you can get with full cooperation from Google, with money to hire programmers, etc, and with unlimited computing infrastructure.
Each of these is worth thinking about in turn. Cut through all of it, though, and I think the core takeaway should be this:
Humanists need to be more involved in how these massive stores of data are used.
None were involved in this project, and it shows in all sorts of ways. Humanists understand the limitations, the opportunities, and the nuances of the books that have been sitting on library shelves better than anyone else, and they have been thinking for decades about differences between language and culture that seem to be mostly off the radar screen of the people doing this work now.
The Google dataset is fascinating. If I can find the hard drive space to start processing it, I look forward to using it on some of the same questions I've been tackling. It's great for linguistic applications like the regular/irregular verbs it was created for, and will be useful for a lot of constructions I'm interested in for other reasons.
As a public tool, as a first source, it's invaluable. The first academic I showed it to today was blown away by the implications. Lots of people are playing around with the results today. Once we start to figure out to process the gigs of data that were dumped with it, a lot of other interesting stuff becomes possible, too.
But for now: it's disconnected from the texts. This severely compromises its usefulness in most humanities applications. I can't track evolutionary language in any subset of books or any sentence/paragraph context; a literary scholar can't separate out pulp fiction from literary presses, much less Henry James from Mark Twain. It was created by linguists, and treats texts fundamentally syntactically--as bags of words linked only by very short-term connections--two or three words. The wider network of connections that happen in texts is missing.
Don't doubt that it's coming, though. My fear right now is that all of the work is proceeding without the expertise that humanists have developed in understanding how to carefully assess our cultural heritage. The current study casually tosses out pronouncements about the changing nature of 'fame' in 'culture' without, at a first skim, at least, acknowledging any gap at all between print culture and the Zeitgeist. I know I've done the same thing sometimes, but I'm trying to be aware of it, at least. An article in Science promising the "Quantitative Analysis of Culture" is several bridges too far.
So is it possible to a) convince humanists they have something to gain by joining these projects; b) convincing the projects that they're better off starting within conversations, not treating this as an opportunity to reboot the entire study of culture? I think so. It's already happening, and the CHNM–Google collaboration is a good chance. I think most scholars see the opportunities in this sort of work as clearly as they see the problems, and this can be a good spur to talk about just what we want to get out of all the new forms of reading coming down the pike. So let's get started.