Friday, December 17, 2010

Missing humanists

(First in a series on yesterday's Google/Harvard paper in Science and its reception.)

So there are four things I'm immediately interested from yesterday's Google/Harvard paper.

  1. A team of linguists, computer scientists and other non-humanists published that paper in Science about using Google data for word counts to outline the new science of 'culturomics';
  2. They described the methodology they used to get word counts out of the raw metadata and scans, which presumably represents the best Google could do in 2008-09;
  3. Google released a web site letting you chart the shifts in words and phrases over time;
  4. Google released the core data powering that site containing data on word, book, and page occurrences for various combinations of words.

Twitter seems largely focused on #3 as a fascinating tool/diversion, the researchers seem to hope that #1 will create a burst of serious research using #4, and anyone doing research in the field should be eagerly scanning #2 for clues about what the state of art is—how far you can get with full cooperation from Google, with money to hire programmers, etc, and with unlimited computing infrastructure.

Each of these is worth thinking about in turn. Cut through all of it, though, and I think the core takeaway should be this:

Humanists need to be more involved in how these massive stores of data are used.

None were involved in this project, and it shows in all sorts of ways. Humanists understand the limitations, the opportunities, and the nuances of the books that have been sitting on library shelves better than anyone else, and they have been thinking for decades about differences between language and culture that seem to be mostly off the radar screen of the people doing this work now.

The Google dataset is fascinating. If I can find the hard drive space to start processing it, I look forward to using it on some of the same questions I've been tackling. It's great for linguistic applications like the regular/irregular verbs it was created for, and will be useful for a lot of constructions I'm interested in for other reasons.

As a public tool, as a first source, it's invaluable. The first academic I showed it to today was blown away by the implications. Lots of people are playing around with the results today. Once we start to figure out to process the gigs of data that were dumped with it, a lot of other interesting stuff becomes possible, too.

But for now: it's disconnected from the texts. This severely compromises its usefulness in most humanities applications. I can't track evolutionary language in any subset of books or any sentence/paragraph context; a literary scholar can't separate out pulp fiction from literary presses, much less Henry James from Mark Twain. It was created by linguists, and treats texts fundamentally syntactically--as bags of words linked only by very short-term connections--two or three words. The wider network of connections that happen in texts is missing.

Don't doubt that it's coming, though. My fear right now is that all of the work is proceeding without the expertise that humanists have developed in understanding how to carefully assess our cultural heritage. The current study casually tosses out pronouncements about the changing nature of 'fame' in 'culture' without, at a first skim, at least, acknowledging any gap at all between print culture and the Zeitgeist. I know I've done the same thing sometimes, but I'm trying to be aware of it, at least. An article in Science promising the "Quantitative Analysis of Culture" is several bridges too far.

So is it possible to a) convince humanists they have something to gain by joining these projects; b) convincing the projects that they're better off starting within conversations, not treating this as an opportunity to reboot the entire study of culture? I think so. It's already happening, and the CHNM–Google collaboration is a good chance. I think most scholars see the opportunities in this sort of work as clearly as they see the problems, and this can be a good spur to talk about just what we want to get out of all the new forms of reading coming down the pike. So let's get started.


  1. I'm a social scientist, and I can't imagine doing a "quantitative analysis of culture"--that is an overly ambitious idea that misses the whole point of the humanities.

    I'm not sure the datasets are everything we would want them to be, either. I mean, how good is OCR? Not good enough for this, in my experience. How would something like this account for differences in a word like "date," which can mean a specific day, a person you love, dinner out, and a sweet fruit? Sure it's fun to play with. But--at least for now--I will view any formal research out of ngrams with skepticism.

  2. Thanks for this objective assessment of this new google tool.

    At the first sight I thought this tool would be good for tracking large-scale tendencies in the printed aspect of culture. At the moment, however, I can only think about the result of a search as an illustration, and not an argument, for a claim that is based on other facts and arguments, precisely because without specifications of genres, of types of books, of editions, of publishers, without any reference to meaning the finding with this tool is either accidental or possibly misleading.

    Thanks again for this thought-provoking post!

  3. I've been doing some playing around with the Ngram search tool. I've got a long-standing interest in Coleridge's "Kubla Khan," which introduced "Xanadu" into the modern English lexicon. So I searched the corpus for "Xanadu" from 1800-2008. No big deal. But . . .

    Here's a post where I integrate that search with earlier work, which is based on a web search on "Xanadu" and on some historical data from the OED and the NYTimes archive (which goes back to 1851). The Books Ngram search picked up interesting stuff not in that other material, which was hardly representative -- nor, I suppose, is the Books Ngram search. But it's something we didn't have.

    It seems that the "peak" year for "Xanadu" is 1934, six years before Citizen Kane and 7 years after Livingston Lowes published The Road to Xanadu. The significance of this is not clear. For one thing, the tool graphs percentages, not absolute numbers. For another, as you've noted, the corpus is weak on periodicals.

    Still, an interesting exercise.

  4. Great post -- very thoughtful. I think your call to action for humanists to investigate these techniques is spot-on. As you suggest, the Science piece points out a ton of really interesting potential. But, as you rightly point out, there is much more work to be done to make these new research techniques truly valuable. This type of work is still in its early stages but I do think this article will help move things along in a positive way.

    I do want to point out to your readers that there are indeed some very smart humanists who are now (and have been) investigating this kind of research. I noticed that one of your commenters, @Natalie Binder, pointed out (rightly so) the problem of "sense" (does "date" mean the fruit, a day on the calendar, or a dinner out?). There are scholars tackling this very issue. For example, David Bamman and Greg Crane from the Perseus project have a great paper about this topic in which they address how to do automated word sense induction and disambiguation across very large text corpora. (See: Of course, there are many other areas where more work must be done to make this kind of research widely useful.

    Perseus is a very strong Classics-based project that has been doing this kind of research for many years. But, of course, there are many others. I actually organized an international grant competition called the Digging into Data Challenge to fund projects that are investigating new research methods at the large scale. That is: now that the "stuff" that humanists study (books, newspapers, music, artwork, photographs, etc) are increasingly digital and increasingly in very, very large digital collections, what new computationally-based research methods might be developed to take advantage of this huge scale? What are the humanities equivalents of scientific "big data" approaches? We've funded 8 major international projects thus far (see: And these projects certainly do involve quite a few humanities scholars who are working to enable new kinds of research.

    All that said, I do want to make one last point about the recent Harvard/Google paper. I certainly hope humanists don't dismiss the paper because of a lack of humanities involvement. In the digital humanities, we're always talking about interdisciplinary and how the humanities pervades all knowledge areas, etc. Just because the lead researchers don't have PhD's in the humanities doesn't mean they can't do work important to the humanities (for example, I know a lot of librarians who make immense contribution to the humanities). And, for the record, I note that the co-lead from Harvard, Erez Lieberman Aiden, has a masters degree in history, studied philosophy at Princeton, and is a practicing artist as well as having degrees in math, physics, and bioengineering. In fact, both of the Harvard team member have remarkably wide interests across many disciplines. Personally, I think that's terrific -- many of the digital humanities scholars doing the most creative work are also people who move across disciplinary lines with ease. Sure, I can see how the "culturomics" tag might grate some folks, fair enough. And, of course, Google's participation helped get a lot of attention focused on the project (more attention than similar projects happening elsewhere). But in the long run, that's a good thing. This paper will advance the field and having this data available to others will spark some new research.

    Yes, we're still new at this. Yes, we need more humanities scholars helping to build this new field. As you said, "let's get started."

    Office of Digital Humanities

  5. @Natalie: These are very important points, which it's great that you raise. As Brett says, there's a lot of interesting research on these things--in my next post I look at the OCR stuff a little bit.

    @Brett: Great points. I've been really impressed in general with how the NEH seems to be funding a lot of interesting and innovative projects wherever I look. More than anything I've seen within the disciplines, you all have been helping moving the ball down the field in all sorts of interesting ways. I certainly don't want to imply that humanists haven't already been making progress. My concern is that a lot of the defensive line keeping team Digital Humanities from gaining yardage much yardage inside the academy is made of old-school humanists who have a knee-jerk suspicion of quantification.

    On interdisciplinarity: absolutely good. I would say, though, that it's not the lack (or for Aiden, I guess, the presence) of humanities degrees that rankles me, although I probably phrased it that way: it's that they didn't feel the need to engage with a lot of basic ideas in the contemporary humanities. That's what more humanist co-authors could have brought. The paper seems to imply that humanists should be thrilled that we can now move uncritically from a mass of agglomerated words (not even texts) to definitive pronouncements about "culture," in toto. Even—especially—if they're not on the opposing team, I think a lot of humanists think someone needs to throw a flag on that play.