Thursday, December 30, 2010

Assisted Reading vs. Data Mining

I've started thinking that there's a useful distinction to be made in two different ways of doing historical textual analysis. First stab, I'd call them:
  1. Assisted Reading: Using a computer as a means of targeting and enhancing traditional textual reading—finding texts relevant to a topic, doing low level things like counting mentions, etc.
  2. Text Mining: Treating texts as data sources to chopped up entirely and recast into new forms like charts of word use or graphics of information exchange that, themselves, require a sort of historical reading.
Humanists are far more comfortable with the first than the second. (That's partly why they keep calling the second type of work 'text mining', even I think the field has moved on from that label--it sounds sinister). Basic search, which everyone uses on J-stor or Google Books, is far more algorithmically sophisticated than a text-mining star like Ngrams. But since it promises to merely enable reading, it has casually slipped into research practices without much thought.

The distinction is important because the way we use texts is tied to humanists' reactions to new work in digital humanities. Ted Underwood started an interesting blog to look at ngrams results from an English lit perspective: he makes a good point in his first post:

What puzzles me about humanistic disdain for the ngram viewer is that it often seems to presume that a piece of evidence must be legible in itself — naked and free of all context — in order to have any significance at all. If a graph doesn’t have a single determinate meaning, read from its face as easily as the value of a coin, then what is it good for? This critique seems to take hyper-positivism as a premise in order to refute a rather mild and contextual empiricism.  …[Humanists] fear that the superficial certainty of quantitative evidence will seduce people away from more difficult kinds of interpretation.
I think there's a lot to this view; I've been trying to say some similar things from time to time. Word graphs like those in ngrams are just another kind of historical evidence. Yes, they require nuanced, contextual interpretation. But that's no different than other sorts of evidence. Graphs give us new texts to read, give a new platform for meditations and reflections, give a new sort of source to interpret.

And yet. I think the real fear is not about difficult vs. easy interpretation. It's about the privileged place of a particular form of reading in the humanities. Historians, in the Rankean tradition, pretty much read documents. Sometimes they read stained-glass windows or maps of archeological sites or advertisement illustrations, but those departures don't challenge the primacy of texts in the field. Humanistic disciplines preserve and elaborate traditions of reading different types of artifacts—whether they're poems, paintings, music, or diplomatic cables. That expertise is central not only to the disciplines, but to the self-identity of lots of humanists themselves.

Text mining produces completely different artifacts to read. We get summary tables, charts, line graphs. In the case of ngrams, they're almost severed from traditional books. Progressive professors like Underwood can try to read them, but the practice is quite different from looking at text. I think he gives more evidence to my claim that poststructuralist theory (although he says structuralist, being a little more interested in referents than I am) has to some degree prepared us to read these sorts of artifacts better. But at the extremes, the temptation with the new data is to model rather than to read--to chart out half lives for fame as in Science, or to plot the prominence of president's centennials like I did. This is fun, but it's not clear how useful. Or more precisely, who it's useful for. (Maybe there's a market in parts of the culture industry for macroculturnomic forecasting--studios wanting to know if zombies are on their way out, etc.)

As a result, text mining is something of a challenge to the humanities, because it seems to promise to obviate their ways of reading—suddenly understanding Dirichlet distributions becomes more important than having a sophisticated ear for meter or an understanding of rhetorical conventions. Humanists love to complain about the decline of the humanities and their increasing exclusion from culture: they're well primed for heavily negative responses against pure text-mining approaches. We can scold those reactions away as grouchy or Luddite, but that misses the point—old guard humanists are right that their ways of reading are often designed to facilitate interactions between two people—the creator and the reader—and any programming solution that gets between the two, however ingenious, misses the point of what the humanities offer over the social sciences. Unless we want to reproduce the split within anthropology in all the humanities fields, there's no reason to clamor for the fight. 

Assisted reading, on the other hand, is a much easier sell. As I said, search has been adopted without much thought, because it reinforces existing patterns of reading. It deprecates some of our expertise, to be sure: but those are mostly older research practices—card catalogs, letters to experts, treks to periodicals reading rooms after every footnote—that humanists are much less invested in. The problem with assisted reading is that most humanists regard not far removed from magic, and certainly don't engage in designing tools to do it themselves. This, I think, is one of the most important problems for the digital humanities—humanists use digital resources all the time, but are quite na├»ve about how they work and thus unaware of the potential to get more out of them. As a result, our resources are arranged in ways that make it far harder for us to use them. Aside from a few longstanding gems like the Perseus Project, we aren't involved in the ways that our resources go digital, and they end up in places like Jstor with only one, suboptimal, way of getting at them. I've been thinking for a while about what humanists need to know about database design that they might not—hopefully I'll finally post that sometime soon.

I don't think we're stuck here. Some work slightly more sophisticated than artfully constructed search terms could really help to continue to demonstrate to humanists how digitization benefits them. (I'm sure there's a lot of this out there: but let me spin my lack of immediate examples as typical rather than merely embarrassing.) That sort of work makes the path to more sophisticated methods, even with non-textual outputs like charts and graphs, more palatable—something from inside the field, not an imposition from outside. Text mining and assisted reading are extremes on a spectrum, not discrete categories. (I'm sure it's clear by now that I think that about everything from genre to authorship, but it still bears repeating.) Assisted reading does rely on computers to dispose of many texts completely, and text mining always retains some lexical information, however heavily translated, at the end. The more work we can get in the middle of the spectrum, not just at the extremes, the better off we'll be.


  1. This isn't terribly related but I think there's actually an opportunity lurking in the prospect of humanists knowing more about Johann Peter Gustav Lejeune Dirichlet. I think it's fair to say that most folks in the humanities don't know that much about the history of mathematics or of science. Who knows what might come from grappling with new methods (and their origins, perhaps)?

    There's this article that I keep thinking has great relevance for these kinds of discussions. It appeared in Critical Inquiry: "Postdisciplinary Liaisons: Science Studies and the Humanities" by Mario Biagioli

  2. Interesting. There's some stuff in there (seriously, double PhDs? Aren't Harvard grad students spending enough time in Cambridge, already?) that I'm not sure I agree with, but there's definitely a need to apply the nuanced understandings of science from science studies on the humanities. I've been thinking a lot about the epistemology of error and how difficult it is to get humanists to accept imperfection.

    Are historians of science better digital humanists? Dan Cohen does history of mathematics, of course--but in general, it seems like it's English depts. with comparatively less critical engagement with science that have really gotten on board the text-analysis train, than historians who have had some exposure. I don't really know, though. Hank? Dan?

  3. Here's my reach for a concrete example of a possible gain. I think understanding a bit about Cantor's "discoveries" in mathematics really could deepen an understanding of late 19th century intellectual history (if only to get a sense of what all the excitement was about). And anyone who learns a bit of probability theory may well have to wrestle with infinite sets--and maybe even the Cantor set.

  4. I think this post has all the characterstics of the best post. Thank you a lot, man, for that.

  5. Nice work. Programs De administration De Pincas..