I may (or may not) be about to dash off a string of corpus-comparison posts to follow up the ones I've been making the last month. On the surface, I think, this comes across as less interesting than some other possible topics. So I want to explain why I think this matters now. This is not quite my long-promised topic-modeling post, but getting closer.
Off the top of my head, I think there are roughly three things that computers may let us do with text so much faster than was previously possible as to qualitatively change research.
1. Find texts that use words, phrases, or names we're interested in.
2. Compare individual texts or groups of texts against each other.
3. Classify and cluster texts or words. (Where 'classifying' is assigning texts to predefined groups like 'US History', and 'clustering' is letting the affinities be only between the works themselves).
These aren't, to be sure, completely different. I've argued before that in some cases,
full-text search is best thought of as a way to create a new classification scheme and populating it with books. (Anytime I get fewer than 15 results for a historical subject in a ProQuest newspapers search, I read all of them--the ranking inside them isn't very important). Clustering algorithms are built around models of cross group comparisons; full text searches often have faceted group comparisons. And so on.
But as ideal types, these are different, and in very different places in the digital humanities right now. Everybody knows about number 1; I think there's little doubt that it continues to be the most important tool for most researchers, and rightly so. (It wasn't, so far as I know, helped along the way by digital humanists at all). More recently, there's a lot of attention to 3.
Scott Weingart has a good summary/literature review on topic modeling and network analysis this week--I think his synopsis that "they’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination" gets it just right, although I wish he'd bring the hammer down harder on the danger part. I've read a fair amount about topic models, implemented a few on text collections I've built, and I certainly see the appeal: but not necessarily the embrace. I've also done some work with classification.
In any case: I'm worried that in the excitement about clustering, we're not sufficiently understanding the element in between: comparisons. It's not as exciting a field as topic modeling or clustering: it doesn't produce much by way of interesting visualizations, and there's not the same density of research in computer science that humanists can piggyback on. At the same time, it's not nearly so mature a technology as search. There are a few production quality applications that include some forms of comparisons (WordHoard uses Dunning Log-Likelihood; I can only find relative ratios on the
Tapor page). But there isn't widespread adoption, generally used methodologies for search, or anything else like that.
This
is a problem, because cross-textual comparison is one of the basic competencies of the humanities, and it's one that computers ought to be able to help with. While we
do talk historically about clusters and networks and spheres of discourse, I think comparisons are also closer to most traditional work; there's nothing quite so classically historiographical as tracing out the similarities and differences between Democratic and Whig campaign literature, Merovingian and Carolingian statecraft, 1960s and 1980s defenses of American capitalism. These are just what we teach in history---I in fact felt like I was coming up with exam or essay questions writing that last sentence.
So why isn't this a more vibrant area? (Admitting one reason might be: it is, and I just haven't done my research. In that case, I'd love to hear what I'm missing).