I may (or may not) be about to dash off a string of corpus-comparison posts to follow up the ones I've been making the last month. On the surface, I think, this comes across as less interesting than some other possible topics. So I want to explain why I think this matters now. This is not quite my long-promised topic-modeling post, but getting closer.
Off the top of my head, I think there are roughly three things that computers may let us do with text so much faster than was previously possible as to qualitatively change research.
1. Find texts that use words, phrases, or names we're interested in.
2. Compare individual texts or groups of texts against each other.
3. Classify and cluster texts or words. (Where 'classifying' is assigning texts to predefined groups like 'US History', and 'clustering' is letting the affinities be only between the works themselves).
These aren't, to be sure, completely different. I've argued before that in some cases, full-text search is best thought of as a way to create a new classification scheme and populating it with books. (Anytime I get fewer than 15 results for a historical subject in a ProQuest newspapers search, I read all of them--the ranking inside them isn't very important). Clustering algorithms are built around models of cross group comparisons; full text searches often have faceted group comparisons. And so on.
But as ideal types, these are different, and in very different places in the digital humanities right now. Everybody knows about number 1; I think there's little doubt that it continues to be the most important tool for most researchers, and rightly so. (It wasn't, so far as I know, helped along the way by digital humanists at all). More recently, there's a lot of attention to 3. Scott Weingart has a good summary/literature review on topic modeling and network analysis this week--I think his synopsis that "they’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination" gets it just right, although I wish he'd bring the hammer down harder on the danger part. I've read a fair amount about topic models, implemented a few on text collections I've built, and I certainly see the appeal: but not necessarily the embrace. I've also done some work with classification.
In any case: I'm worried that in the excitement about clustering, we're not sufficiently understanding the element in between: comparisons. It's not as exciting a field as topic modeling or clustering: it doesn't produce much by way of interesting visualizations, and there's not the same density of research in computer science that humanists can piggyback on. At the same time, it's not nearly so mature a technology as search. There are a few production quality applications that include some forms of comparisons (WordHoard uses Dunning Log-Likelihood; I can only find relative ratios on the Tapor page). But there isn't widespread adoption, generally used methodologies for search, or anything else like that.
This is a problem, because cross-textual comparison is one of the basic competencies of the humanities, and it's one that computers ought to be able to help with. While we do talk historically about clusters and networks and spheres of discourse, I think comparisons are also closer to most traditional work; there's nothing quite so classically historiographical as tracing out the similarities and differences between Democratic and Whig campaign literature, Merovingian and Carolingian statecraft, 1960s and 1980s defenses of American capitalism. These are just what we teach in history---I in fact felt like I was coming up with exam or essay questions writing that last sentence.
So why isn't this a more vibrant area? (Admitting one reason might be: it is, and I just haven't done my research. In that case, I'd love to hear what I'm missing).
I think the biggest reason for this is probably legal-technical, and getting solved. A site like J-Stor (or Bookworm, for that matter) can set up full-text search much easier than it can cross-corpus comparisons; one takes tenths of a second, and the other can take minutes. Minutes isn't very much, of course, and if it worked plenty of humanists would be happy to let their laptops plug away at the problem; but restrictions on downloading texts makes that impossible. Add into the mix all the completely un-digitized texts we'd want to include in many comparisons, and there are only a few cases where it's possible. Topic modelling and search both work much better with a research model where one centralized research server provides on-demand service to lots of people who don't necessarily understand what's going on behind the scenes.
Another reason is algorithmic. To put it bluntly, Dunning Log-Likelihood doesn't work very well; not only does it over-represent common words, it also finds spurious differences based on one or two texts. Ted Underwood has been exploring some aspects of Mann-Whitney; but it too has its share of flaws, and in some cases, it can be much more difficult or inappropriate to implement. TF-IDF suffers some difficult translation problems when comparing two parts to comparing a part to a whole. I started a few posts on these things, and hopefully, they'll see the light of day. But in general, I get the impression that there isn't a very good all-around corpus comparison tool any scholar could apply to their questions.
I also suspect that there are some cultural-psychological reasons. One of the things that's so appealing about the topic models and the networks is that they alleviate the feeling of being overwhelmed by unstructured information. (Or the print age, or whatever.) Topic models and network graphs put the world in order, which is very reassuring; they also create things that are very cool looking, which is very (too?) important in the web ecosystem where DH lives. (This site tends to a lot of Google image search results to some cluster charts I made in the past, which are certainly among the least clear charts I've ever posted; there's something about the untangling that sort of puzzle that people find very rewarding.)
Comparisons just create word lists, and they aren't as rewarding as topic models--you just get a list of differences to sift through. And they don't deal in the same way with the whole corpus--you are much more restricted in what you work with. I think there's a bit of a tendency to think that as long as we're using computers to read texts, we might as well do all of the ones that are in good enough shape to work with--moving down for comparisons doesn't tell us much.
I don't see any of these reasons as basically good ones. The inability to get digital texts to work with or text curators who allow multifaceted access is the biggest problem facing textual digital analysis. The lack of good algorithms is just more evidence that this is our problem; that humanists need to be developing expertise and a feel for the data here themselves. And there's certainly no one who defends eye-candy for itself; they (and I) would only point to its usefulness for asking more interesting questions. But comparisons should let us do that too.