Friday, October 7, 2011

Dunning Statistics on authors

As promised, some quick thoughts broken off my post on Dunning Log-likelihood. There, I looked at _big_ corpuses--two history classes of about 20,000 books each. But I also wonder how we can use algorithmic comparison on a much smaller scale: particularly, at the level of individual authors or works. English dept. digital humanists tend to rely on small sets of well curated, TEI texts, but even the ugly wilds of machine OCR might be able to offer them some insights. (Sidenote--interesting post by Ted Underwood today on the mechanics of creating a middle group between these two poles).

As an example, let's compare all the books in my library by Charles Dickens and William Dean Howells, respectively. (I have a peculiar fascination with WDH, regular readers may notice: it's born out of a month-long fascination with Silas Lapham several years ago, and a complete inability to get more than 10 pages into anything else he's written.) We have about 150 books by each (they're among the most represented authors in the Open Library, which is why I choose it), which means lots of duplicate copies published in different years, perhaps some miscategorizations, certainly some OCR errors. Can Dunning scores act as a crutch to thinking even on such ugly data? Can they explain my Howells fixation?

I'll present the results in faux-wordle form as discussed last time. That means I use graphics, but with the size corresponding not to frequency but to Dunning scores comparing the two corpuses. What does that look like?

Thursday, October 6, 2011

Comparing Corpuses by Word Use

Historians often hope that digitized texts will enable better, faster comparisons of groups of texts. Now that at least the 1grams on Bookworm are running pretty smoothly, I want to start to lay the groundwork for using corpus comparisons to look at words in a big digital library. For the algorithmically minded: this post should act as a somewhat idiosyncratic approach to Dunning's Log-likelihood statistic. For the hermeneutically minded: this post should explain why you might need _any_ log-likelihood statistic.

What are some interesting, large corpuses to compare? A lot of what we'll be interested in historically are subtle differences between closely related sets, so a good start might be the two Library of Congress subject classifications called "History of the Americas," letters E and F. The Bookworm database has over 20,000 books from each group. What's the difference between the two? The full descriptions could tell us: but as a test case, it should be informative to use only the texts themselves to see the difference.

That leads a tricky question. Just what does it mean to compare usage frequencies across two corpuses? This is important, so let me take this quite slowly. (Feel free to skip down to Dunning if you just want the best answer I've got.) I'm comparing E and F: suppose I say my goal to answer this question:

What words appear the most times more in E than in F, and vice versa?

There's already an ambiguity here: what does "times more" mean? In plain English, this can mean two completely different things. Say E and F are exactly the same overall length (eg, each have 10,000 books of 100,000 words). Suppose further "presbygational" (to take a nice, rare, American history word) appears 6 times in E and 12 times in F. Do we want to say that it appears two times more (ie, use multiplication), or six more times (use addition)?