Showing posts with label Howells. Show all posts
Showing posts with label Howells. Show all posts

Friday, October 7, 2011

Dunning Statistics on authors

As promised, some quick thoughts broken off my post on Dunning Log-likelihood. There, I looked at _big_ corpuses--two history classes of about 20,000 books each. But I also wonder how we can use algorithmic comparison on a much smaller scale: particularly, at the level of individual authors or works. English dept. digital humanists tend to rely on small sets of well curated, TEI texts, but even the ugly wilds of machine OCR might be able to offer them some insights. (Sidenote--interesting post by Ted Underwood today on the mechanics of creating a middle group between these two poles).

As an example, let's compare all the books in my library by Charles Dickens and William Dean Howells, respectively. (I have a peculiar fascination with WDH, regular readers may notice: it's born out of a month-long fascination with Silas Lapham several years ago, and a complete inability to get more than 10 pages into anything else he's written.) We have about 150 books by each (they're among the most represented authors in the Open Library, which is why I choose it), which means lots of duplicate copies published in different years, perhaps some miscategorizations, certainly some OCR errors. Can Dunning scores act as a crutch to thinking even on such ugly data? Can they explain my Howells fixation?

I'll present the results in faux-wordle form as discussed last time. That means I use wordle.com graphics, but with the size corresponding not to frequency but to Dunning scores comparing the two corpuses. What does that look like?