As an example, let's compare all the books in my library by Charles Dickens and William Dean Howells, respectively. (I have a peculiar fascination with WDH, regular readers may notice: it's born out of a month-long fascination with Silas Lapham several years ago, and a complete inability to get more than 10 pages into anything else he's written.) We have about 150 books by each (they're among the most represented authors in the Open Library, which is why I choose it), which means lots of duplicate copies published in different years, perhaps some miscategorizations, certainly some OCR errors. Can Dunning scores act as a crutch to thinking even on such ugly data? Can they explain my Howells fixation?
I'll present the results in faux-wordle form as discussed last time. That means I use wordle.com graphics, but with the size corresponding not to frequency but to Dunning scores comparing the two corpuses. What does that look like?
Words overrepresented in Dickens vs Howells:
We get a bit of the British orthography, but less than I'd fear; and we do get a number of insights into Dickens's style ('appearance','merry','little','bright','eyes') as well as some interesting social distinctions probably more reflective of the US vs. Britain and the mid vs the late 19th century ('gentlemen,' 'gentleman,' 'wot', 'coach').
Words overrepresented in Howells vs. Dickens:
How is reading these texts the same as or different than comparing Dickens and Howells themselves? It's not quite what I've had expected on the Howells side: he comes off with an Austinian directness (Jane, not J.L.) that doesn't match my expectations. In some ways, I'd say that the comparison tells us far more about Dickens than about Howells.
Just how becomes clear when we compare Howells to a more comparable figure, Henry James. Howells still overuses common words, but now appears overly masculine in his character choices, and fond of 'and' conjunction (while Dickens used 'and' significantly more than Howells...)
Words overrepresented in Howells vs. James
Words overrepresented in James vs. Howells
Again, I feel like this more closely captures the distinctive qualities of James (moment-companion-charming-extraordinary-view) than of Howells. Even Howell's distinctive points (lots of boys?) -- can be seen as reflections more of James's attributes (endless portraits of ladies). I'd use it as a sort of evidence for the phenomenal blankness of Howells; it's one of the reasons he's an interesting source. (Dan Rodgers once told me he read a lot of Howells one summer to get a better sense of the late 19th century, since James was just too good to portray it blankly.)
But: we're firmly in the fun-with-wordle camp right here. This is not even senior thesis material; I wouldn't count myself qualified to make good English dept. pronouncements about different authors. But I would say--the algorithm seems to be doing a reasonable job using hundreds of minimally processed books to make meaningful distinctions here, and that's for the good. Perfect OCR isn't necessary to get started on this if we know what we're looking for.
Of course: what are we looking for? Some more database-building for me now, and we'll try to get there soon. If anybody can think of some great corpus-comparisons they'd like to see, let me know.