Most intensive text analysis is done on heavily maintained sources. I'm using a mess, by contrast, but a much larger one. Partly, I'm doing this tendentiously--I think it's important to realize that we can accept all the errors due to poor optical character recognition, occasional duplicate copies of works, and so on, and still get workable materials.
Using worse sources is something of a necessity for digital history. The text recognition and the metadata for a lot of the sources we use often—google books, jstor, proquest—is full of errors under the surface, and it's OK for us to work with such data in the open. The historical profession doesn't have any small-ish corpuses we would be interested in analyzing again and again. This isn't true of English departments, who seem to be well ahead of historians in computer-assisted text analysis, and have the luxury of emerging curated text sources like the one Martin Mueller describes here.
But the side effect of that is that we need to be careful about understanding what we're working with. So I'm running periodic checks on the data in my corpus of books by major American publishers (described more earlier) to see what's in there. I thought I'd post the list of the top twenty authors, because I found it surprising, though not in a bad way. We'll do from no. 20 ranking on up, because that's how they do it on sports blogs. (What I really should do is a slideshow to increase pageviews). I'll identify the less famous names.
William Dean Howells (54)
Albert Harkness (54) -- classicist, wrote textbooks
Walter Scott (55)
John Burroughs (56) -- naturalist, essayist
James Russell Lowell (57) -- poet, critic, Lowell.
Francis Parkman (57) -- historian, though I guess everyone knows that
Bayard Taylor (64) — poet, critic
Mark Twain (65)
Bret Harte (66) -- Western novelist
Nathaniel Hawthorne (68)
Charles Dickens (69)
Oliver Wendell Holmes (73)
John Fiske (73) -- philosopher/polymath
James Fenimore Cooper (74)
Robert Louis Stevenson (74)
Thackeray, William Makepeace, 1811-1863 (86)
Henry Wadsworth Longfellow (91)
William Shakespeare (92)
Herbert Spencer (116) -- British philosopher/polymath
Washington Irving (170)
Spencer and Fiske in the top 10 is what I like. There are quite a few Brits being published in the US, but that's fine. And for the most part, the rest of these are old American studies chestnuts that no one reads much anymore. Instead of Melville, Thoreau, Emerson, and the James brothers, we have Fenimore Cooper, Burroughs, Fiske, and Spencer and Howells. Twain is really the only American on the list to have prospered in the last hundred years. That sounds right if we want to look at what people were reading, although it's a little far from what we've decided was important.