Comments on Sapping Attention: How Bad is Internet Archive OCR?

There are some men who, years after being in a rel...

2014-09-24T14:47:16.252-04:00

There are some men who, years after being in a relationship, continue to treat their women like if it was just yesterday that they got together.

I like this post very much!

2014-05-02T13:29:52.441-04:00

I like this post very much!

I now remember how/why I got concerned about this....

2010-12-16T08:53:29.041-05:00

I now remember how/why I got concerned about this. I was working with a smaller dataset (100 19th century novels) and I decided to use a *very* rough metric of OCR quality: percentage of words misspelled (by a modern word spell checker, aspell).

I found that the difference was considerable.

Here's a google-IA-rescan with 7% misspelled words:

http://www.archive.org/details/contariniflemin04disrgoog

The IA scan of the same novel has 3% misspelled words. ID: contarinifleming04disr

And a google epub version, once I stripped the raw text, has 2% misspelled words. Google ID: mhK7xT4Bc3wC

It's important to note that Google epubs are end-of-line dehyphenated so I'm sure that helps the metric.

Great work Ben. To be clear, I think there's a...

2010-12-16T08:05:20.748-05:00

Great work Ben. To be clear, I think there's an issue with some of the IA's OCRing of Google books. It's not a general comment about IA's OCR.

The issue I'm really thinking about is what might happen if you were working with a small(er) sample (say, 100-1000 books) and you mixed texts with significantly different OCR quality. I can imagine some sort of difference in vocabulary being an artifact of OCR. And the effect would be magnified if one was doing principle component analysis.

In other words, let's say you were interested in the frequency of the word "universal" and a certain group of texts, having poor OCR, had one in every 20 words misspelled.

Another thought: what if spelling errors are more likely to crop up in longer words? What if the error rate was 20% for words longer than 8 characters. I feel like these issues are important to address if work in text analysis in the "digital humanities" wants to be taken seriously by folks in other disciplines. Is this being overly cautious?

agreed with Anthony (above), it's good to see ...

2010-12-15T12:23:19.900-05:00

agreed with Anthony (above), it's good to see you working out these problems. again, though, i think you're too centered on *your* database and monkeying around with it - this is good information to have for those of us also interested in building a database, but these methodological points seem to be the focus, whereas one could imagine some of your blog content moving over into actual sustained attention to a single historical problem/issue. you can do it! i think now's the time, and these methodological posts can run right alongside that sort of work. what do you have in mind?

Instead of perl hacking on your own, why don't...

2010-12-15T11:23:10.693-05:00

Instead of perl hacking on your own, why don't you try to apply an already existing algorithm for OCR cleanup and see what happens? Something like this: http://www.springerlink.com/content/l2724747mt78039l/