Monday, March 28, 2011

Cronon's politics

Let me step away from digital humanities for just a second to say one thing about the Cronon affair.
(Despite the professor-blogging angle, and that Cronon's upcoming AHA presidency will probably have the same pro-digital history agenda as Grafton's, I don't think this has much to do with DH). The whole "we are all Bill Cronon" sentiment misses what's actually interesting. Cronon's playing a particular angle: one that gets missed if we think about him as either a naïve professor, stumbling into the public sphere, or as a liberal ideologue trying to score some points.

Thursday, March 24, 2011

Author Ages

Back from Venice (which is plastered with posters for "Mapping the Republic of Letters," making a DH-free vacation that much harder), done grading papers, MAW paper presented. That frees up some time for data. So let me start off looking at a new pool for book data for a little while that I think is really interesting.

Open Library metadata has author birth dates. The interaction of these with publication years offers a lot of really fascinating routes to go down, and hopefully I can sketch out a few over the next week or two. Let me start off, thought, with just a quick note on its reliability, scope, etc., looking only at the metadata itself. The really interesting stuff won't come out of metadata manipulation like this, but rather out of looking at actual word use patterns. But I need to understand what's going one before that's possible.

Open Library has pretty comprehensive metadata on authors. In the bigpubs database I made, about 40,000 books have author birth years, and 8,000 do not; given that some of those are corporate authors, anonymous, etc., that's not bad at all. (About 1500 books have no author listed whatsoever).

First, a pretty basic question: how old are authors when they write books? I've been meaning to switch over to ggplot in R for basic graphing, so here's a chance to break its histogram function. Here's a chart of author age for all the books in my bigpubs set:

Wednesday, March 2, 2011

What historians don't know about database design…

I've been thinking for a while about the transparency of digital infrastructure, and what historians need to know that currently is only available to the digitally curious. They're occasionally stirred by a project like ngrams to think about the infrastructure, but when that happens they only see the flaws. But those problems—bad OCR, inconsistent metadata, lack of access to original materials—are present to some degree in all our texts.

One of the most illuminating things I've learned in trying to build up a fairly large corpus of texts is how database design constrains the ways historians can use digital sources. This is something I'm pretty sure most historians using jstor or google books haven't thought about at all. I've only thought about it a little bit, and I'm sure I still have major holes in my understanding, but I want to set something down.

Historians tend to think of our online repositories as black boxes that take boolean statements from users, apply it to data, and return results. We ask for all the books about the Soviet Union written before 1917, Google spits it back. That's what computers aspire to. Historians respond by muttering about how we could have 13,000 misdated books for just that one phrase. The basic state of the discourse in history seems to be stuck there. But those problems are getting fixed, however imperfectly. We should be muttering instead about something else.