Monday, January 31, 2011

Where were 19C US books published?

Open Library has pretty good metadata. I'm using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While I'm waiting for some indexes to build, that will give a good chance to figure out just what's in these digital sources.

Most interestingly, it has state level information on books you can download from the Internet Archive. There are about 500,000 books with library call numbers or other good metadata, 225,000 of which are published in the US. How much geographical diversity is there within that? Not much. About 70% of the books are published in three states: New York, Massachusetts, and Pennsylvania. That's because the US publishing industry was heavily concentrated in Boston, NYC, and Philadelphia. Here's a map, using the Google graph API through the great new GoogleViz R package, of how many books there are from each state. (Hover over for the numbers, and let me know if it doesn't load, there still seem to be some kinks). Not included is Washington DC, which has 13,000 books, slightly fewer than Illinois.

I'm going to try to pick publishers that aren't just in the big three cities, but any study of "culture," not the publishing industry, is going to be heavily influenced by the pull of the Northeastern cities.

This raises some interesting questions about how well book data works for generalizations about American culture as a whole. For a lot of purposes, including the one that Culturomics says it's interested in, a well-cultivated collection of scanned newspapers with text files released into the public domain with metadata would be much better. Newspapers are generally published in the places some of their editorial content comes from (although a lot of it was republished/stolen from other newspapers in the 19C, right?) so it would let you see, say, some really interesting things depending on how much data you got. You could see, over several days, the spread of news about Mexican War battles. Or you could trace newspaper coverage of campaigns to whistlestop tours, or compare the relative newspaper coverage of a dull campaign like 1888 to an exciting one like 1896. I'd want to see, maybe, how discussion of the League of Nations tracked Wilson's tour. It's easy to think of a lot of things like this. We currently have scanned newspaper databases, but as far as I know, they don't release their text and metada, but rather shoehorn you into a web interface. Maybe we just need individual dissertators to strike content agreements to do research, but of course I'd rather see everyone have free access like IA gives to books. Is anything like that out there or coming down the pike?

Since I'm doing more traditional intellectual history, I'm not worried about using books instead. I'm mostly interested in not regional variations, but genre variation; I want to know what psychology books say, and I don't really care if few were published in the South. But it does affect other types of questions we might ask. Just thought I'd throw that out there.

1 comment:

  1. Note: one mistake with this chart is the Nebraska data, which I thought I fixed but realize I didn't. As far as I can tell, whoever set up the codes used "NB" for Nebraska instead of "NE," so it shows up as two here instead a few dozen/hundred like it should.