Showing posts with label Building a Corpus. Show all posts
Showing posts with label Building a Corpus. Show all posts

Sunday, April 3, 2011

Stopwords to the wise

Shane Landrum (@cliotropic) says my claim that historians have different digital infrastructural needs than other fields might be provocative. I don't mean this as exceptionalism for historians, particularly not compared to other humanities fields. I do think historians are somewhat exceptional in the volume of texts they want to process—at Princeton, they often gloat about being the heaviest users of the library. I do think this volume is one important reason English has a more advanced field of digital humanities than history does. But the needs are independent of the volume, and every academic field has distinct needs. Data, though, is often structured for either one set of users, or for a mushy middle.

A particularly clear connection is from database structures to "categories of analysis" in our methodology. Since humanists share methods in a lot of ways, digital resources designed for one humanities discipline will carry well for others. But it's quite possible to design a resource that makes extensive use of certain categories of analysis nearly impossible.

One clear-cut example: ancestry.com. The bulk of interest in digitized census records lies in two groups: historians and genealogists. That web site is clearly built for the latter: it has lots of genealogy-specific features built into the database for matching sound-alike names and misspellings, for example, but almost nothing for social history. (I'm pretty sure you can't use it to find German cabinet-makers in Camden in 1850, for example.) Ancestry.com views names (last names in particular) as the most important field and structures everything else around serving those up. Lots of historians are more interested in the place or the profession or the ancestry fields in the census: what we take as a unit of analysis affects what we want to see database indexes and search terms built around. (And that's not even getting into the question of aggregating the records into statistics.)

Wednesday, March 2, 2011

What historians don't know about database design…

I've been thinking for a while about the transparency of digital infrastructure, and what historians need to know that currently is only available to the digitally curious. They're occasionally stirred by a project like ngrams to think about the infrastructure, but when that happens they only see the flaws. But those problems—bad OCR, inconsistent metadata, lack of access to original materials—are present to some degree in all our texts.

One of the most illuminating things I've learned in trying to build up a fairly large corpus of texts is how database design constrains the ways historians can use digital sources. This is something I'm pretty sure most historians using jstor or google books haven't thought about at all. I've only thought about it a little bit, and I'm sure I still have major holes in my understanding, but I want to set something down.

Historians tend to think of our online repositories as black boxes that take boolean statements from users, apply it to data, and return results. We ask for all the books about the Soviet Union written before 1917, Google spits it back. That's what computers aspire to. Historians respond by muttering about how we could have 13,000 misdated books for just that one phrase. The basic state of the discourse in history seems to be stuck there. But those problems are getting fixed, however imperfectly. We should be muttering instead about something else.

Tuesday, February 1, 2011

Technical notes

I'm changing several things about my data, so I'm going to describe my system again in case anyone is interested, and so I have a page to link to in the future.

Platform
Everything is done using MySQL, Perl, and R. These are all general computing tools, not the specific digital humanities or text processing ones that various people have contributed over the years. That's mostly because the number and size of files I'm dealing with are so large that I don't trust an existing program to handle them, and because the existing packages don't necessarily have implementations for the patterns of change over time I want as a historian. I feel bad about not using existing tools, because the collaboration and exchange of tools is one of the major selling points of the digital humanities right now, and something like Voyeur or MONK has a lot of features I wouldn't necessarily think to implement on my own. Maybe I'll find some way to get on board with all that later. First, a quick note on the programs:

Monday, January 31, 2011

Where were 19C US books published?

Open Library has pretty good metadata. I'm using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While I'm waiting for some indexes to build, that will give a good chance to figure out just what's in these digital sources.

Most interestingly, it has state level information on books you can download from the Internet Archive. There are about 500,000 books with library call numbers or other good metadata, 225,000 of which are published in the US. How much geographical diversity is there within that? Not much. About 70% of the books are published in three states: New York, Massachusetts, and Pennsylvania. That's because the US publishing industry was heavily concentrated in Boston, NYC, and Philadelphia. Here's a map, using the Google graph API through the great new GoogleViz R package, of how many books there are from each state. (Hover over for the numbers, and let me know if it doesn't load, there still seem to be some kinks). Not included is Washington DC, which has 13,000 books, slightly fewer than Illinois.






I'm going to try to pick publishers that aren't just in the big three cities, but any study of "culture," not the publishing industry, is going to be heavily influenced by the pull of the Northeastern cities.

Friday, January 28, 2011

Picking texts, again

I'm trying to get a new group of texts to analyze. We already have enough books to move along on certain types of computer-assisted textual analysis. The big problems are OCR and metadata. Those are a) probably correlated somewhat, and b) partially superable. I've been spending a while trying to figure out how to switch over to better metadata for my texts (which actually means an almost all-new set of texts, based on new metadata). I've avoided blogging the really boring stuff, but I'm going to stay with pretty boring stuff for a little while (at least this post and one more later, maybe more) to get this on the record.

Thursday, December 9, 2010

Metadata for OCR books

A commenter asked about why I don't improve the metadata instead of doing this clustering stuff, which seems just poorly to reproduce the work of generations of librarians in classifying books. I'd like to. The biggest problem right now for text analysis for historical purposes is metadata (followed closely by OCR quality). What are the sources? I'm going to think through what I know, but I'd love any advice on this because it's really outside my expertise.

Saturday, December 4, 2010

Now with actual text!

Lexical analysis widens the hermeneutic circle. The statistics need to be kept close to the text to keep any work sufficiently under the researcher's control. I've noticed that when I ask the computer to do too much work for me in identifying patterns, outliers, and so on, it frequently responds with mistakes in the data set, not with real historical data. So as I start to harness this new database, one of the big questions is how to integrate what the researcher already knows into the patterns he or she is analyzing.


This is all by way of showing off the latest thing it lets me do--get examples of actual usage so we can do semantic processing ourselves, rather than trying to have a computer do it poorly. It might be good to put some tests like this into the code by default, as a check on interpretive hubris. I need to put the years and titles in here too, but if we just take a random set of samples of the language of natural selection, I think it's already clear that we get an interesting new form of text to interpret; it's sort of like reading the usage examples in the OED, except that we can create much more interesting search contraints on where our passages come from.


> get.usage.example("natural selection",sample(books,1))
[1] "we might extend the parallel and get some good illustrations of natural selection from the history of architecture and the origin of the different styles under different climates and conditions"

Tuesday, November 30, 2010

Catalog data and genre

Mostly a note to myself:

I think genre data would be helpful in all sorts of ways--tracking evolutionary language through different sciences, say, or finding what discourses are the earliest to use certain constructions like "focus attention." The Internet Archive books have no genre information in their metadata, for the most part. The genre data I think I want to use would Library of Congress call numbers--that divides up books in all sorts of ways at various levels that I could parse. It's tricky to get from one to the other, though. I could try to hit the LOC catalog with a script that searches for title, author and year from the metadata I do have, but that would miss a lot and maybe have false positives, plus the LOC catalog is sort of tough to machine-query. Or I could try to run a completely statistical clustering, but I don't trust that that would come out with categories that correspond to ones in common use. Some sort of hybrid method might be best--just a quick sketch below.

Sunday, November 28, 2010

Top ten authors

Most intensive text analysis is done on heavily maintained sources. I'm using a mess, by contrast, but a much larger one. Partly, I'm doing this tendentiously--I think it's important to realize that we can accept all the errors due to poor optical character recognition, occasional duplicate copies of works, and so on, and still get workable materials.

Using worse sources is something of a necessity for digital history. The text recognition and the metadata for a lot of the sources we use often—google books, jstor, proquest—is full of errors under the surface, and it's OK for us to work with such data in the open. The historical profession doesn't have any small-ish corpuses we would be interested in analyzing again and again. This isn't true of English departments, who seem to be well ahead of historians in computer-assisted text analysis, and have the luxury of emerging curated text sources like the one Martin Mueller describes here.

But the side effect of that is that we need to be careful about understanding what we're working with. So I'm running periodic checks on the data in my corpus of books by major American publishers (described more earlier) to see what's in there. I thought I'd post the list of the top twenty authors, because I found it surprising, though not in a bad way. We'll do from no. 20 ranking on up, because that's how they do it on sports blogs. (What I really should do is a slideshow to increase pageviews). I'll identify the less famous names.

Saturday, November 13, 2010

Infrastructure

It's time for another bookkeeping post. Read below if you want to know about changes I'm making and contemplating to software and data structures, which I ought to put in public somewhere. Henry posted questions in the comments earlier about whether we use Princeton's supercomputer time, and why I didn't just create a text scatter chart for evolution like the one I made for scientific method. This answers those questions. It also explains why I continue to drag my feet on letting us segment counts by some sort of genre, which would be very useful.

Wednesday, November 10, 2010

digitizecr by ljooq ic

Obviously, I like charts. But I've periodically been presenting data as a number of random samples, as well.  It's a technique that can be important for digital humanities analysis. And it's one that can draw more on the skills in humanistic training, so might help make this sort of work more appealing. In the sciences, an individual data point often has very little meaning on its own--it's just a set of coordinates. Even in the big education datasets I used to work with, the core facts that I was aggregating up from were generally very dull--one university awarded three degrees in criminal science in 1984, one faculty member earned $55,000 a year. But with language, there's real meaning embodied in every point, that we're far better equipped to understand than the computer. The main point of text processing is to act as a sort of extraordinarily stupid and extraordinarily perseverant research assistant, who can bring patterns to our attention but is terrible at telling which patterns are really important. We can't read everything ourselves, but it's good to check up periodically--that's why I do things like see what sort of words are the 300,000th in the language, or what 20 random book titles from the sample are.

So any good text processing application will let us delve into the individual data as well as giving the individual picture. I'm circling around something commenter "Jamie" said, though not addressing it directly: (quote after break)


Monday, November 8, 2010

Back to Basics

I've rushed straight into applications here without taking much time to look at the data I'm working with. So let me take a minute to describe the set and how I'm trimming it.