Monday, March 28, 2011

Cronon's politics

Let me step away from digital humanities for just a second to say one thing about the Cronon affair.
(Despite the professor-blogging angle, and that Cronon's upcoming AHA presidency will probably have the same pro-digital history agenda as Grafton's, I don't think this has much to do with DH). The whole "we are all Bill Cronon" sentiment misses what's actually interesting. Cronon's playing a particular angle: one that gets missed if we think about him as either a naïve professor, stumbling into the public sphere, or as a liberal ideologue trying to score some points.

Thursday, March 24, 2011

Author Ages

Back from Venice (which is plastered with posters for "Mapping the Republic of Letters," making a DH-free vacation that much harder), done grading papers, MAW paper presented. That frees up some time for data. So let me start off looking at a new pool for book data for a little while that I think is really interesting.

Open Library metadata has author birth dates. The interaction of these with publication years offers a lot of really fascinating routes to go down, and hopefully I can sketch out a few over the next week or two. Let me start off, thought, with just a quick note on its reliability, scope, etc., looking only at the metadata itself. The really interesting stuff won't come out of metadata manipulation like this, but rather out of looking at actual word use patterns. But I need to understand what's going one before that's possible.

Open Library has pretty comprehensive metadata on authors. In the bigpubs database I made, about 40,000 books have author birth years, and 8,000 do not; given that some of those are corporate authors, anonymous, etc., that's not bad at all. (About 1500 books have no author listed whatsoever).

First, a pretty basic question: how old are authors when they write books? I've been meaning to switch over to ggplot in R for basic graphing, so here's a chance to break its histogram function. Here's a chart of author age for all the books in my bigpubs set:


Wednesday, March 2, 2011

What historians don't know about database design…

I've been thinking for a while about the transparency of digital infrastructure, and what historians need to know that currently is only available to the digitally curious. They're occasionally stirred by a project like ngrams to think about the infrastructure, but when that happens they only see the flaws. But those problems—bad OCR, inconsistent metadata, lack of access to original materials—are present to some degree in all our texts.

One of the most illuminating things I've learned in trying to build up a fairly large corpus of texts is how database design constrains the ways historians can use digital sources. This is something I'm pretty sure most historians using jstor or google books haven't thought about at all. I've only thought about it a little bit, and I'm sure I still have major holes in my understanding, but I want to set something down.

Historians tend to think of our online repositories as black boxes that take boolean statements from users, apply it to data, and return results. We ask for all the books about the Soviet Union written before 1917, Google spits it back. That's what computers aspire to. Historians respond by muttering about how we could have 13,000 misdated books for just that one phrase. The basic state of the discourse in history seems to be stuck there. But those problems are getting fixed, however imperfectly. We should be muttering instead about something else.

Tuesday, February 22, 2011

Genres in Motion

Here's an animation of the PCA numbers I've been exploring this last week.

There's quite a bit of data built in here, and just what it means is up for grabs. But it shows some interesting possibilities. As a reminder: at the end of my first post on categorizing genres, I arranged all the genres in the Library of Congress Classification in two dimensional space using the first two principal components. PCA basically find the combinations of variables that most define the differences within a group. (Read more by me here or generally here.). The first dimension roughly corresponded to science vs. non-science: the second separated social science from the humanities. It did, I think, a pretty good job at showing which fields were close to each other. But since I do history, I wanted to know: do those relations change? Here's that same data, but arranged to show how those positions shift over time. I made this along the same lines as the great Rosling/Gapminder bubble charts, created with this via this. To get it started, I'm highlighting psychology.


[If this doesn't load, you can click through to the file here]. What in the world does this mean?

Sunday, February 20, 2011

Vector Space, overlapping genres, and the world beyond keyword search

I wanted to see how well the vector space model of documents I've been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if you're sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Lab's Pamphlet One, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books  in LCC subclasses "BF" (psychology) blue, and use red for "QE" (Geology), overlaying them on a chart of the first two principal components like I've been using for the last two posts:



That's a little worse than I was hoping. Generally the books stay close to their term, but there is a lot of variation, and even a little bit of overlap. Can we do better? And what would that mean?

Thursday, February 17, 2011

PCA on years

I used principal components analysis at the end of my last post to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, here's an improved (using all my data on the 10,000 most common words) version of that plot:

I have a professional interest in shifts in genres. But this isn't temporal--it's just a static depiction of genres that presumably waxed and waned over time. What can we do to make it historical?

Monday, February 14, 2011

Fresh set of eyes

One of the most important services a computer can provide for us is a different way of reading. It's fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.

And though a text can be a book, it can also be something much larger. Take library call numbers. Library of Congress headings classifications are probably the best hierarchical classification of books we'll ever get. Certainly they're the best human-done hierarchical classification. It's literally taken decades for librarians to amass the card catalogs we have now, with their classifications of every book in every university library down to several degrees of specificity. But they're also a little foreign, at times, and it's not clear how well they'll correspond to machine-centric ways of categorizing books. I've been playing around with some of the data on LCC headings classes and subclasses with some vague ideas of what it might be useful for and how we can use categorized genre to learn about patterns in intellectual history. This post is the first part of that.

***
Everybody loves dendrograms, even if they don't like statistics. Here's a famous one, from the French Encylopedia.
 That famous tree of knowledge raises two questions for me:

Friday, February 11, 2011

Going it alone

I've spent a lot of the last week trying to convince Princeton undergrads it's OK to occasionally disagree with each other, even if they're not sure they're right. So let me make one of my notes on one of the places I've felt a little bit of skepticism as I try to figure what's going on with the digital humanities.

Since I'm late to the party, I've been trying to catch up a bit on where the field is now. One thing that jumped out is how wide-ranging the hopes are for what the digital humanities might do if they take over the existing disciplines or create their own. Being a bit of a job market determinist myself, I wonder if the wreckage many see in the current structure of the humanities doesn't promote a little bit of millenarian strand about how great the reconstruction might be. I feel occasionally I've stumbled into Moscow 1919 or Paris 1968; there are manifestos, there are spontaneous leaderless youth, and in the wreckage of the old system, anything seems possible for the new technological man. Digital humanities, to exaggerate the claims, will create the mass audience academic historians have lost, will reaffirm the importance of public history in the field, will create new fields with new jobs, will break down the boundaries between disciplines, will allow collaborative history to finally emerge. And it might be in danger if it's co-opted by the powers-that-be, as John Unsworth finds many worrying (pdf).

Paris 1968 is an exciting place to be. I've been watching Al-Jazeera all week. But all these transformations promised by DH won't happen all at once, and some of them won't happen at all. As I try to write some of this up for a Princeton audience (which is why, along with the start of our term last week, I'm not blogging much right now) I'm thinking about what it takes to get skeptical historians on board, and what parts of the promised land might put them off.

The thing I'm mulling over: collaboration. A colleague said to me yesterday he thought the digital humanities will come and go before most historians ever stopped working alone, and I think I tend to agree.  I'm pretty much agnostic on the need for collaborative history, myself. Certainly, digital humanities open up fascinating new prospects for collaborative projects. But so far as we're trying to get anyone established on board, an insistence on collaboration might be as much a liability as a benefit. I'm signing up for a THATcamp, but I have to admit a bit of trepidation about putting in volunteer work onto anything that isn't mine. Not just for selfishness, but because we often have funny standards about academic work it's difficult to impose on others. I went to a talk this week where one participant says he refuses to use the words "idea" or "concept." No one can live up to all the constraints we might want to put on work, but it's often fascinating to see what people come up with when we let them do things wholly their own way. Labs aren't always amenable to humanist practices because it's critically important for the health of our disciplines that we don't agree on methodology.

Luckily, then, I've been most struck by in the last couple months is how far one can go it alone right now--unlike the early years of humanities computing (or so I gather), you don't need teams to get computing time, all the truly technical work of digitization, OCR, and cataloging has been done by groups like the Internet Archive, and free software makes it possible to get started on some forms of analysis quite quickly. It's quite possible for someone at a university without any digital humanities infrastructure to do work in text mining or GIS without having a full lab or collaborative team behind them. Sure, it's harder than firing up an iPad app; but I'm not sure it's that much worse than all the commands plenty of senior academics learned in the dark ages to check their e-mail on pine or elm.

What about all the collaborative the labs and programs we already have? Clearly they do more than anything to advance the field, and it's hard to imagine all the great work coming out of GMU or Stanford (say) happening with lone scholars. But it's equally hard for me to imagine that the digital humanities will have actually succeeded until there's a lot of good work coming out that doesn't need the collaborative model, and that answers to some of the expectations of solitary scholars about how humanistic work is produced. At least, that's what I'm thinking for now.

Wednesday, February 2, 2011

Graphing word trends inside genres

Genre information is important and interesting. Using the smaller of my two book databases, I can get some pretty good genre information about some fields I'm interested in for my dissertation by using the Library of Congress classifications for the books. I'm going to start with the difference between psychology and philosophy. I've already got some more interesting stuff than these basic charts, but I think a constrained comparison like this should be somewhat more clear.

Most people know that psychology emerged out of philosophy, becoming a more scientific or experimental study of the mind sometime in the second half of the 19C. The process of discipline formation is interesting, well studied, and clearly connected to the vocabulary used. Given that, there should be something for lexical statistics in it. Also, there's something neatly meta about using the split of a 'scientific' discipline off of a humanities one, since some rhetoric in or around the digital humanities promises a bit more rigor in our analysis by using numbers. So what are the actual differences we can find?

Let me start by just introducing these charts with a simple one. How much do the two fields talk about "truth?"

Tuesday, February 1, 2011

Technical notes

I'm changing several things about my data, so I'm going to describe my system again in case anyone is interested, and so I have a page to link to in the future.

Platform
Everything is done using MySQL, Perl, and R. These are all general computing tools, not the specific digital humanities or text processing ones that various people have contributed over the years. That's mostly because the number and size of files I'm dealing with are so large that I don't trust an existing program to handle them, and because the existing packages don't necessarily have implementations for the patterns of change over time I want as a historian. I feel bad about not using existing tools, because the collaboration and exchange of tools is one of the major selling points of the digital humanities right now, and something like Voyeur or MONK has a lot of features I wouldn't necessarily think to implement on my own. Maybe I'll find some way to get on board with all that later. First, a quick note on the programs:

Monday, January 31, 2011

Where were 19C US books published?

Open Library has pretty good metadata. I'm using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While I'm waiting for some indexes to build, that will give a good chance to figure out just what's in these digital sources.

Most interestingly, it has state level information on books you can download from the Internet Archive. There are about 500,000 books with library call numbers or other good metadata, 225,000 of which are published in the US. How much geographical diversity is there within that? Not much. About 70% of the books are published in three states: New York, Massachusetts, and Pennsylvania. That's because the US publishing industry was heavily concentrated in Boston, NYC, and Philadelphia. Here's a map, using the Google graph API through the great new GoogleViz R package, of how many books there are from each state. (Hover over for the numbers, and let me know if it doesn't load, there still seem to be some kinks). Not included is Washington DC, which has 13,000 books, slightly fewer than Illinois.






I'm going to try to pick publishers that aren't just in the big three cities, but any study of "culture," not the publishing industry, is going to be heavily influenced by the pull of the Northeastern cities.

Friday, January 28, 2011

Picking texts, again

I'm trying to get a new group of texts to analyze. We already have enough books to move along on certain types of computer-assisted textual analysis. The big problems are OCR and metadata. Those are a) probably correlated somewhat, and b) partially superable. I've been spending a while trying to figure out how to switch over to better metadata for my texts (which actually means an almost all-new set of texts, based on new metadata). I've avoided blogging the really boring stuff, but I'm going to stay with pretty boring stuff for a little while (at least this post and one more later, maybe more) to get this on the record.

Friday, January 21, 2011

Digital history and the copyright black hole

In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. I've called 1922 the year digital history ends before; for the kind of work I want to see, it's nearly an insuperable barrier, and it's one I think not enough non-tech-savvy humanists think about. So let me dig in a little.

The Sonny Bono Copyright Term Extension Act is a black hole. It has trapped 95% of the books ever written, and 1922 lies just outside its event horizon. Small amounts of energy can leak out past that barrier, but the information they convey (or don't) is miniscule compared to what's locked away inside. We can dive headlong inside the horizon and risk our work never getting out; we can play with the scraps of radiation that seep out and hope it adequately characterizes what's been lost inside; or we can figure out how to work with the material that isn't trapped to see just what we want. I'm in favor of the latter: let me give a bit of my reasoning why.

My favorite individual ngram is for the zip code 02138. It is steadily persistent from 1800 to 1922, and then disappears completely until the invention of the zip code in the 1960s. Can you tell what's going on?



Thursday, January 20, 2011

Openness and Culturomics

The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. I'll get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.

Tuesday, January 18, 2011

Cluster Charts

I'll end my unannounced hiatus by posting several charts that show the limits of the search-term clustering I talked about last week before I respond to a couple things that caught my interest in the last week.

To quickly recap: I take a word or phrase—evolution, for example—and then find words that appear disproportionately often, according to TF-IDF scores, in the books that use evolution the most. (I just use an arbitrary cap to choose those books--it's 60 books for these charts here. I don't think that's the best possible implementation, but given my processing power it's not terrible). Then I take each of those words, and find words that appear disproportionately in the books that use both evolution and the target word most frequently. This process can be iterated any number of times as we learn about more words that appear frequently—"evolution"–"sociology" comes out of the first batch, but it might suggest "evolution"–"Hegel" for the second, and that in turn might suggest "evolution" –"Kant" for the third. (I'm using colors to indicate at what point in the search process a word turned up: Red for words that associated with the original word on its own, down to light blue for ones that turned up only in the later stages of searching).

Often, I'll get the same results for several different search terms—that's what I'm relying on. I use a force-directed placement algorithm to put the words into a chart based on their connections to other words. Essentially, I create a social network where a term like "social" is friends with "ethical" because "social" is one of the most distinguishing terms in books that score highly on a search for "evolution"–"social", and "ethical" is one of the most distinguishing terms in books that score highly on a search for "evolution"–"ethical". (The algorithm is actually a little more complicated than that, thought maybe not for the better). So for evolution, the chart looks like this. (click-enlarge)