Sunday, December 12, 2010

Capitalist lackeys

I'm interested in the ways different words are tied together. That's sort of the universal feature of this project, so figuring out ways to find them would be useful. I already looked at some ways of finding interesting words for "scientific method," but that was in the context of the related words as an endpoint of the analysis. I want to be able to automatically generate linked words, as well. I'm going to think through this staying on "capitalist" as the word of the day. Fair warning: this post is a rambler.

Earlier I looked at some sentences to conclude that language about capitalism has always had critics in the American press (more, Dan said in the comments, than some of the historiography might suggest). Can we find this by looking at numbers, rather than just random samples of text? Let's start with a log-scale chart about what words get used in the same sentence as "capitalist" or "capitalists" between 1918 and 1922. (I'm going to just say capitalist, but my numbers include the plural too).


Thursday, December 9, 2010

Metadata for OCR books

A commenter asked about why I don't improve the metadata instead of doing this clustering stuff, which seems just poorly to reproduce the work of generations of librarians in classifying books. I'd like to. The biggest problem right now for text analysis for historical purposes is metadata (followed closely by OCR quality). What are the sources? I'm going to think through what I know, but I'd love any advice on this because it's really outside my expertise.

Wednesday, December 8, 2010

First Principals

Let me get ahead of myself a little.

For reasons related to my metadata, I had my computer assemble some data on the frequencies of the most common words (I explain why at the end of the post.) But it raises some exciting possibilities using forms of clustering and principal components analysis (PCA); I can't resist speculating a little bit about what else it can do to help explore ways different languages intersect. With some charts at the bottom.

Monday, December 6, 2010

Back to the Future

Maybe this is just Patricia Cohen's take, but it's interesting to note that she casts both of the text mining projects she's put on the Times site this week (Victorian books and the Stanford Literature Lab) as attempts to use modern tools to address questions similar to vast, comprehensive tomes written in the 1950s. There are good reasons for this. Those books are some of the classics that informed the next generation of scholarship in their field; they offer an appealing opportunity to find people who should have read more than they did; and, more than some recent scholarship, they contribute immediately to questions that are of interest outside narrow disciplinary communities. (I think I've seen the phrase 'public intellectuals' more times in the four days I've been on Twitter than in the month before). One of the things that the Times articles highlight is how this work can re-engage a lot of the general public with current humanities scholarship.

But some part of my ABD self is a little uncomfortable with reaching so far back. As important as it is to get the general public on board with digital humanities, we also need to persuade less tech-interested, but theory-savvy, scholars that this can create cutting edge research, not just technology. The lede for P. Cohen's first article—that the Theory Wars can be replaced by technology—isn't going to convince many inside the academy. Everybody's got a theory. It's better if you can say what it is.

The Age of Capital–

Dan asks for some numbers on "capitalism" and "capitalist" similar to the ones on "Darwinism" and "Darwinist" I ran for Hank earlier. That seems like a nice big question I can use to get some basic methods to warm up the new database I set up this week and to get some basic functionality written into it.

I'm going to go step-by-step here at some length to show just how cyclical a process this is--the computer is bad at semantic analysis, and it requires some actual knowledge of the history involved to get anything very useful out of the raw data on counts. A lot of comments on semantic analysis make it sound like it's asking computers to think for us, so I think it's worth showing that most of the R functions I'm using generally operate at a pretty low level--doing some counting, some index work, but nothing too mysterious.

Saturday, December 4, 2010

Full-text American versions of the Times charts

This verges on unreflective datadumping: but because it's easy and I think people might find it interesting, I'm going to drop in some of my own charts for total word use in 30,000 books by the largest American publishers on the same terms for which the Times published Cohen's charts of title word counts. I've tossed in a couple extra words where it seems interesting—including some alternate word-forms that tell a story, using a perl word-stemming algorithm I set up the other day that works fairly well. My charts run from 1830 (there just aren't many American books from before, and even the data from the 30s is a little screwy) to 1922 (the date that digital history ends--thank you, Sonny Bono.) In some cases, (that 1874 peak for science), the American and British trends are surprisingly close. Sometimes, they aren't.

This is pretty close to Cohen's chart, and I don't have much to add. In looking at various words that end in -ism, I got some sense earlier of how individual religious discussions--probably largely in history—peak at substantially different times. But I don't quite have the expertise in American religious history to fully interpret that data, so I won't try to plug any of it in.

Today's Times Article

Patricia Cohen's new article about the digital humanities doesn't come with the rafts of crotchety comments the first one did, so unlike last time I'm not in a defensive crouch. To the contrary: I'm thrilled and grateful that Dan Cohen, the main subject of the article, took the time in his moment in the sun to link to me. The article itself is really good, not just because the Cohen-Gibbs Victorian project is so exciting, but because P. Cohen gets some thoughtful comments and the NYT graphic designers, as always, do a great job. So I just want to focus on the Google connection for now, and then I'll post my versions of the charts the Times published.

Now with actual text!

Lexical analysis widens the hermeneutic circle. The statistics need to be kept close to the text to keep any work sufficiently under the researcher's control. I've noticed that when I ask the computer to do too much work for me in identifying patterns, outliers, and so on, it frequently responds with mistakes in the data set, not with real historical data. So as I start to harness this new database, one of the big questions is how to integrate what the researcher already knows into the patterns he or she is analyzing.


This is all by way of showing off the latest thing it lets me do--get examples of actual usage so we can do semantic processing ourselves, rather than trying to have a computer do it poorly. It might be good to put some tests like this into the code by default, as a check on interpretive hubris. I need to put the years and titles in here too, but if we just take a random set of samples of the language of natural selection, I think it's already clear that we get an interesting new form of text to interpret; it's sort of like reading the usage examples in the OED, except that we can create much more interesting search contraints on where our passages come from.


> get.usage.example("natural selection",sample(books,1))
[1] "we might extend the parallel and get some good illustrations of natural selection from the history of architecture and the origin of the different styles under different climates and conditions"

Friday, December 3, 2010

Quick, extremely relevant outlinks

Dan Cohen, the hub of all things digital history, in the news and on his blog.

What's worth knowing?

I have my database finally running in a way that lets me quickly select data about books. So now I can start to ask questions that are more interesting than just how overall vocabulary shifted in American publishers. The question is, what sort of questions? I'll probably start to shift to some of my dissertation stuff, about shifts in verbs modifying "attention", but there are all sorts of things we can do now. I'm open to suggestions, but here are some random examples:

1. How does the vocabulary used around slavery change between the 1850s and the 1890s, or the 1890s and the 1920s? Probably the discursive space widens--but in what kind of ways, and what sorts of authors use rhetoric of slavery most freely?

2. How do various social and political words cluster by book in the progressive era? Maybe these are words that appear disproportionately often in a sentence with "reform." Can we identify the closeness of ties between various social movements (suffragism, temperance, segregation, municipal government) based on some sort of clustering of co-mentions in books, as I did for the isms?

Questions don't have to be historical, either: they can plug in to other American Studies areas:

3. What different sorts of words are used to modify 'city' or 'crowd' in the novels of (say) Howells, James, and Dreiser? How does it change over time within some of them?

4. What sorts of books discuss the plays of Shakespeare between 1850 and 1922--can we identify a shift in a) the sorts of books writing about him that could confirm some Highbrow/Lowbrow stuff, or b) the particular plays that get mention or praise?

Centennials, part II

So I just looked at patterns of commemoration for a few famous anniversaries. This is, for some people, kind of interesting--how does the publishing industry focus in on certain figures to create news or resurgences of interest in them?  I love the way we get excited about the civil war sesquicentennial now, or the Darwin/Lincoln year last year.

I was asking if this spike in mentions of Thoreau in 1917, is extraordinary or merely high.
Emerson (1903) doesn't seem to have much a spike--he's up in 1904 with everyone, although Hawthorne, whose centenary is 1904, isn't up very much.

Can we look at the centennial spikes for a lot of authors? Yes. The best way would be to use a biographical dictionary or wikipedia or something, but I can also just use the years built into some of my author metadata to get a rough list of authors born between 1730 and 1822, so they can have a centenary during my sample. A little grepping gets us down to thousand or so authors. Here are the ten with the most books, to check for reliability:

Centennials, part I.

I was starting to write about the implicit model of historical change behind loess curves, which I'll probably post soon, when I started to think some more about a great counterexample to the gradual change I'm looking for: the patterns of commemoration for anniversaries. At anniversaries, as well as news events, I often see big spikes in wordcounts for an event or person.

I've always been interested in tracking changes in historical memory, and this is a good place to do it. I talked about the Gettysburg sesquicentennial earlier, and I think all the stuff about the civil war sesquicentennial (a word that doesn't show up in my top 200,000, by the way) prompted me to wonder whether the commemorations a hundred years ago helped push forward practices in the publishing industry of more actively reflecting on anniversaries. Are there patterns in the celebration of anniveraries? For once my graphs will be looking at the spikes, not the general trends. With two exceptions to start: the words themselves:
So that's a start: the word centennial was hardly an American word at all before 1876, and it didn't peak until 1879. The Loess trend puts the peak around 1887. So it seems like not only did the American centennial put the word into circulation, it either remained a topic of discussion or spurred a continuing interest in centennials of Founding era events for over a decade.

Thursday, December 2, 2010

Do it yourself

Jamie's been asking for some thoughts on what it takes to do this--statistics backgrounds, etc. I should say that I'm doing this, for the most part, the hard way, because 1) My database is too large to start out using most tools I know of, including I think the R text-mining package, and 2) I want to understand how it works better. I don't think I'm going to do the software review thing here, but there are what look like a lot of promising leads at an American Studies blog.

As for whether the courses exist, I think they do from place to place: Stephen Ramsay says he's taught one at Nebraska for years.

It's easy to follow a few of these links and quickly end up drinking from a firehose of information. I get two initial impressions: 1) English is ahead of history on this; 2) there are a lot of highly developed applications for doing similar things with text analysis. The advantage is that it's leading me to think more carefully about how my applications are different than other people's.

Wednesday, December 1, 2010

Digital Humanities and Humanities Computing

I've had "digital humanities" in the blog's subtitle for a while, but it's a terribly offputting term. I guess it's supposed to evoke future frontiers and universal dissemination of humanistic work, but it carries an unfortunate implication that the analog humanities are something completely different. It makes them sound older, richer, more subtle—and scheduled for demolition. No wonder a world of online exhibitions and digital texts doesn't appeal to most humanists of the tweed– and dust-jacket crowd. I think we need a distinction that better expresses how digital technology expands the humanities, rather than constraining it.

It's too easy to think Digital Humanities is about teaching people to think like computers, when it really should be about making computers think like humanists.* What we want isn't digital humanities; it's humanities computing. To some degree, we all know this is possible—we all think word processors are better than pen and paper, or jstor better than buried stacks of journals (musty musings about serendipity aside). But we can go farther than that. Manfred Kuehn's blog is an interesting project in exploring how notetaking software can reflect and organize our thinking in ways that create serendipity within one person's own notes. I'm trying to figure out ways of doing that on a larger body of texts, but we could think of those as notes, themselves.

Programming and other Languages

Jamie asked about assignments for students using digital sources. It's a difficult question.

A couple weeks ago someone referred an undergraduate to me who was interested in using some sort of digital maps for a project on a Cuban emigre writer like the ones I did of Los Angeles German emigres a few years ago. Like most history undergraduates, she didn't have any programming background, and she didn't have a really substantial pile of data to work with from the start. For her to do digital history, she'd have to type hundreds of addresses and dates off of letters from the archives, and then learn some sort of GIS software or google maps API, without any clear payoff. No would get much out of forcing her to spend three days playing with databases when she's really looking at the contents of letters.