Sapping Attention: 2011

Friday, December 16, 2011

Genre similarities

When data exploration produces Christmas-themed charts, that's a sign it's time to post again. So here's a chart and a problem.

First, the problem. One of the things I like about the posts I did on author age and vocabulary change in the spring is that they have two nice dimensions we can watch changes happening in. This captures the fact that language as a whole doesn't just up and change--things happen among particular groups of people, and the change that results has shape not just in time (it grows, it shrinks) but across those other dimensions as well.

There's nothing fundamental about author age for this--in fact, I think it probably captures what, at least at first, I would have thought were the least interesting types of vocabulary change. But author age has two nice characteristics.

1) It's straightforwardly linear, and so can be set against publication year cleanly.

2) Librarians have been keeping track of it, pretty much accidentally, by noting the birth year of every book's author.

Neither of these attributes are that remarkable; but the combination is.

Treating texts as individuals vs. lumping them together

Ted Underwood has been talking up the advantages of the Mann-Whitney test over Dunning's Log-likelihood, which is currently more widely used. I'm having trouble getting M-W running on large numbers of texts as quickly as I'd like, but I'd say that his basic contention--that Dunning log-likelihood is frequently not the best method--is definitely true, and there's a lot to like about rank-ordering tests.

Before I say anything about the specifics, though, I want to make a more general point first, about how we think about comparing groups of texts.The most important difference between these two tests rests on a much bigger question about how to treat the two corpuses we want to compare.

Are they a single long text? Or are they a collection of shorter texts, which have common elements we wish to uncover? This is a central concern for anyone who wants to algorithmically look at texts: how far can we can ignore the traditional limits between texts and create what are, essentially, new documents to be analyzed? There are extremely strong reasons to think of texts in each of these ways.

Compare and Contrast

I may (or may not) be about to dash off a string of corpus-comparison posts to follow up the ones I've been making the last month. On the surface, I think, this comes across as less interesting than some other possible topics. So I want to explain why I think this matters now. This is not quite my long-promised topic-modeling post, but getting closer.

Off the top of my head, I think there are roughly three things that computers may let us do with text so much faster than was previously possible as to qualitatively change research.

1. Find texts that use words, phrases, or names we're interested in.
2. Compare individual texts or groups of texts against each other.
3. Classify and cluster texts or words. (Where 'classifying' is assigning texts to predefined groups like 'US History', and 'clustering' is letting the affinities be only between the works themselves).

These aren't, to be sure, completely different. I've argued before that in some cases, full-text search is best thought of as a way to create a new classification scheme and populating it with books. (Anytime I get fewer than 15 results for a historical subject in a ProQuest newspapers search, I read all of them--the ranking inside them isn't very important). Clustering algorithms are built around models of cross group comparisons; full text searches often have faceted group comparisons. And so on.

But as ideal types, these are different, and in very different places in the digital humanities right now. Everybody knows about number 1; I think there's little doubt that it continues to be the most important tool for most researchers, and rightly so. (It wasn't, so far as I know, helped along the way by digital humanists at all). More recently, there's a lot of attention to 3. Scott Weingart has a good summary/literature review on topic modeling and network analysis this week--I think his synopsis that "they’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination" gets it just right, although I wish he'd bring the hammer down harder on the danger part. I've read a fair amount about topic models, implemented a few on text collections I've built, and I certainly see the appeal: but not necessarily the embrace. I've also done some work with classification.

In any case: I'm worried that in the excitement about clustering, we're not sufficiently understanding the element in between: comparisons. It's not as exciting a field as topic modeling or clustering: it doesn't produce much by way of interesting visualizations, and there's not the same density of research in computer science that humanists can piggyback on. At the same time, it's not nearly so mature a technology as search. There are a few production quality applications that include some forms of comparisons (WordHoard uses Dunning Log-Likelihood; I can only find relative ratios on the Tapor page). But there isn't widespread adoption, generally used methodologies for search, or anything else like that.

This is a problem, because cross-textual comparison is one of the basic competencies of the humanities, and it's one that computers ought to be able to help with. While we do talk historically about clusters and networks and spheres of discourse, I think comparisons are also closer to most traditional work; there's nothing quite so classically historiographical as tracing out the similarities and differences between Democratic and Whig campaign literature, Merovingian and Carolingian statecraft, 1960s and 1980s defenses of American capitalism. These are just what we teach in history---I in fact felt like I was coming up with exam or essay questions writing that last sentence.

So why isn't this a more vibrant area? (Admitting one reason might be: it is, and I just haven't done my research. In that case, I'd love to hear what I'm missing).

Dunning Amok

A few points following up my two posts on corpus comparison using Dunning Log-Likelihood last month. Nur ein stueck Technik.

Ted said in the comments that he's interested in literary diction.

I've actually been thinking about Dunnings lately too. I was put in mind of it by a great article a couple of months ago by Ben Zimmer~~Zimmerman~~ addressing the character of "literary diction" in a given period (i.e., Dunnings on a fiction corpus versus the broader corpus of works in the same period).

I'd like to incorporate a diachronic dimension to that analysis. In other words, first take a corpus of 18/19c fiction and compare it to other books published in the same period. Then, among the words that are generally overrepresented in 18/19c fiction, look for those whose degree of overrepresentation *peaks in a given period* of 10 or 20 years. Perhaps this would involve doing a kind of meta-Dunnings on the Dunnings results themselves.

I'm still thinking about this, as I come back to doing some other stuff with the Dunnings. This actually seems to me like a case where the Dunning's wouldn't be much good; so much of a Dunning score is about the sizes of the corpuses, so after an initial comparison to establish 'literary diction' (say), I think we'd just want to compare the percentages.

Theory First

Natalie Cecire recently started an important debate about the role of theory in the digital humanities. She's rightly concerned that the THATcamp motto--"more hack, less yack"--promotes precisely the wrong understanding of what digital methods offer:

the whole reason DH is theoretically consequential is that the use of technical methods and tools should be making us rethink the humanities.

Cecire wants a THATcamp theory, so that the teeming DHers can better describe the implications of all the work that's going on. Ted Underwood worries that claims for the primacy of theory can be nothing more than a power play, serving to reify existing class distinctions inside the academy; but he's willing to go along with a reciprocal relation between theory and practice going forward.

Dunning Statistics on authors

As promised, some quick thoughts broken off my post on Dunning Log-likelihood. There, I looked at _big_ corpuses--two history classes of about 20,000 books each. But I also wonder how we can use algorithmic comparison on a much smaller scale: particularly, at the level of individual authors or works. English dept. digital humanists tend to rely on small sets of well curated, TEI texts, but even the ugly wilds of machine OCR might be able to offer them some insights. (Sidenote--interesting post by Ted Underwood today on the mechanics of creating a middle group between these two poles).

As an example, let's compare all the books in my library by Charles Dickens and William Dean Howells, respectively. (I have a peculiar fascination with WDH, regular readers may notice: it's born out of a month-long fascination with Silas Lapham several years ago, and a complete inability to get more than 10 pages into anything else he's written.) We have about 150 books by each (they're among the most represented authors in the Open Library, which is why I choose it), which means lots of duplicate copies published in different years, perhaps some miscategorizations, certainly some OCR errors. Can Dunning scores act as a crutch to thinking even on such ugly data? Can they explain my Howells fixation?

I'll present the results in faux-wordle form as discussed last time. That means I use wordle.com graphics, but with the size corresponding not to frequency but to Dunning scores comparing the two corpuses. What does that look like?

Comparing Corpuses by Word Use

Historians often hope that digitized texts will enable better, faster comparisons of groups of texts. Now that at least the 1grams on Bookworm are running pretty smoothly, I want to start to lay the groundwork for using corpus comparisons to look at words in a big digital library. For the algorithmically minded: this post should act as a somewhat idiosyncratic approach to Dunning's Log-likelihood statistic. For the hermeneutically minded: this post should explain why you might need _any_ log-likelihood statistic.

What are some interesting, large corpuses to compare? A lot of what we'll be interested in historically are subtle differences between closely related sets, so a good start might be the two Library of Congress subject classifications called "History of the Americas," letters E and F. The Bookworm database has over 20,000 books from each group. What's the difference between the two? The full descriptions could tell us: but as a test case, it should be informative to use only the texts themselves to see the difference.

That leads a tricky question. Just what does it mean to compare usage frequencies across two corpuses? This is important, so let me take this quite slowly. (Feel free to skip down to Dunning if you just want the best answer I've got.) I'm comparing E and F: suppose I say my goal to answer this question:

What words appear the most times more in E than in F, and vice versa?

There's already an ambiguity here: what does "times more" mean? In plain English, this can mean two completely different things. Say E and F are exactly the same overall length (eg, each have 10,000 books of 100,000 words). Suppose further "presbygational" (to take a nice, rare, American history word) appears 6 times in E and 12 times in F. Do we want to say that it appears two times more (ie, use multiplication), or six more times (use addition)?

Bookworm and library search

We just launched a new website, Bookworm, from the Cultural Observatory. I might have a lot to say about it from different perspectives; but since it was submitted to the DPLA beta sprint, let's start with the way it helps you find library books.

Google Ngrams, which Bookworm in many ways resembles, was fundamentally about words and their histories; Bookworm tries to place texts much closer to the center instead. At their hearts, Ngrams uses a large collection of texts to reveal trends in the history of words; Bookworm lets you use words to discover the history of different groups of books--and by extension, their authors and readers.

Is catalog information really metadata?

We've been working on making a different type of browser using the Open Library books I've been working with to date, and it's raised a interesting question I want to think through here.

I think many people looking at word countson a large scale right now (myself included) have tended to make a distinction between wordcount data on the one hand, and catalog metadata on the other. (I know I have the phrase "catalog metadata" burned into my reflex vocabulary at this point--I've had to edit it out of this very post several times.) The idea is that we're looking at the history of words or phrases, and the information from library catalogs can help to split or supplement that. So for example, my big concern about the ngrams viewer when it came out was that it included only one form of metadata (publication year) to supplement the word-count data, when it should really have titles, subjects, and so on. But that still assumes that word data--catalog metadata is a useful binary.

I'm starting to think that it could instead be a fairly pernicious misunderstanding.

Wars, Recessions, and the size of the ngrams corpus

Hank wants me to post more, so here's a little problem I'm working on. I think it's a good example of how quantitative analysis can help to remind us of old problems, and possibly reveal new ones, with library collections.

My interest in texts as a historian is particularly focused on books in libraries. Used carefully, an academic library is sufficient to answer many important historical questions. (That statement might seem too obvious to utter, but it's not--the three most important legs of historical research are books, newspapers, and archives, and the archival leg has been lengthening for several decades in a way that tips historians farther into irrelevance.) A fair concern about studies of word frequency is that they can ignore the particular histories of library acquisition patterns--although I think Anita Guerrini takes that point a bit too far in her recent article on culturomics in Miller-McCune. (By the way, the Miller-McCune article on science PhDs is my favorite magazine article of the last couple of years). A corollary benefit, though, is that they help us to start understanding better just what is included in our libraries, both digital and brick.

Background: right now, I need a list of of the most common English words. (Basically to build a much larger version of the database I've been working with; making it is teaching me quite a bit of computer science but little history right now). I mean 'most common' expansively: earlier I found that about 200,000 words gets pretty much every word worth analyzing. There were some problems with the list I ended up producing. The obvious one, the one I'm trying to fix, is that words from the early 19th century, when many fewer books were published, will be artificially depressed compared to newer ones.

But it turns out that a secular increase in words published per year isn't the only effect worth fretting about. Words in the Google Books corpus doesn't just increase steadily over time. Looking at the data series on overall growth, one period immediately jumped out at me:

Graphing and smoothing

I mentioned earlier I've been rebuilding my database; I've also been talking to some of the people here at Harvard about various follow-up projects to ngrams. So this seems like a good moment to rethink a few pretty basic things about different ways of presenting historical language statistics. For extensive interactions, nothing is going to beat a database or direct access to text files in some form. But for quick interactions, which includes a lot of pattern searching and public outreach, we have some interesting choices about presentation.

This post is mostly playing with graph formats, as a way to think through a couple issues on my mind and put them to rest. I suspect this will be an uninteresting post for many people, but it's probably going to live on the front page for a little while given my schedule the next few weeks. Sorry, visitors!

Moving

Starting this month, I’m moving from New Jersey to do a fellowship at the Harvard Cultural Observatory. This should be a very interesting place to spend the next year, and I’m very grateful to JB Michel and Erez Lieberman Aiden for the opportunity to work on an ongoing and obviously ambitious digital humanities project. A few thoughts on the shift from Princeton to Cambridge:

What's new?

Let me get back into the blogging swing with a (too long—this is why I can't handle Twitter, folks) reflection on an offhand comment. Don't worry, there's some data stuff in the pipe, maybe including some long-delayed playing with topic models.

Even at the NEH's Digging into Data conference last weekend, one commenter brought out one of the standard criticisms of digital work—that it doesn't tell us anything we didn't know before. The context was some of Gregory Crane's work in describing shifting word use patterns in Latin over very long time spans (2000 years) at the Perseus Project: Cynthia Damon, from Penn, worried that "being able to represent this as a graph instead by traditional reading is not necessarily a major gain." That is to say, we already know this; having a chart restate the things any classicist could tell you is less than useful. I might have written down the quote wrong; it doesn't really matter, because this is a pretty standard response from humanists to computational work, and Damon didn't press the point as forcefully as others do. Outside the friendly confines of the digital humanities community, we have to deal with it all the time.

Predicting publication year and generational language shift

Before end-of-semester madness, I was looking at how shifts in vocabulary usage occur. In many cases, I found, vocabulary change doesn't happen evenly across across all authors. Instead, it can happen generationally; older people tend to use words at the rate that was common in their youth, and younger people anticipate future word patterns. An eighty-year-old in 1880 uses a world like "outside" more like a 40-year-old in 1840 than he does like a 40-year-old in 1880. The original post has a more detailed explanation.

Will had some some good questions in the comments about how different words fit these patterns. Looking at different types of words should help find some more ways that this sort of investigation is interesting, and show how different sorts of language vary. But to look at other sorts of words, I should be a little clearer about the kind of words I chose the first time through. If I can describe the usage pattern for a "word like 'outside'," just what kind of words are like 'outside'? Can we generalize the trend that they demonstrate?

The 1940 election

A couple weeks ago, I wrote about how ancestry.com structured census data for genealogy, not history, and how that limits what historians can do with it. Last week, I got an interesting e-mail from IPUMS, at the Minnesota population center on just that topic:

We have an extraordinary opportunity to partner with a leading genealogical firm to produce a microdata collection that will encompass the entire 1940 census of population of over 130 million cases. It is not feasible to digitize every variable that was collected in the 1940 census. We are therefore seeking your help to prioritize variables for inclusion in the 1940 census database.

In search of the great white whale

All the cool kids are talking about shortcomings in digitized text databases. I don't have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream interests—the neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that it's not just at the margins we're missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Here's an example.

Age cohort and Vocabulary use

Let's start with two self-evident facts about how print culture changes over time:

The words that writers use change. Some words flare into usage and then back out; others steadily grow in popularity; others slowly fade out of the language.
The writers using words change. Some writers retire or die, some hit mid-career spurts of productivity, and every year hundreds of new writers burst onto the scene. In the 19th-century US, median author age stays within a few years of 49: that constancy, year after year, means the supply of writers is constantly being replenished from the next generation.

How do (1) and (2) relate to each other? To what extent do the shifting group of authors create the changes in language, and how much do changes happen in a culture that authors all draw from?

This might be a historical question, but it also might be a linguistics/sociology/culturomics one. Say there are two different models of language use: type A and type B.

Type A means a speaker drifts on the cultural winds: the language shifts and everyone changes their vocabulary every year.
Type B, on the other hand, assumes that vocabulary is largely fixed at a certain age: a speaker will be largely consistent in her word choice from age 30 to 70, say, and new terms will not impinge on her vocabulary.

Both of these models are extremes, and we can assume that hardly any words are pure A or pure B. To firm this up, let me concretize this with two nicely alphabetical examples of fictional characters to warm up the subject for all you humanists out there:

Type A: John Updike's Rabbit Angstrom. Rabbit doesn't know what he wants to say. Every decade, his vocabulary changes; he talks like a ennui-ed salaryman in the 50s, flirts with hippiedom and Nixonian silent-majorityism in the 60s, spends the late 70s hoarding gold and muttering about Consumer Reports and the Japanese. For Updike, part of Rabbit being an everyman is the shifts he undergoes from book to book: there's a sort of implicit type-A model underlying his transformations. He's a different person at every age because America is different in every year.

Type B: Richard Ford's Frank Bascombe. Frank Bascombe, on the other hand, has his own voice. It shifts from decade to decade, to be sure, but 80s Bascombe sounds more like 2000s Bascombe than he sounds like 80s Angstrom. What does change is internal to his own life: he's in the Existence period in the 90s and worries about careers, and the 00s he's in the Permanent Period and worried about death. Bascombe is a dreamy outsider everywhere he goes: the Mississippian who went to Ann Arbor, always perplexed by the present.*

Anyhow: I don't have good enough author metadata right now to check this on authors (which would be really interesting), but I can do it a bit on words. An Angstrom word would be one that pops up across all age cohorts in society simultaneously; a Bascombe word is one that creeps in more with each succeeding generation, but that doesn't change much over time within an age cohort.

This is getting into some pretty multi-dimensional data, so we need something a little more complicated than line graphs. The solution I like right now is heat maps.

An example: I know that "outside" is a word that shows a steady, upward trend from 1830 to 1922; in fact, I found that it was so steady that it was among the best words at helping to date books based on their vocabulary usage. So how did "outside" become more popular? Was it the Angstrom model, where everyone just started using it more? Or was it the Bascombe model, where each succeeding generation used it more and more? To answer that, we need to combine author birth year with year of publication:

Stopwords to the wise

Shane Landrum (@cliotropic) says my claim that historians have different digital infrastructural needs than other fields might be provocative. I don't mean this as exceptionalism for historians, particularly not compared to other humanities fields. I do think historians are somewhat exceptional in the volume of texts they want to process—at Princeton, they often gloat about being the heaviest users of the library. I do think this volume is one important reason English has a more advanced field of digital humanities than history does. But the needs are independent of the volume, and every academic field has distinct needs. Data, though, is often structured for either one set of users, or for a mushy middle.

A particularly clear connection is from database structures to "categories of analysis" in our methodology. Since humanists share methods in a lot of ways, digital resources designed for one humanities discipline will carry well for others. But it's quite possible to design a resource that makes extensive use of certain categories of analysis nearly impossible.

One clear-cut example: ancestry.com. The bulk of interest in digitized census records lies in two groups: historians and genealogists. That web site is clearly built for the latter: it has lots of genealogy-specific features built into the database for matching sound-alike names and misspellings, for example, but almost nothing for social history. (I'm pretty sure you can't use it to find German cabinet-makers in Camden in 1850, for example.) Ancestry.com views names (last names in particular) as the most important field and structures everything else around serving those up. Lots of historians are more interested in the place or the profession or the ancestry fields in the census: what we take as a unit of analysis affects what we want to see database indexes and search terms built around. (And that's not even getting into the question of aggregating the records into statistics.)

Generations vs. contexts

When I first thought about using digital texts to track shifts in language usage over time, the largest reliable repository of e-texts was Project Gutenberg. I quickly found out, though, that they didn't have works for years, somewhat to my surprise. (It's remarkable how much metadata holds this sort of work back, rather than data itself). They did, though, have one kind of year information: author birth dates. You can use those to create same type of charts of word use over time that people like me, the Victorian Books project, or the Culturomists have been doing, but in a different dimension: we can see how all the authors born in a year use language rather than looking at how books published in a year use language.

I've been using 'evolution' as my test phrase for a while now: but as you'll see, it turns out to be a really interesting word for this kind of analysis. Maybe that's just chance, but I think it might be a sort of indicative test case--generational shifts are particularly important for live intellectual issues, perhaps, compared to overall linguistic drift.

To start off, here's a chart of the usage of the word "evolution" by share of words per year. There's nothing new here yet, so this is merely a reminder:

Here's what's new: we can also plot by year of author birth, which shows some interesting (if small) differences:

Cronon's politics

Let me step away from digital humanities for just a second to say one thing about the Cronon affair.
(Despite the professor-blogging angle, and that Cronon's upcoming AHA presidency will probably have the same pro-digital history agenda as Grafton's, I don't think this has much to do with DH). The whole "we are all Bill Cronon" sentiment misses what's actually interesting. Cronon's playing a particular angle: one that gets missed if we think about him as either a naïve professor, stumbling into the public sphere, or as a liberal ideologue trying to score some points.

Author Ages

Back from Venice (which is plastered with posters for "Mapping the Republic of Letters," making a DH-free vacation that much harder), done grading papers, MAW paper presented. That frees up some time for data. So let me start off looking at a new pool for book data for a little while that I think is really interesting.

Open Library metadata has author birth dates. The interaction of these with publication years offers a lot of really fascinating routes to go down, and hopefully I can sketch out a few over the next week or two. Let me start off, thought, with just a quick note on its reliability, scope, etc., looking only at the metadata itself. The really interesting stuff won't come out of metadata manipulation like this, but rather out of looking at actual word use patterns. But I need to understand what's going one before that's possible.

Open Library has pretty comprehensive metadata on authors. In the bigpubs database I made, about 40,000 books have author birth years, and 8,000 do not; given that some of those are corporate authors, anonymous, etc., that's not bad at all. (About 1500 books have no author listed whatsoever).

First, a pretty basic question: how old are authors when they write books? I've been meaning to switch over to ggplot in R for basic graphing, so here's a chance to break its histogram function. Here's a chart of author age for all the books in my bigpubs set:

What historians don't know about database design…

I've been thinking for a while about the transparency of digital infrastructure, and what historians need to know that currently is only available to the digitally curious. They're occasionally stirred by a project like ngrams to think about the infrastructure, but when that happens they only see the flaws. But those problems—bad OCR, inconsistent metadata, lack of access to original materials—are present to some degree in all our texts.

One of the most illuminating things I've learned in trying to build up a fairly large corpus of texts is how database design constrains the ways historians can use digital sources. This is something I'm pretty sure most historians using jstor or google books haven't thought about at all. I've only thought about it a little bit, and I'm sure I still have major holes in my understanding, but I want to set something down.

Historians tend to think of our online repositories as black boxes that take boolean statements from users, apply it to data, and return results. We ask for all the books about the Soviet Union written before 1917, Google spits it back. That's what computers aspire to. Historians respond by muttering about how we could have 13,000 misdated books for just that one phrase. The basic state of the discourse in history seems to be stuck there. But those problems are getting fixed, however imperfectly. We should be muttering instead about something else.

Genres in Motion

Here's an animation of the PCA numbers I've been exploring this last week.

There's quite a bit of data built in here, and just what it means is up for grabs. But it shows some interesting possibilities. As a reminder: at the end of my first post on categorizing genres, I arranged all the genres in the Library of Congress Classification in two dimensional space using the first two principal components. PCA basically find the combinations of variables that most define the differences within a group. (Read more by me here or generally here.). The first dimension roughly corresponded to science vs. non-science: the second separated social science from the humanities. It did, I think, a pretty good job at showing which fields were close to each other. But since I do history, I wanted to know: do those relations change? Here's that same data, but arranged to show how those positions shift over time. I made this along the same lines as the great Rosling/Gapminder bubble charts, created with this via this. To get it started, I'm highlighting psychology.

[If this doesn't load, you can click through to the file here]. What in the world does this mean?

Vector Space, overlapping genres, and the world beyond keyword search

I wanted to see how well the vector space model of documents I've been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if you're sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Lab's Pamphlet One, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books in LCC subclasses "BF" (psychology) blue, and use red for "QE" (Geology), overlaying them on a chart of the first two principal components like I've been using for the last two posts:

That's a little worse than I was hoping. Generally the books stay close to their term, but there is a lot of variation, and even a little bit of overlap. Can we do better? And what would that mean?

PCA on years

I used principal components analysis at the end of my last post to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, here's an improved (using all my data on the 10,000 most common words) version of that plot:

I have a professional interest in shifts in genres. But this isn't temporal--it's just a static depiction of genres that presumably waxed and waned over time. What can we do to make it historical?

Fresh set of eyes

One of the most important services a computer can provide for us is a different way of reading. It's fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.

And though a text can be a book, it can also be something much larger. Take library call numbers. Library of Congress ~~headings~~ classifications are probably the best hierarchical classification of books we'll ever get. Certainly they're the best human-done hierarchical classification. It's literally taken decades for librarians to amass the card catalogs we have now, with their classifications of every book in every university library down to several degrees of specificity. But they're also a little foreign, at times, and it's not clear how well they'll correspond to machine-centric ways of categorizing books. I've been playing around with some of the data on LCC ~~headings~~ classes and subclasses with some vague ideas of what it might be useful for and how we can use categorized genre to learn about patterns in intellectual history. This post is the first part of that.

***
Everybody loves dendrograms, even if they don't like statistics. Here's a famous one, from the French Encylopedia.

(From http://quod.lib.umich.edu/d/did/: translation)

That famous tree of knowledge raises two questions for me:

Going it alone

I've spent a lot of the last week trying to convince Princeton undergrads it's OK to occasionally disagree with each other, even if they're not sure they're right. So let me make one of my notes on one of the places I've felt a little bit of skepticism as I try to figure what's going on with the digital humanities.

Since I'm late to the party, I've been trying to catch up a bit on where the field is now. One thing that jumped out is how wide-ranging the hopes are for what the digital humanities might do if they take over the existing disciplines or create their own. Being a bit of a job market determinist myself, I wonder if the wreckage many see in the current structure of the humanities doesn't promote a little bit of millenarian strand about how great the reconstruction might be. I feel occasionally I've stumbled into Moscow 1919 or Paris 1968; there are manifestos, there are spontaneous leaderless youth, and in the wreckage of the old system, anything seems possible for the new technological man. Digital humanities, to exaggerate the claims, will create the mass audience academic historians have lost, will reaffirm the importance of public history in the field, will create new fields with new jobs, will break down the boundaries between disciplines, will allow collaborative history to finally emerge. And it might be in danger if it's co-opted by the powers-that-be, as John Unsworth finds many worrying (pdf).

Paris 1968 is an exciting place to be. I've been watching Al-Jazeera all week. But all these transformations promised by DH won't happen all at once, and some of them won't happen at all. As I try to write some of this up for a Princeton audience (which is why, along with the start of our term last week, I'm not blogging much right now) I'm thinking about what it takes to get skeptical historians on board, and what parts of the promised land might put them off.

The thing I'm mulling over: collaboration. A colleague said to me yesterday he thought the digital humanities will come and go before most historians ever stopped working alone, and I think I tend to agree. I'm pretty much agnostic on the need for collaborative history, myself. Certainly, digital humanities open up fascinating new prospects for collaborative projects. But so far as we're trying to get anyone established on board, an insistence on collaboration might be as much a liability as a benefit. I'm signing up for a THATcamp, but I have to admit a bit of trepidation about putting in volunteer work onto anything that isn't mine. Not just for selfishness, but because we often have funny standards about academic work it's difficult to impose on others. I went to a talk this week where one participant says he refuses to use the words "idea" or "concept." No one can live up to all the constraints we might want to put on work, but it's often fascinating to see what people come up with when we let them do things wholly their own way. Labs aren't always amenable to humanist practices because it's critically important for the health of our disciplines that we don't agree on methodology.

Luckily, then, I've been most struck by in the last couple months is how far one can go it alone right now--unlike the early years of humanities computing (or so I gather), you don't need teams to get computing time, all the truly technical work of digitization, OCR, and cataloging has been done by groups like the Internet Archive, and free software makes it possible to get started on some forms of analysis quite quickly. It's quite possible for someone at a university without any digital humanities infrastructure to do work in text mining or GIS without having a full lab or collaborative team behind them. Sure, it's harder than firing up an iPad app; but I'm not sure it's that much worse than all the commands plenty of senior academics learned in the dark ages to check their e-mail on pine or elm.

What about all the collaborative the labs and programs we already have? Clearly they do more than anything to advance the field, and it's hard to imagine all the great work coming out of GMU or Stanford (say) happening with lone scholars. But it's equally hard for me to imagine that the digital humanities will have actually succeeded until there's a lot of good work coming out that doesn't need the collaborative model, and that answers to some of the expectations of solitary scholars about how humanistic work is produced. At least, that's what I'm thinking for now.

Wednesday, February 2, 2011

Graphing word trends inside genres

Genre information is important and interesting. Using the smaller of my two book databases, I can get some pretty good genre information about some fields I'm interested in for my dissertation by using the Library of Congress classifications for the books. I'm going to start with the difference between psychology and philosophy. I've already got some more interesting stuff than these basic charts, but I think a constrained comparison like this should be somewhat more clear.

Most people know that psychology emerged out of philosophy, becoming a more scientific or experimental study of the mind sometime in the second half of the 19C. The process of discipline formation is interesting, well studied, and clearly connected to the vocabulary used. Given that, there should be something for lexical statistics in it. Also, there's something neatly meta about using the split of a 'scientific' discipline off of a humanities one, since some rhetoric in or around the digital humanities promises a bit more rigor in our analysis by using numbers. So what are the actual differences we can find?

Let me start by just introducing these charts with a simple one. How much do the two fields talk about "truth?"

Technical notes

I'm changing several things about my data, so I'm going to describe my system again in case anyone is interested, and so I have a page to link to in the future.

Platform
Everything is done using MySQL, Perl, and R. These are all general computing tools, not the specific digital humanities or text processing ones that various people have contributed over the years. That's mostly because the number and size of files I'm dealing with are so large that I don't trust an existing program to handle them, and because the existing packages don't necessarily have implementations for the patterns of change over time I want as a historian. I feel bad about not using existing tools, because the collaboration and exchange of tools is one of the major selling points of the digital humanities right now, and something like Voyeur or MONK has a lot of features I wouldn't necessarily think to implement on my own. Maybe I'll find some way to get on board with all that later. First, a quick note on the programs:

Where were 19C US books published?

Open Library has pretty good metadata. I'm using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While I'm waiting for some indexes to build, that will give a good chance to figure out just what's in these digital sources.

Most interestingly, it has state level information on books you can download from the Internet Archive. There are about 500,000 books with library call numbers or other good metadata, 225,000 of which are published in the US. How much geographical diversity is there within that? Not much. About 70% of the books are published in three states: New York, Massachusetts, and Pennsylvania. That's because the US publishing industry was heavily concentrated in Boston, NYC, and Philadelphia. Here's a map, using the Google graph API through the great new GoogleViz R package, of how many books there are from each state. (Hover over for the numbers, and let me know if it doesn't load, there still seem to be some kinks). Not included is Washington DC, which has 13,000 books, slightly fewer than Illinois.

I'm going to try to pick publishers that aren't just in the big three cities, but any study of "culture," not the publishing industry, is going to be heavily influenced by the pull of the Northeastern cities.

Picking texts, again

I'm trying to get a new group of texts to analyze. We already have enough books to move along on certain types of computer-assisted textual analysis. The big problems are OCR and metadata. Those are a) probably correlated somewhat, and b) partially superable. I've been spending a while trying to figure out how to switch over to better metadata for my texts (which actually means an almost all-new set of texts, based on new metadata). I've avoided blogging the really boring stuff, but I'm going to stay with pretty boring stuff for a little while (at least this post and one more later, maybe more) to get this on the record.

Digital history and the copyright black hole

In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. I've called 1922 the year digital history ends before; for the kind of work I want to see, it's nearly an insuperable barrier, and it's one I think not enough non-tech-savvy humanists think about. So let me dig in a little.

The Sonny Bono Copyright Term Extension Act is a black hole. It has trapped 95% of the books ever written, and 1922 lies just outside its event horizon. Small amounts of energy can leak out past that barrier, but the information they convey (or don't) is miniscule compared to what's locked away inside. We can dive headlong inside the horizon and risk our work never getting out; we can play with the scraps of radiation that seep out and hope it adequately characterizes what's been lost inside; or we can figure out how to work with the material that isn't trapped to see just what we want. I'm in favor of the latter: let me give a bit of my reasoning why.

My favorite individual ngram is for the zip code 02138. It is steadily persistent from 1800 to 1922, and then disappears completely until the invention of the zip code in the 1960s. Can you tell what's going on?

Openness and Culturomics

The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. I'll get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.

Cluster Charts

I'll end my unannounced hiatus by posting several charts that show the limits of the search-term clustering I talked about last week before I respond to a couple things that caught my interest in the last week.

To quickly recap: I take a word or phrase—evolution, for example—and then find words that appear disproportionately often, according to TF-IDF scores, in the books that use evolution the most. (I just use an arbitrary cap to choose those books--it's 60 books for these charts here. I don't think that's the best possible implementation, but given my processing power it's not terrible). Then I take each of those words, and find words that appear disproportionately in the books that use both evolution and the target word most frequently. This process can be iterated any number of times as we learn about more words that appear frequently—"evolution"–"sociology" comes out of the first batch, but it might suggest "evolution"–"Hegel" for the second, and that in turn might suggest "evolution" –"Kant" for the third. (I'm using colors to indicate at what point in the search process a word turned up: Red for words that associated with the original word on its own, down to light blue for ones that turned up only in the later stages of searching).

Often, I'll get the same results for several different search terms—that's what I'm relying on. I use a force-directed placement algorithm to put the words into a chart based on their connections to other words. Essentially, I create a social network where a term like "social" is friends with "ethical" because "social" is one of the most distinguishing terms in books that score highly on a search for "evolution"–"social", and "ethical" is one of the most distinguishing terms in books that score highly on a search for "evolution"–"ethical". (The algorithm is actually a little more complicated than that, thought maybe not for the better). So for evolution, the chart looks like this. (click-enlarge)

Clustering from Search

Because of my primitive search engine, I've been thinking about some of the ways we can better use search data to a) interpret historical data, and b) improve our understanding of what goes on when we search. As I was saying then, there are two things that search engines let us do that we usually don't get:

1) Numeric scores on results
2) The ability to from a set of books to a set of high-scoring words, as well as (the normal direction) from a set of words to a set of high-scoring books.

We can start to do some really interesting stuff by feeding this information back in and out of the system. (Given unlimited memory, we could probably do it all even better with pure matrix manipulation, and I'm sure there are creative in-between solutions). Let me give an example that will lead to ever-elaborating graphics.

An example: we can find the most distinguishing words for the 100 books that use “evolution” the most frequently:

Searching for Correlations

More access to the connections between words makes it possible to separate word-use from language. This is one of the reasons that we need access to analyzed texts to do any real digital history. I'm thinking through ways to use patterns of correlations across books as a way to start thinking about how connections between words and concepts change over time, just as word count data can tell us something (fuzzy, but something) about the general prominence of a term. This post is about how the search algorithm I've been working with can help improve this sort of search. I'll get back to evolution (which I talked about in my post introducing these correlation charts) in a day or two, but let me start with an even more basic question that illustrates some of the possibilities and limitations of this analysis: What was the Civil War fought about?

I've always liked this one, since it's one of those historiographical questions that still rattles through politics. The literature, if I remember generals properly (the big work is David Blight, but in the broad outline it comes out of the self-situations of Foner and McPherson, and originally really out of Du Bois), says that the war was viewed as deeply tied to slavery at the time—certainly by emancipation in 1863, and even before. But as part of the process of sectional reconciliation after Reconstruction (ending in 1876) and even more into the beginning of Jim Crow (1890s-ish) was a gradual suppression of that truth in favor of a narrative about the war as a great national tragedy in which the North was an aggressor, and in which the South was defending states' rights but not necessarily slavery. The mainstream historiography has since swung back to slavery as the heart of the matter, but there are obviously plenty of people interested in defending the Lost Cause. Anyhow: let's try to get a demonstration of that. Here's a first chart:

How should we read this kind of chart? Well, it's not as definitive as I'd like, but there's a big peak the year after the war breaks out in 1861, and a massive plunge downwards right after the disputed Hayes–Tilden election of 1876. But the correlation is perhaps higher than the literature would suggest around 1900. And both the ends are suspicious. In the 1830s, what is a search for "civil war" picking up? And why is that dip in the 1910s so suspiciously aligned with the Great War? Luckily, we can do better than this.

Basic Search

To my surprise, I built a search engine as a consequence of trying to quantify information about word usage in the books I downloaded from the Internet Archive. Before I move on with the correlations I talked about in my last post, I need to explain a little about that.

I described TF-IDF weights a little bit earlier. They're a basic way to find the key content words in a text. Since a "text" can be any set of words from a sentence to the full works of a publishing house or a decade (as Michael Witmore recently said on his blog, they are "massively addressable"), these can be really powerful. And as I said talking about assisted reading, search is a technology humanists use all the time to, essentially, do a form of reading for them. (Even though they don't necessarily understand just what search does.) I'm sure there are better implementations than the basic TFIDF I'm using, but it's still interesting both as a way to understand the searches we do and don't reflect on.

More broadly, my point is that we should think about whether we can use that same technology past the one stage in our research we use it for now. Plus, if you're just here for the graphs, it lets us try a few new ones. But they're not until the next couple posts, since I'm trying to keep down the lengths a little bit right now.

Correlations

How are words linked in their usage? In a way, that's the core question of a lot of history. I think we can get a bit of a picture of this, albeit a hazy one, using some numbers. This is the first of two posts about how we can look at connections between discourses.

Any word has a given prominence for any book. Basically, that's the number of times it appears. (The numbers I give here are my TF-IDF scores, but for practical purposes, they're basically equivalent to the rate of incidence per book when we look at single words. Things only get tricky when looking at multiple word correlations, which I'm not going to use in this post.) To explain graphically: here's a chart. Each dot is a book, the x axis is the book's score for "evolution", and the y axis is the book's score for "society."

Friday, December 16, 2011

Friday, November 18, 2011

Monday, November 14, 2011

Thursday, November 10, 2011

Thursday, November 3, 2011

Friday, October 7, 2011

Thursday, October 6, 2011

Friday, September 30, 2011

Monday, September 5, 2011

Sunday, August 28, 2011

Thursday, August 4, 2011

Friday, July 15, 2011

Thursday, June 16, 2011

Tuesday, May 10, 2011

Monday, April 18, 2011

Wednesday, April 13, 2011

Monday, April 11, 2011

Sunday, April 3, 2011

Friday, April 1, 2011

Monday, March 28, 2011

Thursday, March 24, 2011

Wednesday, March 2, 2011

Tuesday, February 22, 2011

Sunday, February 20, 2011

Thursday, February 17, 2011

Monday, February 14, 2011

Friday, February 11, 2011

Wednesday, February 2, 2011

Tuesday, February 1, 2011

Monday, January 31, 2011

Friday, January 28, 2011

Friday, January 21, 2011

Thursday, January 20, 2011

Tuesday, January 18, 2011

Tuesday, January 11, 2011

Monday, January 10, 2011

Thursday, January 6, 2011

Wednesday, January 5, 2011