Tuesday, January 11, 2011

Clustering from Search

Because of my primitive search engine, I've been thinking about some of the ways we can better use search data to a) interpret historical data, and b) improve our understanding of what goes on when we search. As I was saying then, there are two things that search engines let us do that we usually don't get:

1)  Numeric scores on results
2) The ability to from a set of books to a set of high-scoring words, as well as (the normal direction) from a set of words to a set of high-scoring books.

We can start to do some really interesting stuff by feeding this information back in and out of the system. (Given unlimited memory, we could probably do it all even better with pure matrix manipulation, and I'm sure there are creative in-between solutions). Let me give an example that will lead to ever-elaborating graphics.

An example: we can find the most distinguishing words for the 100 books that use “evolution” the most frequently: 

Monday, January 10, 2011

Searching for Correlations

More access to the connections between words makes it possible to separate word-use from language. This is one of the reasons that we need access to analyzed texts to do any real digital history. I'm thinking through ways to use patterns of correlations across books as a way to start thinking about how connections between words and concepts change over time, just as word count data can tell us something (fuzzy, but something) about the general prominence of a term. This post is about how the search algorithm I've been working with can help improve this sort of search. I'll get back to evolution (which I talked about in my post introducing these correlation charts) in a day or two, but let me start with an even more basic question that illustrates some of the possibilities and limitations of this analysis: What was the Civil War fought about?

I've always liked this one, since it's one of those historiographical questions that still rattles through politics. The literature, if I remember generals properly (the big work is David Blight, but in the broad outline it comes out of the self-situations of Foner and McPherson, and originally really out of Du Bois), says that the war was viewed as deeply tied to slavery at the time—certainly by emancipation in 1863, and even before. But as part of the process of sectional reconciliation after Reconstruction (ending in 1876) and even more into the beginning of Jim Crow (1890s-ish) was a gradual suppression of that truth in favor of a narrative about the war as a great national tragedy in which the North was an aggressor, and in which the South was defending states' rights but not necessarily slavery. The mainstream historiography has since swung back to slavery as the heart of the matter, but there are obviously plenty of people interested in defending the Lost Cause. Anyhow: let's try to get a demonstration of that. Here's a first chart:

How should we read this kind of chart? Well, it's not as definitive as I'd like, but there's a big peak the year after the war breaks out in 1861, and a massive plunge downwards right after the disputed Hayes–Tilden election of 1876. But the correlation is perhaps higher than the literature would suggest around 1900. And both the ends are suspicious. In the 1830s, what is a search for "civil war" picking up? And why is that dip in the 1910s so suspiciously aligned with the Great War? Luckily, we can do better than this.

Thursday, January 6, 2011

Basic Search

To my surprise, I built a search engine as a consequence of trying to quantify information about word usage in the books I downloaded from the Internet Archive. Before I move on with the correlations I talked about in my last post, I need to explain a little about that.


I described TF-IDF weights a little bit earlier. They're a basic way to find the key content words in a text. Since a "text" can be any set of words from a sentence to the full works of a publishing house or a decade (as Michael Witmore recently said on his blog, they are "massively addressable"), these can be really powerful. And as I said talking about assisted reading, search is a technology humanists use all the time to, essentially, do a form of reading for them. (Even though they don't necessarily understand just what search does.) I'm sure there are better implementations than the basic TFIDF I'm using, but it's still interesting both as a way to understand the searches we do and don't reflect on.

More broadly, my point is that we should think about whether we can use that same technology past the one stage in our research we use it for now. Plus, if you're just here for the graphs, it lets us try a few new ones. But they're not until the next couple posts, since I'm trying to keep down the lengths a little bit right now.


Wednesday, January 5, 2011

Correlations

How are words linked in their usage? In a way, that's the core question of a lot of history. I think we can get a bit of a picture of this, albeit a hazy one, using some numbers. This is the first of two posts about how we can look at connections between discourses.


Any word has a given prominence for any book. Basically, that's the number of times it appears. (The numbers I give here are my TF-IDF scores, but for practical purposes, they're basically equivalent to the rate of incidence per book when we look at single words. Things only get tricky when looking at multiple word correlations, which I'm not going to use in this post.) To explain graphically: here's a chart. Each dot is a book, the x axis is the book's score for "evolution", and the y axis is the book's score for "society."

Thursday, December 30, 2010

Assisted Reading vs. Data Mining

I've started thinking that there's a useful distinction to be made in two different ways of doing historical textual analysis. First stab, I'd call them:
  1. Assisted Reading: Using a computer as a means of targeting and enhancing traditional textual reading—finding texts relevant to a topic, doing low level things like counting mentions, etc.
  2. Text Mining: Treating texts as data sources to chopped up entirely and recast into new forms like charts of word use or graphics of information exchange that, themselves, require a sort of historical reading.
Humanists are far more comfortable with the first than the second. (That's partly why they keep calling the second type of work 'text mining', even I think the field has moved on from that label--it sounds sinister). Basic search, which everyone uses on J-stor or Google Books, is far more algorithmically sophisticated than a text-mining star like Ngrams. But since it promises to merely enable reading, it has casually slipped into research practices without much thought.

The distinction is important because the way we use texts is tied to humanists' reactions to new work in digital humanities. Ted Underwood started an interesting blog to look at ngrams results from an English lit perspective: he makes a good point in his first post:

Monday, December 27, 2010

Call numbers

I finally got some call numbers. Not for everything, but for a better portion than I thought I would: about 7,600 records, or c. 30% of my books.

The HathiTrust Bibliographic API is great. What a resource. There are a few odd tricks I had to put in to account for their integrating various catalogs together (Michigan call numbers are filed under MARC 050 (Library of Congress catalog), while California ones are filed under MARC 090 (local catalog), for instance, although they both seem to be basically an LCC scheme). But the openness is fantastic--you just plug in OCLC or LCCN identifiers into a url string to get an xml record. It's possible to get a lot of OCLCs, in particular, by scraping Internet Archive pages. I haven't yet found a good way to go the opposite direction, though: from a large number of specially chosen Hathi catalogue items to IA books.

This lets me get a slightly better grasp on what I have. First, a list of how many books I have for each headline LC letter:

Sunday, December 26, 2010

Finding keywords

Before Christmas, I spelled out a few ways of thinking about historical texts as related to other texts based on their use of different words, and did a couple examples using months and some abstract nouns. Two of the problems I've had with getting useful data out of this approach are:

  1. What words to use? I have 200,000, and processing those would take at least 10 times more RAM than I have (2GB, for the record). 
  2. What books to use? I can—and will—apply them across the whole corpus, but I think it's more useful to use the data to draw distinctions between types of books we know to be interesting.
I've got tentative solutions to both those questions. For (2), I finally figured out how to get a substantial number of LCC call numbers into my database (for about 30% of the books). More on that later, which I'm obviously excited about. But I finally did some reading to get a better answer for (1), too. This is all still notes and groundwork-laying, so if you're reading for historical analysis or DH commentary, this is the second of several skippable posts. But I like this stuff because it gives us glimpses at the connections between semantics, genre, and word-use patterns.

Basically, I'm going to start off using tf-idf weight. A while ago, I talked about finding "lumpy" words. Any word appears in x books, and y times overall. We can plot that. (I'm using the data from the ngrams 1-set here instead of mine, because it has a more complete set of words. There are lots of uses for that data, for sure, although I keep finding funny little mistakes in it that aren't really worth blogging—they seem to have messed up their processing of contractions, for instance, and their handling of capital letters forces some guess-work into the analysis I'm doing here). Each blue dot in this graph is a word: the red ones are the 1000 or so ones that appear a lot but in fewer books than you'd think. Those words should be more interesting for analysis. 

Thursday, December 23, 2010

What good are the 5-grams?

 Dan Cohen gives the comprehensive Digital Humanities treatment on Ngrams, and he mostly gets it right. There's just one technical point I want to push back on. He says the best research opportunities are in the multi-grams. For the post-copyright era, this is true, since they are the only data anyone has on those books. But for pre-copyright stuff, there's no reason to use the ngrams data rather than just downloading the original books, because:

  1. Ngrams are not complete; and
  2. Were they complete, they wouldn't offer significant computing benefits over reading the whole corpus.
Edit: let me intervene after the fact and change this from a rhetorical to a real question. Am I missing some really important research applications of the 5-grams in what follows? Another way of putting it: has the dump that Google did for the non historical ngrams in 2006 been useful in serious research? I don't know, but I suspect it might have been.

Second Principals

Back to my own stuff. Before the Ngrams stuff came up, I was working on ways of finding books that share similar vocabularies. I said at the end of my second ngrams post that we have hundreds of thousands of dimensions for each book: let me explain what I mean. My regular readers were unconvinced, I think, by my first foray here into principal components, but I'm going to try again. This post is largely a test of whether I can explain principal components analysis to people who don't know about it so: correct me if you already understand PCA, and let me know me know what's unclear if you don't. (Or, it goes without saying, skip it.)

Start with an example. Let's say I'm interested in social theory. I can take two words—"social" and "political"—and count how frequent each of them is --something like two or three out of every thousand words is one of those. I can even make a chart, where every point is a book, with one axis the percentage of words in that book that are "social" and the other the percentage that are "political." I put a few books on it just to show what it finds:



Sunday, December 19, 2010

Not included in ngrams: Tom Sawyer

I wrote yesterday about how well the filters applied to remove some books from ngrams work for increasing the quality of year information and OCR compared to Google books.

But we have no idea what books are in there. There's no connection to the texts from the data.

I'm particularly interested in how they deal with subsequent editions of books. Their methodology (pdf) talks about multiple editions of Tom Sawyer. I think it says that they eliminate multiple copies of the same edition but keep different years.

I thought I'd check this. There are about 5 occasions in Tom Sawyer where the phrase "Huck said" appears with separating quotes, and 11 for "said Huck." Both are phrases that basically appear only in Tom Sawyer in the 19th century (the latter also has a tiny life in legal contracts involving huckaback, and a few other places), so we can use it as a fair proxy for different editions. The first edition of Tom Sawyer was 1881: there are loads of later ones, obviously. Here's what you get from ngrams:



Three big spikes around 1900, and nothing before. Until about 1940, the ratio is somewhat consistent with the internal usage in the book, 11 to 5, although "said huck" is a little overrepresented as we might think. Note:

Saturday, December 18, 2010

State of the Art/Science

As I said: ngrams represents the state of the art for digital humanities right now in some ways. Put together some smart Harvard postdocs, a few eminent professors, the Google Books team, some undergrad research assistants for programming, then give them access to Google computing power and proprietary information to produce the datasets, and you're pretty much guaranteed an explosion of theories and methods.

Some of the theories are deeply interesting. I really like the censorship stuff. That really does deal with books specifically, not 'culture,' so it makes a lot of sense to do with this dataset.  The stuff about half-lives for celebrity fame and particularly for years is cool, although without strict genre controls and a little more context I'm not sure what it actually says--it might be something as elegaic as the article's "We are forgetting our past faster with each passing year," but there are certainly more prosaic explanations. (Say: 1) footnotes are getting more and more common, and 2) footnotes generally cite more recent years than does main text. I think that might cover all the bases, too.) Yes, the big ideas, at least the ones I feel qualified to assess, are a little fuzzier—it's hard to tell what to do with the concluding description of wordcounts as "a great cache of bones from which to reconstruct the skeleton of a new science," aside from marveling at the BrooksianFreedmanian tangle of metaphor. (Sciences once roamed the earth?) But although a lot of the language of a new world order (have you seen the "days since first light" counter on their web page?) will rankle humanists, that fuzziness about the goals is probably good. This isn't quite sociobiology redux, intent on forcing a particular understanding of humanity on the humanities. It's just a collection of data and tools that they find interesting uses for, and we can too.


But it's the methods that should be more exciting for people following this. Google remains ahead of the curve in terms of both metadata and OCR, which are the stuff of which digital humanities is made. What does the Science team get?

Friday, December 17, 2010

Missing humanists

(First in a series on yesterday's Google/Harvard paper in Science and its reception.)

So there are four things I'm immediately interested from yesterday's Google/Harvard paper.

  1. A team of linguists, computer scientists and other non-humanists published that paper in Science about using Google data for word counts to outline the new science of 'culturomics';
  2. They described the methodology they used to get word counts out of the raw metadata and scans, which presumably represents the best Google could do in 2008-09;
  3. Google released a web site letting you chart the shifts in words and phrases over time;
  4. Google released the core data powering that site containing data on word, book, and page occurrences for various combinations of words.

Twitter seems largely focused on #3 as a fascinating tool/diversion, the researchers seem to hope that #1 will create a burst of serious research using #4, and anyone doing research in the field should be eagerly scanning #2 for clues about what the state of art is—how far you can get with full cooperation from Google, with money to hire programmers, etc, and with unlimited computing infrastructure.


Each of these is worth thinking about in turn. Cut through all of it, though, and I think the core takeaway should be this:

Humanists need to be more involved in how these massive stores of data are used.

Thursday, December 16, 2010

Culturomics

Days from when I said "Google Trends for historical terms might be worse than nothing" to the release of "Google ngrams:" 12. So: we'll get to see!


Also, I take back everything I said about 'digital humanities' having unfortunate implications. "Culturomics"—like 'culturenomics', but fluffier?—takes the cake.


Anyway, I should have some more thoughts on this later. I have them now, I suppose, but let me digest. For now, just dwell on the total lack of any humanists in that article promising to revolutionize the humanities.

Tuesday, December 14, 2010

How Bad is Internet Archive OCR?

We all know that the OCR on our digital resources is pretty bad. I've often wondered if part of the reason Google doesn't share its OCR is simply it would show so much ugliness. (A common misreading, 'tlie' for 'the', gets about 4.6m results in Google books). So how bad is the the internet archive OCR, which I'm using? I've started rebuilding my database, and I put in a few checks to get a better idea. Allen already asked some questions in the comments about this, so I thought I'd dump it on to the internet, since there doesn't seem to be that much out there.

First: here's a chart of the percentage of "words" that lie outside my list of the top 200,000 or so words. (See an earlier post for the method). The recognized words hover at about 91-93 percent for the period. (That it's lowest in the middle is pretty good evidence the gap isn't a product of words entering or leaving the language).

Now, that has flaws in both directions. Here are some considerations that would tend to push the OCR error rate on a word basis lower than 8%:

Avoidance tactics

Can historical events suppress use of words? Usage of the word 'panic' seems to spike down around the bank panics of 1873 and 1893, and maybe 1837 too. I'm pretty confident this is just an artifact of me plugging in a lot of words in to test out how fast the new database is and finding some random noise. There are too many reasons to list: 1857 and 1907 don't have the pattern, the rebound in 1894 is too fast, etc. It's only 1873 that really looks abnormal. What do you think:
But it would be really interesting if true--in my database of mostly non-newsy texts, do authors maybe shy away from using words that have too specific a meaning at the present moment? Lack of use might be interesting in all sorts of other ways, even if this one is probably just a random artifact.