Open Library has pretty good metadata. I'm using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While I'm waiting for some indexes to build, that will give a good chance to figure out just what's in these digital sources.
Most interestingly, it has state level information on books you can download from the Internet Archive. There are about 500,000 books with library call numbers or other good metadata, 225,000 of which are published in the US. How much geographical diversity is there within that? Not much. About 70% of the books are published in three states: New York, Massachusetts, and Pennsylvania. That's because the US publishing industry was heavily concentrated in Boston, NYC, and Philadelphia. Here's a map, using the Google graph API through the great new GoogleViz R package, of how many books there are from each state. (Hover over for the numbers, and let me know if it doesn't load, there still seem to be some kinks). Not included is Washington DC, which has 13,000 books, slightly fewer than Illinois.
I'm going to try to pick publishers that aren't just in the big three cities, but any study of "culture," not the publishing industry, is going to be heavily influenced by the pull of the Northeastern cities.
Digital Humanities: Using tools from the 1990s to answer questions from the 1960s about 19th century America.
Monday, January 31, 2011
Friday, January 28, 2011
Picking texts, again
I'm trying to get a new group of texts to analyze. We already have enough books to move along on certain types of computer-assisted textual analysis. The big problems are OCR and metadata. Those are a) probably correlated somewhat, and b) partially superable. I've been spending a while trying to figure out how to switch over to better metadata for my texts (which actually means an almost all-new set of texts, based on new metadata). I've avoided blogging the really boring stuff, but I'm going to stay with pretty boring stuff for a little while (at least this post and one more later, maybe more) to get this on the record.
Friday, January 21, 2011
Digital history and the copyright black hole
In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. I've called 1922 the year digital history ends before; for the kind of work I want to see, it's nearly an insuperable barrier, and it's one I think not enough non-tech-savvy humanists think about. So let me dig in a little.
The Sonny Bono Copyright Term Extension Act is a black hole. It has trapped 95% of the books ever written, and 1922 lies just outside its event horizon. Small amounts of energy can leak out past that barrier, but the information they convey (or don't) is miniscule compared to what's locked away inside. We can dive headlong inside the horizon and risk our work never getting out; we can play with the scraps of radiation that seep out and hope it adequately characterizes what's been lost inside; or we can figure out how to work with the material that isn't trapped to see just what we want. I'm in favor of the latter: let me give a bit of my reasoning why.
My favorite individual ngram is for the zip code 02138. It is steadily persistent from 1800 to 1922, and then disappears completely until the invention of the zip code in the 1960s. Can you tell what's going on?
The Sonny Bono Copyright Term Extension Act is a black hole. It has trapped 95% of the books ever written, and 1922 lies just outside its event horizon. Small amounts of energy can leak out past that barrier, but the information they convey (or don't) is miniscule compared to what's locked away inside. We can dive headlong inside the horizon and risk our work never getting out; we can play with the scraps of radiation that seep out and hope it adequately characterizes what's been lost inside; or we can figure out how to work with the material that isn't trapped to see just what we want. I'm in favor of the latter: let me give a bit of my reasoning why.
My favorite individual ngram is for the zip code 02138. It is steadily persistent from 1800 to 1922, and then disappears completely until the invention of the zip code in the 1960s. Can you tell what's going on?
Thursday, January 20, 2011
Openness and Culturomics
The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. I'll get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.
Tuesday, January 18, 2011
Cluster Charts
I'll end my unannounced hiatus by posting several charts that show the limits of the search-term clustering I talked about last week before I respond to a couple things that caught my interest in the last week.
To quickly recap: I take a word or phrase—evolution, for example—and then find words that appear disproportionately often, according to TF-IDF scores, in the books that use evolution the most. (I just use an arbitrary cap to choose those books--it's 60 books for these charts here. I don't think that's the best possible implementation, but given my processing power it's not terrible). Then I take each of those words, and find words that appear disproportionately in the books that use both evolution and the target word most frequently. This process can be iterated any number of times as we learn about more words that appear frequently—"evolution"–"sociology" comes out of the first batch, but it might suggest "evolution"–"Hegel" for the second, and that in turn might suggest "evolution" –"Kant" for the third. (I'm using colors to indicate at what point in the search process a word turned up: Red for words that associated with the original word on its own, down to light blue for ones that turned up only in the later stages of searching).
Often, I'll get the same results for several different search terms—that's what I'm relying on. I use a force-directed placement algorithm to put the words into a chart based on their connections to other words. Essentially, I create a social network where a term like "social" is friends with "ethical" because "social" is one of the most distinguishing terms in books that score highly on a search for "evolution"–"social", and "ethical" is one of the most distinguishing terms in books that score highly on a search for "evolution"–"ethical". (The algorithm is actually a little more complicated than that, thought maybe not for the better). So for evolution, the chart looks like this. (click-enlarge)
To quickly recap: I take a word or phrase—evolution, for example—and then find words that appear disproportionately often, according to TF-IDF scores, in the books that use evolution the most. (I just use an arbitrary cap to choose those books--it's 60 books for these charts here. I don't think that's the best possible implementation, but given my processing power it's not terrible). Then I take each of those words, and find words that appear disproportionately in the books that use both evolution and the target word most frequently. This process can be iterated any number of times as we learn about more words that appear frequently—"evolution"–"sociology" comes out of the first batch, but it might suggest "evolution"–"Hegel" for the second, and that in turn might suggest "evolution" –"Kant" for the third. (I'm using colors to indicate at what point in the search process a word turned up: Red for words that associated with the original word on its own, down to light blue for ones that turned up only in the later stages of searching).
Often, I'll get the same results for several different search terms—that's what I'm relying on. I use a force-directed placement algorithm to put the words into a chart based on their connections to other words. Essentially, I create a social network where a term like "social" is friends with "ethical" because "social" is one of the most distinguishing terms in books that score highly on a search for "evolution"–"social", and "ethical" is one of the most distinguishing terms in books that score highly on a search for "evolution"–"ethical". (The algorithm is actually a little more complicated than that, thought maybe not for the better). So for evolution, the chart looks like this. (click-enlarge)
Tuesday, January 11, 2011
Clustering from Search
Because of my primitive search engine, I've been thinking about some of the ways we can better use search data to a) interpret historical data, and b) improve our understanding of what goes on when we search. As I was saying then, there are two things that search engines let us do that we usually don't get:
1) Numeric scores on results
2) The ability to from a set of books to a set of high-scoring words, as well as (the normal direction) from a set of words to a set of high-scoring books.
1) Numeric scores on results
2) The ability to from a set of books to a set of high-scoring words, as well as (the normal direction) from a set of words to a set of high-scoring books.
We can start to do some really interesting stuff by feeding this information back in and out of the system. (Given unlimited memory, we could probably do it all even better with pure matrix manipulation, and I'm sure there are creative in-between solutions). Let me give an example that will lead to ever-elaborating graphics.
An example: we can find the most distinguishing words for the 100 books that use “evolution” the most frequently:
An example: we can find the most distinguishing words for the 100 books that use “evolution” the most frequently:
Monday, January 10, 2011
Searching for Correlations
More access to the connections between words makes it possible to separate word-use from language. This is one of the reasons that we need access to analyzed texts to do any real digital history. I'm thinking through ways to use patterns of correlations across books as a way to start thinking about how connections between words and concepts change over time, just as word count data can tell us something (fuzzy, but something) about the general prominence of a term. This post is about how the search algorithm I've been working with can help improve this sort of search. I'll get back to evolution (which I talked about in my post introducing these correlation charts) in a day or two, but let me start with an even more basic question that illustrates some of the possibilities and limitations of this analysis: What was the Civil War fought about?
I've always liked this one, since it's one of those historiographical questions that still rattles through politics. The literature, if I remember generals properly (the big work is David Blight, but in the broad outline it comes out of the self-situations of Foner and McPherson, and originally really out of Du Bois), says that the war was viewed as deeply tied to slavery at the time—certainly by emancipation in 1863, and even before. But as part of the process of sectional reconciliation after Reconstruction (ending in 1876) and even more into the beginning of Jim Crow (1890s-ish) was a gradual suppression of that truth in favor of a narrative about the war as a great national tragedy in which the North was an aggressor, and in which the South was defending states' rights but not necessarily slavery. The mainstream historiography has since swung back to slavery as the heart of the matter, but there are obviously plenty of people interested in defending the Lost Cause. Anyhow: let's try to get a demonstration of that. Here's a first chart:
I've always liked this one, since it's one of those historiographical questions that still rattles through politics. The literature, if I remember generals properly (the big work is David Blight, but in the broad outline it comes out of the self-situations of Foner and McPherson, and originally really out of Du Bois), says that the war was viewed as deeply tied to slavery at the time—certainly by emancipation in 1863, and even before. But as part of the process of sectional reconciliation after Reconstruction (ending in 1876) and even more into the beginning of Jim Crow (1890s-ish) was a gradual suppression of that truth in favor of a narrative about the war as a great national tragedy in which the North was an aggressor, and in which the South was defending states' rights but not necessarily slavery. The mainstream historiography has since swung back to slavery as the heart of the matter, but there are obviously plenty of people interested in defending the Lost Cause. Anyhow: let's try to get a demonstration of that. Here's a first chart:
How should we read this kind of chart? Well, it's not as definitive as I'd like, but there's a big peak the year after the war breaks out in 1861, and a massive plunge downwards right after the disputed Hayes–Tilden election of 1876. But the correlation is perhaps higher than the literature would suggest around 1900. And both the ends are suspicious. In the 1830s, what is a search for "civil war" picking up? And why is that dip in the 1910s so suspiciously aligned with the Great War? Luckily, we can do better than this.
Thursday, January 6, 2011
Basic Search
To my surprise, I built a search engine as a consequence of trying to quantify information about word usage in the books I downloaded from the Internet Archive. Before I move on with the correlations I talked about in my last post, I need to explain a little about that.
I described TF-IDF weights a little bit earlier. They're a basic way to find the key content words in a text. Since a "text" can be any set of words from a sentence to the full works of a publishing house or a decade (as Michael Witmore recently said on his blog, they are "massively addressable"), these can be really powerful. And as I said talking about assisted reading, search is a technology humanists use all the time to, essentially, do a form of reading for them. (Even though they don't necessarily understand just what search does.) I'm sure there are better implementations than the basic TFIDF I'm using, but it's still interesting both as a way to understand the searches we do and don't reflect on.
More broadly, my point is that we should think about whether we can use that same technology past the one stage in our research we use it for now. Plus, if you're just here for the graphs, it lets us try a few new ones. But they're not until the next couple posts, since I'm trying to keep down the lengths a little bit right now.
I described TF-IDF weights a little bit earlier. They're a basic way to find the key content words in a text. Since a "text" can be any set of words from a sentence to the full works of a publishing house or a decade (as Michael Witmore recently said on his blog, they are "massively addressable"), these can be really powerful. And as I said talking about assisted reading, search is a technology humanists use all the time to, essentially, do a form of reading for them. (Even though they don't necessarily understand just what search does.) I'm sure there are better implementations than the basic TFIDF I'm using, but it's still interesting both as a way to understand the searches we do and don't reflect on.
More broadly, my point is that we should think about whether we can use that same technology past the one stage in our research we use it for now. Plus, if you're just here for the graphs, it lets us try a few new ones. But they're not until the next couple posts, since I'm trying to keep down the lengths a little bit right now.
Wednesday, January 5, 2011
Correlations
How are words linked in their usage? In a way, that's the core question of a lot of history. I think we can get a bit of a picture of this, albeit a hazy one, using some numbers. This is the first of two posts about how we can look at connections between discourses.
Any word has a given prominence for any book. Basically, that's the number of times it appears. (The numbers I give here are my TF-IDF scores, but for practical purposes, they're basically equivalent to the rate of incidence per book when we look at single words. Things only get tricky when looking at multiple word correlations, which I'm not going to use in this post.) To explain graphically: here's a chart. Each dot is a book, the x axis is the book's score for "evolution", and the y axis is the book's score for "society."
Any word has a given prominence for any book. Basically, that's the number of times it appears. (The numbers I give here are my TF-IDF scores, but for practical purposes, they're basically equivalent to the rate of incidence per book when we look at single words. Things only get tricky when looking at multiple word correlations, which I'm not going to use in this post.) To explain graphically: here's a chart. Each dot is a book, the x axis is the book's score for "evolution", and the y axis is the book's score for "society."