Jamie's been asking for some thoughts on what it takes to do this--statistics backgrounds, etc. I should say that I'm doing this, for the most part, the hard way, because 1) My database is too large to start out using most tools I know of, including I think the R text-mining package, and 2) I want to understand how it works better. I don't think I'm going to do the software review thing here, but there are what look like a lot of promising leads at an American Studies blog.
As for whether the courses exist, I think they do from place to place: Stephen Ramsay says he's taught one at Nebraska for years.
It's easy to follow a few of these links and quickly end up drinking from a firehose of information. I get two initial impressions: 1) English is ahead of history on this; 2) there are a lot of highly developed applications for doing similar things with text analysis. The advantage is that it's leading me to think more carefully about how my applications are different than other people's.
Digital Humanities: Using tools from the 1990s to answer questions from the 1960s about 19th century America.
Thursday, December 2, 2010
Wednesday, December 1, 2010
Digital Humanities and Humanities Computing
I've had "digital humanities" in the blog's subtitle for a while, but it's a terribly offputting term. I guess it's supposed to evoke future frontiers and universal dissemination of humanistic work, but it carries an unfortunate implication that the analog humanities are something completely different. It makes them sound older, richer, more subtle—and scheduled for demolition. No wonder a world of online exhibitions and digital texts doesn't appeal to most humanists of the tweed– and dust-jacket crowd. I think we need a distinction that better expresses how digital technology expands the humanities, rather than constraining it.
It's too easy to think Digital Humanities is about teaching people to think like computers, when it really should be about making computers think like humanists.* What we want isn't digital humanities; it's humanities computing. To some degree, we all know this is possible—we all think word processors are better than pen and paper, or jstor better than buried stacks of journals (musty musings about serendipity aside). But we can go farther than that. Manfred Kuehn's blog is an interesting project in exploring how notetaking software can reflect and organize our thinking in ways that create serendipity within one person's own notes. I'm trying to figure out ways of doing that on a larger body of texts, but we could think of those as notes, themselves.
Programming and other Languages
Jamie asked about assignments for students using digital sources. It's a difficult question.
A couple weeks ago someone referred an undergraduate to me who was interested in using some sort of digital maps for a project on a Cuban emigre writer like the ones I did of Los Angeles German emigres a few years ago. Like most history undergraduates, she didn't have any programming background, and she didn't have a really substantial pile of data to work with from the start. For her to do digital history, she'd have to type hundreds of addresses and dates off of letters from the archives, and then learn some sort of GIS software or google maps API, without any clear payoff. No would get much out of forcing her to spend three days playing with databases when she's really looking at the contents of letters.
Tuesday, November 30, 2010
Catalog data and genre
Mostly a note to myself:
I think genre data would be helpful in all sorts of ways--tracking evolutionary language through different sciences, say, or finding what discourses are the earliest to use certain constructions like "focus attention." The Internet Archive books have no genre information in their metadata, for the most part. The genre data I think I want to use would Library of Congress call numbers--that divides up books in all sorts of ways at various levels that I could parse. It's tricky to get from one to the other, though. I could try to hit the LOC catalog with a script that searches for title, author and year from the metadata I do have, but that would miss a lot and maybe have false positives, plus the LOC catalog is sort of tough to machine-query. Or I could try to run a completely statistical clustering, but I don't trust that that would come out with categories that correspond to ones in common use. Some sort of hybrid method might be best--just a quick sketch below.
I think genre data would be helpful in all sorts of ways--tracking evolutionary language through different sciences, say, or finding what discourses are the earliest to use certain constructions like "focus attention." The Internet Archive books have no genre information in their metadata, for the most part. The genre data I think I want to use would Library of Congress call numbers--that divides up books in all sorts of ways at various levels that I could parse. It's tricky to get from one to the other, though. I could try to hit the LOC catalog with a script that searches for title, author and year from the metadata I do have, but that would miss a lot and maybe have false positives, plus the LOC catalog is sort of tough to machine-query. Or I could try to run a completely statistical clustering, but I don't trust that that would come out with categories that correspond to ones in common use. Some sort of hybrid method might be best--just a quick sketch below.
Sunday, November 28, 2010
Top ten authors
Most intensive text analysis is done on heavily maintained sources. I'm using a mess, by contrast, but a much larger one. Partly, I'm doing this tendentiously--I think it's important to realize that we can accept all the errors due to poor optical character recognition, occasional duplicate copies of works, and so on, and still get workable materials.
Using worse sources is something of a necessity for digital history. The text recognition and the metadata for a lot of the sources we use often—google books, jstor, proquest—is full of errors under the surface, and it's OK for us to work with such data in the open. The historical profession doesn't have any small-ish corpuses we would be interested in analyzing again and again. This isn't true of English departments, who seem to be well ahead of historians in computer-assisted text analysis, and have the luxury of emerging curated text sources like the one Martin Mueller describes here.
But the side effect of that is that we need to be careful about understanding what we're working with. So I'm running periodic checks on the data in my corpus of books by major American publishers (described more earlier) to see what's in there. I thought I'd post the list of the top twenty authors, because I found it surprising, though not in a bad way. We'll do from no. 20 ranking on up, because that's how they do it on sports blogs. (What I really should do is a slideshow to increase pageviews). I'll identify the less famous names.
Using worse sources is something of a necessity for digital history. The text recognition and the metadata for a lot of the sources we use often—google books, jstor, proquest—is full of errors under the surface, and it's OK for us to work with such data in the open. The historical profession doesn't have any small-ish corpuses we would be interested in analyzing again and again. This isn't true of English departments, who seem to be well ahead of historians in computer-assisted text analysis, and have the luxury of emerging curated text sources like the one Martin Mueller describes here.
But the side effect of that is that we need to be careful about understanding what we're working with. So I'm running periodic checks on the data in my corpus of books by major American publishers (described more earlier) to see what's in there. I thought I'd post the list of the top twenty authors, because I found it surprising, though not in a bad way. We'll do from no. 20 ranking on up, because that's how they do it on sports blogs. (What I really should do is a slideshow to increase pageviews). I'll identify the less famous names.
Clustering isms together
In addition to finding the similarities in use between particular isms, we can look at their similarities in general. Since we have the distances, it's possible to create a dendrogram, which is a sort of family tree. Looking around the literary studies text-analysis blogs, I see these done quite a few times to classify works by their vocabulary. I haven't seen much using words, though: but it works fairly well. I thought it might help answer Hank's question about the difference between evolutionism and darwinism, but, as you'll see, that distinction seems to be a little too fine for now.
Here's the overall tree of the 400-ish isms, with the words removed, just to give a sense. We can cut the tree at any point to divide into however many groups we'd like. The top three branches essentially correspond to 1) Christian and philosophical terminology, 2) social, historical, and everything else, and 3) medical and some scientific terminology.
Here's the overall tree of the 400-ish isms, with the words removed, just to give a sense. We can cut the tree at any point to divide into however many groups we'd like. The top three branches essentially correspond to 1) Christian and philosophical terminology, 2) social, historical, and everything else, and 3) medical and some scientific terminology.
But you probably want to see the actual words.
Friday, November 26, 2010
Comparing usage patterns across the isms
What can we do with this information we’ve gathered about unexpected occurrences? The most obvious thing is simply to look at what words appear most often with other ones. We can do this for any ism given the data I’ve gathered. Hank asked earlier in the comments about the difference between "Darwinism" and evolutionism, so:
> find.related.words("darwinism",matrix = "percent.diff", return=5)
phenomenism evolutionism revolutionism subjectivism hermaphroditism
2595.147 1967.021 1922.339 1706.679 1681.792
Phenomenism appears 2,595%—26 times—more often in books about Darwin than chance would imply. That revolutionism is so high is certainly interesting, and maybe there’s some story out there about why hermaphroditism is so high. The takeaway might be that Darwinism appears as much in philosophical literature as scientific, which isn’t surprising.
But we don’t just have individual counts for words—we have a network of interrelated meanings that lets us compare the relations across all the interrelations among words. We can use that to create a somewhat different list of words related to Darwinism:
> find.related.words("darwinism",matrix = "percent.diff", return=5)
phenomenism evolutionism revolutionism subjectivism hermaphroditism
2595.147 1967.021 1922.339 1706.679 1681.792
Phenomenism appears 2,595%—26 times—more often in books about Darwin than chance would imply. That revolutionism is so high is certainly interesting, and maybe there’s some story out there about why hermaphroditism is so high. The takeaway might be that Darwinism appears as much in philosophical literature as scientific, which isn’t surprising.
But we don’t just have individual counts for words—we have a network of interrelated meanings that lets us compare the relations across all the interrelations among words. We can use that to create a somewhat different list of words related to Darwinism:
Measuring word collocation, part III
Now to the final term in my sentence from earlier— “How often, compared to what we would expect, does a given word appear with any other given word?”. Let’s think about How much more often. I though this was more complicated than it is for a while, so this post will be short and not very important.
Basically, I’m just using the percentage of time more often as the measuring stick—I fiddled around with standard deviations for a while, but I don’t have a good way to impute expected variations, and percentages seems to work well enough. I do want to talk for a minute about an aspect that I’ve glossed over so far—how do we measure the occurrences of a word relative to itself?
As I think of things
Abraham Lincoln invented Thanksgiving. And I suppose this might be a good way to prove to more literal-minded students that the Victorian invention of tradition really happened. Other than that, I don't know what this means.
Measuring word collocation, part II
This is the second post on ways to measure connections—or more precisely, distance—between words by looking at how often they appear together in books. These are a little dry, and the payoff doesn't come for a while, so let me remind you of the payoff (after which you can bail on this post). I’m trying to create some simple methods that will work well with historical texts to see relations between words—what words are used in similar semantic contexts, what groups of words tend to appear together. First I’ll apply them to the isms, and then we’ll put them in the toolbox to use for later analysis.
I said earlier I would break up the sentence “How often, compared to what we would expect, does a given word appear with any other given word?” into different components. Now let’s look at the central, and maybe most important, part of the question—how often do we expect words to appear together?
I said earlier I would break up the sentence “How often, compared to what we would expect, does a given word appear with any other given word?” into different components. Now let’s look at the central, and maybe most important, part of the question—how often do we expect words to appear together?
Thursday, November 25, 2010
Back from Moscow--where to now?
I’m back from Moscow, and with a lot of blog content from my 23-hour itinerary. I’m going to try to dole it out slowly, though, because a lot of it is dull and somewhat technical, and I think it’s best to intermix with other types of content. I think there are four things I can do here.
1. Document my process of building up a specific system and set of techniques for analyzing texts from the internet archive, and publishing an account my tentative explorations into the structure of my system.
2. Trying to produce some chunks of writing that I could integrate into presentations (we’re talking about one in Princeton in February) and other non-blog writing.
3. Digging in with the data into some major questions in American intellectual history to see whether we can get anything useful out of it.
4. Reflecting on the state of textual analysis within the digital humanities, talking about how it can be done outside of my Perl-SQL-R framework, and thinking about how to overcome some of the more gratuitous obstacles in its way.
I’m interested in all of these, but find myself most naturally writing the first two (aside from a few manifestos of type 4 written in a haze of Russia and midnight flights that will likely never see the light of day). I think my two commenters may like the latter two more.
So I think I’ll try to intersperse the large amount of type 1 that I have now with some other sorts of analysis over the next week or so. That includes a remake of the isms chart, a further look at loess curves, etc.
Tuesday, November 23, 2010
Links between words
Ties between words are one of the most important things computers can tell us about language. I already looked at one way of looking at connections between words in talking about the phrase "scientific method"--the percentage of occurrences of a word that occur with another phrase. I've been trying a different tack, however, in looking at the interrelations among the isms. The whole thing has been do complicated--I never posted anything from Russia because I couldn't get the whole system in order in my time here. So instead, I want to take a couple posts to break down a simple sentence and think about how we could statistically measure each component. Here's the sentence:
How often, compared to what we would expect, does a given word appear with any other given word?
In doing the math, we have to work from the back to the front, so this post is about the last part of the sentence: What does it mean to appear with another word?
How often, compared to what we would expect, does a given word appear with any other given word?
In doing the math, we have to work from the back to the front, so this post is about the last part of the sentence: What does it mean to appear with another word?
Thursday, November 18, 2010
More on Grafton
One more note on that Grafton quote, which I'll post below.
“The digital humanities do fantastic things,” said the eminent Princeton historian Anthony Grafton. “I’m a believer in quantification. But I don’t believe quantification can do everything. So much of humanistic scholarship is about interpretation.”
“It’s easy to forget the digital media are means and not ends,” he added.Anne pointed out at dinner that the reason this is so frustrating is because it gives far too much credit to quantification. Grafton has tossed out all the history of science he's been so involved in and pretends he thinks that the quantitative sciences use numbers to reveal irrefutable facts about the world. I'm sure there are people who do believe that they unearth truth through elaborate cliometrics; but those oddballs are far less harmful and numerous than those who think the humanities are about 'interpretations', and the sciences about 'facts.' Again, I bet this in some way this is a misstatement or misquotation. Still, it made it through because it's so representative of how a lot of the profession thinks.
(more below the break)
Wednesday, November 17, 2010
Moscow and NyTimes
I'm in Moscow now. I still have a few things to post from my layover, but there will be considerably lower volume through Thanksgiving.
I don't want to comment too much on yesterday (today's? I can't tell anymore) article about digital humanities in the New York Times, but a couple e-mail people e-mailed about it. So a couple random points:
1. Tony Grafton is, as always, magnanimous: but he makes an unfortunate distinction between "data" and "interpretation" that gives others cover to view digital humanities less charitably than he does. I shouldn't need to say this, but: the whole point of data is that it gives us new objects of interpretation. And the Grafton school of close reading, which seems to generally now involve writing a full dissertation on a single book, is also not a substitute for the full range of interpretive techniques that play on humanistic knowledge.
(more after the break)
I don't want to comment too much on yesterday (today's? I can't tell anymore) article about digital humanities in the New York Times, but a couple e-mail people e-mailed about it. So a couple random points:
1. Tony Grafton is, as always, magnanimous: but he makes an unfortunate distinction between "data" and "interpretation" that gives others cover to view digital humanities less charitably than he does. I shouldn't need to say this, but: the whole point of data is that it gives us new objects of interpretation. And the Grafton school of close reading, which seems to generally now involve writing a full dissertation on a single book, is also not a substitute for the full range of interpretive techniques that play on humanistic knowledge.
(more after the break)
Lumpy words
What are the most distinctive words in the nineteenth century? That's an impossible question, of course. But as I started to say in my first post about bookcounts, [link] we can find something useful--the words that are most concentrated in specific texts. Some words appear at about the same rate in all books, while some are more highly concentrated in particular books. And historically, the words that are more highly concentrated may be more specific in their meanings--at the very least, they might help us to analyze genre or other forms of contextual distribution.
Because of computing power limitations, I can't use all of my 200,000 words to analyze genre--I need to pick a subset that will do most of the heavy lifting. I'm doing this by finding outliers on the curve of words against books. First, I'll show you that curve again. This time, I've made both axes logarithmic, which makes it easier to fit a curve. And the box shows the subset I'm going to zoom in on for classification--dropping a) the hundred most common words, and b) any words that appear in less than 5% of all books. The first are too common to be of great use (and confuse my curve fitting), and the second are so rare I'd need too many of them to make for useful categorization.
(more after the break)
Subscribe to:
Posts (Atom)


