Mostly a note to myself:
I think genre data would be helpful in all sorts of ways--tracking evolutionary language through different sciences, say, or finding what discourses are the earliest to use certain constructions like "focus attention." The Internet Archive books have no genre information in their metadata, for the most part. The genre data I think I want to use would Library of Congress call numbers--that divides up books in all sorts of ways at various levels that I could parse. It's tricky to get from one to the other, though. I could try to hit the LOC catalog with a script that searches for title, author and year from the metadata I do have, but that would miss a lot and maybe have false positives, plus the LOC catalog is sort of tough to machine-query. Or I could try to run a completely statistical clustering, but I don't trust that that would come out with categories that correspond to ones in common use. Some sort of hybrid method might be best--just a quick sketch below.
Digital Humanities: Using tools from the 1990s to answer questions from the 1960s about 19th century America.
Tuesday, November 30, 2010
Sunday, November 28, 2010
Top ten authors
Most intensive text analysis is done on heavily maintained sources. I'm using a mess, by contrast, but a much larger one. Partly, I'm doing this tendentiously--I think it's important to realize that we can accept all the errors due to poor optical character recognition, occasional duplicate copies of works, and so on, and still get workable materials.
Using worse sources is something of a necessity for digital history. The text recognition and the metadata for a lot of the sources we use often—google books, jstor, proquest—is full of errors under the surface, and it's OK for us to work with such data in the open. The historical profession doesn't have any small-ish corpuses we would be interested in analyzing again and again. This isn't true of English departments, who seem to be well ahead of historians in computer-assisted text analysis, and have the luxury of emerging curated text sources like the one Martin Mueller describes here.
But the side effect of that is that we need to be careful about understanding what we're working with. So I'm running periodic checks on the data in my corpus of books by major American publishers (described more earlier) to see what's in there. I thought I'd post the list of the top twenty authors, because I found it surprising, though not in a bad way. We'll do from no. 20 ranking on up, because that's how they do it on sports blogs. (What I really should do is a slideshow to increase pageviews). I'll identify the less famous names.
Using worse sources is something of a necessity for digital history. The text recognition and the metadata for a lot of the sources we use often—google books, jstor, proquest—is full of errors under the surface, and it's OK for us to work with such data in the open. The historical profession doesn't have any small-ish corpuses we would be interested in analyzing again and again. This isn't true of English departments, who seem to be well ahead of historians in computer-assisted text analysis, and have the luxury of emerging curated text sources like the one Martin Mueller describes here.
But the side effect of that is that we need to be careful about understanding what we're working with. So I'm running periodic checks on the data in my corpus of books by major American publishers (described more earlier) to see what's in there. I thought I'd post the list of the top twenty authors, because I found it surprising, though not in a bad way. We'll do from no. 20 ranking on up, because that's how they do it on sports blogs. (What I really should do is a slideshow to increase pageviews). I'll identify the less famous names.
Clustering isms together
In addition to finding the similarities in use between particular isms, we can look at their similarities in general. Since we have the distances, it's possible to create a dendrogram, which is a sort of family tree. Looking around the literary studies text-analysis blogs, I see these done quite a few times to classify works by their vocabulary. I haven't seen much using words, though: but it works fairly well. I thought it might help answer Hank's question about the difference between evolutionism and darwinism, but, as you'll see, that distinction seems to be a little too fine for now.
Here's the overall tree of the 400-ish isms, with the words removed, just to give a sense. We can cut the tree at any point to divide into however many groups we'd like. The top three branches essentially correspond to 1) Christian and philosophical terminology, 2) social, historical, and everything else, and 3) medical and some scientific terminology.
Here's the overall tree of the 400-ish isms, with the words removed, just to give a sense. We can cut the tree at any point to divide into however many groups we'd like. The top three branches essentially correspond to 1) Christian and philosophical terminology, 2) social, historical, and everything else, and 3) medical and some scientific terminology.
But you probably want to see the actual words.
Friday, November 26, 2010
Comparing usage patterns across the isms
What can we do with this information we’ve gathered about unexpected occurrences? The most obvious thing is simply to look at what words appear most often with other ones. We can do this for any ism given the data I’ve gathered. Hank asked earlier in the comments about the difference between "Darwinism" and evolutionism, so:
> find.related.words("darwinism",matrix = "percent.diff", return=5)
phenomenism evolutionism revolutionism subjectivism hermaphroditism
2595.147 1967.021 1922.339 1706.679 1681.792
Phenomenism appears 2,595%—26 times—more often in books about Darwin than chance would imply. That revolutionism is so high is certainly interesting, and maybe there’s some story out there about why hermaphroditism is so high. The takeaway might be that Darwinism appears as much in philosophical literature as scientific, which isn’t surprising.
But we don’t just have individual counts for words—we have a network of interrelated meanings that lets us compare the relations across all the interrelations among words. We can use that to create a somewhat different list of words related to Darwinism:
> find.related.words("darwinism",matrix = "percent.diff", return=5)
phenomenism evolutionism revolutionism subjectivism hermaphroditism
2595.147 1967.021 1922.339 1706.679 1681.792
Phenomenism appears 2,595%—26 times—more often in books about Darwin than chance would imply. That revolutionism is so high is certainly interesting, and maybe there’s some story out there about why hermaphroditism is so high. The takeaway might be that Darwinism appears as much in philosophical literature as scientific, which isn’t surprising.
But we don’t just have individual counts for words—we have a network of interrelated meanings that lets us compare the relations across all the interrelations among words. We can use that to create a somewhat different list of words related to Darwinism:
Measuring word collocation, part III
Now to the final term in my sentence from earlier— “How often, compared to what we would expect, does a given word appear with any other given word?”. Let’s think about How much more often. I though this was more complicated than it is for a while, so this post will be short and not very important.
Basically, I’m just using the percentage of time more often as the measuring stick—I fiddled around with standard deviations for a while, but I don’t have a good way to impute expected variations, and percentages seems to work well enough. I do want to talk for a minute about an aspect that I’ve glossed over so far—how do we measure the occurrences of a word relative to itself?
As I think of things
Abraham Lincoln invented Thanksgiving. And I suppose this might be a good way to prove to more literal-minded students that the Victorian invention of tradition really happened. Other than that, I don't know what this means.
Measuring word collocation, part II
This is the second post on ways to measure connections—or more precisely, distance—between words by looking at how often they appear together in books. These are a little dry, and the payoff doesn't come for a while, so let me remind you of the payoff (after which you can bail on this post). I’m trying to create some simple methods that will work well with historical texts to see relations between words—what words are used in similar semantic contexts, what groups of words tend to appear together. First I’ll apply them to the isms, and then we’ll put them in the toolbox to use for later analysis.
I said earlier I would break up the sentence “How often, compared to what we would expect, does a given word appear with any other given word?” into different components. Now let’s look at the central, and maybe most important, part of the question—how often do we expect words to appear together?
I said earlier I would break up the sentence “How often, compared to what we would expect, does a given word appear with any other given word?” into different components. Now let’s look at the central, and maybe most important, part of the question—how often do we expect words to appear together?
Thursday, November 25, 2010
Back from Moscow--where to now?
I’m back from Moscow, and with a lot of blog content from my 23-hour itinerary. I’m going to try to dole it out slowly, though, because a lot of it is dull and somewhat technical, and I think it’s best to intermix with other types of content. I think there are four things I can do here.
1. Document my process of building up a specific system and set of techniques for analyzing texts from the internet archive, and publishing an account my tentative explorations into the structure of my system.
2. Trying to produce some chunks of writing that I could integrate into presentations (we’re talking about one in Princeton in February) and other non-blog writing.
3. Digging in with the data into some major questions in American intellectual history to see whether we can get anything useful out of it.
4. Reflecting on the state of textual analysis within the digital humanities, talking about how it can be done outside of my Perl-SQL-R framework, and thinking about how to overcome some of the more gratuitous obstacles in its way.
I’m interested in all of these, but find myself most naturally writing the first two (aside from a few manifestos of type 4 written in a haze of Russia and midnight flights that will likely never see the light of day). I think my two commenters may like the latter two more.
So I think I’ll try to intersperse the large amount of type 1 that I have now with some other sorts of analysis over the next week or so. That includes a remake of the isms chart, a further look at loess curves, etc.
Tuesday, November 23, 2010
Links between words
Ties between words are one of the most important things computers can tell us about language. I already looked at one way of looking at connections between words in talking about the phrase "scientific method"--the percentage of occurrences of a word that occur with another phrase. I've been trying a different tack, however, in looking at the interrelations among the isms. The whole thing has been do complicated--I never posted anything from Russia because I couldn't get the whole system in order in my time here. So instead, I want to take a couple posts to break down a simple sentence and think about how we could statistically measure each component. Here's the sentence:
How often, compared to what we would expect, does a given word appear with any other given word?
In doing the math, we have to work from the back to the front, so this post is about the last part of the sentence: What does it mean to appear with another word?
How often, compared to what we would expect, does a given word appear with any other given word?
In doing the math, we have to work from the back to the front, so this post is about the last part of the sentence: What does it mean to appear with another word?
Thursday, November 18, 2010
More on Grafton
One more note on that Grafton quote, which I'll post below.
“The digital humanities do fantastic things,” said the eminent Princeton historian Anthony Grafton. “I’m a believer in quantification. But I don’t believe quantification can do everything. So much of humanistic scholarship is about interpretation.”
“It’s easy to forget the digital media are means and not ends,” he added.Anne pointed out at dinner that the reason this is so frustrating is because it gives far too much credit to quantification. Grafton has tossed out all the history of science he's been so involved in and pretends he thinks that the quantitative sciences use numbers to reveal irrefutable facts about the world. I'm sure there are people who do believe that they unearth truth through elaborate cliometrics; but those oddballs are far less harmful and numerous than those who think the humanities are about 'interpretations', and the sciences about 'facts.' Again, I bet this in some way this is a misstatement or misquotation. Still, it made it through because it's so representative of how a lot of the profession thinks.
(more below the break)
Wednesday, November 17, 2010
Moscow and NyTimes
I'm in Moscow now. I still have a few things to post from my layover, but there will be considerably lower volume through Thanksgiving.
I don't want to comment too much on yesterday (today's? I can't tell anymore) article about digital humanities in the New York Times, but a couple e-mail people e-mailed about it. So a couple random points:
1. Tony Grafton is, as always, magnanimous: but he makes an unfortunate distinction between "data" and "interpretation" that gives others cover to view digital humanities less charitably than he does. I shouldn't need to say this, but: the whole point of data is that it gives us new objects of interpretation. And the Grafton school of close reading, which seems to generally now involve writing a full dissertation on a single book, is also not a substitute for the full range of interpretive techniques that play on humanistic knowledge.
(more after the break)
I don't want to comment too much on yesterday (today's? I can't tell anymore) article about digital humanities in the New York Times, but a couple e-mail people e-mailed about it. So a couple random points:
1. Tony Grafton is, as always, magnanimous: but he makes an unfortunate distinction between "data" and "interpretation" that gives others cover to view digital humanities less charitably than he does. I shouldn't need to say this, but: the whole point of data is that it gives us new objects of interpretation. And the Grafton school of close reading, which seems to generally now involve writing a full dissertation on a single book, is also not a substitute for the full range of interpretive techniques that play on humanistic knowledge.
(more after the break)
Lumpy words
What are the most distinctive words in the nineteenth century? That's an impossible question, of course. But as I started to say in my first post about bookcounts, [link] we can find something useful--the words that are most concentrated in specific texts. Some words appear at about the same rate in all books, while some are more highly concentrated in particular books. And historically, the words that are more highly concentrated may be more specific in their meanings--at the very least, they might help us to analyze genre or other forms of contextual distribution.
Because of computing power limitations, I can't use all of my 200,000 words to analyze genre--I need to pick a subset that will do most of the heavy lifting. I'm doing this by finding outliers on the curve of words against books. First, I'll show you that curve again. This time, I've made both axes logarithmic, which makes it easier to fit a curve. And the box shows the subset I'm going to zoom in on for classification--dropping a) the hundred most common words, and b) any words that appear in less than 5% of all books. The first are too common to be of great use (and confuse my curve fitting), and the second are so rare I'd need too many of them to make for useful categorization.
(more after the break)
Monday, November 15, 2010
Isms and ists
Hank asked for a couple of charts in the comments, so I thought I'd oblige. Since I'm starting to feel they're better at tracking the permeation of concepts, we'll use appearances per 1000 books as the y axis:
And Darwin after the break.
And Darwin after the break.
Similar Trends in Words
I'm going to keep looking at the list of isms, because a) they're fun; and b) the methods we use on them can be used on any group of words--for example, ones that we find are highly tied to evolution. So, let's use them as a test case for one of the questions I started out with: how can we find similarities in the historical patterns of emergence and submergence of words?
Well, let's take a few things for granted. First, we're interested in relative trends: we want to compare all words, say, that get twice as popular during World War I, regardless of how popular they were before. It's easy to normalize for that, but we're still left with a mess of different curves. Below the jump is a very ugly chart, with the adjusted loess curves for the 300 most popular isms, with "1" being a year in which a word is used its average amount.
Well, let's take a few things for granted. First, we're interested in relative trends: we want to compare all words, say, that get twice as popular during World War I, regardless of how popular they were before. It's easy to normalize for that, but we're still left with a mess of different curves. Below the jump is a very ugly chart, with the adjusted loess curves for the 300 most popular isms, with "1" being a year in which a word is used its average amount.
Sunday, November 14, 2010
Century of -isms, take one
Here's a fun way of using this dataset to convey a lot of historical information. I took all the 414 words that end in ism in my database, and plotted them by the year in which they peaked,* with the size proportional to their use at peak. I'm going to think about how to make it flashier, but it's pretty interesting as it is. Sample below, and full chart after the break.
Subscribe to:
Posts (Atom)




