Dan asks for some numbers on "capitalism" and "capitalist" similar to the ones on "Darwinism" and "Darwinist" I ran for Hank earlier. That seems like a nice big question I can use to get some basic methods to warm up the new database I set up this week and to get some basic functionality written into it.
I'm going to go step-by-step here at some length to show just how cyclical a process this is--the computer is bad at semantic analysis, and it requires some actual knowledge of the history involved to get anything very useful out of the raw data on counts. A lot of comments on semantic analysis make it sound like it's asking computers to think for us, so I think it's worth showing that most of the R functions I'm using generally operate at a pretty low level--doing some counting, some index work, but nothing too mysterious.
Digital Humanities: Using tools from the 1990s to answer questions from the 1960s about 19th century America.
Showing posts with label isms. Show all posts
Showing posts with label isms. Show all posts
Monday, December 6, 2010
Sunday, November 28, 2010
Clustering isms together
In addition to finding the similarities in use between particular isms, we can look at their similarities in general. Since we have the distances, it's possible to create a dendrogram, which is a sort of family tree. Looking around the literary studies text-analysis blogs, I see these done quite a few times to classify works by their vocabulary. I haven't seen much using words, though: but it works fairly well. I thought it might help answer Hank's question about the difference between evolutionism and darwinism, but, as you'll see, that distinction seems to be a little too fine for now.
Here's the overall tree of the 400-ish isms, with the words removed, just to give a sense. We can cut the tree at any point to divide into however many groups we'd like. The top three branches essentially correspond to 1) Christian and philosophical terminology, 2) social, historical, and everything else, and 3) medical and some scientific terminology.
Here's the overall tree of the 400-ish isms, with the words removed, just to give a sense. We can cut the tree at any point to divide into however many groups we'd like. The top three branches essentially correspond to 1) Christian and philosophical terminology, 2) social, historical, and everything else, and 3) medical and some scientific terminology.
But you probably want to see the actual words.
Friday, November 26, 2010
Comparing usage patterns across the isms
What can we do with this information we’ve gathered about unexpected occurrences? The most obvious thing is simply to look at what words appear most often with other ones. We can do this for any ism given the data I’ve gathered. Hank asked earlier in the comments about the difference between "Darwinism" and evolutionism, so:
> find.related.words("darwinism",matrix = "percent.diff", return=5)
phenomenism evolutionism revolutionism subjectivism hermaphroditism
2595.147 1967.021 1922.339 1706.679 1681.792
Phenomenism appears 2,595%—26 times—more often in books about Darwin than chance would imply. That revolutionism is so high is certainly interesting, and maybe there’s some story out there about why hermaphroditism is so high. The takeaway might be that Darwinism appears as much in philosophical literature as scientific, which isn’t surprising.
But we don’t just have individual counts for words—we have a network of interrelated meanings that lets us compare the relations across all the interrelations among words. We can use that to create a somewhat different list of words related to Darwinism:
> find.related.words("darwinism",matrix = "percent.diff", return=5)
phenomenism evolutionism revolutionism subjectivism hermaphroditism
2595.147 1967.021 1922.339 1706.679 1681.792
Phenomenism appears 2,595%—26 times—more often in books about Darwin than chance would imply. That revolutionism is so high is certainly interesting, and maybe there’s some story out there about why hermaphroditism is so high. The takeaway might be that Darwinism appears as much in philosophical literature as scientific, which isn’t surprising.
But we don’t just have individual counts for words—we have a network of interrelated meanings that lets us compare the relations across all the interrelations among words. We can use that to create a somewhat different list of words related to Darwinism:
Measuring word collocation, part III
Now to the final term in my sentence from earlier— “How often, compared to what we would expect, does a given word appear with any other given word?”. Let’s think about How much more often. I though this was more complicated than it is for a while, so this post will be short and not very important.
Basically, I’m just using the percentage of time more often as the measuring stick—I fiddled around with standard deviations for a while, but I don’t have a good way to impute expected variations, and percentages seems to work well enough. I do want to talk for a minute about an aspect that I’ve glossed over so far—how do we measure the occurrences of a word relative to itself?
Measuring word collocation, part II
This is the second post on ways to measure connections—or more precisely, distance—between words by looking at how often they appear together in books. These are a little dry, and the payoff doesn't come for a while, so let me remind you of the payoff (after which you can bail on this post). I’m trying to create some simple methods that will work well with historical texts to see relations between words—what words are used in similar semantic contexts, what groups of words tend to appear together. First I’ll apply them to the isms, and then we’ll put them in the toolbox to use for later analysis.
I said earlier I would break up the sentence “How often, compared to what we would expect, does a given word appear with any other given word?” into different components. Now let’s look at the central, and maybe most important, part of the question—how often do we expect words to appear together?
I said earlier I would break up the sentence “How often, compared to what we would expect, does a given word appear with any other given word?” into different components. Now let’s look at the central, and maybe most important, part of the question—how often do we expect words to appear together?
Tuesday, November 23, 2010
Links between words
Ties between words are one of the most important things computers can tell us about language. I already looked at one way of looking at connections between words in talking about the phrase "scientific method"--the percentage of occurrences of a word that occur with another phrase. I've been trying a different tack, however, in looking at the interrelations among the isms. The whole thing has been do complicated--I never posted anything from Russia because I couldn't get the whole system in order in my time here. So instead, I want to take a couple posts to break down a simple sentence and think about how we could statistically measure each component. Here's the sentence:
How often, compared to what we would expect, does a given word appear with any other given word?
In doing the math, we have to work from the back to the front, so this post is about the last part of the sentence: What does it mean to appear with another word?
How often, compared to what we would expect, does a given word appear with any other given word?
In doing the math, we have to work from the back to the front, so this post is about the last part of the sentence: What does it mean to appear with another word?
Monday, November 15, 2010
Isms and ists
Hank asked for a couple of charts in the comments, so I thought I'd oblige. Since I'm starting to feel they're better at tracking the permeation of concepts, we'll use appearances per 1000 books as the y axis:
And Darwin after the break.
And Darwin after the break.
Similar Trends in Words
I'm going to keep looking at the list of isms, because a) they're fun; and b) the methods we use on them can be used on any group of words--for example, ones that we find are highly tied to evolution. So, let's use them as a test case for one of the questions I started out with: how can we find similarities in the historical patterns of emergence and submergence of words?
Well, let's take a few things for granted. First, we're interested in relative trends: we want to compare all words, say, that get twice as popular during World War I, regardless of how popular they were before. It's easy to normalize for that, but we're still left with a mess of different curves. Below the jump is a very ugly chart, with the adjusted loess curves for the 300 most popular isms, with "1" being a year in which a word is used its average amount.
Well, let's take a few things for granted. First, we're interested in relative trends: we want to compare all words, say, that get twice as popular during World War I, regardless of how popular they were before. It's easy to normalize for that, but we're still left with a mess of different curves. Below the jump is a very ugly chart, with the adjusted loess curves for the 300 most popular isms, with "1" being a year in which a word is used its average amount.
Sunday, November 14, 2010
Century of -isms, take one
Here's a fun way of using this dataset to convey a lot of historical information. I took all the 414 words that end in ism in my database, and plotted them by the year in which they peaked,* with the size proportional to their use at peak. I'm going to think about how to make it flashier, but it's pretty interesting as it is. Sample below, and full chart after the break.
Subscribe to:
Posts (Atom)


