Sunday, November 28, 2010

Clustering isms together

In addition to finding the similarities in use between particular isms, we can look at their similarities in general. Since we have the distances, it's possible to create a dendrogram, which is a sort of family tree. Looking around the literary studies text-analysis blogs, I see these done quite a few times to classify works by their vocabulary. I haven't seen much using words, though: but it works fairly well. I thought it might help answer Hank's question about the difference between evolutionism and darwinism, but, as you'll see, that distinction seems to be a little too fine for now.

Here's the overall tree of the 400-ish isms, with the words removed, just to give a sense. We can cut the tree at any point to divide into however many groups we'd like. The top three branches essentially correspond to 1) Christian and philosophical terminology, 2) social, historical, and everything else, and 3) medical and some scientific terminology.
But you probably want to see the actual words.
We can see a little more detail on any of these chunks. If we cut it up lower, it often makes quite a bit of sense. If I break it into 41 clusters, for example, so that they will average ten items per cluster, the largest one looks like this:

From the top, you can see three sub-clusters; one roughly of Catholic friends and enemies, one of Protestant ones, and a third that is less clear but may involve slightly more exotic heresies. The Catholic one nicely separates words connected to the first millenium—heathenism, nestorianism, arianism away from more modern movements—jesuitism, jansenism—although some words involved with the early church fathers—augustinism, manicheanism—show up in the latter cluster, presumably because they had more currency. "Sacerdotalism" and "Ritualism" show up among the Protestant words because they are important in defining Protestantism by contrast, not Catholicism—opposites are pulled together by use. There are a lot of reminders that this is classifying by use, not by meaning--I'd like to have 'montanism' and 'ultramontanism' closer together, and a number of other protestant words, particularly those with a stronger role in American history (Puritanism, Wesleyanism) appear in quite different clusters. But there's certainly some stuff here. In some other cases, it turns up discursive spheres quite neatly recognizable:

With a few exceptions (favoritism in the first, opportunism in the second), it's pretty clear why these words are clustered. But in others, the connections are somewhat more mystifying:

None of these words are very closely related to each other—see how they branch off before distance=2, while our big protestant cluster was all inside 1.5—but they are mostly as close as "trinitarianism" and "tritheism." The frustrating but important thing is that it's just these sorts of odd juxtapositions that can spur us in new directions. A frequent complaint about statistical humanities is that it tells us nothing we didn't know before. Well, I certainly didn't know before that there was any connection between transcendentalism and animalism, or between those two and obscurantism and humanism. There is one. Probably it's not a historically interesting one among these four--it could have to with publishers, typos, anything else. 

As for evolutionism vs. Darwinism, there's not much to separate them--they appear in different places, but clustered oddly among among various philosophical terms. Maybe I could make sense of it if I read the chart more--it's down at the bottom if you want to try.

The isms may not be the best set to use this on. What would be? Some set of names might be interesting. But the real work might involve a related set of concepts whose connections are disputed. I've been thinking about running some kind of a retread of Dan Rodger's "In Search of Progressivism"article, in which he basically does this same kind of cluster analysis to various strands in the language of progressive reform. Limiting the set to books in relevant categories, which would require some sort of LOC catalog information, and then taking years in the progressive era, we could see what sort of clustering a naive computer program does, or even start up a few cluster around words of social control, reform, etc, and see if we can modify or confirm parts of his ordering. That's a ways off, though. For now, I'll just dump the whole tree, in three segments, at the bottom of this post. I'd show the connection between them, but blogger doesn't let me post that long an image.

1 comment:

  1. Holy Crap! My compliments to you for taking on this mammoth task! Now, all we need is definitions and a short history of each. LOL! Somebody needs to make an app.