And though a text can be a book, it can also be something much larger. Take library call numbers. Library of Congress
Everybody loves dendrograms, even if they don't like statistics. Here's a famous one, from the French Encylopedia.
That famous tree of knowledge raises two questions for me:
- Is there any good way to present a long dendrogram on a printed page or computer screen? Even professional typographers have trouble setting these graphics in ways that fit neatly on a page--all that white space under "imagination" is a white flag.
- More seriously: Can we use data about vocabulary to create this type of tree of knowledge back together? Are brute force readings of words alone capable of getting at types of knowledge like the encylopedists hoped for? Or will we get something more like the library of congress classification, or something else entirely?
As a reminder: dendrograms are like evolutionary trees, built from the ground up. The clustering algorithm finds similar pairs of genres based on word usage, then builds those pairs into larger groups. The height at which a group comes together shows how similar they are. So for example, BX and BV, Christian denominations and Practical Theology, are about as close in their word usage as any two genres. Outside of the groups, the order is unimportant; so LA, for example, could be at the top of this chart followed by LB and LC without losing any information, and HV is no closer to LA than it is to LB even though they appear next to each other. I thought I'd try a triangular dendrogram because it sets off the groups more elegantly. The downside is that you get line crossing occasionally, but that just accentuates how weird some genres (genealogy, which is all names, I assume) are from everything else.
Here's what we get. You can cut straight to the analysis after the chart. But if I'm right about all this 'aid to thinking' stuff, you'll probably get more out of reading it yourself.
Not quite Diderot here, but it's somewhat interesting. Roughly, this divides into three main groups: one wouldn't be crazy to call them the sciences, the humanities, and genealogy. The social sciences are uneasily divided: the farthest out section of the sciences group is a cluster of law, political science, and economics: while the most distinctive elements of the humanities are the various literature classes.
Most of the time, it roughly matches an intuitive ordering. The physical sciences and technology headings cluster together, as do American and European literature, the various education classifications, etc. So much we know. It's important to see what we know confirmed, in a way.
What's interesting, though, are the 'mistakes.' The splitting apart of the H's and J's matches my discomfort with that class. The jumble of Q, R, S, and T neatly obliterates distinctions between the sciences and technology, reuniting the practical and the applied. "LAW", a non-standard LC class that sneaks into my database, is united with KF, US law. LD, educational books about specific institutions, is dropped back into American History, where it arguably fits better.
But far and away my favorite is the cluster right in the middle, nestled among the sciences. Here it is again:
I like this not because it's a new way of categorizing, but because it's closer to our present-day ideas about high colonial historiography than to what historians of the time might have said. They wrote about the undeveloped world in the language of recreation and war, rather than in the language of history. What does it say about the unity of history, epistemologically, when the language used to describe some areas is much closer to other areas?
This is just a first stab in; it doesn't prove anything. But it is a nice illustration of a broader point. I've been saying for a while that I think statistical demonstrations of standard tropes of postmodern tropes might help to make them more persuasive, to be seen as more rigorous and less political. Strategically, I think things like this might be good for teaching; and conceptually, I think they're good for thinking even if they only lead us to analyze a little more why these similarities come up.
But what about the Diderot and D'Alembert clustering? I'm actually a little surprised that this clustering doesn't match up a little more closely to theirs, just because by focusing on word choice I incorporate a lot of information about verb tense and person. I would have thought that the big three categories of memory, philosophy, and poesy would stand out a little more for that reason. You can see how each of those would have a different set of verbs. I suspect if I changed the number of classes to represent those three more evenly, I could reproduce it better--all of these algorithms are extremely sensitive to initial inputs.
Still, there are some ways that the genres jump out. Here's a different representation: in pca space (which I tried to explain a bit earlier) on the 10,000 most common words in the language. It's not perfect, but you can see pretty obvious and clear clustering by genre among the various subgenres in the LC classes. (I've used color to mark off a couple obvious classes). You can use the tree above or wikipedia to get the names of the classes--I'm plotting just the call letters here. One thing I like about this is that the general letters-QRST for the sciences, eg.--all drift towards the center of common vocabulary usage compared to the specific genres. I'll have a lot more to say about this later (although I'm always disappointed with applications for pca space for language, which I don't think is uncommon) but for now, let me just finish by offering it as a different way of viewing the relations among genres.