I used principal components analysis at the end of my last post to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, here's an improved (using all my data on the 10,000 most common words) version of that plot:
I have a professional interest in shifts in genres. But this isn't temporal--it's just a static depiction of genres that presumably waxed and waned over time. What can we do to make it historical?
Well, the first thing to remember is that, just as a genre can be a single text for the purposes of pca (or any other analysis), so can a year. We can drop onto that same chart all of the years from 1822 to 1922, say, shading from yellow for the early years to red for the later ones:
But rather than make these two different sets live together, something I'll get into more in a later post, we can run a separate pca analysis on just the years. Remember, pca finds the set of coordinates that discriminate most strongly between the inputs you give it. So here, the first principal component is a set of weightings for different that distinguish the most strongly between years. On one side are those that use words like the following a lot:
> paste(names(sort(comps$rotation[,1])[1:20]),collapse=", ")
 "justly, circumstances, occasion, which, prudent, acknowledged, notwithstanding, disposed, rendered, commencement, circumstance, mode, pursued, excite, attended, whole, former, their, bestowed, effectually"
And on the other are years that use words like the following:
> paste(names(sort(comps$rotation[,1],decreasing=T)[1:20]),collapse=", ")
 "outside, helped, background, ignored, needed, appreciation, started, back, anything, across, significance, anywhere, start, recognition, worked, out, confronted, largely, based, help"
If you click on the links to ngrams, it's clear what these are--words that change steadily in their use over time, and are therefore good at telling years apart. Here's a chart of every year's vocabulary scored by the first principal component:
There are a few aberrations—pca seems to think 1871 looks like it belongs in the 1890s, which is odd—but generally it's quite good for a system that had no idea what order the years should go in.
[Edit: Playing around with the animated version, which should go up soon, it becomes clear why 1871 is such an outlier: it has a one-year explosion of magazines in the sample, which tend to use more modern language for a variety of reasons].
So that's the first principal component. Read no more if you only care about the useful. The later ones are stranger: where stuff either gets useless, or really interesting, or both. The thing to remember about PCA is that it finds perpendicular dimensions in the data so that there is zero correlation between two components. So if the first principal component finds terms that move up or down across the year horizon, subsequent ones will find patterns that have no overall temporal trend. In practice, that means the second component will be words that have either peak or trough towards the middle, and are higher on the ends. Here's a link to them on ngrams. (Remember, my data is different than ngrams, so it the patterns probably stand out better if I plot them myself. But this is easy, and keeps me honest).
I think it's possible somehow, somewhere those numbers could be useful--you could tweak the endpoints to find temporary peaks in any given period, or troughs. Not as sophisticated as trend line fitting like they used for the suppression indices in the Science paper, but possibly much faster, which is one of the great strengths of eigenvectors, I think. Anyway, that's not really my department.
Later components lead nowhere good. On a sufficiently large dataset, I think—unless I'm missing something—the principals components from number 2 on would probably come to find words whose patterns mirror the harmonics of a string. Now, I've been saying for a while that digital tools are good for poststructuralists. But in the spirit of the big tent that seems to be the DH issue of the day, let me point that here's a case where computers would be good for Pythagorean mystics or modern-day Spenglerians trying to eke out a living on the margins of the academy. Someone should run some tests on the ngrams sets to see if cyclical patterns occur more often than they should by chance: if so, Scienza Nuova, here we come!
Of course, I think that's all completely ridiculous. So let me extricate myself from this whole line of thinking by drawing the moral that we have to keep a pretty firm hand on the tiller as we go fishing for hypotheses, or else we can end up elevating tools and techniques over method. That said, I'm going to bank that first axis, that shows steady linguistic change over time, for later use, because I think it does show something useful.