In the middle of this, I will briefly veer into some odd reflections about how the post-lapsarian state of language. Some people will want to skip that; maybe some others will want to skip to it.
Humanists seem to want to do different things with topic models than the computer scientists who invented them. David Blei's group at Princeton (David Mimno aside) most often seem to push LDA (I'm using topic modeling and LDA interchangeably again) as an advance in information retrieval: making large collections of text browsable by giving useful tags to the documents. When someone gives you 100,000 documents, you can 'read' the topic headings first, and then only read articles in the ones that interest you.
Probably there are people using LDA for this sort of thing. I haven't it seen it much in practice, though: it just isn't very interesting* to talk about. And while this power of LDA is great for some institutions, it's not a huge sellling point for the individual researcher: it's a lot of effort for something that produces almost exactly the same outcome as iterative keyword searching. Basically you figure out what you're interested in, read the top documents in the field. If discovery is the goal, humanists would probably be better off trying to get more flexible search engines than more machinely learned ones.
*I spun around a post for a while trying to respond to Trevor Owens' post about the binary of "justification" and "discovery" by saying that really only justification matters, but I couldn't get it to cohere—obviously discovery matters in some way. That post of his is ironclad. So I'll just say here that I think conversations which are purely about discovery methods are rare, and usually uninteresting; when scholars make public avowals of their discovery methodology, they frequently do it in part as evidence for the quality of their conclusions. Even if they say they aren't. Anyhow.
So instead of building a browser for their topics, humanists like to take some or all of the topics and plot their relative occurrence over time. I could come up with at least a dozen examples: in DH, one of the most high-profile efforts like this is Rob Nelson's Mining the Dispatch. On the front page is this plot, which assigns labels to two topics and uses them to show rates of two different types of advertising.
There's an obvious affinity between plotting topic frequencies and plotting word frequencies, something dear to my heart. The most widely-used line charts of this sort are Google Ngrams. (The first time I myself read up on topic modeling was after seeing it referenced in the comments to Dan Cohen's first post about Google Ngrams.) Bookworm is obviously similar to Ngrams: it's designed to keep the Ngrams strategy of investigating trends through words, but also foreground the individual texts that underlie the patterns: it makes it natural to investigate the history of books using words, as well as the history of words using books.
Bookworm/Ngrams-type graphs and these topic-model graphs promote pretty much the same type of reflection, and share many of the same pitfalls. But one of the reasons I like the Ngrams-style approach better is that it wears its weaknesses on its sleeves. Weaknesses like: vocabulary changes, individual words don't necessarily capture the full breadth of something like "Western Marxism," any word can have multiple meanings, an individual word is much rarer.
Topic modeling seems like an appealing way to fix just these problems, by producing statistical aggregates that map the history of ideas better than any word could. Instead of dividing texts into 200,000 (or so) words, it divides them into 200-or-so topics that should be nearly as easy to cognize, but that will be much more heavily populated; the topics should map onto concepts better than words; and they avoid the ambiguity of a word like "bank" (riverbank? Bank of England?) by splitting it into different bins based on context.
So that's the upside. What's the downside? First, as I said last time, the model can go wrong in ways that the standard diagnostics I see humanists applying won't work. (Dave Mimno points out that MALLET's diagnostic package can catch things like this, which I believe; but I'm not clear that even the humanists using topic modeling are spending much time using these.) Each individual model thus takes some serious work to get one's head around. Second, even if the model works, it's no longer possible to judge the results without investment in the statistical techniques. If I use Ngrams to argue that Ataturk's policies propelled the little city of Istanbul out of obscurity around 1930, anyone can explain why I'm an idiot. If I show a topic model I created, on the other hand, I'll have a whole trove of explanations at hand of how it doesn't match the problems you see.
Digression: Fundamentally, this starts to get into some really basic questions about modeling and magic. Permit me three paragraphs off the hook. My general stance is that quantification has always been a fundamental part of humanistic practice we shouldn't shy away from, but that there's a great deal to be said for the sort of quantification we do being simple in the sense of easily communicated or argued against. That has implications for public evidence, for private discovery, and for public discovery. I have a post somewhere in the hopper about Edward Gibbon and all of this.
I think—although I'm not sure that practice bears me out—that most of the arguments against any particular Ngram are widely accessible. That's because we know what words are: we care incredibly deeply about them, have a nearly religious veneration for words. One of the most important objections against using Ngrams for research is that words don't bear a straightforward relationship to actual concepts. And that's not just a methodological objection: the sundering of a direct relation between signifier and signified is sometimes attributed to Fall of Man. If people are blaming Eve for something, you know it's a serious problem.
So the question is: does replacing words with topics sidestep all that mystical weight? Obviously not. Are—on the other hand—the results of topic modeling so appealing because they seem to lift the burden of living in a fallen world from our shoulders? Honestly, I kind of think so. One of the things that's so appealing about having computers re-arrange texts is that the arbitrariness of words, all their problems and complexities, is a fundamental fact of human life. Part of the freshness of reading machine-derived aggregates is the Edenic joy of being able to imagine what it would be like to work with pure ideas, free from the prison-house of language. It's no wonder we grab at any chance.
OK, digression over. The assumptions in LDA will create some problems, though, and an under-used way to find them is making sure to incorporate the individual words as well as the individual works into the way we analyze topic models. One reason I keep forging ahead with projects tracking anachronisms in fiction is that I'm fascinated by how regular linguistic change is. LDA, though, assumes stasis.*
*More advanced topic modeling like dynamic topic models and Topics over Time don't. But they require specific assumptions about the way discourses change that are much more challenging than those in vanilla topic modeling, which is probably one reasons humanists never seem to use them. A quick run-down on why humanists shouldn't use Topics over Time is at the end of this post.
That means on the surface, topics are going to look constant: but even aside from the really obvious shifts (Petrograd->Leningrad->Petersburg) there's going to be a strong undertow of change. In any 150-year topic model, for example, the spelling of "anyone" will change to "any one," "sneaked" to "snuck", and so forth. The model is going to have to account for those changes somehow. In my experience, there tend not to be topics that straightforwardly map onto general linguistic drift, though (am I wrong?) Instead, it finds its place among the other topics as a sort of undertow in other topics that are also thematically related in real ways. W can't see that undertow—the surface is still—but it's treacherous.
[Yes, I'm using a metaphor where I should be using a formula here. Hopefully someone with the math will shout me down if I'm seriously off base here.]
The result is, for example, that in long duration corpuses (over 40 years, say) there should start to be a pretty strong trend to split topics up over time even when they're conceptually clean; and for shorter ones (like newspapers) there may be strong forces driving cyclic patterns. For all I know, I should emphasize, the effect of this is trivial. But since humanities-topics tend to be a lot less obviously coherent than the science-topics Blei's original papers modeled, I wonder just how strong a role this drift plays.
Anyhow, I've been wondering about this for a while. After I started this post, Andrew Goldstone and Ted Underwood put together a tour-de-force post whose centerpiece is topic models of PMLA. One of their arguments is that the individual topic is too small—"interpreters really need to survey a topic model as a whole, instead of considering single topics in isolation." They argue we need to look at networks of interconnection through the whole thing in the course of looking at individual topics. This is true, and I want to underline that one implication is that it's not so easy to pull out an individual topic and chart it because—for example—there may be another set of topics that it displaces at once.
But I want to make a reminder on the other hand that's based in changes in language over time: understanding one single topic is much too big. It presumes a lot to just take a topic as constant through time. In particular, we should be using the individual word assignments that topic models produce, not just taking the labels (as listed by the top-10 words) as intrinsically meaningful.
An experiment in lazy topic modeling.
The primary appeal of topic modeling, again, is that you get a single topic which _means_ something. So for example, one of the topics I produced would be called "grant state twain language foreign bs teachers." It's a little difficult to characterize, you can see, but probably it has something to with education and nationalism in the late 19th century United States. Probably it falls over time. We could look at this sort of thing.
But it's actually stranger than that. One of the things I liked about shipping data was that we could see everything in the model on a chart. That's hard to do with words, but we can at least load in all the data (using the option to dump the full MALLET state at the end). But usually topic-modelers just use the document assignments and the topic descriptions, and leave all that other information on the table.
So I took that data and (again using Goldstone's scripts) matched it with the Jstor metadata. For every topic, split it into two pieces of equal size around the median year. If you look at this 'grant state twain' topic not as a single ordered list but as two ordered lists from different historical periods, you get a very different sense of what it means. Here I take one topic and split the words in half: one batch is before 1959, and one after. Taking the top 20 words overall, I rank them in each period, and draw lines to connect them across the middle.
The first topic would be called "grant state bs ba teachers language:" suddenly it looks basically like a topic about land-grant Universities. The second would be called "twain language mark clemens foreign:" that would appear to 'obviously' be about Mark Twain. (Why foreign? I don't know, but could come up with something. Maybe it's Innocents Abroad, or the king and duke in Huck Finn, or something. We're good at justifying things like this.) The third and fourth most common words in the first period—BS and BA, the degrees—are the 115th and 570th most common in the later period. That's a huge gap. Even though the model is predicated on finding topics that are the same through time, they end up being massively shaped by historical changes. There is presumably—buried in the lower realms of this topic—some deeper coherence. It's probably helped on by some coincidences: Twain tends to write about the same sort of states that have land grant colleges (Mississippi and Illinois do well), and he helped publish Ulysses Grant's autobiography. And probably some pre-1959 Twain writing and post-1959 land-grant college stuff manages to creep in.
Obviously I cherry-picked this example: but out of the 100-or-so English language topics I ended up with, there were plenty to choose from. Here are the worst 10:
(click to enlarge)
In the middle of the top row, for example, there's a category that would have just been labeled "Jewish fiction," but that manages to switch from being largely a mishmash of Ulysses with 19th-century Russian novelists to—after 1994—something completely free from Leopold Bloom and much more interested in the Arab world. You might be tempted to draw some conclusions from that: maybe to start looking for the period that PMLA's political sympathy's shifted away from the Israelis towards the Palestinians, when writing on Jewish topics, or whatever. But given that the whole point of the mathematical abstractions in LDA is stability of topics, that seems like a curious route to go: better to take some seed words and track out from them, or something similar.
Even a sample of 10 random topics, below, shows some strange things. In the middle top, for example, we get a topic anchored by a common emphasis on things "Italian": but before 1924, it seems to heavily rely on the British museum, while afte that falls away entirely in favor of Dante and Petrarch.
Words as signals
Another thing that using individual words should be good for is letting us see how individual words drift among topics. Take the word "represent," which doesn't show up as anchored in any particular topic, although clearly it has some vocabulary of its own. It drifts among several ones: appearing in a bin of three topics before 1960, appearing most in 'criticism meaning literary' in the 60s and 70s, and appearing overwhelmingly in 'language narrative text trans' from 1980 onwards.
It's harder to include words because you have to deal with some much bigger objects in memory. I don't think this presents insuperable obstacles most of the time, since almost all applications of LDA are done on what I'd call "collection-sized," not "library-sized," batches of documents. All the code for this post, with a little bit more technical explanation, is online here. Well, all the code that I wrote. But that's the vast minority, since it's mostly stuff by David Mimno and Andrew Goldstone I'm using.
Appendix: Topics over Time
I was going to write up the advantages of the two temporal topic models, but for now I only have some notes for Topics over Time. Which I will share, because I don't think humanists should be using it. Dynamic topic models seem more useful to me, but I haven't dug so deeply into the internals. I think I remember Ted Underwood saying the exact opposite somewhere.
The problem with Topics over Time is that it presumes that some statistical distributions are better than others over time. This is potentially useful, but it will cause haywire on inference. It's a common fallacy to throw results into a model that tends to produce a certain sort of result, and then act surprised that your data can fit that particular type of model. If you run, for example, Principal Components Analysis, you'll almost always get a smooth distribution through space in the dimensions you analyze in. That doesn't mean that the real phenomenon is less batchy than we might have thought: It means you ran PCA.
Two mistakes are likely to result from these assumptions:
1) Prior distributions will lead questionable assignments of documents from immediately before their peak. For example, in Wang-McCallum's paper, they identify a topic for the "Cold War." Since that topic is quite strong from 1947 to 1989, the prior distribution assumes there must be a number of documents from 1900 to 1945. But while smooth priors cannot imagine it, there's good reason to want a Cold War topic that emerges de novo in 1945. They are pleased that it works better than than the related LDA topic: but the vanilla one is also more general, not including, for instance, "soviet" in its top ten.
2) Topics that don't follow a beta distribution in their temporal pattern will be lost or split. This seems like the deal-breaker to me. There's absolutely no reason to presume that historical patterns should follow a beta distribution, not the way there is to expect linguistic groupings to. Direchlet distributions are convenient abstractions for topic-document distributions, but seem like an obviously incorrect prior for topic-year distributions. One can see this from the Ngrams data: the curves on not symmetric, but rather tend to show a brief peak followed by a long decay. I can show you a lot of camel-backed curves. Every anachronism that I found in the movie "Lincoln," for example, looked a lot like this one:
So you'll end up artificially reinforcing the already-problematic tendency of LDA to split a natural 'war' discourse into two historically conditioned ones. If you're analyzing newspapers, it will penalize election topics on some horizons because they peak in 2 to 4 year increments. And so forth.
So if you want to find heavily concentrated in time events, Topics over Time might be useful; but it doesn't seem like a good all-purpose solution to the problem of language drift.