Following up on my previous topic modeling post, I want to talk about one thing humanists actually do with topic models once they build them, most of the time: chart the topics over time. Since I think that, although Topic Modeling can be very useful, there's too little skepticism about the technique, I'm venturing to provide it (even with, I'm sure, a gross misunderstanding or two). More generally, the sort of mistakes temporal changes cause should call into question the complacency with which humanists tend to 'topics' in topic modeling as stable abstractions, and argue for a much greater attention to the granular words that make up a topic model.
In the middle of this, I will briefly veer into some odd reflections about how the post-lapsarian state of language. Some people will want to skip that; maybe some others will want to skip to it.
Digital Humanities: Using tools from the 1990s to answer questions from the 1960s about 19th century America.
Showing posts with label Topic Modelling. Show all posts
Showing posts with label Topic Modelling. Show all posts
Wednesday, January 9, 2013
Friday, November 2, 2012
When you have a MALLET, everything looks like a nail
Note: this post is part 4, section 2 of my series on whaling logs and digital history. For the full overview, click here.
One reason I'm interested in ship logs is that they give some distance to think about problems in reading digital texts. That's particularly true for machine learning techniques. In my last post, an appendix to the long whaling post, I talked about using K-means clustering and k-nearest neighbor methods to classify whaling voyages. But digital humanists working with texts hardly ever use k-means clustering; instead, they gravitate towards a more sophisticated form of clustering called topic modeling, particularly David Blei's LDA (so much so that I'm going to use 'LDA' and 'topic modeling' synonymously here). There's a whole genre of introductory posts out there encouraging humanists to try LDA: Scott Weingart's wraps a lot of them together, and Miriam Posner's is freshest off the presses.
So as an appendix to that appendix, I want to use ship's data to think about how we use LDA. I've wondered for a while why there's such a rush to make topic modeling into the machine learning tool for historians and literature scholars. It's probably true that if you only apply one algorithm to your texts, it should be LDA. But most humanists are better off applying zero clusterings, and most of the remainder should be applying several. I haven't mastered the arcana of various flavors of topic modeling to my own satisfaction, and don't feel qualified to deliver a full-on jeremiad against its uses and abuses. Suffice it to say, my basic concerns are:
Ship data gives an interesting perspective on these problems. So, at the risk of descending into self-parody, I ran a couple topic models on the points in the ship's logs as a way of thinking through how that clustering works. (For those who only know LDA as a text-classification system, this isn't as loony as it sounds; in computer science, the algorithm gets thrown at all sorts of unrelated data, from images to music).
Instead of using a vocabulary of words, we can just use one of latitude-longitude points at decimal resolution. Each voyage is a text, and each day it spends in, say, Boston is one use of the word "42.4,-72.1". That gives us a vocabulary of 600,000 or so 'words' across 11,000 'texts', not far off a typical topic model (although the 'texts' are short, averaging maybe 30-50 words). Unlike k-means clustering, a topic model will divide each route up among several topics, so instead of showing paths, we can visually only look at which points fall into which 'topic'; but a single point isn't restricted to a single topic, so New York could be part of both a hypothetical 'European trade' and 'California trade' topic.
With words, it's impossible to meaningfully convey all the data in a topic model's output. Geodata has the nice feature that we can inspect all the results in a topic by simply plotting them on a map. Essentially, 'meaning' for points can be firmly reduced a two-dimensional space (although it has other ones as well), while linguistic meaning can't.
Here's the output of a model, plotted with high transparency so that a point on the map will appear black if it appears in that topic in 100 or more log entries. (The basic code to build the model and plot the code is here--dataset available on request).
One reason I'm interested in ship logs is that they give some distance to think about problems in reading digital texts. That's particularly true for machine learning techniques. In my last post, an appendix to the long whaling post, I talked about using K-means clustering and k-nearest neighbor methods to classify whaling voyages. But digital humanists working with texts hardly ever use k-means clustering; instead, they gravitate towards a more sophisticated form of clustering called topic modeling, particularly David Blei's LDA (so much so that I'm going to use 'LDA' and 'topic modeling' synonymously here). There's a whole genre of introductory posts out there encouraging humanists to try LDA: Scott Weingart's wraps a lot of them together, and Miriam Posner's is freshest off the presses.
So as an appendix to that appendix, I want to use ship's data to think about how we use LDA. I've wondered for a while why there's such a rush to make topic modeling into the machine learning tool for historians and literature scholars. It's probably true that if you only apply one algorithm to your texts, it should be LDA. But most humanists are better off applying zero clusterings, and most of the remainder should be applying several. I haven't mastered the arcana of various flavors of topic modeling to my own satisfaction, and don't feel qualified to deliver a full-on jeremiad against its uses and abuses. Suffice it to say, my basic concerns are:
- The ease of use for LDA with basic settings means humanists are too likely to take its results as 'magic', rather than interpreting it as the output of one clustering technique among many.
- The primary way of evaluating its result (confirming that the top words and texts in each topic 'make sense') ignores most of the model output and doesn't map perfectly onto the expectations we have for the topics. (A Gary King study, for example, that empirically ranks document clusterings based on human interpretation of 'informativeness' found Direchlet-prior based clustering the least effective of several methods.)
Ship data gives an interesting perspective on these problems. So, at the risk of descending into self-parody, I ran a couple topic models on the points in the ship's logs as a way of thinking through how that clustering works. (For those who only know LDA as a text-classification system, this isn't as loony as it sounds; in computer science, the algorithm gets thrown at all sorts of unrelated data, from images to music).
Instead of using a vocabulary of words, we can just use one of latitude-longitude points at decimal resolution. Each voyage is a text, and each day it spends in, say, Boston is one use of the word "42.4,-72.1". That gives us a vocabulary of 600,000 or so 'words' across 11,000 'texts', not far off a typical topic model (although the 'texts' are short, averaging maybe 30-50 words). Unlike k-means clustering, a topic model will divide each route up among several topics, so instead of showing paths, we can visually only look at which points fall into which 'topic'; but a single point isn't restricted to a single topic, so New York could be part of both a hypothetical 'European trade' and 'California trade' topic.
With words, it's impossible to meaningfully convey all the data in a topic model's output. Geodata has the nice feature that we can inspect all the results in a topic by simply plotting them on a map. Essentially, 'meaning' for points can be firmly reduced a two-dimensional space (although it has other ones as well), while linguistic meaning can't.
Here's the output of a model, plotted with high transparency so that a point on the map will appear black if it appears in that topic in 100 or more log entries. (The basic code to build the model and plot the code is here--dataset available on request).
![]() | |
| Click to enlarge |
Subscribe to:
Posts (Atom)
