My last post was about how the frustrating imprecisions of language drive humanists towards using statistical aggregates instead of words: this one is about how they drive scientists to treat words as fundamental units even when their own models suggest they should be using something more abstract.
I've been thinking about a recent article by Alexander M. Petersen et al., "Languages cool as they expand: Allometric scaling and the decreasing need for new words." The paper uses Ngrams data to establish the dynamics for the entry of new words into natural languages. Mark Liberman argues that the bulk of change in the Ngrams corpus involves things like proper names and alphanumeric strings, rather than actual vocabulary change, which keeps the paper from being more than 'though-provoking.' Liberman's fundamental objection is that although the authors say they are talking about 'words,' it would be better for them to describe their findings in terms of 'tokens.' Words seem good and basic, but dissolve on close inspection into a forest of inflected forms, capitals, and OCR mis-readings. So it's hard to know whether the conclusions really apply to 'words' even if they do to 'tokens.'
Digital Humanities: Using tools from the 1990s to answer questions from the 1960s about 19th century America.
Wednesday, February 6, 2013
Thursday, January 10, 2013
Crossroads
Just a quick post to point readers of this blog to my new Atlantic article on anachronisms in Kushner/Spielberg's Lincoln; and to direct Atlantic readers interested in more anachronisms over to my other blog, Prochronisms, which is currently churning on through the new season of Downton Abbey. (And to stick around here; my advanced market research shows you might like some of the posts about mapping historical shipping routes.)
Wednesday, January 9, 2013
Keeping the words in Topic Models
Following up on my previous topic modeling post, I want to talk about one thing humanists actually do with topic models once they build them, most of the time: chart the topics over time. Since I think that, although Topic Modeling can be very useful, there's too little skepticism about the technique, I'm venturing to provide it (even with, I'm sure, a gross misunderstanding or two). More generally, the sort of mistakes temporal changes cause should call into question the complacency with which humanists tend to 'topics' in topic modeling as stable abstractions, and argue for a much greater attention to the granular words that make up a topic model.
In the middle of this, I will briefly veer into some odd reflections about how the post-lapsarian state of language. Some people will want to skip that; maybe some others will want to skip to it.
In the middle of this, I will briefly veer into some odd reflections about how the post-lapsarian state of language. Some people will want to skip that; maybe some others will want to skip to it.
Thursday, November 15, 2012
Military History and data: the US Navy in World War II
A stray idea left over from my whaling series: just how much should digital humanists be flocking to military history? Obviously the field is there a bit already: the Digital Scholarship lab at Richmond in particular has a number of interesting Civil War projects, and the Valley of the Shadow is one of the archetypal digital history projects. But it's possible someone could get a lot of mileage out of doing a lot more.
There are two opportunistic reasons to think so.
1. Digital historians have always been very interested in public audiences; military history has always been one of the keenest areas of public interest.
2. The data is there for algorithmic exploration. In most countries, no organization is better at keeping structured records than the military.
And the stuff is interesting. It's easy, for example,to pull out the locations of nearly the entire US Navy, season-by-season, in the Pacific Theater:
Or even animate them and the less comprehensive Japanese records to show the tide of battle (America in blue, Japan in red):
There are two opportunistic reasons to think so.
1. Digital historians have always been very interested in public audiences; military history has always been one of the keenest areas of public interest.
2. The data is there for algorithmic exploration. In most countries, no organization is better at keeping structured records than the military.
And the stuff is interesting. It's easy, for example,to pull out the locations of nearly the entire US Navy, season-by-season, in the Pacific Theater:
![]() |
| Click to enlarge. |
Reading digital sources: a case study in ship's logs
[Temporary note, March 2015: those arriving from reddit may also be interested in this post, which has a bit more about the specific image and a few more like it.]
Digitization makes the most traditional forms of humanistic scholarship more necessary, not less. But the differences mean that we need to reinvent, not reaffirm, the way that historians do history.
This month, I've posted several different essays about ship's logs. These all grew out of a single post; so I want to wrap up the series with an introduction to the full set. The motivation for the series is that a medium-sized data set like Maury's 19th century logs (with 'merely' millions of points) lets us think through in microcosm the general problems of reading historical data. So I want in this post to walk through the various parts I've posted to date as a single essay in how we can use digital data for historical analysis.
The central conclusion is this: To do humanistic readings of digital data, we cannot rely on either traditional humanistic competency or technical expertise from the sciences. This presents a challenge for the execution of research projects on digital sources: research-center driven models for digital humanistic resource, which are not uncommon, presume that traditional humanists can bring their interpretive skills to bear on sources presented by others.
We need to rejuvenate three traditional practices: first, a source criticism that explains what's in the data; second, a hermeneutics that lets us read data into a meaningful form; and third, situated argumentation that ties the data in to live questions in their field.
Digitization makes the most traditional forms of humanistic scholarship more necessary, not less. But the differences mean that we need to reinvent, not reaffirm, the way that historians do history.
This month, I've posted several different essays about ship's logs. These all grew out of a single post; so I want to wrap up the series with an introduction to the full set. The motivation for the series is that a medium-sized data set like Maury's 19th century logs (with 'merely' millions of points) lets us think through in microcosm the general problems of reading historical data. So I want in this post to walk through the various parts I've posted to date as a single essay in how we can use digital data for historical analysis.
The central conclusion is this: To do humanistic readings of digital data, we cannot rely on either traditional humanistic competency or technical expertise from the sciences. This presents a challenge for the execution of research projects on digital sources: research-center driven models for digital humanistic resource, which are not uncommon, presume that traditional humanists can bring their interpretive skills to bear on sources presented by others.
![]() | |||||||||||
|
Wednesday, November 14, 2012
Where are the individuals in data-driven narratives?
Note: this post is part 5 of my series on whaling logs and digital history. For the full overview, click here.
In the central post in my whaling series, I argued data presentation offers historians an appealing avenue for historical argumentation, analogous in importance to the practice of shaping personal stories into narratives in more traditional histories. Both narratives and data presentations can appeal to a broader public than more technical parts of history like historiography; and both can be crucial in making arguments persuasive, although they rarely constitute an argument in themselves. But while narratives about people ensure that histories are fundamentally about individuals, working with data generally means we'll be dealing with aggregates of some sort. (In my case, 'voyages' by 'whaling ships'.*)
*I put those in quotation marks because, as described at greater length in the technical methodology post, what I give are only the best approximations I could get of the real categories of oceangoing voyages and of whaling ships.
This is, depending on how you look at it, either a problem or an opportunity. So I want to wrap into this longer series a slightly abtruse--technical from the social theory side rather than the algorithmic side--justification for why we might not want to linger over individual experiences.
One major reason to embrace digital history is precisely that it lets us tell stories that are fundamentally about collective actions--the 'swarm' of the whaling industry as a whole--rather than traditional subjective accounts. While it's discomforting to tell histories without individuals, that discomfort is productive for the field; we need a way to tell those histories, and we need reminders they exist. In fact, those are just the stories that historians are becoming worse and worse at telling, even as our position in society makes us need them more and more.
In the central post in my whaling series, I argued data presentation offers historians an appealing avenue for historical argumentation, analogous in importance to the practice of shaping personal stories into narratives in more traditional histories. Both narratives and data presentations can appeal to a broader public than more technical parts of history like historiography; and both can be crucial in making arguments persuasive, although they rarely constitute an argument in themselves. But while narratives about people ensure that histories are fundamentally about individuals, working with data generally means we'll be dealing with aggregates of some sort. (In my case, 'voyages' by 'whaling ships'.*)
*I put those in quotation marks because, as described at greater length in the technical methodology post, what I give are only the best approximations I could get of the real categories of oceangoing voyages and of whaling ships.
This is, depending on how you look at it, either a problem or an opportunity. So I want to wrap into this longer series a slightly abtruse--technical from the social theory side rather than the algorithmic side--justification for why we might not want to linger over individual experiences.
One major reason to embrace digital history is precisely that it lets us tell stories that are fundamentally about collective actions--the 'swarm' of the whaling industry as a whole--rather than traditional subjective accounts. While it's discomforting to tell histories without individuals, that discomfort is productive for the field; we need a way to tell those histories, and we need reminders they exist. In fact, those are just the stories that historians are becoming worse and worse at telling, even as our position in society makes us need them more and more.
Friday, November 2, 2012
When you have a MALLET, everything looks like a nail
Note: this post is part 4, section 2 of my series on whaling logs and digital history. For the full overview, click here.
One reason I'm interested in ship logs is that they give some distance to think about problems in reading digital texts. That's particularly true for machine learning techniques. In my last post, an appendix to the long whaling post, I talked about using K-means clustering and k-nearest neighbor methods to classify whaling voyages. But digital humanists working with texts hardly ever use k-means clustering; instead, they gravitate towards a more sophisticated form of clustering called topic modeling, particularly David Blei's LDA (so much so that I'm going to use 'LDA' and 'topic modeling' synonymously here). There's a whole genre of introductory posts out there encouraging humanists to try LDA: Scott Weingart's wraps a lot of them together, and Miriam Posner's is freshest off the presses.
So as an appendix to that appendix, I want to use ship's data to think about how we use LDA. I've wondered for a while why there's such a rush to make topic modeling into the machine learning tool for historians and literature scholars. It's probably true that if you only apply one algorithm to your texts, it should be LDA. But most humanists are better off applying zero clusterings, and most of the remainder should be applying several. I haven't mastered the arcana of various flavors of topic modeling to my own satisfaction, and don't feel qualified to deliver a full-on jeremiad against its uses and abuses. Suffice it to say, my basic concerns are:
Ship data gives an interesting perspective on these problems. So, at the risk of descending into self-parody, I ran a couple topic models on the points in the ship's logs as a way of thinking through how that clustering works. (For those who only know LDA as a text-classification system, this isn't as loony as it sounds; in computer science, the algorithm gets thrown at all sorts of unrelated data, from images to music).
Instead of using a vocabulary of words, we can just use one of latitude-longitude points at decimal resolution. Each voyage is a text, and each day it spends in, say, Boston is one use of the word "42.4,-72.1". That gives us a vocabulary of 600,000 or so 'words' across 11,000 'texts', not far off a typical topic model (although the 'texts' are short, averaging maybe 30-50 words). Unlike k-means clustering, a topic model will divide each route up among several topics, so instead of showing paths, we can visually only look at which points fall into which 'topic'; but a single point isn't restricted to a single topic, so New York could be part of both a hypothetical 'European trade' and 'California trade' topic.
With words, it's impossible to meaningfully convey all the data in a topic model's output. Geodata has the nice feature that we can inspect all the results in a topic by simply plotting them on a map. Essentially, 'meaning' for points can be firmly reduced a two-dimensional space (although it has other ones as well), while linguistic meaning can't.
Here's the output of a model, plotted with high transparency so that a point on the map will appear black if it appears in that topic in 100 or more log entries. (The basic code to build the model and plot the code is here--dataset available on request).
One reason I'm interested in ship logs is that they give some distance to think about problems in reading digital texts. That's particularly true for machine learning techniques. In my last post, an appendix to the long whaling post, I talked about using K-means clustering and k-nearest neighbor methods to classify whaling voyages. But digital humanists working with texts hardly ever use k-means clustering; instead, they gravitate towards a more sophisticated form of clustering called topic modeling, particularly David Blei's LDA (so much so that I'm going to use 'LDA' and 'topic modeling' synonymously here). There's a whole genre of introductory posts out there encouraging humanists to try LDA: Scott Weingart's wraps a lot of them together, and Miriam Posner's is freshest off the presses.
So as an appendix to that appendix, I want to use ship's data to think about how we use LDA. I've wondered for a while why there's such a rush to make topic modeling into the machine learning tool for historians and literature scholars. It's probably true that if you only apply one algorithm to your texts, it should be LDA. But most humanists are better off applying zero clusterings, and most of the remainder should be applying several. I haven't mastered the arcana of various flavors of topic modeling to my own satisfaction, and don't feel qualified to deliver a full-on jeremiad against its uses and abuses. Suffice it to say, my basic concerns are:
- The ease of use for LDA with basic settings means humanists are too likely to take its results as 'magic', rather than interpreting it as the output of one clustering technique among many.
- The primary way of evaluating its result (confirming that the top words and texts in each topic 'make sense') ignores most of the model output and doesn't map perfectly onto the expectations we have for the topics. (A Gary King study, for example, that empirically ranks document clusterings based on human interpretation of 'informativeness' found Direchlet-prior based clustering the least effective of several methods.)
Ship data gives an interesting perspective on these problems. So, at the risk of descending into self-parody, I ran a couple topic models on the points in the ship's logs as a way of thinking through how that clustering works. (For those who only know LDA as a text-classification system, this isn't as loony as it sounds; in computer science, the algorithm gets thrown at all sorts of unrelated data, from images to music).
Instead of using a vocabulary of words, we can just use one of latitude-longitude points at decimal resolution. Each voyage is a text, and each day it spends in, say, Boston is one use of the word "42.4,-72.1". That gives us a vocabulary of 600,000 or so 'words' across 11,000 'texts', not far off a typical topic model (although the 'texts' are short, averaging maybe 30-50 words). Unlike k-means clustering, a topic model will divide each route up among several topics, so instead of showing paths, we can visually only look at which points fall into which 'topic'; but a single point isn't restricted to a single topic, so New York could be part of both a hypothetical 'European trade' and 'California trade' topic.
With words, it's impossible to meaningfully convey all the data in a topic model's output. Geodata has the nice feature that we can inspect all the results in a topic by simply plotting them on a map. Essentially, 'meaning' for points can be firmly reduced a two-dimensional space (although it has other ones as well), while linguistic meaning can't.
Here's the output of a model, plotted with high transparency so that a point on the map will appear black if it appears in that topic in 100 or more log entries. (The basic code to build the model and plot the code is here--dataset available on request).
![]() | |
| Click to enlarge |
Thursday, November 1, 2012
Machine Learning at sea
Note: this post is part 4 of my series on whaling logs and digital history. For the full overview, click here.
As part of my essay visualizing 19th-century American shipping records, I need to give a more technical appendix on machine learning: it discusses how I classified whaling vessels as an example of how supervised and unsupervised machine learning algorithms, including the ubiquitous topic modeling, can help work with historical datasets.
For context: here's my map that shows shifting whaling grounds by extracting whale voyages from the Maury datasets. Particularly near the end, you might see one or two trips that don't look like whaling voyages; they probably aren't. As with a lot of historical data, the metadata is patchy, and it's worth trying to build out from what we have to what's actually true. To supplement I made a few leaps of faith to pull whaling trips out of the database: here's how.
Tuesday, October 30, 2012
Data narratives and structural histories: Melville, Maury, and American whaling
Note: this post is part I of my series on whaling logs and digital history. For the full overview, click here.
Data visualizations are like narratives: they suggest interpretations, but don't require them. A good data visualization, in fact, lets you see things the interpreter might have missed. This should make data visualization especially appealing to historians. Much of the historian's art is turning dull information into compelling narrative; visualization is useful for us because it suggests new ways of making interesting the stories we've been telling all along. In particular: data visualization lets us make historical structures immediately accessible in the same way that narratives have let us do so for stories about individual agents.
I'll repost this below the break with a bit more of an explanation. First I want to ask some basic questions: If this is a narrative, what kind of story does it tell? And how compelling can a story from data alone be: is there anything left from a view so high that no individuals are present?
Data visualizations are like narratives: they suggest interpretations, but don't require them. A good data visualization, in fact, lets you see things the interpreter might have missed. This should make data visualization especially appealing to historians. Much of the historian's art is turning dull information into compelling narrative; visualization is useful for us because it suggests new ways of making interesting the stories we've been telling all along. In particular: data visualization lets us make historical structures immediately accessible in the same way that narratives have let us do so for stories about individual agents.
I've been looking at the ship's logs that climatologists digitize because it's a perfect case of forlorn data that might tell a more interesting story. My post on European shipping gives more of the details about how to make movies from ship's logs, but this time I want to talk about why, using a new set with about a half-century of American vessels sailing around the world. It looks like this:
I'll repost this below the break with a bit more of an explanation. First I want to ask some basic questions: If this is a narrative, what kind of story does it tell? And how compelling can a story from data alone be: is there anything left from a view so high that no individuals are present?
Thursday, October 18, 2012
Word counts rule of thumb
Here's a special post from the archives of my 'too-boring for prime time' files. I wrote this a few months ago but didn't know if anyone needed: but now I'll pull it out just for Scott Weingart since I saw him estimating word counts using 'the,' which is exactly what this post is about. If that sounds boring to you: for heaven's sake, don't read any further.
Melville Plots
Note: this post is part III of my series on whaling logs and digital history. For the full overview, click here.
The main thrust of my big post on the Maury logs is against using them to try to tell individual stories. But in the interests of Internet Melvilleiana, there are two particular tracks I want to pull out.
The first is the Acushnet, the whaling ship Herman Melville served on for 18 months. It was there he got the bulk of his first-hand experience whaling. Melville's track winds mostly around the old American whaling grounds off the coast of South America: you can see that had he stayed aboard a bit longer, the chase for Moby Dick might have entered colder waters. (And we might have a 19th-century account of Aleutian islands as strange as the Encantadas are of the Galapagos).

The main thrust of my big post on the Maury logs is against using them to try to tell individual stories. But in the interests of Internet Melvilleiana, there are two particular tracks I want to pull out.
The first is the Acushnet, the whaling ship Herman Melville served on for 18 months. It was there he got the bulk of his first-hand experience whaling. Melville's track winds mostly around the old American whaling grounds off the coast of South America: you can see that had he stayed aboard a bit longer, the chase for Moby Dick might have entered colder waters. (And we might have a 19th-century account of Aleutian islands as strange as the Encantadas are of the Galapagos).

Friday, October 12, 2012
Logbooks and the long history of digitization
Note: this post is part II of my series on whaling logs and digital history. For the full overview, click here.
To read the data in ship's logs we first must know where the data came from. The short answer--ICOADS--might be enough. But working with digitized books has convinced me that knowing the full provenance of your data, through all its twists and turns, is one of the most important parts of any digital humanities project.
Like most humanists, the real digitization projects I care about are books, periodicals, and archives. A major theme on this blog is the attempt to understand how particular choices in digitization history shape the books available to us.
But ship's logs are interesting because they present a wholly alternate digitization history that can help us understand the mechanics of digitization more clearly. Logs are a digitized data source that has been driving large-scale research projects for more than 150 years: because of that, they can be a useful abstraction for reflecting on what digitization means. Logbook digitization is an interesting process in its own right; the particular cast of characters--Confederate technocrats, Nazi data thieves--in the history of shipping logs is unique. But the general problems are the same as those found in other large-scale sources of data. Unless humanists intend only to work with data digitized by our own standards, we have to be better at understanding just what can go wrong.
So before I get to those Nazis, let me lay out the basic themes that the story reinforces.
To read the data in ship's logs we first must know where the data came from. The short answer--ICOADS--might be enough. But working with digitized books has convinced me that knowing the full provenance of your data, through all its twists and turns, is one of the most important parts of any digital humanities project.
Like most humanists, the real digitization projects I care about are books, periodicals, and archives. A major theme on this blog is the attempt to understand how particular choices in digitization history shape the books available to us.
But ship's logs are interesting because they present a wholly alternate digitization history that can help us understand the mechanics of digitization more clearly. Logs are a digitized data source that has been driving large-scale research projects for more than 150 years: because of that, they can be a useful abstraction for reflecting on what digitization means. Logbook digitization is an interesting process in its own right; the particular cast of characters--Confederate technocrats, Nazi data thieves--in the history of shipping logs is unique. But the general problems are the same as those found in other large-scale sources of data. Unless humanists intend only to work with data digitized by our own standards, we have to be better at understanding just what can go wrong.
So before I get to those Nazis, let me lay out the basic themes that the story reinforces.
Tuesday, September 25, 2012
Advertising and politics
-->
I've now seen
a paragraph about advertising in Jill Lepore's latest New Yorker piece in
a few places, including Andrew Sullivan's blog. Digital history blogging should resume soon, but first some advertising history, since something weird is going on here:
Political consulting is often thought of as an offshoot of the advertising industry, but closer to the truth is that the advertising industry began as a form of political consulting. As the political scientist Stanley Kelley once explained, when modern advertising began, the big clients were just as interested in advancing a political agenda as a commercial one. Monopolies like Standard Oil and DuPont looked bad: they looked greedy and ruthless and, in the case of DuPont, which made munitions, sinister. They therefore hired advertising firms to sell the public on the idea of the large corporation, and, not incidentally, to advance pro-business legislation.
I can see why
this paragraph seemed interesting enough to print. It offers a counter-intuitive
spin on the role of advertising—and business in general—in the history of American
politics. No one likes
advertisers, no one likes political consultants, and they seem somewhow
connected. But although we’re tempted to blame some modern debasement of
politics on the over-reach of consumer culture, this suggests a much more
direct approach: in fact, the subversion of politics was the goal of big
industry all along, and the
anti-consumerist clichés about consumerism only make us ignore that big fact.
Unfortunately,
though, it has nothing to do with the actual history of advertising. Standard Oil and DuPont were not the 'big
clients' of the advertising agencies, and the industry's roots have little to
with the corporate image-making. For example: browse through the files, paying
attention to size and year, in the portfolios
of J Walter Thompson to see
who was paying their bills in the 1920s and 1930s. Or just trust me: it's far
and away consumer goods, companies like Quaker Oats, Lever Brothers soap, and
Kraft foods.
Tuesday, July 31, 2012
The Wide World of Physics
I've been thinking more than usual lately about spatially representing the data in the various Bookworm browsers.
So in this post, I want to do two things:
First, give a quick overview of the geography of the ArXiv. This is interesting in itself--the ArXiv is the most comprehensive source of scientific papers for physics and mathematics, and plays a substantial role in some other fields. And it's good for me going forward, as a way to build up some code that can be used on other collections.
Second, to put some code online. I've been doing most of my work lately--writing as well as coding--in RStudio using Yihui Xie's fantastic Knitr package. The idea is to combine code with text to allow, simultaneously, literate programming and reproducible research. Blogger is pain: but all the source and text for this post is up at the Rpubs site, which is a very interesting project encouraging sharing research. You can go read this post there instead of here if you want code, but there are a few small changes. And the youtube clip is only available here.
The basic idea--to jump ahead a bit--is that it might be useful to create charts like the following, which show differing geographical patterns of usage. (Here, people talk about Harvard near Harvard, and Stanford near Stanford--but in Europe, Stanford seems to win out near the big particle physics projects in Italy and Switzerland.)
How we do that--and what we get from it--are both a little tricky.
So in this post, I want to do two things:
First, give a quick overview of the geography of the ArXiv. This is interesting in itself--the ArXiv is the most comprehensive source of scientific papers for physics and mathematics, and plays a substantial role in some other fields. And it's good for me going forward, as a way to build up some code that can be used on other collections.
Second, to put some code online. I've been doing most of my work lately--writing as well as coding--in RStudio using Yihui Xie's fantastic Knitr package. The idea is to combine code with text to allow, simultaneously, literate programming and reproducible research. Blogger is pain: but all the source and text for this post is up at the Rpubs site, which is a very interesting project encouraging sharing research. You can go read this post there instead of here if you want code, but there are a few small changes. And the youtube clip is only available here.
The basic idea--to jump ahead a bit--is that it might be useful to create charts like the following, which show differing geographical patterns of usage. (Here, people talk about Harvard near Harvard, and Stanford near Stanford--but in Europe, Stanford seems to win out near the big particle physics projects in Italy and Switzerland.)
![]() |
| Click to enlarge |
Thursday, July 12, 2012
Making and publishing history in the Civil War
A follow up on my post from yesterday about whether there's more history published in times of revolution. I was saying that I thought the dataset Google uses must be counting documents of historical importance as history: because libraries tend to shelve in a way that conflates things that are about history and things that are history.
I realized after posting that the first of the two graphs in Michael Witmore and Robin Valenza's post actually shows a spike in publications of US history somewhere near 1860. (It actually looks closer to the late 1850s, but there aren't any grid lines on the chart.) Bookworm is pretty much useless in the 17th century, but it's on solid ground in the 1860s. And I've long known there was something funny going in Bookworm around the Civil War, particularly in the History class.
So--is there more history published in the Civil War period in the Bookworm database? What kind?
I realized after posting that the first of the two graphs in Michael Witmore and Robin Valenza's post actually shows a spike in publications of US history somewhere near 1860. (It actually looks closer to the late 1850s, but there aren't any grid lines on the chart.) Bookworm is pretty much useless in the 17th century, but it's on solid ground in the 1860s. And I've long known there was something funny going in Bookworm around the Civil War, particularly in the History class.
So--is there more history published in the Civil War period in the Bookworm database? What kind?
Subscribe to:
Posts (Atom)



