I wanted to try to replicate and slightly expand Ted Underwood's recent discussion of genre formation over time using the Bookworm dataset of Open Library books. I couldn't, quite, but I want to just post the results and code up here for those who have been following that discussion. Warning: this is a rather dry post.
Digital Humanities: Using tools from the 1990s to answer questions from the 1960s about 19th century America.
Wednesday, February 29, 2012
Monday, February 20, 2012
Downton Abbey Anachronisms, Season Finale edition
[Update: I've consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]
It's Monday, so let's run last night's episode of Downton Abbey through the anachronism machine. I looked for Downton Abbey anachronisms for the first time last week: using the Google Ngram dataset, I can check every two-word phrase in an episode to see if it's more common today than then. This 1) lets us find completely anachronistic phrases, which is fun; and 2) lets us see how the language has evolved, and what shows do the best job at it. [Since some people care about this--don't worry, no plot spoilers below].
I'll start this with a chart of every two-word phrase that appears in the episode, just like last time. Left-to-right is overall frequency; top to bottom is over-representation. Higher up is representative of 1995 language; lower down, of 1917. Click to enlarge.
So: how does it look?
It's Monday, so let's run last night's episode of Downton Abbey through the anachronism machine. I looked for Downton Abbey anachronisms for the first time last week: using the Google Ngram dataset, I can check every two-word phrase in an episode to see if it's more common today than then. This 1) lets us find completely anachronistic phrases, which is fun; and 2) lets us see how the language has evolved, and what shows do the best job at it. [Since some people care about this--don't worry, no plot spoilers below].
I'll start this with a chart of every two-word phrase that appears in the episode, just like last time. Left-to-right is overall frequency; top to bottom is over-representation. Higher up is representative of 1995 language; lower down, of 1917. Click to enlarge.
So: how does it look?
Sunday, February 19, 2012
Second epistle to the intellectual historians
I. The new USIH blogger LD Burnett has a post up expressing ambivalence about the digital humanities because it is too eager to reject books. This is a pretty common argument, I think, familiar to me in less eloquent forms from New York Times comment threads. It's a rhetorically appealing position--to set oneself up as a defender of the book against the philistines who not only refuse to read it themselves, but want to take your books away and destroy them. I worry there's some mystification involved--conflating corporate publishers with digital humanists, lumping together books with codices with monographs, and ignoring the tension between reader and consumer. This problem ties up nicely into the big event in DH in the last week--the announcement of the first issue of the ambitiously all-digital Journal of Digital Humanities. So let me take a minute away from writing about TV shows to sort out my preliminary thoughts on books.
Monday, February 13, 2012
Making Downton more traditional
[Update: I've consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]
Digital humanists like to talk about what insights about the past big data can bring. So in that spirit, let me talk about Downton Abbey for a minute. The show's popularity has led many nitpickers to draft up lists of mistakes. Language Loggers Mark Liberman and Ben Zimmer have looked at some idioms that don't belong for Language Log, NPR and the Boston Globe.) In the best British tradition, the Daily Mail even managed to cast the errors as a sort of scandal. But all of these have relied, so far as I can tell, on finding a phrase or two that sounds a bit off, and checking the online sources for earliest use. This resembles what historians do nowadays; go fishing in the online resources to confirm hypotheses, but never ever start from the digital sources. That would be, as the dowager countess, might say, untoward.
I lack such social graces. So I thought: why not just check every single line in the show for historical accuracy? Idioms are the most colorful examples, but the whole language is always changing. There must be dozens of mistakes no one else is noticing. Google has digitized so much of written language that I don't have to rely on my ear to find what sounds wrong; a computer can do that far faster and better. So I found some copies of the Downton Abbey scripts online, and fed every single two-word phrase through the Google Ngram database to see how characteristic of the English Language, c. 1917, Downton Abbey really is.
The results surprised me. There are, certainly, quite a few pure anachronisms. Asking for phrases that appear in no English-language books between 1912 and 1921 gives a list of 34 anachronistic phrases this season. Sorted from most to least common in contemporary books, we get a rather boring list:
Digital humanists like to talk about what insights about the past big data can bring. So in that spirit, let me talk about Downton Abbey for a minute. The show's popularity has led many nitpickers to draft up lists of mistakes. Language Loggers Mark Liberman and Ben Zimmer have looked at some idioms that don't belong for Language Log, NPR and the Boston Globe.) In the best British tradition, the Daily Mail even managed to cast the errors as a sort of scandal. But all of these have relied, so far as I can tell, on finding a phrase or two that sounds a bit off, and checking the online sources for earliest use. This resembles what historians do nowadays; go fishing in the online resources to confirm hypotheses, but never ever start from the digital sources. That would be, as the dowager countess, might say, untoward.
I lack such social graces. So I thought: why not just check every single line in the show for historical accuracy? Idioms are the most colorful examples, but the whole language is always changing. There must be dozens of mistakes no one else is noticing. Google has digitized so much of written language that I don't have to rely on my ear to find what sounds wrong; a computer can do that far faster and better. So I found some copies of the Downton Abbey scripts online, and fed every single two-word phrase through the Google Ngram database to see how characteristic of the English Language, c. 1917, Downton Abbey really is.
The results surprised me. There are, certainly, quite a few pure anachronisms. Asking for phrases that appear in no English-language books between 1912 and 1921 gives a list of 34 anachronistic phrases this season. Sorted from most to least common in contemporary books, we get a rather boring list:
Thursday, February 2, 2012
Poor man's sentiment analysis
Though I usually work with the Bookworm database of Open Library texts, I've been playing a bit more with the Google Ngram data sets lately, which have substantial advantages in size, quality, and time period. Largely I use it to check or search for patterns I can then analyze in detail with text-length data; but there's also a lot more that could be coming out of the Ngrams set than what I've seen in the last year.
Most humanists respond to the raw frequency measures in Google Ngrams with some bafflement. There's a lot to get excited about internally to those counts that can help answer questions we already have, but the base measure is a little foreign. If we want to know about the history of capitalism, the punctuated ascent of its Ngram only tells us so much:
It's certainly interesting that the steepest rises, in the 1930s and the 1970s, are associated with systematic worldwide crises--but that's about all I can glean from this, and it's one more thing than I get from most Ngrams. Usually, the game is just tracing individual peaks to individual events; a solitary quiz on historical events in front of the screen. Is this all the data can tell us?
Most humanists respond to the raw frequency measures in Google Ngrams with some bafflement. There's a lot to get excited about internally to those counts that can help answer questions we already have, but the base measure is a little foreign. If we want to know about the history of capitalism, the punctuated ascent of its Ngram only tells us so much:
It's certainly interesting that the steepest rises, in the 1930s and the 1970s, are associated with systematic worldwide crises--but that's about all I can glean from this, and it's one more thing than I get from most Ngrams. Usually, the game is just tracing individual peaks to individual events; a solitary quiz on historical events in front of the screen. Is this all the data can tell us?