[Update: I've consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]
I've got an article up today on the Atlantic's web site about how Mad Men stacks up against historical language usage. So if you're reading this blog, go read that.
Maybe I'll add some breakouts of individual episodes later today if I get some time, but here are the overall word clouds like the ones I made for Downton Abbey. Mad Men has noticeably fewer outliers towards the top:
And the ones that are are actually appropriate. (My dissertation actually has a bit on the origins of focus groups in the 1940s).
Digital Humanities: Using tools from the 1990s to answer questions from the 1960s about 19th century America.
Wednesday, March 21, 2012
Tuesday, March 6, 2012
Do women hide their gender by publishing under their initials?
A quick follow-up on this issue of author gender.
In my last post, I looked at first names as a rough gauge of author gender to see who is missing from libraries. This method has two obvious failings as a way of finding gender:
1) People use pseudonyms that can be of the opposite gender. (More often women writing as men, but sometimes men writing as women as well.)
2) People publish using initials. It's pretty widely known that women sometimes publish under their initials to avoid making their gender obvious.
The first problem is basically intractable without specific knowledge. (I can fix George Eliot by hand, but no other way). The second we can get actually get some data on, though. Authors are identified by their first initial alone in about 10% of the books I'm using (1905-1922, Open Library texts). It turns out we can actually figure out a little bit about what gender they are. If this is a really important phenomenon in the data, then it should show up in other ways.
In my last post, I looked at first names as a rough gauge of author gender to see who is missing from libraries. This method has two obvious failings as a way of finding gender:
1) People use pseudonyms that can be of the opposite gender. (More often women writing as men, but sometimes men writing as women as well.)
2) People publish using initials. It's pretty widely known that women sometimes publish under their initials to avoid making their gender obvious.
The first problem is basically intractable without specific knowledge. (I can fix George Eliot by hand, but no other way). The second we can get actually get some data on, though. Authors are identified by their first initial alone in about 10% of the books I'm using (1905-1922, Open Library texts). It turns out we can actually figure out a little bit about what gender they are. If this is a really important phenomenon in the data, then it should show up in other ways.
Evidence of absence is not absence of evidence
I just saw that various Digital Humanists on Twitter were talking about representativeness, exclusion of women from digital archives, and other Big Questions. I can only echo my general agreement about most of the comments.
But now that I see some concerns about gender biases in big digital corpora, I do have a bit to say. Partly that I have seen nothing to make me think social prejudices played into the scanning decisions at all. Rather, Google Books, Hathi Trust, the Internet Archive, and all the other similar projects are pretty much representative of the state of academic libraries. (With strange exceptions, of course). You can choose where to vaccum, but not what gets sucked up the machine; likewise the companies.
But now that I see some concerns about gender biases in big digital corpora, I do have a bit to say. Partly that I have seen nothing to make me think social prejudices played into the scanning decisions at all. Rather, Google Books, Hathi Trust, the Internet Archive, and all the other similar projects are pretty much representative of the state of academic libraries. (With strange exceptions, of course). You can choose where to vaccum, but not what gets sucked up the machine; likewise the companies.
Labels:
Digital Humanities,
Gender
Wednesday, February 29, 2012
Journal of Irreproduced results, vol. 1
I wanted to try to replicate and slightly expand Ted Underwood's recent discussion of genre formation over time using the Bookworm dataset of Open Library books. I couldn't, quite, but I want to just post the results and code up here for those who have been following that discussion. Warning: this is a rather dry post.
Monday, February 20, 2012
Downton Abbey Anachronisms, Season Finale edition
[Update: I've consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]
It's Monday, so let's run last night's episode of Downton Abbey through the anachronism machine. I looked for Downton Abbey anachronisms for the first time last week: using the Google Ngram dataset, I can check every two-word phrase in an episode to see if it's more common today than then. This 1) lets us find completely anachronistic phrases, which is fun; and 2) lets us see how the language has evolved, and what shows do the best job at it. [Since some people care about this--don't worry, no plot spoilers below].
I'll start this with a chart of every two-word phrase that appears in the episode, just like last time. Left-to-right is overall frequency; top to bottom is over-representation. Higher up is representative of 1995 language; lower down, of 1917. Click to enlarge.
So: how does it look?
It's Monday, so let's run last night's episode of Downton Abbey through the anachronism machine. I looked for Downton Abbey anachronisms for the first time last week: using the Google Ngram dataset, I can check every two-word phrase in an episode to see if it's more common today than then. This 1) lets us find completely anachronistic phrases, which is fun; and 2) lets us see how the language has evolved, and what shows do the best job at it. [Since some people care about this--don't worry, no plot spoilers below].
I'll start this with a chart of every two-word phrase that appears in the episode, just like last time. Left-to-right is overall frequency; top to bottom is over-representation. Higher up is representative of 1995 language; lower down, of 1917. Click to enlarge.
So: how does it look?
Labels:
TV watch
Sunday, February 19, 2012
Second epistle to the intellectual historians
I. The new USIH blogger LD Burnett has a post up expressing ambivalence about the digital humanities because it is too eager to reject books. This is a pretty common argument, I think, familiar to me in less eloquent forms from New York Times comment threads. It's a rhetorically appealing position--to set oneself up as a defender of the book against the philistines who not only refuse to read it themselves, but want to take your books away and destroy them. I worry there's some mystification involved--conflating corporate publishers with digital humanists, lumping together books with codices with monographs, and ignoring the tension between reader and consumer. This problem ties up nicely into the big event in DH in the last week--the announcement of the first issue of the ambitiously all-digital Journal of Digital Humanities. So let me take a minute away from writing about TV shows to sort out my preliminary thoughts on books.
Monday, February 13, 2012
Making Downton more traditional
[Update: I've consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]
Digital humanists like to talk about what insights about the past big data can bring. So in that spirit, let me talk about Downton Abbey for a minute. The show's popularity has led many nitpickers to draft up lists of mistakes. Language Loggers Mark Liberman and Ben Zimmer have looked at some idioms that don't belong for Language Log, NPR and the Boston Globe.) In the best British tradition, the Daily Mail even managed to cast the errors as a sort of scandal. But all of these have relied, so far as I can tell, on finding a phrase or two that sounds a bit off, and checking the online sources for earliest use. This resembles what historians do nowadays; go fishing in the online resources to confirm hypotheses, but never ever start from the digital sources. That would be, as the dowager countess, might say, untoward.
I lack such social graces. So I thought: why not just check every single line in the show for historical accuracy? Idioms are the most colorful examples, but the whole language is always changing. There must be dozens of mistakes no one else is noticing. Google has digitized so much of written language that I don't have to rely on my ear to find what sounds wrong; a computer can do that far faster and better. So I found some copies of the Downton Abbey scripts online, and fed every single two-word phrase through the Google Ngram database to see how characteristic of the English Language, c. 1917, Downton Abbey really is.
The results surprised me. There are, certainly, quite a few pure anachronisms. Asking for phrases that appear in no English-language books between 1912 and 1921 gives a list of 34 anachronistic phrases this season. Sorted from most to least common in contemporary books, we get a rather boring list:
Digital humanists like to talk about what insights about the past big data can bring. So in that spirit, let me talk about Downton Abbey for a minute. The show's popularity has led many nitpickers to draft up lists of mistakes. Language Loggers Mark Liberman and Ben Zimmer have looked at some idioms that don't belong for Language Log, NPR and the Boston Globe.) In the best British tradition, the Daily Mail even managed to cast the errors as a sort of scandal. But all of these have relied, so far as I can tell, on finding a phrase or two that sounds a bit off, and checking the online sources for earliest use. This resembles what historians do nowadays; go fishing in the online resources to confirm hypotheses, but never ever start from the digital sources. That would be, as the dowager countess, might say, untoward.
I lack such social graces. So I thought: why not just check every single line in the show for historical accuracy? Idioms are the most colorful examples, but the whole language is always changing. There must be dozens of mistakes no one else is noticing. Google has digitized so much of written language that I don't have to rely on my ear to find what sounds wrong; a computer can do that far faster and better. So I found some copies of the Downton Abbey scripts online, and fed every single two-word phrase through the Google Ngram database to see how characteristic of the English Language, c. 1917, Downton Abbey really is.
The results surprised me. There are, certainly, quite a few pure anachronisms. Asking for phrases that appear in no English-language books between 1912 and 1921 gives a list of 34 anachronistic phrases this season. Sorted from most to least common in contemporary books, we get a rather boring list:
Thursday, February 2, 2012
Poor man's sentiment analysis
Though I usually work with the Bookworm database of Open Library texts, I've been playing a bit more with the Google Ngram data sets lately, which have substantial advantages in size, quality, and time period. Largely I use it to check or search for patterns I can then analyze in detail with text-length data; but there's also a lot more that could be coming out of the Ngrams set than what I've seen in the last year.
Most humanists respond to the raw frequency measures in Google Ngrams with some bafflement. There's a lot to get excited about internally to those counts that can help answer questions we already have, but the base measure is a little foreign. If we want to know about the history of capitalism, the punctuated ascent of its Ngram only tells us so much:
It's certainly interesting that the steepest rises, in the 1930s and the 1970s, are associated with systematic worldwide crises--but that's about all I can glean from this, and it's one more thing than I get from most Ngrams. Usually, the game is just tracing individual peaks to individual events; a solitary quiz on historical events in front of the screen. Is this all the data can tell us?
Most humanists respond to the raw frequency measures in Google Ngrams with some bafflement. There's a lot to get excited about internally to those counts that can help answer questions we already have, but the base measure is a little foreign. If we want to know about the history of capitalism, the punctuated ascent of its Ngram only tells us so much:
It's certainly interesting that the steepest rises, in the 1930s and the 1970s, are associated with systematic worldwide crises--but that's about all I can glean from this, and it's one more thing than I get from most Ngrams. Usually, the game is just tracing individual peaks to individual events; a solitary quiz on historical events in front of the screen. Is this all the data can tell us?
Monday, January 30, 2012
Fixing the job market in two modest steps
Another January, another set of hand-wringing about the humanities job market. So, allow me a brief departure from the digital humanities. First, in four paragraphs, the problem with our current understanding of the history job market; and then, in several more, the solution.
Tony Grafton and Jim Grossman launched the latest exchange with what they call a "modest proposal" for expanding professional opportunities for historians. Jesse Lemisch counters that we need to think bigger and mobilize political action. There's a big and productive disagreement there, but also a deep similarity: both agree there isn't funding inside the academy for history PhDs to find work, but think we ought to be able to get our hands on money controlled by someone else. Political pressure and encouraging words will unlock vast employment opportunities in the world of museums, archives, and other public history (Grafton) or government funded jobs programs (Lemisch). These are funny places to look for growth in a 21st-century OECD country (perhaps Bill Cronon could take the more obvious route, and make his signature initiative as AHA president creating new tenure-track jobs in the BRICs?) but the higher levels of the profession don't see much choice but to change the world.
Tony Grafton and Jim Grossman launched the latest exchange with what they call a "modest proposal" for expanding professional opportunities for historians. Jesse Lemisch counters that we need to think bigger and mobilize political action. There's a big and productive disagreement there, but also a deep similarity: both agree there isn't funding inside the academy for history PhDs to find work, but think we ought to be able to get our hands on money controlled by someone else. Political pressure and encouraging words will unlock vast employment opportunities in the world of museums, archives, and other public history (Grafton) or government funded jobs programs (Lemisch). These are funny places to look for growth in a 21st-century OECD country (perhaps Bill Cronon could take the more obvious route, and make his signature initiative as AHA president creating new tenure-track jobs in the BRICs?) but the higher levels of the profession don't see much choice but to change the world.
Thursday, January 5, 2012
Practices, the periphery, and Pittsburg(h)
[This is not what I'll be saying at the AHA on Sunday morning, since I'm participating in a panel discussion with Stefan Sinclair, Tim Sherrat, and Fred Gibbs, chaired by Bill Turkel. Do come! But if I were to toss something off today to show how text mining can contribute to historical questions and what sort of issues we can answer, now, using simple tools and big data, this might be the story I'd start with to show how much data we have, and how little things can have different meanings at big scales...]
Spelling variations are not a bread-and-butter historical question, and with good reason. There is nothing at stake in whether someone writes "Pittsburgh" or "Pittsburg." But precisely because spelling is so arbitrary, we only change it for good reason. And so it can give insights into power, center and periphery, and transmission. One of the insights of cultural history is that the history of practices, however mundane, can be deeply rooted in the history of power and its use. So bear with me through some real arcana here; there's a bit of a payoff. Plus a map.
The set-up: until 1911, the proper spelling of Pittsburg/Pittsburgh was in flux. Wikipedia (always my go-to source for legalistic minutia) has an exhaustive blow-by-blow, but basically, it has to do with decisions in Washington DC, not Pittsburgh itself (which has usually used the 'h'). The city was supposedly mostly "Pittsburgh" to 1891, when the new US Board on Geographic Names made it firmly "Pittsburg;" then they changed their minds, and made it and once again and forevermore "Pittsburgh" from 1911 on. This is kind of odd, when you think about it: the government changed the name of the eighth-largest city in the country twice in twenty years. (Harrison and Taft are not the presidents you usually think of as kings of over-reach). But it happened; people seem to have changed the addresses on their envelopes, the names on their baseball uniforms, and everything else right on cue.
Thanks to about 500,000 books from the Open Library, though, we don't have to accept this prescriptive account as the whole story; what did people actually do when they had to write about Pittsburgh?
Here's the usage in American books:
What does this tell us about how practices change?
Spelling variations are not a bread-and-butter historical question, and with good reason. There is nothing at stake in whether someone writes "Pittsburgh" or "Pittsburg." But precisely because spelling is so arbitrary, we only change it for good reason. And so it can give insights into power, center and periphery, and transmission. One of the insights of cultural history is that the history of practices, however mundane, can be deeply rooted in the history of power and its use. So bear with me through some real arcana here; there's a bit of a payoff. Plus a map.
The set-up: until 1911, the proper spelling of Pittsburg/Pittsburgh was in flux. Wikipedia (always my go-to source for legalistic minutia) has an exhaustive blow-by-blow, but basically, it has to do with decisions in Washington DC, not Pittsburgh itself (which has usually used the 'h'). The city was supposedly mostly "Pittsburgh" to 1891, when the new US Board on Geographic Names made it firmly "Pittsburg;" then they changed their minds, and made it and once again and forevermore "Pittsburgh" from 1911 on. This is kind of odd, when you think about it: the government changed the name of the eighth-largest city in the country twice in twenty years. (Harrison and Taft are not the presidents you usually think of as kings of over-reach). But it happened; people seem to have changed the addresses on their envelopes, the names on their baseball uniforms, and everything else right on cue.
Thanks to about 500,000 books from the Open Library, though, we don't have to accept this prescriptive account as the whole story; what did people actually do when they had to write about Pittsburgh?
Here's the usage in American books:
What does this tell us about how practices change?
Friday, December 16, 2011
Genre similarities
When data exploration produces Christmas-themed charts, that's a sign it's time to post again. So here's a chart and a problem.
First, the problem. One of the things I like about the posts I did on author age and vocabulary change in the spring is that they have two nice dimensions we can watch changes happening in. This captures the fact that language as a whole doesn't just up and change--things happen among particular groups of people, and the change that results has shape not just in time (it grows, it shrinks) but across those other dimensions as well.
There's nothing fundamental about author age for this--in fact, I think it probably captures what, at least at first, I would have thought were the least interesting types of vocabulary change. But author age has two nice characteristics.
1) It's straightforwardly linear, and so can be set against publication year cleanly.
2) Librarians have been keeping track of it, pretty much accidentally, by noting the birth year of every book's author.
Neither of these attributes are that remarkable; but the combination is.
Friday, November 18, 2011
Treating texts as individuals vs. lumping them together
Ted Underwood has been talking up the advantages of the Mann-Whitney test over Dunning's Log-likelihood, which is currently more widely used. I'm having trouble getting M-W running on large numbers of texts as
quickly as I'd like, but I'd say that his basic contention--that Dunning
log-likelihood is frequently not the best method--is definitely true, and there's a lot to like about rank-ordering tests.
Before I say anything about the specifics, though, I want to make a more general point first, about how we think about comparing groups of texts.The most important difference between these two tests rests on a much bigger question about how to treat the two corpuses we want to compare.
Are they a single long text? Or are they a collection of shorter texts, which have common elements we wish to uncover? This is a central concern for anyone who wants to algorithmically look at texts: how far can we can ignore the traditional limits between texts and create what are, essentially, new documents to be analyzed? There are extremely strong reasons to think of texts in each of these ways.
Before I say anything about the specifics, though, I want to make a more general point first, about how we think about comparing groups of texts.The most important difference between these two tests rests on a much bigger question about how to treat the two corpuses we want to compare.
Are they a single long text? Or are they a collection of shorter texts, which have common elements we wish to uncover? This is a central concern for anyone who wants to algorithmically look at texts: how far can we can ignore the traditional limits between texts and create what are, essentially, new documents to be analyzed? There are extremely strong reasons to think of texts in each of these ways.
Monday, November 14, 2011
Compare and Contrast
I may (or may not) be about to dash off a string of corpus-comparison posts to follow up the ones I've been making the last month. On the surface, I think, this comes across as less interesting than some other possible topics. So I want to explain why I think this matters now. This is not quite my long-promised topic-modeling post, but getting closer.
Off the top of my head, I think there are roughly three things that computers may let us do with text so much faster than was previously possible as to qualitatively change research.
1. Find texts that use words, phrases, or names we're interested in.
2. Compare individual texts or groups of texts against each other.
3. Classify and cluster texts or words. (Where 'classifying' is assigning texts to predefined groups like 'US History', and 'clustering' is letting the affinities be only between the works themselves).
These aren't, to be sure, completely different. I've argued before that in some cases, full-text search is best thought of as a way to create a new classification scheme and populating it with books. (Anytime I get fewer than 15 results for a historical subject in a ProQuest newspapers search, I read all of them--the ranking inside them isn't very important). Clustering algorithms are built around models of cross group comparisons; full text searches often have faceted group comparisons. And so on.
But as ideal types, these are different, and in very different places in the digital humanities right now. Everybody knows about number 1; I think there's little doubt that it continues to be the most important tool for most researchers, and rightly so. (It wasn't, so far as I know, helped along the way by digital humanists at all). More recently, there's a lot of attention to 3. Scott Weingart has a good summary/literature review on topic modeling and network analysis this week--I think his synopsis that "they’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination" gets it just right, although I wish he'd bring the hammer down harder on the danger part. I've read a fair amount about topic models, implemented a few on text collections I've built, and I certainly see the appeal: but not necessarily the embrace. I've also done some work with classification.
In any case: I'm worried that in the excitement about clustering, we're not sufficiently understanding the element in between: comparisons. It's not as exciting a field as topic modeling or clustering: it doesn't produce much by way of interesting visualizations, and there's not the same density of research in computer science that humanists can piggyback on. At the same time, it's not nearly so mature a technology as search. There are a few production quality applications that include some forms of comparisons (WordHoard uses Dunning Log-Likelihood; I can only find relative ratios on the Tapor page). But there isn't widespread adoption, generally used methodologies for search, or anything else like that.
This is a problem, because cross-textual comparison is one of the basic competencies of the humanities, and it's one that computers ought to be able to help with. While we do talk historically about clusters and networks and spheres of discourse, I think comparisons are also closer to most traditional work; there's nothing quite so classically historiographical as tracing out the similarities and differences between Democratic and Whig campaign literature, Merovingian and Carolingian statecraft, 1960s and 1980s defenses of American capitalism. These are just what we teach in history---I in fact felt like I was coming up with exam or essay questions writing that last sentence.
So why isn't this a more vibrant area? (Admitting one reason might be: it is, and I just haven't done my research. In that case, I'd love to hear what I'm missing).
Off the top of my head, I think there are roughly three things that computers may let us do with text so much faster than was previously possible as to qualitatively change research.
1. Find texts that use words, phrases, or names we're interested in.
2. Compare individual texts or groups of texts against each other.
3. Classify and cluster texts or words. (Where 'classifying' is assigning texts to predefined groups like 'US History', and 'clustering' is letting the affinities be only between the works themselves).
These aren't, to be sure, completely different. I've argued before that in some cases, full-text search is best thought of as a way to create a new classification scheme and populating it with books. (Anytime I get fewer than 15 results for a historical subject in a ProQuest newspapers search, I read all of them--the ranking inside them isn't very important). Clustering algorithms are built around models of cross group comparisons; full text searches often have faceted group comparisons. And so on.
But as ideal types, these are different, and in very different places in the digital humanities right now. Everybody knows about number 1; I think there's little doubt that it continues to be the most important tool for most researchers, and rightly so. (It wasn't, so far as I know, helped along the way by digital humanists at all). More recently, there's a lot of attention to 3. Scott Weingart has a good summary/literature review on topic modeling and network analysis this week--I think his synopsis that "they’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination" gets it just right, although I wish he'd bring the hammer down harder on the danger part. I've read a fair amount about topic models, implemented a few on text collections I've built, and I certainly see the appeal: but not necessarily the embrace. I've also done some work with classification.
In any case: I'm worried that in the excitement about clustering, we're not sufficiently understanding the element in between: comparisons. It's not as exciting a field as topic modeling or clustering: it doesn't produce much by way of interesting visualizations, and there's not the same density of research in computer science that humanists can piggyback on. At the same time, it's not nearly so mature a technology as search. There are a few production quality applications that include some forms of comparisons (WordHoard uses Dunning Log-Likelihood; I can only find relative ratios on the Tapor page). But there isn't widespread adoption, generally used methodologies for search, or anything else like that.
This is a problem, because cross-textual comparison is one of the basic competencies of the humanities, and it's one that computers ought to be able to help with. While we do talk historically about clusters and networks and spheres of discourse, I think comparisons are also closer to most traditional work; there's nothing quite so classically historiographical as tracing out the similarities and differences between Democratic and Whig campaign literature, Merovingian and Carolingian statecraft, 1960s and 1980s defenses of American capitalism. These are just what we teach in history---I in fact felt like I was coming up with exam or essay questions writing that last sentence.
So why isn't this a more vibrant area? (Admitting one reason might be: it is, and I just haven't done my research. In that case, I'd love to hear what I'm missing).
Labels:
Comparisons,
Dunning,
Featured
Thursday, November 10, 2011
Dunning Amok
A few points following up my two posts on corpus comparison using Dunning Log-Likelihood last month. Nur ein stueck Technik.
Ted said in the comments that he's interested in literary diction.
I'm still thinking about this, as I come back to doing some other stuff with the Dunnings. This actually seems to me like a case where the Dunning's wouldn't be much good; so much of a Dunning score is about the sizes of the corpuses, so after an initial comparison to establish 'literary diction' (say), I think we'd just want to compare the percentages.
Ted said in the comments that he's interested in literary diction.
I've actually been thinking about Dunnings lately too. I was put in mind of it by a great article a couple of months ago by Ben ZimmerZimmermanaddressing the character of "literary diction" in a given period (i.e., Dunnings on a fiction corpus versus the broader corpus of works in the same period).
I'd like to incorporate a diachronic dimension to that analysis. In other words, first take a corpus of 18/19c fiction and compare it to other books published in the same period. Then, among the words that are generally overrepresented in 18/19c fiction, look for those whose degree of overrepresentation *peaks in a given period* of 10 or 20 years. Perhaps this would involve doing a kind of meta-Dunnings on the Dunnings results themselves.
I'm still thinking about this, as I come back to doing some other stuff with the Dunnings. This actually seems to me like a case where the Dunning's wouldn't be much good; so much of a Dunning score is about the sizes of the corpuses, so after an initial comparison to establish 'literary diction' (say), I think we'd just want to compare the percentages.
Thursday, November 3, 2011
Theory First
Natalie Cecire recently started an important debate about the role of theory in the digital humanities. She's rightly concerned that the THATcamp motto--"more hack, less yack"--promotes precisely the wrong understanding of what digital methods offer:
the whole reason DH is theoretically consequential is that the use of technical methods and tools should be making us rethink the humanities.Cecire wants a THATcamp theory, so that the teeming DHers can better describe the implications of all the work that's going on. Ted Underwood worries that claims for the primacy of theory can be nothing more than a power play, serving to reify existing class distinctions inside the academy; but he's willing to go along with a reciprocal relation between theory and practice going forward.
Subscribe to:
Posts (Atom)