<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8929346053949579231</id><updated>2012-03-09T11:56:39.669-05:00</updated><category term='LCC classes'/><category term='Historical memory'/><category term='collocation'/><category term='Online Databases'/><category term='HathiTrust'/><category term='Bookworm'/><category term='Metadata'/><category term='The Profession'/><category term='Open Library'/><category term='Howells'/><category term='Comparisons'/><category term='Dunning'/><category term='isms'/><category term='TV watch'/><category term='This Blog'/><category term='authors'/><category term='Featured'/><category term='Ngrams'/><category term='Building a Corpus'/><category term='Resources'/><category term='Evolution'/><category term='Genres'/><category term='search'/><category term='Gender'/><category term='Digital Humanities'/><category term='Literature'/><category term='Changes in language over time'/><category term='pca'/><category term='Data exploration and visualization'/><category term='capitalism'/><title type='text'>Sapping Attention</title><subtitle type='html'>Digital Humanities: Using tools from the 1990s to answer questions from the 1960s about 19th century America.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default?start-index=101&amp;max-results=100'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>101</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-5323102390411257274</id><published>2012-03-06T13:23:00.002-05:00</published><updated>2012-03-07T17:44:45.456-05:00</updated><title type='text'>Do women hide their gender by publishing under their initials?</title><content type='html'>A quick follow-up on &lt;a href="http://sappingattention.blogspot.com/2012/03/evidence-of-absence-is-not-absence-of.html"&gt;this issue of author gender&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In my last post, I looked at first names as a rough gauge of author gender to see who is missing from libraries. This method has two obvious failings as a way of finding gender:&lt;br /&gt;&lt;br /&gt;1) People use pseudonyms that can be of the opposite gender. (More often women writing as men, but sometimes men writing as women as well.)&lt;br /&gt;&lt;br /&gt;2) People publish using initials. It's pretty widely known that women sometimes publish under their initials to avoid making their gender obvious.&lt;br /&gt;&lt;br /&gt;The first problem is basically intractable without specific knowledge. (I can fix George Eliot by hand, but no other way). The second we can get actually get some data on, though. Authors are identified by their first initial alone in about 10% of the books I'm using (1905-1922, Open Library texts). It turns out we can actually figure out a little bit about what gender they are. If this is a really important phenomenon in the data, then it should show up in other ways.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Here's one way to look at that. For every letter, we can find the percentage of books using only the initial for authors with that letter. So for example, 11% of all books by people whose names start with "J" (James, Jessicas, etc.) are just by "J." Only 6% of those by people whose names start with "D" are.&lt;br /&gt;&lt;br /&gt;Moreover, for every letter we know from the census the real distribution in the population of that name. 90% of all M's are female; 85% of all T's are male.&lt;br /&gt;&lt;br /&gt;We can combine those two, and see whether women's letters are used as initials more than men's letters are. I was hoping this might provide evidence for a whole raft of female authors in the library hiding behind their initials. But that turns out, as far as I can tell, not to be the case. In fact, majority-female letters are probably &lt;i&gt;less&lt;/i&gt; likely to be used instead of full names than are majority-male letters.&lt;br /&gt;[Edit--Note: size is the frequency of that letter beginning first names in the census.]&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-7gVtgrKSJrM/T1ZDuSd26jI/AAAAAAAADBE/KKSTf6yEJic/s1600/Are+women+less+likely+to+use+initials.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-7gVtgrKSJrM/T1ZDuSd26jI/AAAAAAAADBE/KKSTf6yEJic/s1600/Are+women+less+likely+to+use+initials.png" /&gt;&amp;nbsp;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;So the intuition here is that if someone's bookplate says "W. Brown," their name is most likely William or Willard; if "M. Black", it's probably Mary or Marian. If women use initials, the ratio of just "M." to Mary/Marian/Michael should be higher than that of just "W." to Willard/William/Willa. And that looks untrue.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;This might partly be a genre thing--the sciences use lots of initials, maybe, and women don't write for them. Restricting to just fiction (LC Classification PZ) reduces the effect from a strong one to a non-existent one: but there's still no evidence that women are more likely to use initials than are men. (Maybe "Elizabeths" do it a lot more than "Marys", indicating something about Catholics vs. WASPs?)&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-Qw3s8PV5Kfg/T1ZLIgxoXEI/AAAAAAAADBM/_q4cLyStc2Y/s1600/Restricting+to+fiction,+the+pattern+is+less+strong.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-Qw3s8PV5Kfg/T1ZLIgxoXEI/AAAAAAAADBM/_q4cLyStc2Y/s1600/Restricting+to+fiction,+the+pattern+is+less+strong.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Personally, I find this a disappointing result. I went in this looking for some nice evidence that women were publishing under their initials at significant rates: enough to make me think twice about gender when pulling an initialed book off the shelves. That seems not to be the case. (Although, it's worth remembering that in many cases, the title on the bookplate is shorter than that in the library catalog, so it's still possible). It's possible to come up with scenarios where it's still important--maybe Mary's don't even bother using initials since their gender is obvious, and Juliet's almost always do, since they know they'll be mistaken for James?--but I can't think of any really plausible ones. Can you?&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;But for looking at author gender of big corpuses using just names, this is a somewhat positive result, in a way; we do have to worry about the pseudonym effect, but initials seem not to particularly cloud the gender breakdown of the library in the aggregate.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-5323102390411257274?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/5323102390411257274/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2012/03/do-women-hide-their-gender-by.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/5323102390411257274'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/5323102390411257274'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2012/03/do-women-hide-their-gender-by.html' title='Do women hide their gender by publishing under their initials?'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-7gVtgrKSJrM/T1ZDuSd26jI/AAAAAAAADBE/KKSTf6yEJic/s72-c/Are+women+less+likely+to+use+initials.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-7800421111516226139</id><published>2012-03-06T01:20:00.000-05:00</published><updated>2012-03-07T17:45:16.519-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Gender'/><category scheme='http://www.blogger.com/atom/ns#' term='Digital Humanities'/><title type='text'>Evidence of absence is not absence of evidence</title><content type='html'>I just saw that &lt;a href="http://storify.com/ncecire/from-archival-silence-to-glorious-data?awesm=sfy.co_eNs&amp;amp;utm_campaign=&amp;amp;utm_medium=sfy.co-twitter&amp;amp;utm_source=direct-sfy.co&amp;amp;utm_content=storify-pingback"&gt;various Digital Humanists on Twitter&lt;/a&gt; were talking about representativeness, exclusion of women from digital archives, and other Big Questions. I can only echo my general agreement about most of the comments.&lt;br /&gt;&lt;br /&gt;But now that I see some concerns about gender biases in big digital corpora, I do have a bit to say. Partly that I have seen nothing to make me think social prejudices played into the scanning decisions at all. Rather, Google Books, Hathi Trust, the Internet Archive, and all the other similar projects are pretty much representative of the state of academic libraries. (With &lt;a href="http://sappingattention.blogspot.com/2011/04/in-search-of-great-white-whale.html"&gt;strange&lt;/a&gt; &lt;a href="http://sappingattention.blogspot.com/2011/01/digital-history-and-copyright-black.html"&gt;exceptions&lt;/a&gt;, of course). You can choose where to vaccum, but not what gets sucked up the machine; likewise the companies.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I have, though, seen everything to make me think that libraries collections are spectacularly biased in who they collect. This is true of gender, it's true of professions, it's true of race, it's true of class. It's &lt;a href="http://sappingattention.blogspot.com/2011/08/word-counts.html"&gt;even true of the time in which an author write&lt;/a&gt;. I never get tired of thinking about this. Still, this doesn't bother me much. There are lots of interesting questions that can be successfully posed to library books despite their biases; there are quite a few than can be successfully posed &lt;i&gt;because of &lt;/i&gt;their biases. The fact that a library chose to keep some books is not an inconvenient sample of some occluded truth; it's the central fact of what they, back then, just to let let us see. I deliberately say "their biases," and not "their selection biases:" I try never to talk  of library collections as 'samples.' Sample implies a whole; I have no idea what that would be here. Would it be every book ever written? That would show extremely similar biases against the dispossessed. Not every word ever spoken; only some are permitted to speak. Even the thoughts of historical actors are constrained by what they are permitted to know.&lt;br /&gt;&lt;br /&gt;We have to rest somewhere. In seeking the ideas bandied about in the past, the library is not only a good place to start, it is as a good a place to end as any. As for its biases; there is no getting around it; but there is understanding it. That's always been a core obligation of historians. &lt;br /&gt;&lt;br /&gt;To come back to gender: those biases are a big reason why I like digital libraries. It's possible to get one's arms around the biases in the whole just a bit better; and since they resemble physical libraries so well, they tell us how we might have been misreading them. This applies to texts, but sometimes to their authors as well. Remember: we can only understand past actors insofar as they have attributes--gender, race, nationality--enumerated by the state, but it is those state categories in which we're so caught up today. (Perhaps to our detriment; maybe we should be looking for bias against aesthetes, or agnostics, or the victims of violence). It's only rarely possible to escape those categories even a bit to see around the big issues. Tim Sherratt's now-canonical example of &lt;a href="http://invisibleaustralians.org/"&gt;faces &lt;/a&gt;does it for real individuals. But in the aggregates, I think there's something about names. Names can tell us about gender; but they can also take us outside it.&lt;br /&gt;&lt;br /&gt;A couple weeks ago, I downloaded the 1% &lt;a href="http://www.ipums.org/"&gt;IPUMS&lt;/a&gt; sample of 1910 and 1920 US census records. Those years, unlike many others, have names for each record. I already have author names as well, from the Open Library. It's Downton Abbey all over again: I can just divide between the two sets to see what names are used too little, and what names too much.  I could guess at a headline figure on the gender breakdown of books in  libraries; something like 10% of books are by women before 1922, maybe, but let's not reduce so far yet. Let's stay with the names. If more than 50% of the holders are women, call it a female name, and vice versa.&lt;br /&gt;&lt;br /&gt;How do these names stack up? The numbers will not surprise you, but it's worth putting them down because we don't really know this about our libraries, I don't think, and we clearly want to:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/-9mLXqW-TBFE/T1WWMGcLxZI/AAAAAAAADA0/muU3ghchnmI/s1600/Women+Write+fewer+books+on+average" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-9mLXqW-TBFE/T1WWMGcLxZI/AAAAAAAADA0/muU3ghchnmI/s1600/Women+Write+fewer+books+on+average" /&gt;&lt;/a&gt;&lt;br /&gt;That red line at 10^0 (ie, 1) is where a name is equally frequent in the US census, and in the Open Library list of authors. 10^1 is ten times as frequent in real life, 10^-1 is ten times as frequent in books, and so on. So the most men's names are at about 2x more common  in books than in the census, and the peak of the women's names fall somewhere between 8x less common and just slightly more.&lt;br /&gt;&lt;br /&gt;And remember, each of those distributions is built up of individual names. Let's make them dots at first. Here I've arranged each first name from left to right by overall frequency: you can still see the differing patterns of women's names, usually about 9x more common in the census, and men's names, usually about 2x as common in libraries.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-dth6K73M-dQ/T1WZfqEXvTI/AAAAAAAADA8/vXO5uRVR4ko/s1600/But+some+male+names+and+over-represented%252C+and+vice-versa" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-7XmFxRi8TZM/T1WVyuaooiI/AAAAAAAADAk/XR94e2JK83E/s1600/And+that+pattern+is+independent+of+frequency" imageanchor="1"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-7XmFxRi8TZM/T1WVyuaooiI/AAAAAAAADAk/XR94e2JK83E/s1600/And+that+pattern+is+independent+of+frequency" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;But there's no story to dots. Split up the genders and see the names themselves, and you get something we can talk about.&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-dth6K73M-dQ/T1WZfqEXvTI/AAAAAAAADA8/vXO5uRVR4ko/s1600/But+some+male+names+and+over-represented%252C+and+vice-versa" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-dth6K73M-dQ/T1WZfqEXvTI/AAAAAAAADA8/vXO5uRVR4ko/s1600/But+some+male+names+and+over-represented%252C+and+vice-versa" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-9mLXqW-TBFE/T1WWMGcLxZI/AAAAAAAADA0/muU3ghchnmI/s1600/Women+Write+fewer+books+on+average" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;It is hard not to conflate the names to classes of people. Mary, the eldest daughter, the most common name in the country, is about 4x more common in real life than on book title pages. Annas do worse; Graces do better.&amp;nbsp; Who are those pioneers below the line: Amy, Sara, Eleanor, and Anne? Despite all odds, they are more common among authors than among women. Clearly the census taker mostly just wrote down "Francis" for Frances before asking the gender dozens of times; but could there be a bigger story, too, about female Francis's better able to push through the doors of print.&lt;br /&gt;&lt;br /&gt;And for the men. Why is John so different from William and George? How many of those over-represented single letters, which are usually men in the census, are actually women in disguise (GEM Anscombe, JK Rowling) in the libraries? Are those Willies all boys, those Joes all men?&lt;br /&gt;&lt;br /&gt;I find it easy to spin out stories here about access to publishing that may or may not be true: about predominantly wealthy Eleanors allowed to write novels, about poor Irish Michaels who never finished school, about rich Yankee Williams and hardscrabble, farming Johns. It would be easy to test some of these stories, harder to test others. But we could always start. &lt;br /&gt;&lt;br /&gt;We may not care about names. But for the questions we do care about, the ones that the state has been codifying and enumerating since it came of age, the questions are even easier to answer, because that's what they've been collecting on. I suspect that the solution lies in what we already have.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-7800421111516226139?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/7800421111516226139/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2012/03/evidence-of-absence-is-not-absence-of.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/7800421111516226139'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/7800421111516226139'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2012/03/evidence-of-absence-is-not-absence-of.html' title='Evidence of absence is not absence of evidence'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-9mLXqW-TBFE/T1WWMGcLxZI/AAAAAAAADA0/muU3ghchnmI/s72-c/Women+Write+fewer+books+on+average' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-7648479514071000587</id><published>2012-02-29T15:43:00.001-05:00</published><updated>2012-02-29T17:22:07.752-05:00</updated><title type='text'>Journal of Irreproduced results, vol. 1</title><content type='html'>I wanted to try to replicate and slightly expand &lt;a href="http://tedunderwood.wordpress.com/2012/02/26/the-differentiation-of-literary-and-nonliterary-diction-1700-1900/"&gt;Ted Underwood's recent discussion of genre formation&lt;/a&gt; over time using the Bookworm dataset of Open Library books. I couldn't, quite, but I want to just post the results and code up here for those who have been following that discussion. Warning: this is a rather dry post.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The result, from the Bookworm dataset, doesn't show trends as clear as Ted's, but some interesting evidence remains. To track genre formation (or really, genre persistence) I compared each of 5 Library of Congress subject headings to each other. This is using the same metric of similarity as Ted's post (Spearman correlation of the top 5,000 words) with one exception (see point 2 below). I also took two separate sample from each genre in each year, so I can look at how similar biography is to itself without any overlap.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-uMY5AN3Y9to/T06DX3HUE2I/AAAAAAAADAU/pKDCeZI5_u0/s1600/Without+1800.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="470" src="http://2.bp.blogspot.com/-uMY5AN3Y9to/T06DX3HUE2I/AAAAAAAADAU/pKDCeZI5_u0/s640/Without+1800.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Similarity to fiction is the chart that Ted looked at the most, and my findings are roughly consistent with his, except that I lack the 18th century, and therefore don't show the pronounced decline he does. That makes my results less interesting: but it also confirms he's looking in the right place for action on these particular areas. I should be doing this inside the social sciences, probably, if I want to capture genres in action: maybe next month sometime.&lt;br /&gt;&lt;br /&gt;Nonetheless, this also isn't as smooth as Ted's charts, for a few reasons I want to address.&lt;br /&gt;&lt;br /&gt;1. (And this is really the only substantive comment I have). I very deliberately tried to avoid any smoothing before the final step. Taking 39-year slices (or one year slices) and comparing them to other 39-year slices, or even to 1-year slices, as Ted does, means there will be substantial overlap between adjacent points, and so the data will look smoother than it actually is.&amp;nbsp; So m goal was to create subsets of books that are about 1,000,000 words, that can be directly compared to each other in the same time horizon, and that have zero overlap from one to the next. The best way to do that seemed to be about a decade at a time. The code to do this (below) is pretty hairy; it's possible that it introduces some errors. (I notice, off the top, that I actually use an 11 year interval and a ten-year time step, so there may occasionally be a book that ends up representing poetry twice. These will be very rare, though, and poetry compared to poetry will never have overlapping elements. I hope).&lt;br /&gt;&lt;br /&gt;2. I use word stems, instead of words, as my comparison; that loses tense differences between words, which helps differentiate between genres, but which don't generally capture the differences I'm interested in.&lt;br /&gt;&lt;br /&gt;3. My LC Subject Headings may not capture genres the same way his categories do.&lt;br /&gt;&lt;br /&gt;4. Bookworm data includes books published in a year that were written earlier. (Drama, most notably, includes Shakespeare). That will haze up trends in general, though not eliminate them.&lt;br /&gt;&lt;br /&gt;Here's the size of the results I'm using, number of books and average number of words:&lt;br /&gt;&lt;br /&gt;&lt;pre class="GD40030CLR" tabindex="0"&gt;  LCSubjectHeading averageSize number&lt;br /&gt;1        biography      143351   9815&lt;br /&gt;2            drama       36132   1377&lt;br /&gt;3          fiction       85641   5273&lt;br /&gt;4           poetry       38360   3092&lt;br /&gt;5          sermons       81050   4195&lt;/pre&gt;&lt;br /&gt;And here's my R code to make these charts, just pasted inline. I'm posting this mostly because it was fun to try to comment up a longish piece of code that never uses 'if','for', a variable named 'i', or any of the other elements of normal programming that R eschews. This unfortunately probably makes it nearly unreadable as well, since the primary action is descending up and down a complex data structure that is pieced together in various places. And this is another irreproducible result, because you need a local version of Bookworm to run it. But: here you go!&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-size: x-small;"&gt;rm(list=ls())&lt;br /&gt;&lt;br /&gt;#This uses the API to Bookworm to build a data.frame with metadata for all the books matching&lt;br /&gt;#the query. The list structure is so ugly because the API uses dictionary structures&lt;br /&gt;#extensively, and R is one of the less JSON-friendly languages out there.&lt;br /&gt;#This Bookworm API lets all sorts of fun questions get asked; the reason we don't expose it,&lt;br /&gt;#oddly, is because it's _too_ powerful; it's easy to accidentally submit a query that can hang the server&lt;br /&gt;#for hours.&lt;br /&gt;&lt;br /&gt;#I"m not showing all the code here: for instance, "Rbindings.R" is a &lt;br /&gt;source("Rbindings.R")&lt;br /&gt;con=dbConnect(MySQL())&lt;br /&gt;booklist = dbGetQuery(con,&lt;br /&gt;&amp;nbsp; APIcall(&lt;br /&gt;&amp;nbsp; list("method" = "counts_query",&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #Groups is the most important term here: it says that we want a data.frame with years, Library of&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #Congress subject headings, and unique identifiers for all the books matching the search limits.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #We'll also get the numbers of words in each book--that's what counts_query does.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; groups = list('year',"LCSH as LCSubjectHeading",'catalog.bookid as id'),&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "search_limits" = list(&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; LCSH=list("Poetry","Fiction","Biography","Drama","Sermons","Sermons, English","Sermons, American"),&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; aLanguage=list("eng")&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; )&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; )&lt;br /&gt;&amp;nbsp; )&lt;br /&gt;)&lt;br /&gt;&lt;br /&gt;#As it turns out, the database results are a little buggy, so we have to clean them&lt;br /&gt;#to make them work completely consistently. (This is MySQL's fault, mostly)&lt;br /&gt;booklist$LCSubjectHeading = gsub(",.*","",booklist$LCSubjectHeading)&lt;br /&gt;booklist$LCSubjectHeading = tolower(booklist$LCSubjectHeading)&lt;br /&gt;booklist$LCSubjectHeading = gsub(" ","",booklist$LCSubjectHeading)&lt;br /&gt;booklist$LCSubjectHeading = factor(booklist$LCSubjectHeading)&lt;br /&gt;booklist = booklist[!duplicated(booklist$id),]&lt;br /&gt;&lt;br /&gt;summary(booklist)&lt;br /&gt;#Now that we know what the books are, we just set a couple variables:&lt;br /&gt;&lt;br /&gt;#Rather than require a book for the year "1850" be written in 1850 exactly, we'll do a moving window, set at&lt;br /&gt;#5 years on either side.&lt;br /&gt;smoothing = 5&lt;br /&gt;#Ted says 1,000,000 words per set works well, so I'll do that.&lt;br /&gt;wordsPerSet = 1000000&lt;br /&gt;#Ted just takes one sample of 1,000,000 books for year, but I want to be able to check the distance of&lt;br /&gt;#(for example) history from itself, so I'm going to take two.&lt;br /&gt;samplesPerYear = 2&lt;br /&gt;&lt;br /&gt;#And here's the beginning of the meat, where we decide which books we'll be looking at.&lt;br /&gt;#ddply is a beautiful function that iterates over the categories in a data frame to produce a list&lt;br /&gt;#It's part of Hadley Wickham's 'plyr' package: the 'dl' at the beginning means &lt;br /&gt;#'dataframe'-&amp;gt;'list', and there are lots of other ways to use it.&lt;br /&gt;require(plyr)&lt;br /&gt;booklist[1:5,]&lt;br /&gt;samples = ddply(booklist,.(LCSubjectHeading),function(LCSHframe) {&lt;br /&gt;&amp;nbsp; #This goes by each of the subject headings, and creates a new frame only consisting &lt;br /&gt;&amp;nbsp; #of books with that heading. We'll create the samples from those.&lt;br /&gt;&amp;nbsp; genreSamples = ldply(seq(1800,1922,by=smoothing*2),function(year) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #Now we're going to create some sample years. I just put them in by hand here.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; myframe = LCSHframe[LCSHframe$year &amp;gt; year - smoothing &amp;amp; LCSHframe$year &amp;lt; year + smoothing,&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; c('id','count')]&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #No point even considering the large books.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; myframe = myframe[myframe$count&amp;lt;2*wordsPerSet,]&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; possibleSamples = llply(1:100,function(setber) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #So, 100 times we randomly reorder these books. No reason for 100 in particular; but a hundred simulations&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #to get 2 book sets it should be pretty good.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #We're going to randomly assign those books into subcorpuses. I am confident this is a pretty bad way to do it.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #First determine how big the corpuses can be: they can range from a single book to twice the average number required to get &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #up to the target&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; corpusSizes = 1:((wordsPerSet*2)%/%mean(myframe$count))&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #Then assign group numbers to individual books based on some random sampling.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; group = unlist(lapply(1:100,function(n) rep(n,times = sample(corpusSizes,1))))&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #Only a few of these are actually needed.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; myframe$set = group[1:nrow(myframe)]&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #Now ddply can easily tell us the size of each of these sets we've created&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; counts = ddply(myframe,.(set),function(frame) data.frame(setcount = sum(frame$count)))&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #Then I just remove all but the top two sets, order by how close they are to our wordcount target.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; counts=counts[order(abs(wordsPerSet-counts$setcount)),][1:2,]&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #And now I write down the root mean square distances for each set.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; distance = sqrt(sum(wordsPerSet-counts$setcount)^2)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; corpus = myframe[myframe$set %in% counts$set,]&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #Now we rename those two sets as '1' and '2':&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; corpus$set = as.numeric(factor(corpus$set))&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; list(corpus = corpus[,c('id','set')],distance=distance)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; })&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #OK, so we have 100 lists of book pairs of about 1,000,000 words. Let's narrow it down.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #you can use "[[" as a function in an apply statement to pull out just one element&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #of a list--this is frequently handy. Here I create two new vectors: one the number of &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #in each of the booklists we've created, and one the actual books in that list.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; corpusgoodness = sapply(possibleSamples,'[[',"distance") #sapply returns a vector, not a list.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; corpuses = lapply(possibleSamples,'[[',"corpus") #lapply returns a list, which we need &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #because there can be varying numbers of bookids in each group here.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #Some of these simulations produce corpuses with only 40 or 50K words; I take the &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #two that are closest to 1,000,000 words in length. This will probably bias us towards&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #shorter texts: I don't know that that's a problem, but it could be.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; topSamples = corpuses[[which(corpusgoodness==min(corpusgoodness))[1]]]&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; topSamples$year = year&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #if the mean distance is more than 10% of the whole, just scrap the whole thing.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if (min(corpusgoodness,na.rm=T)&amp;gt;wordsPerSet/10) {topSamples = data.frame()}&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #We now add a column to the frame that lets us know what year the sample&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #is for.&amp;nbsp;&amp;nbsp; &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #And then we return the set back out. These will get aggregated across subject headings by ddply &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #automatically&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; topSamples&lt;br /&gt;&amp;nbsp; },.progress = "text")&lt;br /&gt;})&lt;br /&gt;&lt;br /&gt;#So now we have a data.frame with four columns: bookid,year,genre,and set.&lt;br /&gt;&lt;br /&gt;#I'll fill those in with the actual top 5K words for every comparison we'd like to make by doing a ddply call that fills in the words with actual&lt;br /&gt;#examples from the database. Then we can take some summary statistics from them. &lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-size: x-small;"&gt;words = ddply(samples,.(year),function(yearsample) {&lt;br /&gt;&amp;nbsp; #Dividing by 'year' has ggplot create a frame for each year with a list of the books in each sample &lt;br /&gt;&amp;nbsp; yearResults = ddply(yearsample,.(LCSubjectHeading),function(subjectSample) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #Then--and this is the most time-consuming step--we pull out the full word counts, again using&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #the Bookworm API, for every set of bookids in each year-subject heading-set combo&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ddply(subjectSample,.(set),function(myframe) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; words = dbGetQuery(&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; con,&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; APIcall(&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; list("method" = "counts_query",&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #This 'groups' term is where I specify what's getting returned. I like to use stems in&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #situations like this, so that differences in verb tenses don't affect the results.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; groups = list('words1.stem as w1'),&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "search_limits" = list(&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #This is where I place limitations on the words returned: myframe$bookid is a vector&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #that contains all the bookids in the 1,000,000 word sample we're looking at.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #This as.list(as.integer()) stuff is ugly.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; catalog.bookid=as.list(as.integer(myframe$id))&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #To test this out, I only count words that appear immediately before the word "are"&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #word2=list("are")&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ))&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; )&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; )&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; #RMySQL tends to return character vectors, but factors are much more efficient. So we change it.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; words$w1 = factor(words$w1)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; words&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; })&lt;br /&gt;&amp;nbsp; })&lt;br /&gt;&amp;nbsp; #Here's a random problem: I select on stems, but some words don't have stems in my database&lt;br /&gt;&amp;nbsp; #(Arabic numerals, for instance). So we have to drop all those.&lt;br /&gt;&amp;nbsp; yearResults=yearResults[!is.na(yearResults$w1),]&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-size: x-small;"&gt;&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-size: x-small;"&gt;&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-size: x-small;"&gt;&lt;br /&gt;&amp;nbsp; #And here's where it gets a little ugly. We want to have a data.frame that has information on each&lt;br /&gt;&amp;nbsp; #of the combinations that we can look at. One line for fiction subset 1 compared to poetry subset 2,&lt;br /&gt;&amp;nbsp; #one for biography 2 to fiction 1, and so on down the chain. expand.grid does this in one function call.&lt;br /&gt;&amp;nbsp; variable_combinations = expand.grid(&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; genre=levels(samples$LCSubjectHeading),&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; comparedTo=levels(samples$LCSubjectHeading),&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; group1 = unique(samples$set),&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; group2 = unique(samples$set))&lt;br /&gt;&amp;nbsp; #And now the ugliest line in this script.&lt;br /&gt;&amp;nbsp; #We don't want both fiction 1&amp;lt;-&amp;gt;poetry 2 AND poetry2&amp;lt;-&amp;gt;fiction 1 comparisons: &lt;br /&gt;&amp;nbsp; #It's the same thing in a different order. There's probably a function for this,&lt;br /&gt;&amp;nbsp; #but I just sort, paste, and drop.&lt;br /&gt;&amp;nbsp; variable_combinations = variable_combinations[!duplicated(&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; apply(variable_combinations,1,function(row) paste(sort(c(paste(row[1],row[3]),paste(row[2],row[4]))),collapse="") )),]&lt;br /&gt;&amp;nbsp; #less importantly, we don't want to compare fiction 1 to fiction 1; we know they're identical.&lt;br /&gt;&amp;nbsp; #So we can drop those all out.&lt;br /&gt;&amp;nbsp; variable_combinations = variable_combinations[!(variable_combinations[,1]==variable_combinations[,2] &amp;amp; variable_combinations[,3]==variable_combinations[,4]),]&lt;br /&gt;&amp;nbsp; #Then, using merge() we wrap in the word counts for every possible genre and group category.&lt;br /&gt;&amp;nbsp; comparisons = merge(&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; variable_combinations,&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; yearResults,by.x=c("genre","group1"),&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; by.y=c("LCSubjectHeading","set"),all=T)&lt;br /&gt;&amp;nbsp; #But that's not all: we merge again to get every comparison of genre and group. This produces a big data.frame&lt;br /&gt;&amp;nbsp; comparisons = merge(&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; comparisons,&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; yearResults,by.x=c("comparedTo","group2","w1"),&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; by.y=c("LCSubjectHeading","set","w1"),all=T)&lt;br /&gt;&amp;nbsp; #Now, it would be nice to just return this comparisons set as a whole. But it's too big for that.&lt;br /&gt;&amp;nbsp; #Instead, I'm just going to reduce it down to the top 5000 words in each set. ddply works for this, too.&lt;br /&gt;&amp;nbsp; topwords = ddply(comparisons,.(comparedTo,group2,genre,group1),function(frame){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #Note that I order on a negative result--that's often easier.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; frame = frame[order(-(frame$count.x+frame$count.y)),][1:5000,]&lt;br /&gt;&amp;nbsp; })&lt;br /&gt;&amp;nbsp; #again, some NAs creep in, I know not how. Drop them all out.&lt;br /&gt;&amp;nbsp; topwords = topwords[!is.na(topwords$count.x),]&lt;br /&gt;&amp;nbsp; #I'd actually like to just return this as it is, but that's impractical with sufficiently large sets.&lt;br /&gt;&amp;nbsp; #So we reduce it down to a single score--similarity--instead&lt;br /&gt;&amp;nbsp; ddply(topwords,.(genre,group1,comparedTo,group2),function(comparison){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; comparison = comparison[!is.na(comparison$count.x),]&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; comparison = comparison[!is.na(comparison$count.y),]&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; data.frame(similarity = cor(comparison$count.x,comparison$count.y,method='spearman'))&lt;br /&gt;&amp;nbsp; })&lt;br /&gt;},.progress = "text")&lt;br /&gt;&lt;br /&gt;similarities = words&lt;br /&gt;similarities = similarities[!is.na(similarities$similarity),]&lt;br /&gt;head(similarities)&lt;br /&gt;#Earlier, we made it so that we didn't have both fiction-poetry and poetry-fiction&lt;br /&gt;#Now I add them back in, to make the plotting work in either direction:&lt;br /&gt;inverse = similarities&lt;br /&gt;names(inverse)[match(c("comparedTo","group2","genre","group1"),names(inverse))] = c("genre","group1","comparedTo","group2")&lt;br /&gt;#Basically, we just switch the names around, and rbind knows to keep them together&lt;br /&gt;similarities = rbind(similarities,inverse)&lt;br /&gt;#Using two sets gets two different samples. I want these to be as different as possible, so I only compare 1 to 1 and 2 to 2, except within a single genre.&lt;br /&gt;similarities = similarities[similarities$group1 == similarities$group2 | similarities$genre == similarities$comparedTo,]&lt;br /&gt;&lt;br /&gt;#And then I'll just average the results of simulations using another ddply call.&lt;br /&gt;summarysims = ddply(similarities,.(genre,comparedTo,year),function(localframe) {&lt;br /&gt;&amp;nbsp; data.frame(similarity = mean(localframe$similarity))&lt;br /&gt;})&lt;br /&gt;&lt;br /&gt;#I'm going to make a bunch of plots at once using lapply here:for each of the types of genres.&lt;br /&gt;plots = lapply(levels(summarysims$genre),function(genre) {&lt;br /&gt;&amp;nbsp; #The 1800 data is terrible, so I'm just dropping it. &lt;br /&gt;&amp;nbsp; ggplot(summarysims[summarysims$comparedTo==genre,],aes(x=year,y=similarity,color=genre )) + &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; geom_smooth(se=F) + geom_point() + opts(title=paste("Similarity to",genre))&lt;br /&gt;})&lt;br /&gt;&lt;br /&gt;#do.call(grid.arrange,plots)&lt;br /&gt;&amp;nbsp; &lt;br /&gt;#The 1800 data is terrible, so I'm just dropping it. Partly this may be labeling by libraries of 18xx as 1800, it may be&lt;br /&gt;#may be typesetting mistakes.&lt;br /&gt;plots2 = lapply(levels(summarysims$genre),function(genre) {&lt;br /&gt;&amp;nbsp; ggplot(summarysims[summarysims$comparedTo==genre &amp;amp; summarysims$year != 1800,],aes(x=year,y=similarity,color=genre )) + &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; geom_smooth(se=F) + geom_point() + opts(title=paste("Similarity to",genre))&lt;br /&gt;})&lt;br /&gt;&lt;br /&gt;do.call(grid.arrange,plots2)&lt;/span&gt;&lt;br /&gt;&amp;nbsp;&lt;/code&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-7648479514071000587?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/7648479514071000587/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2012/02/journal-of-irreproduced-results-vol-1.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/7648479514071000587'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/7648479514071000587'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2012/02/journal-of-irreproduced-results-vol-1.html' title='Journal of Irreproduced results, vol. 1'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-uMY5AN3Y9to/T06DX3HUE2I/AAAAAAAADAU/pKDCeZI5_u0/s72-c/Without+1800.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-2053597886051702769</id><published>2012-02-20T16:48:00.000-05:00</published><updated>2012-02-24T11:14:21.913-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='TV watch'/><title type='text'>Downton Abbey Anachronisms, Season Finale edition</title><content type='html'>It's Monday, so let's run last night's episode of Downton Abbey through the anachronism machine.&lt;a href="http://sappingattention.blogspot.com/2012/02/making-downton-more-traditional.html"&gt; I looked for Downton Abbey anachronisms for the first time last week&lt;/a&gt;: using the Google Ngram dataset, I can check every two-word phrase in an episode to see if it's more common today than then. This 1) lets us find completely anachronistic phrases, which is fun; and 2) lets us see how the language has evolved, and what shows do the best job at it. [Since some people care about this--don't worry, no plot spoilers below].&lt;br /&gt;&lt;br /&gt;I'll start this with a chart of every two-word phrase that appears in the episode, just like last time. Left-to-right is overall frequency; top to bottom is over-representation. Higher up is representative of 1995 language; lower down, of 1917. Click to enlarge.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-dr-Gv6RLbMM/T0JunVzOynI/AAAAAAAAC_s/oTQcI9ME_d8/s1600/Downton+Abbey+Christmas+Episode.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="468" src="http://3.bp.blogspot.com/-dr-Gv6RLbMM/T0JunVzOynI/AAAAAAAAC_s/oTQcI9ME_d8/s640/Downton+Abbey+Christmas+Episode.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;So: how does it look?&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;In short: not too bad. This was one of the best episodes of the season, anachronism-wise. Last week, "black market" was grossly, terribly wrong. This time, there are no unquestionably anachronistic two-word phrases at all. The algorithm's only suggestions, '&lt;b&gt;dogsbody&lt;/b&gt;' and 'cheese souffles,' are both plausible candidates for extremely rare spoken words that just don't make it into the written record out of chance.*&lt;br /&gt;&lt;br /&gt;&lt;i&gt;*Though to be completely pedantic: "&lt;a href="http://books.google.com/ngrams/graph?content=dogsbody%2CDogsbody&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=6&amp;amp;smoothing=3"&gt;Dogsbody&lt;/a&gt;," generalized from a naval term to mean 'menial worker,' is probably a &lt;/i&gt;&lt;i&gt;tiny bit early. It's not attested in the OED until two years later. Though it was probably already present in spoken English somewhere, it seems unlikely that the Daisy, the character who says it, would be on the cutting edge of bringing seafarer's language ashore.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;How do I know it's the best episode? Well, I'll quantify that a bit more towards the end of the post, but you can actually see it just in the shape of the cloud. Here are the wordclouds for every episode of Downton so far. (PBS aired 7 episodes, but I have 9 here; that's because episodes 1-2 and 7-8 in the British version were condensed into a single, longer episode for American audiences, I believe). You can click to enlarge and find some of the modern language (towards the top) and most period-characteristic (towards the bottom) in every episode, but even in thumbnail form, you can see that there aren't that many words up high in last night's episode (lower right) compared to, say, episode 6 (middle right).&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-SLejFRVsKbM/T0Kq8exXtMI/AAAAAAAAC_0/v85sRV4Wc1s/s1600/All+of+Downton+Season+2,+anachronistic+language.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="520" src="http://3.bp.blogspot.com/-SLejFRVsKbM/T0Kq8exXtMI/AAAAAAAAC_0/v85sRV4Wc1s/s640/All+of+Downton+Season+2,+anachronistic+language.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Nonetheless, there's quite a bit that happens that's off. Even when writers do their best, the English language has drifted on in all sorts of directions.&lt;br /&gt;&lt;br /&gt;The single biggest anachronism this week is probably the phrase "&lt;b&gt;novelty value&lt;/b&gt;," which one character talks about regaining by skipping lunch. "&lt;a href="http://books.google.com/ngrams/graph?content=novelty+value&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=6&amp;amp;smoothing=3"&gt;Novelty value&lt;/a&gt;" is doesn't enter British English until the 1930s. There are very a few uses before 1920, but most are part of the phrase 'novelty, value, and usefulness' in American legal language. (Which may the origin of the phrase, but that's neither here nor there.) The very few uses of novelty value I can find before 1922 don't use it with weary cynicism, but &lt;a href="http://books.google.com/books?id=xmBbAAAAMAAJ&amp;amp;pg=PA36&amp;amp;dq=%22novelty+value%22&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=y6tCT6LrFqqB0QHmwp3fBw&amp;amp;ved=0CFYQ6AEwBA#v=onepage&amp;amp;q=%22novelty%20value%22&amp;amp;f=false"&gt;with enthusiasm&lt;/a&gt;. The bloom isn't yet off the rose.&lt;br /&gt;&lt;br /&gt;Premature cynicism is an interesting feature of Downton Abbey, actually. One of the outliers I noticed in the season premier was the Earl of Grantham speaking warily about the "&lt;b&gt;brave new world&lt;/b&gt;" coming after the war. The algorithm senses a problem: Huxley's novel wasn't until 1931, making the phrase far more popular. OK, you and I both say: but "brave new world, that has such people in it" is from &lt;i&gt;The Tempest&lt;/i&gt;, and surely the Earl knew his Shakespeare. But there's the problem: Huxley cut Shakespeare's line in half: until 1931 "has such people" is as common as "brave new world", but &lt;a href="http://books.google.com/ngrams/graph?content=has+such+people%2Cbrave+new+world&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;afterwards the latter trigram takes off&lt;/a&gt;. Accordingly, most pre-1931 uses are about new people, and most post-1931 ones are about new social arrangements. The Earl's usage is ironic, and about social arrangements: therefore I'd say the numbers are right that it's an anachronism. But what's really interesting is that a lot of the time, ironic remarks may be the places where writers are most forced to take in modern sensibilities, because irony just won't translate. &lt;br /&gt;&lt;br /&gt;Other than novelty value, though, there aren't many howling anachronisms this week. "&lt;b&gt;Board games&lt;/b&gt;" is not strictly anachronistic--it &lt;a href="http://books.google.com/books?id=xWQAAAAAYAAJ&amp;amp;pg=RA2-PA67&amp;amp;dq=%22board+game%22&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=rcc-T_DcMufg0QG9nYmkBw&amp;amp;sqi=2&amp;amp;ved=0CGYQ6AEwBw#v=onepage&amp;amp;q=%22board%20game%22&amp;amp;f=false"&gt;shows up in an American magazine ad&lt;/a&gt; during the war, and the novelty of the Ouija board is a big aspect of the episode, so using a rare new word might be OK. On the other hand, it takes a pretty capacious definition of 'game' to easily classify Ouija as a 'board game' (there even appear to be &lt;a href="http://books.google.com/books?id=pIZIAQAAIAAJ&amp;amp;pg=PA446&amp;amp;dq=ouija+game&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=wtI-T_GcA4bq0gHJj62kBw&amp;amp;sqi=2&amp;amp;ved=0CEgQ6AEwAA#v=onepage&amp;amp;q=ouija%20game&amp;amp;f=false"&gt;court cases about just that&lt;/a&gt;), and I sort of doubt that the phrase would have immediately jumped to mind. And the 3-word phrase "&lt;a href="http://books.google.com/ngrams/graph?content=play+board+games%2C+play+a+board+game&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;play board games&lt;/a&gt;" doesn't occur until 1960, so I guess I'll issue a warning. "&lt;b&gt;Trouble understanding&lt;/b&gt;" is another problematic but acceptable phrase: it's almost 100x as common today as in 1920, &lt;a href="http://books.google.com/ngrams/graph?content=trouble+understanding&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;but it did exist&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;But what's really interesting for me are the more common words that get suggested as anachronistic. The big example this week is the phrase &lt;b&gt;"&lt;a href="http://books.google.com/ngrams/graph?content=make+sense&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=6&amp;amp;smoothing=3"&gt;make sense."&lt;/a&gt;&lt;/b&gt; Google books suggests, and Bookworm confirms, that "&lt;a href="http://bookworm.culturomics.org/beta/?%7B%22query%22%3A%7B%22index%22%3A0%2C%22time_measure%22%3A%22year%22%2C%22time_limits%22%3A%5B1888%2C1922%5D%2C%22counttype%22%3A%22Occurrences_per_Million_Words%22%2C%22words_collation%22%3A%22All_Words_with_Same_Stem%22%2C%22smoothingSpan%22%3A%2210%22%2C%22search_limits%22%3A%5B%7B%22word%22%3A%5B%22make+sense%22%5D%2C%22lc1%22%3A%5B%22BF%22%5D%7D%2C%7B%22word%22%3A%5B%22make+sense%22%5D%2C%22lc0%22%3A%5B%22P%22%5D%7D%2C%7B%22word%22%3A%5B%22make+sense%22%5D%7D%5D%7D%2C%22terms%22%3A%5B%22make+sense%22%5D%2C%22category_data%22%3A%5B%5B%5B%22country%22%2C%5B%5D%5D%2C%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%5D%5D%2C%5B%22lc1%22%2C%5B%22BF%22%5D%5D%2C%5B%22LCSH%22%2C%5B%5D%5D%2C%5B%22aLanguage%22%2C%5B%5D%5D%5D%2C%5B%5B%22country%22%2C%5B%5D%5D%2C%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%22P%22%5D%5D%2C%5B%22lc1%22%2C%5B%5D%5D%2C%5B%22LCSH%22%2C%5B%5D%5D%2C%5B%22aLanguage%22%2C%5B%5D%5D%5D%2C%5B%5B%22country%22%2C%5B%5D%5D%2C%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%5D%5D%2C%5B%22lc1%22%2C%5B%5D%5D%2C%5B%22LCSH%22%2C%5B%5D%5D%2C%5B%22aLanguage%22%2C%5B%5D%5D%5D%5D%2C%22comparison%22%3A%22texts%22%7D"&gt;make sense" is most common in psychology in the pre-Downton period&lt;/a&gt;; it doesn't really take off until after 1925 or 1930. A more appropriate choice than 'doesn't make sense' might be 'isn't clear' or 'is nonsense' (the latter is less common than 'make sense' today, but 100x more common in 1922.)&lt;br /&gt;&lt;br /&gt;But for me, the big prize on this chart "&lt;b&gt;just might&lt;/b&gt;." I've spent the weekend asking everyone I know if there's an important semantic difference between "might just" and "just might"; I've heard a few good answers, but it seems like most our ears can't distinguish between the two. Today, "just might" is about half as frequent as "might just"; but it was only about &lt;a href="http://books.google.com/ngrams/graph?content=just+might%2Cmight+just&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;1.1% as frequent&lt;/a&gt; in 1920. (Non-words like 'just just' and 'might might' are equally common in the Bookworm corpus). I can't for the life of me distinguish between those two; I'm not sure one even sounds more modern than the other. But the numbers are pretty clear here: it should definitely be 'might just' in 1920. No question about it.&lt;br /&gt;&lt;br /&gt;This, to be honest, is the sort of thing that I'm most interested in finding. I'm fascinated to see how the language changes in directions that we don't notice. Historical accuracy is, of course, not the primary virtue of television, but it is &lt;i&gt;one &lt;/i&gt;virtue: and every little distinction like this makes the past seem more alien, everything that changes with the passage of time more strange. We can watch TV shows to people behaving just like they do today: but why not see just how different things were? &lt;br /&gt;&lt;br /&gt;Toward that end, I grabbed a bunch of other scripts from online of English period dramas set in the reign of George V. (In most cases, these are extracted subtitles). For each one, I extracted two different statistics: the percentage of extremely anachronistic language (fairly common today, and more than 64 times as common today as when the show/movie is set); and something else approximating the share of somewhat modern language, roughly words 10x as common when the script was written as when it was set). I tossed out the most common spelling changes ("any one" to "anyone", for example), curses, and dialect like "gonna." &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;To this set I added one actual Georgian drawing room drama: George Bernard Shaw's Heartbreak House (1919). Several people said in response to my last post that language enters the spoken language before it enters the written one. True, to a point. But &lt;i&gt;plausibility &lt;/i&gt;and &lt;i&gt;accuracy&lt;/i&gt; are two different things. Maybe words enter the language through speech first. (Although in of Downton's mistakes--"pansystolic murmur," for example--the print form probably came first). Certainly the mistakes may not require one to suspend disbelief too much. But if we want to know what the past sounded like, I can see no reason to believe Julian Fellowes has a better grasp of spoken language from 1919 than did George Bernard Shaw.&lt;br /&gt;&lt;br /&gt;Anyway, here's the result:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-pxE7cfmPFtA/T0K03Wcz_5I/AAAAAAAADAM/YZm12JP078Q/s1600/Georgian+Dramas+by+Anachronistic+language+to+compare+to+Downton+Abbey.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-pxE7cfmPFtA/T0K03Wcz_5I/AAAAAAAADAM/YZm12JP078Q/s1600/Georgian+Dramas+by+Anachronistic+language+to+compare+to+Downton+Abbey.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;(The numerals show the individual episode numbers for Dowton Abbey, 6000-word chunks of Heartbreak House (which is very long), and the whole movie for the rest.)&lt;br /&gt;&lt;br /&gt;What do we learn? &lt;i&gt;Heartbreak House &lt;/i&gt;is indeed the best on the two metrics combined (that is, closest to the lower left); but even it has a few words that are pretty extreme outliers. &lt;i&gt;The Remains of the Day&lt;/i&gt; actually has &lt;i&gt;fewer&lt;/i&gt; extreme outliers than Shaw. Checking for moderate outliers as well makes &lt;i&gt;Heartbreak House &lt;/i&gt;clock back in where it should.&lt;br /&gt;&lt;br /&gt;As for Downton Abbey: you can see that episode 9 is the closest to Heartbreak House, which is why I say it's one of the best. Also, it's nice to see that the individual chunks of Downton and of Heartbreak House are relatively coherent; that means the gaps between the shows are not just statistical noise).&lt;br /&gt;&lt;br /&gt;How does Downton Abbey compare to other scripts? Well, &lt;i&gt;Remains of the Day&lt;/i&gt; beats Downton on both scores; &lt;i&gt;Howard's End&lt;/i&gt; has fewer extreme outliers, but a few more moderate ones. This may partly be because it's set a decade earlier, which I'm not completely controlling for--but I'd wager that also reflects the difference between the two. (More howling anachronisms in Downton, more overall modern language in &lt;i&gt;Howard's End).&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;But most interestingly, exactly overlapping with Downton Abbey is "Gosford Park." That movie, you may know, was directed by Robert Altman, but written by the man who went on to create and write Downton Abbey: Julian Fellowes. Ten years later, the strengths and weaknesses are just the same. Even some of the mistakes are the same; just as 'trouble understanding' was one of the worst phrases in Downton Abbey this week, 'trouble sleeping' is one of the worst in Gosford Park.&lt;br /&gt;&lt;br /&gt;That's what I find fascinating about the whole thing. All of these writers are trying to speak the language of the past, but it's a foreign one; and they each have their own characteristic slip-ups. No one is truly a native speaker of the old tongue. (Even when, like Edith Wharton, they lived through the age themselves). &lt;br /&gt;&lt;br /&gt;Someday maybe I'll post a few more of these. The &lt;i&gt;Deadwood &lt;/i&gt;word cloud, in its anachronistic, R-rated glory, is something to behold; for me proof positive that great TV doesn't &lt;i&gt;have &lt;/i&gt;to be accurate. But that's enough for now. See you when &lt;i&gt;Mad Men&lt;/i&gt; starts up again?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-2053597886051702769?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/2053597886051702769/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2012/02/downton-abbey-anachronisms-season.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/2053597886051702769'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/2053597886051702769'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2012/02/downton-abbey-anachronisms-season.html' title='Downton Abbey Anachronisms, Season Finale edition'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-dr-Gv6RLbMM/T0JunVzOynI/AAAAAAAAC_s/oTQcI9ME_d8/s72-c/Downton+Abbey+Christmas+Episode.png' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-6288293818091603526</id><published>2012-02-19T23:57:00.000-05:00</published><updated>2012-02-24T14:45:47.131-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Digital Humanities'/><title type='text'>Second epistle to the intellectual historians</title><content type='html'>I. The new USIH blogger LD Burnett has a &lt;a href="http://us-intellectual-history.blogspot.com/2012/02/new-hieroglyphics.html"&gt;post up&lt;/a&gt; expressing ambivalence about the digital humanities because it is too eager to reject books. This is a pretty common argument, I think, familiar to me in less eloquent forms from &lt;i&gt;New York Times&lt;/i&gt; comment threads. It's a rhetorically appealing position--to set oneself up as a defender of the book against the philistines who not only refuse to read it themselves, but want to take your books away and destroy them. I worry there's some mystification involved--conflating corporate publishers with digital humanists, lumping together books with codices with monographs, and ignoring the tension between reader and consumer. This problem ties up nicely into the big event in DH in the last week--the announcement of the first issue of the ambitiously all-digital &lt;i&gt;Journal of Digital Humanities. &lt;/i&gt;So let me take a minute away from writing about TV shows to sort out my preliminary thoughts on books. &lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;First: the driving impetus behind quite a bit of digital humanities work is precisely the concern about unavailability and central control that seem to structure Burnett's essay. DH is intensely, productively concerned with finding ways to keep gatekeepers from controlling access to texts. Many--most?--hate proprietary ebooks on principle. (Though they probably use them more than their peers, too). Indeed, I think it's a common grumble in DH that most historians favor prestige in publication over openness and accessibility. No one that I know of is happily trying to "speed along… the obsolesence of the book"; rather, they are actively engaged in trying to find ways to retain the freedoms allowed by print culture while also taking a new opportunity to reevaluate its shortcomings. &lt;br /&gt;&lt;br /&gt;What are these flaws? Well, Burnett says in the post "the technology for producing or reading a written text remains simple, robust, and nearly universally accessible"; while this is an important point about reading, it elides the central fact about printed text in the last 500 years, and especially the last 100. Anyone can write things on paper, a situation which is unlikely to change. The real discussion is not which is mightier, the pen or MS Word; it's about digital publishing vs. offset lithography. It has been decades since books were produced by movable type, or anything so easy to understand mechanically. I got a chance to watch a massive newspaper press in action while designing &lt;a href="http://bookworm.culturomics.org/"&gt;Bookworm&lt;/a&gt; to run on a LAMP platform, and there's no question in my mind as to which one is harder for a lonely humanist to harness. The major difference between a webserver and a modern printing press is not technological complexity; it's the access to capital required to get one. A single person can host a web site, but getting access to a printing press requires the intermediation of precisely the powerful forces Burnett claims to be worried about.*&lt;br /&gt;&lt;br /&gt;&lt;i&gt;*There is one blindingly obvious exception to this, of course, which we'll get to in a minute.&amp;nbsp;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;This is something the knights of the bookshelf should keep in mind. It's all well and good to imagine taking your manuscript to the local academic letterpress printer, sewing up some codices, and filling your bookshelves with texts from the same. But (&lt;a href="http://www.crumpledpress.org/"&gt;with small but notable exceptions&lt;/a&gt;) that's not how academic knowledge is reproduced. Perhaps Burnett is right that technological expertise and corporate control results in knowledge being controlled by a priesthood. If so, however, that has been the case at least since the invention of lithography.&amp;nbsp; &lt;br /&gt;&lt;br /&gt;And that's not even to make the most obvious point, which stems from the fact that a book you read on a screen is not the digital book itself, nor is it a digital copy of the book. It's just another analog publication. A Kindle, say, does not replace a codex; it replaces a piece of paper. When we use screens, we are in some ways moving back in time, replacing the technology of the codex with that of the palimpsest. We have finally created a palimpsest that can be quickly filled and endlessly erased. That is, in many ways, a problem. It's right to be concerned about preservation for born-digital primary &lt;i&gt;and&lt;/i&gt; secondary sources—and indeed, this a massive area of concern for digital humanists. To worry about that is not to criticize the digital humanities, it's to join them. &lt;br /&gt;&lt;br /&gt;The key point here, though, is that regardless of where we publish--screen palimpsest or paper--we should remember that a reader only interacts with analog materials. When it comes to straight text, this means that digitization does not equal screens. Most us, most of the time, do find monitors of some sort the easiest way to mediate interaction with digital texts. If you dip into elementary programming, though, you'll find that there's a thing called "stdout": standard output, the place the computer directs its responses to the user. Usually, that's the monitor. But that wasn't always true: there's another technology of equal importance that we can forget about that brings digital texts into the human-readable world. It's called a printer.&lt;br /&gt;&lt;br /&gt;That may sound trite. But to fully understand the relationship between digital texts and physical printers is to approach the heart of the digital humanities. A lot of the discussion on the Digital Public Library of America list-serve in the last few months has been about &lt;a href="http://ia700400.us.archive.org/31/items/images/montage_copy.jpg?cnt=0"&gt;print-on-demand stations&lt;/a&gt; that would let patrons in libraries buy or borrow-and-return otherwise inaccessible texts. (The proximate cause was a series of harangues from GNU creator Richard Stallman who, in stature, temperament, and facial hair, occupies a position in free software roughly analogous to that of William Lloyd Garrison and John Brown in abolition put together.) I actually have a handsome physical copy of &lt;a href="http://litlab.stanford.edu/?page_id=255"&gt;Pamphlet I by the Stanford Literary Lab&lt;/a&gt; that they sent to me, for free, through the US mail because they placed such a value on physical copies of work. The Center for History and New Media helped organize the creation of &lt;a href="http://anthologize.org/about/"&gt;Anthologize&lt;/a&gt;, which lets writers who use Wordpress (a blogging platform that, unlike Blogger, allows writers to control all stages of production) turn their blog into a physical book. The most visionary digital historians, people like Bill Turkel, are already plumbing out the possibilities for 3-dimensional printing. The list, I'm sure, goes on.&lt;br /&gt;&lt;br /&gt;This interest in printers seems odd, perhaps tangential. But it's tied in with just what gets humanists excited about digitization: it lets you do whatever you want with your sources. That might mean algorithmic manipulation, or hypertext editions, which is where many digital humanists (including myself) see the most exciting new possibilities. It might mean the emergence of new genres from cut-and-paste: this post here is that new invention, the blog comment turned stand alone essay. But everything that makes those things possible &lt;i&gt;also&lt;/i&gt; makes it easy to print out a copy that will last centuries. There is no contradiction between digital texts and permanent paper texts; in fact, permanent paper texts are one of the many things that digitization can best support. And since nearly all humanists, without exception, have an irrational relationship to physical books, we tend to get excited about the possibilities of paper, too. This isn't true about a Kindle e-book, because Amazon goes through to immense contortions to make that impossible; but one of the single biggest goals of the digital humanities is to ensure that the future of academic publishing is &lt;i&gt;not &lt;/i&gt;locked into corporate standards that keep us from having full control over texts. To repeat: opposition to inscrutable forms of publishing is not anethema to Digital Humanities, it is &lt;i&gt;central &lt;/i&gt;to them. &lt;br /&gt;&lt;br /&gt;I could end there. We have replaced the single factory of books with the bounteous field. Each reader is rooted in the earth to himself, free to produce what works she may. Let a thousand flowers bloom, sow their seeds up and down, see what springs from the earth: powerful oaks or armed men.&lt;br /&gt;&lt;br /&gt;But&lt;i&gt;. &lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;II.&amp;nbsp;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-PeNgaPjNLac/T0FGeMCT_8I/AAAAAAAAC_k/dsvtx7HF700/s1600/Screen+shot+2012-02-19+at+1.57.59+PM.png" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="158" src="http://3.bp.blogspot.com/-PeNgaPjNLac/T0FGeMCT_8I/AAAAAAAAC_k/dsvtx7HF700/s400/Screen+shot+2012-02-19+at+1.57.59+PM.png" width="400" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;&lt;i&gt;The grass withereth, and the flower thereof falleth away: &lt;br /&gt;But the word of the Lord endureth for ever. &lt;/i&gt;&lt;/td&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;i&gt; &lt;/i&gt;&lt;br /&gt;&amp;nbsp;Stanley Fish, in his &lt;a href="http://opinionator.blogs.nytimes.com/tag/digital-humanities/"&gt;critical and insightful set of posts on digital humanities&lt;/a&gt;, says that DH is theological because it promises the transcendance of mortality. He seems to overestimate the provocation in that statement (as if any academic doesn't love a secularized theological concept); I've said before that I think it's &lt;a href="http://sappingattention.blogspot.com/2011/02/going-it-alone.html"&gt;millenarianism&lt;/a&gt; more than utopianism that the Digital Humanities seems to call up—all sorts of unrelated problems (adjunctification, the poor job market, the role of public history, the esteem accorded collaboration) see their resolution in the digital age.&lt;br /&gt;&lt;br /&gt;It's not wrong to identify something profoundly disturbing in the way digital humanists want to transform the book, the article, the monograph. But I think the loss of control by the individual reader over his or her texts is the wrong place to look. The printer as part of the digital world means that we each can create and keep forever any texts that we want. They are far more decentralized and free. But, they are like the bloom of grass, transitory and uncertain. The old system of production had fixed hierarchies, the &lt;i&gt;imprimatur &lt;/i&gt;and the &lt;i&gt;nihil obstat;&lt;/i&gt; but those very things gave it authority. Gutenberg didn't publish a newspaper or an autobiography: he published the Bible. The idea of print as permanent was hard won, to be sure; but we have it now.&lt;br /&gt;&lt;br /&gt;And that permanence comes from the sanction of the reviewers, the printers, the idea that someone else has marked a book as worth reading. Seen this way, we're not gaining a priesthood that controls access to reading; we're losing one that controls access to writing. If the current academic system is like the church in all its censoring, rigidly hierarchical glory, the digital field more resembles the chaos of the early church. And that's as terrifying as it is empowering. Removing the intermediation of publishers entirely—which &lt;i&gt;is &lt;/i&gt;something that millenarian DHers might espouse—and suddenly you have only individual voices that stand on their own. The most prominent DH bloggers, like &lt;a href="http://www.dancohen.org/2011/07/26/the-ivory-tower-and-the-open-web-introduction-burritos-browsers-and-books-draft/"&gt;Dan Cohen&lt;/a&gt; or &lt;a href="http://lenz.unl.edu/other/all_posts.html"&gt;Stephen Ramsay&lt;/a&gt;, don't have to go to publishers or conference organizers to speak to hundreds of people--they just post and readers come. And it's not because of their posts: anyone can be struck down on the road to Damascus and start writing letters to the beloved community; anyone can choose to read them.&lt;br /&gt;&lt;br /&gt;Which is great, in many ways; but it turns out we actually miss that authority. To know what to read; to know we're reading what others are reading; to know that we're right that what we just read was good. The fiercest defenders of the book sometimes claim the codex allows the solitary engagement of reader and writer; but from the reaction everyone has to digital work, I'd say that's just what we're most afraid of. Digital Humanists are desperately trying to &lt;i&gt;keep&lt;/i&gt; the elements of print publishing which allow some modern-day bishops or presbyters to intervene and determine what gets read.&lt;br /&gt;&lt;br /&gt;But that path is still unclear. The Center for History and New Media announced the first issue of the &lt;a href="http://digitalhumanitiesnow.org/the-journal-of-digital-humanities/"&gt;Journal of Digital Humanities&lt;/a&gt; this week, which undertakes a fascinating experiment in distributed reviewing and open composition. The idea is that rather than solicit submissions and then farm out the task of reviewing, a journal can cull its content from the open web and then, through the norms of a self-constituted community, choose several articles to be polished through crowd-sourced peer review in the form of comments. This assumes that current patterns of online citation (particularly tweeting) can serve as a rough heuristic for further exploration. It seems like the best executed model of a new decentralized model for digital publishing that retains the authoritative benefits of the old press model. &lt;br /&gt;&lt;br /&gt;I have to admit that in reading (or in many cases, re-reading) the articles thinking of them as the first run in a new journal, I'm a little nonplussed. While the posts are intelligent, interesting, worth reading, many are recognizably creatures of the Internet: they trail off, they offer more commentary than synthesis or analysis, they evangelize for the field more than they practice humanistic reading. These are not bad things--it's all I really know how to do on a blog myself--but when I try to think what it would mean to peer review them, I'm flummoxed. It's as if I were asked to review for a journal of theology, and handed a rough draft of First Peter. Could you cut the bit about obedience to husbands? Be a little clearer about what Jesus did in hell, and the evidence for this? Just how near are you saying the end of all things is? &lt;br /&gt;&lt;br /&gt;The individual idiosyncracies of the author make more sense online than in academic print. The basic difficulty I'm having is that the personal author is not the voice of academic prose. And each discipline has its own writing style. What's exciting and challenging about the whole endeavor is that we may be able to create one anew for this field, so that we can write and read a little bit less as individuals, and more as members of communities defined by norms of authority and hierarchy. It is also a hard one; so far, there is only one comment on the posts, partly because it's hard to know just where to start with such robustly individual visions. (Ironically, one of the uncommented articles is Fred Gibb's &lt;a href="http://digitalhumanitiesnow.org/2012/01/critical-discourse-in-the-digital-humanities-by-fred-gibbs/"&gt;call for more critical discourse in the digital humanities&lt;/a&gt;.) We're for now too free to be willing to hold each other in line. We can print out our blog posts, but it doesn't necessarily look like a journal. The problem isn't the paper; it's the control. It might be time to have less authorship and more authority.&lt;br /&gt;&lt;br /&gt;III.&lt;br /&gt;&lt;br /&gt;So while we can still have paper media, we miss all the social accumulation that gathered around codex production. All that labor was instantiated into one product: the printed book. Is   it any wonder those of us who love that labor and that knowledge come to value the books themselves, to want to protect them, to love them?&lt;br /&gt;&lt;br /&gt;There is a term for letting physical objects take on the life of the human social relations that produced them: commodity fetishism. If any digital humanists do seem to cheer the death of the book, it's because they finally sense the pall of mystification being lifted. Tim Hitchcock's piece in the JDH &lt;a href="http://digitalhumanitiesnow.org/2012/01/academic-history-writing-and-its-disconnects-by-tim-hitchcock/"&gt;deals with this problem&lt;/a&gt; in just those terms: "By mentally escaping the ‘book’ as a normal form and format, we can see it more clearly for what it was." He wants to hurry up and declare the book dead so that we can get on with the autopsy.&lt;br /&gt;&lt;br /&gt;I'm sympathetic to this view. But I sometimes think the real trick for us all is remembering the book was never alive to begin with; but that everything that animated it still is. And while there's plenty of reason to celebrate those forces, there are just as many to think long and hard about whether we can do better than the old systems of production, as well as whether we can do worse.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-6288293818091603526?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/6288293818091603526/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2012/02/second-epistle-to-intellectual.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/6288293818091603526'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/6288293818091603526'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2012/02/second-epistle-to-intellectual.html' title='Second epistle to the intellectual historians'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-PeNgaPjNLac/T0FGeMCT_8I/AAAAAAAAC_k/dsvtx7HF700/s72-c/Screen+shot+2012-02-19+at+1.57.59+PM.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-2419705412344024564</id><published>2012-02-13T13:49:00.002-05:00</published><updated>2012-02-20T17:07:09.291-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='TV watch'/><title type='text'>Making Downton more traditional</title><content type='html'>Digital humanists like to talk about what insights about the past big data can bring. So in that spirit, let me talk about &lt;i&gt;Downton Abbey &lt;/i&gt;for a minute. The show's popularity has led many nitpickers to draft up lists of mistakes. Language Loggers Mark Liberman and Ben Zimmer have looked at some idioms that don't belong for &lt;a href="http://languagelog.ldc.upenn.edu/nll/?p=3692"&gt;Language Log&lt;/a&gt;, &lt;a href="http://www.npr.org/2012/02/13/146652747/im-just-sayin-there-are-anachronisms-in-downton"&gt;NPR&lt;/a&gt; and the &lt;a href="http://articles.boston.com/2012-02-12/ideas/31049177_1_downton-abbey-expressions-slang"&gt;Boston Globe&lt;/a&gt;.) In the best British tradition, the Daily Mail even &lt;a href="http://www.dailymail.co.uk/news/article-2051332/Downton-Abbey-writers-string-language-gaffes.html"&gt;managed to cast the errors&lt;/a&gt; as a sort of scandal. But all of these have relied, so far as I can tell, on finding a phrase or two that sounds a bit off, and checking the online sources for earliest use. This resembles what historians do nowadays; go fishing in the online resources to confirm hypotheses, but never ever start from the digital sources. That would be, as the dowager countess, might say, untoward.&lt;br /&gt;&lt;br /&gt;I lack such social graces. So I thought: why not just check every single line in the show for historical accuracy? Idioms are the most colorful examples, but the whole language is always changing. There must be dozens of mistakes no one else is noticing. Google has digitized so much of written language that I don't have to rely on my ear to find what sounds wrong; a computer can do that far faster and better. So I found some copies of the Downton Abbey scripts online, and fed every single two-word phrase through the Google Ngram database to see how characteristic of the English Language, c. 1917, &lt;i&gt;Downton Abbey&lt;/i&gt; really is.&lt;br /&gt;&lt;br /&gt;The results surprised me. There are, certainly, quite a few pure anachronisms. Asking for phrases that appear in no English-language books between 1912 and 1921 gives a list of 34 anachronistic phrases this season. Sorted from most to least common in contemporary books, we get a rather boring list: &lt;span class="st"&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre class="GD40030CLR" tabindex="0"&gt; [1] realistic prospect funding than       specialist care    pansystolic murmur&lt;br /&gt; [5] moment decision    the rematch        relax together     basic tips        &lt;br /&gt; [9] a pansystolic      of randy           be defeatist       dress fittings    &lt;br /&gt;[13] dedicated nurse    wartime marriage   point pretending   fairly grand      &lt;br /&gt;[17] want grandchildren friendships out    shortages all      when peacetime    &lt;br /&gt;[21] liberal front      heavens name       staff luncheon     can posture       &lt;br /&gt;[25] major inheritance  those logic        fingerprinted or   little daydream   &lt;br /&gt;[29] very disfigured    having pancakes    taxing assignment  rationing now     &lt;br /&gt;[33] liar while         unicorn if&lt;/pre&gt;&lt;br /&gt;Another 26 phrases do appear rarely in the 1910s, but are at least 100x as common today (sorted by biggest difference between the teens and the 1990s to least):&lt;br /&gt;&lt;br /&gt;&lt;pre class="GD40030CLR" tabindex="0"&gt; [1] black market      the basics        overall charge    there anymore    &lt;br /&gt; [5] feel loved        work load         most dedicated    ganging up       &lt;br /&gt; [9] gonna need        first priority    her homework      our funding      &lt;br /&gt;[13] you anymore       bit carried       hospital costs    likely outcome   &lt;br /&gt;[17] off limits        contact her       more traditional  exercise classes &lt;br /&gt;[21] from scratch      in overall        current situation guest bedroom&lt;br /&gt;[25] you gonna  &lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;A few of these are just rare words, plausible &lt;i&gt;&lt;a href="http://en.wikipedia.org/wiki/Hapax_legomenon"&gt;hapax legomena&lt;/a&gt; &lt;/i&gt;in the time period. But others are egregious, howling mistakes. We see here several of the phrases Zimmer discusses ('those logic pills', 'the rematch','contact her' for 'get in touch with'). There are also some more obvious anachronisms ('fingerprint' as a verb, 'did her homework' as a metaphor for being prepared) and a few less recognizably modern phrasings like "&lt;a href="http://books.google.com/ngrams/graph?content=realistic+prospect&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;realistic prospect&lt;/a&gt;" (which, when you think about it, is quite a mixed metaphor) and "dress fittings." Expanding it a bit reveals some more howlers: Lord Downton's complain that his family is "ganging up" on him (the OED has it as a 1925 American coinage)&lt;span class="st"&gt;; Lady Mary's concern about losing the "moral high ground" to Sybil (&lt;a href="http://books.google.com/ngrams/graph?content=moral+high+ground&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;a creation of the 60s that didn't really take off until the early 1980s&lt;/a&gt;); &lt;/span&gt;a usage of the Americanism "cow pie" when a Briton would have said 'cow pat;' and several more. None of those sound as jarring, but they are equally inaccurate. (I particularly like 'cow pie'; we tend to think that rural language is eternal, but it can change as easily as city terms).&lt;br /&gt;&lt;br /&gt;With the full list, we can see some broader patterns of error. There are some areas where writers persistently drop the ball. Through much of season 2, Downton Abbey is a hospital or convalescent home, and medical vocabulary presents a particularly problem. Branson escapes the draft because of a "mitral valve prolapse" (first use, c. 1965) causing a "pansystolic murmur"&lt;i&gt; &lt;/i&gt;(c. 1953); both terms suggest St. Elsewhere more than the Great War. The doctor's helpers aren't trained in 'specialist care'; hardly their fault, since the phrase was never used before 1925. The household is relieved that Carson the butler did not suffer a 'heart attack'; but that phrase was about 50x rarer in 1917 (perhaps a coronary, like the one that nearly killed Roger Sterling in season 1 of &lt;i&gt;Mad Men&lt;/i&gt;, would have been more appropriate?) and, so far as I can tell, only an 'acute heart attack' would have meant myocardial infarction to the crew at Downton.&lt;br /&gt;&lt;br /&gt;&lt;i&gt; &lt;/i&gt;&lt;br /&gt;Season 2's Great War setting opens the door for another sort of mistake: words from the Second World War showing up 20 years ahead of schedule. To most of us, the lingo from the wars is indistinguishable; but there are some major mistakes. One subplot involves Thomas setting up business selling goods on the black market; "there are shortages all around," he declares. He might as well be speaking Greek: the '&lt;a href="http://books.google.com/ngrams/graph?content=black+market&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;black market&lt;/a&gt;' doesn't emerge until 1941, and though&amp;nbsp; businessmen (particular Americans) sometimes used 'shortages' as the opposite of 'surpluses,' &lt;a href="http://bookworm.culturomics.org/?%7B%22query%22%3A%7B%22index%22%3A0%2C%22time_measure%22%3A%22year%22%2C%22time_limits%22%3A%5B1815%2C1922%5D%2C%22counttype%22%3A%22Percentage_of_Books%22%2C%22words_collation%22%3A%22Case_Sensitive%22%2C%22smoothingSpan%22%3A%225%22%2C%22search_limits%22%3A%5B%7B%22word%22%3A%5B%22shortages%22%5D%2C%22country%22%3A%5B%22UK%22%5D%2C%22lc1%22%3A%5B%22PR%22%2C%22PS%22%2C%22PZ%22%5D%7D%2C%7B%22word%22%3A%5B%22shortages%22%5D%2C%22country%22%3A%5B%22USA%22%5D%2C%22lc0%22%3A%5B%22H%22%5D%7D%2C%7B%22word%22%3A%5B%22shortages%22%5D%2C%22country%22%3A%5B%22USA%22%5D%2C%22lc1%22%3A%5B%22PR%22%2C%22PS%22%2C%22PZ%22%5D%7D%5D%7D%2C%22terms%22%3A%5B%22shortages%22%5D%2C%22category_data%22%3A%5B%5B%5B%22country%22%2C%5B%22UK%22%5D%5D%2C%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%5D%5D%2C%5B%22lc1%22%2C%5B%22PR%22%2C%22PS%22%2C%22PZ%22%5D%5D%2C%5B%22LCSH%22%2C%5B%5D%5D%2C%5B%22aLanguage%22%2C%5B%5D%5D%5D%2C%5B%5B%22country%22%2C%5B%22USA%22%5D%5D%2C%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%22H%22%5D%5D%2C%5B%22lc1%22%2C%5B%5D%5D%2C%5B%22LCSH%22%2C%5B%5D%5D%2C%5B%22aLanguage%22%2C%5B%5D%5D%5D%2C%5B%5B%22country%22%2C%5B%22USA%22%5D%5D%2C%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%5D%5D%2C%5B%22lc1%22%2C%5B%22PR%22%2C%22PS%22%2C%22PZ%22%5D%5D%2C%5B%22LCSH%22%2C%5B%5D%5D%2C%5B%22aLanguage%22%2C%5B%5D%5D%5D%5D%2C%22comparison%22%3A%22texts%22%7D"&gt;it is so rare in British speech that it almost never appears in UK fiction from the period&lt;/a&gt;.  "In short supply," also used in this subplot, was about 250 times as common during the second world war as during the first. Even the today ubiquitous ideas of 'wartime' and 'peacetime' aren't appropriate; Mrs. Bryant refers to a 'wartime marriage;' but the use of 'wartime' and 'peacetime' as adjectives didn't pick up in earnest until 1941. &lt;br /&gt;&lt;br /&gt;But we can do more than just pick nits on idiomatic speech. ("Pick nits" is &lt;a href="http://books.google.com/ngrams/graph?content=pick+nits%2Cnitpick&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;1960&lt;/a&gt;, by the way). It lets us look more generally at what the show gets right and wrong about past language. Every episode has dozens of lines that are just slightly off, and it's in these that the patterns really look funny. In addition to the 60 phrases above, there are another 260 that are at least 10 times more common in the 1990s than in the 1910s. These arephrases like "at long last," "from scratch", and "act fast"--maybe a few could be spoken in the teens, but all of them together? &lt;br /&gt;&lt;br /&gt;Some of these are extremely common. To help me find the words, I asked R to make a chart that looks like this to find the worst mistakes in every episode of the season. (This is last night's: click to enlarge):&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-5pGBXJFpf3A/TzlRjDb6iSI/AAAAAAAAC_Q/CVwfubbuqGM/s1600/Downton+Abbey+unlikely+words.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="518" src="http://2.bp.blogspot.com/-5pGBXJFpf3A/TzlRjDb6iSI/AAAAAAAAC_Q/CVwfubbuqGM/s640/Downton+Abbey+unlikely+words.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;[ed--If you want to see more, &lt;a href="http://sappingattention.blogspot.com/2012/02/downton-abbey-anachronisms-season.html"&gt;they're all in my next post on the topic&lt;/a&gt;.] &lt;br /&gt;&lt;br /&gt;Farther to the left means less common nowadays; higher up means more common today ('be defeatist' is next to 3; it's 10^3, or 1000x as common today) and below 0 means more common in 1917. Looking at these, the new words on the upper left jump out, but some more common words that are only overused 5 or 10 times jump out as well.&lt;br /&gt;&lt;br /&gt;For example: Characters in Downton Abbey say "I must" 24 times, three times as often as they say "I need to." Books from the period, on the other hand, say "I must" &lt;i&gt;three hundred times &lt;/i&gt;as often; going by the printed literature, the Abbey's residents should "need to" do something about once every ten seasons, not once an episode. Ben Zimmer pointed out that some characters say "I'm just saying" anachronistically, but it's not just that phrase: they use "just" to modify meaning far too much. Words like "just wrong," "just sucking," "just need" are frequent, and uncharacteristic. (They should be saying "only wrong," I think). &lt;br /&gt;&lt;br /&gt;This is not to say that they get everything wrong. The writers tune their ears well enough to get quite a bit right. They know to say "sympathy with" rather than "sympathy for," and so on. They know to use "awfully" as an intensifier, and so on. We can find the shining examples of period language in Downton, too: all of these phrases were at least 6x as common in the teens as today:&lt;br /&gt;&lt;br /&gt;&lt;pre class="GD40030CLR" tabindex="0"&gt; [1] this war               the trenches           who shall             &lt;br /&gt; [4] war has                practically a          so slight             &lt;br /&gt; [7] old chap               newspaper man          civilised world       &lt;br /&gt;[10] dressing station       very feeble            for luncheon          &lt;br /&gt;[13] little chap            jolly well             stand well            &lt;br /&gt;[16] wonderful what         from Arras             1914 I                &lt;br /&gt;[19] thither as             before luncheon        our chauffeur         &lt;br /&gt;[22] soldier servant        a plutocrat            no livery             &lt;br /&gt;[25] whole bally            awfully cut            tremendous disturbance&lt;br /&gt;[28] hereafter forever      wee chap               me enlist             &lt;br /&gt;[31] bally lot              hall boys              our Arch              &lt;br /&gt;[34] dressing gong          seems jolly            little waspish        &lt;br /&gt;[37] no convalescence       shining film           beggared if        &lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In all, the language in Downton is about 50-50; half is more common in 1995, half more common in 1917. In the abstract, that doesn't sound great to me, but we need a basis for comparison. Let's take another show: the beloved "Pride and Prejudice" adaptation from the nineties.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/-R47EvaOZqOk/TzlUIPBDdAI/AAAAAAAAC_Y/c9d1g334qYk/s1600/Pride+and+Prejudice+example.png" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="518" src="http://3.bp.blogspot.com/-R47EvaOZqOk/TzlUIPBDdAI/AAAAAAAAC_Y/c9d1g334qYk/s640/Pride+and+Prejudice+example.png" width="640" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;That band on the left are the completely new phrases from 1815 to 1995; a lot of language doesn't appear at all in books from Austen's time. Now, &lt;i&gt;Pride and Prejudice &lt;/i&gt;has a lot of obstacles to overcome; the books are worse, a lot of occurrences of 'someone' will appear as 'fomeone' in the old OCR, and it's set 100 years earlier than Downton. But nonetheless, the center of that cloud is a little lower than Downton's. Of the 6 episodes, between 60 and 67% of the words are more common in 1815 than in 1995; for Downton, only about 50% are more common. That's because the BBC could steal lines from Austen that &lt;i&gt;sounded authentic&lt;/i&gt; even without the writers having to think up phrases like "total want of" or "cordially wish." If you care only about gross anachronisms, &lt;i&gt;Pride and Prejudice &lt;/i&gt;will sound worse than Downton because from time to time they added words that Austen &lt;i&gt;didn't &lt;/i&gt;write; but if you care about historically accuracy overall, you'll get a much better experience of old-fashioned speech from the show that took from Austen.&lt;br /&gt;&lt;br /&gt;For a script without a source base to crib from, though, Downton doesn't do so poorly. A couple episodes of &lt;i&gt;Mad Men &lt;/i&gt;I checked were possibly worse &lt;i&gt;[Ed.--Looking into it a little more, I take this back; they're probably better]&lt;/i&gt;; even great novelists do no better. Edith Wharton's "The Age of Innocence" is one of the great historical novels in the public domain (written in 1921, set in the 1870s), but dialogue in it routinely uses phrases like "marked trend" and "shoe polish" that no one in the 1870s would have known. In fact, only 40% of its words are more common in the 1870s than the 1920s; even worse than Downton. Of course, they all sound old-fashioned to us now.&lt;br /&gt;&lt;br /&gt;Do these mistakes really matter? Yes and no. Maybe the characters say it best:&lt;br /&gt;&lt;br /&gt;&lt;pre class="GD40030CLR" tabindex="0"&gt;ROBERT, EARL OF GRANTHAM                                                              &lt;br /&gt;You don't think she'd be happier with a more traditional set up?&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Nothing seems out of order here, perhaps. But, "&lt;a href="http://books.google.com/ngrams/graph?content=more+traditional&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;more traditional&lt;/a&gt;" is a &lt;i&gt;profoundly &lt;/i&gt;untraditional way of describing things. Historians know that the "&lt;a href="http://scholar.google.com/scholar?q=Invention+of+tradition&amp;amp;hl=en&amp;amp;as_sdt=0&amp;amp;as_vis=1&amp;amp;oi=scholart&amp;amp;sa=X&amp;amp;ei=T0s5T6DrHMPv0gHZtqjIAg&amp;amp;ved=0CBgQgQMwAA"&gt;invention of tradition&lt;/a&gt;" was rampant in Victorian England; the practice of happily talking about "more traditional" and "less traditional" outcomes &lt;a href="http://books.google.com/ngrams/graph?content=more+traditional%2Ctraditionally&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;is even more recent&lt;/a&gt;. To a real Earl of Grantham, talking about tradition as a sliding scale would rather miss the point; either it's traditional or it's not.&lt;br /&gt;&lt;br /&gt;But today, of course, those shades of tradition--sometimes right, sometimes wrong--exactly what the show is about. We think we can recapture it in little parts; that various characters in the past can stand in for us, and that we might behave just like they do.&lt;br /&gt;&lt;br /&gt;This is the real weakness of Downton Abbey, I'd say. Not just the language but the sensibilities are obviously modern, easy for us to understand, and false to the reality of the past. (I admit I skipped large parts of the second season of Downton Abbey to watch &lt;i&gt;Cheers&lt;/i&gt;, which gives a far more nuanced depiction of the way social class is used as in instrument of authority and liberation than &lt;i&gt;Downton&lt;/i&gt;.) But to imagine yourself sophisticated, fighting prejudice and eating quaint food, Downton's just the thing.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-2419705412344024564?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/2419705412344024564/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2012/02/making-downton-more-traditional.html#comment-form' title='20 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/2419705412344024564'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/2419705412344024564'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2012/02/making-downton-more-traditional.html' title='Making Downton more traditional'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-5pGBXJFpf3A/TzlRjDb6iSI/AAAAAAAAC_Q/CVwfubbuqGM/s72-c/Downton+Abbey+unlikely+words.png' height='72' width='72'/><thr:total>20</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-4895283433829677323</id><published>2012-02-02T15:51:00.003-05:00</published><updated>2012-02-04T11:57:30.598-05:00</updated><title type='text'>Poor man's sentiment analysis</title><content type='html'>Though I usually work with the &lt;a href="http://bookworm.culturomics.org/"&gt;Bookworm&lt;/a&gt; database of Open Library texts, I've been playing a bit more with the Google Ngram data sets lately, which have substantial advantages in size, quality, and time period. Largely I use it to check or search for patterns I can then analyze in detail with text-length data; but there's also a lot more that could be coming out of the Ngrams set than what I've seen in the last year.&lt;br /&gt;&lt;br /&gt;Most humanists respond to the raw frequency measures in Google Ngrams with some bafflement. There's a lot to get excited about internally to those counts that can help answer questions we already have, but the base measure is a little foreign. If we want to know about the history of capitalism, the &lt;a href="http://books.google.com/ngrams/graph?content=capitalism&amp;amp;year_start=1900&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;punctuated ascent of its Ngram&lt;/a&gt; only tells us so much:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-Qis6Wq-vjfM/TyrE2EKfj3I/AAAAAAAAC-g/5uVQTqlLOmM/s1600/Capitalism+Frequency.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-Qis6Wq-vjfM/TyrE2EKfj3I/AAAAAAAAC-g/5uVQTqlLOmM/s1600/Capitalism+Frequency.png" /&gt;&lt;/a&gt;&lt;/div&gt;It's certainly &lt;i&gt;interesting &lt;/i&gt;that the steepest rises, in the 1930s and the 1970s, are associated with systematic worldwide crises--but that's about all I can glean from this, and it's one more thing than I get from most Ngrams. Usually, the game is just tracing individual peaks to individual events; a solitary quiz on historical events in front of the screen. Is this all the data can tell us?&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Ngrams gives us frequency, but that's just background information for a more interesting question: &lt;i&gt;how &lt;/i&gt;the word is used, not how much&lt;i&gt;.&lt;/i&gt; The full machine learning approach would be to tag all the sentences with sentiment analysis and find out whether capitalism is good or bad. I had a conversation with a Harvard professor yesterday who seemed to think that might work well.&lt;br /&gt;&lt;br /&gt;So maybe that would be useful. But historical sentiment is rarely so simple as 'good' or 'bad.' (Even when &lt;a href="https://www.google.com/search?q=capitalism+is+good&amp;amp;btnG=Search+Books&amp;amp;tbm=bks&amp;amp;tbo=1#q=%22capitalism+is+good%22&amp;amp;hl=en&amp;amp;tbas=0&amp;amp;sa=X&amp;amp;ei=v8YqT6q0DoihtweG5MT7Dw&amp;amp;ved=0CB0QpwUoBA&amp;amp;source=lnt&amp;amp;tbs=cdr:1%2Ccd_min%3A1%2F1%2F1929%2Ccd_max%3A12%2F31%2F1936&amp;amp;tbm=bks&amp;amp;bav=on.2,or.r_gc.r_pw.,cf.osb&amp;amp;fp=1afdd38b09f813c2&amp;amp;biw=1280&amp;amp;bih=988"&gt;those are the words&lt;/a&gt; we search for). Full sentiment analysis would allow us to do a reading of capitalism in seconds, just the way the Ngrams charts allow us to, on whatever polls ("good" vs. "bad") we could come up with. But historians have more time on their hands, and shouldn't necessarily want just that unidimensional view.&lt;br /&gt;&lt;br /&gt;In fact, the shades of sentiment about capitalism are bounded only by the capacity of language to express them. And language is just what we've got already. We may say we want sentiment analysis, but what we really want to know are the shifting contexts in how 'capitalism' is used. Before we have to hang our hats on a classifying tools built for other purposes, we should see what the language itself has to say. Let's pretend that there are just as many sentiments in the English language as there are words. What then?&amp;nbsp; &lt;br /&gt;&lt;br /&gt;So what I've done is load in all the 2-grams into a new database here at the Cultural Observatory, and split them up into their component words. That means one can easily get a list of the top 20 words that appear immediately &lt;i&gt;before &lt;/i&gt;capitalism in the Ngrams dataset, like so:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;mysql&amp;gt; SELECT word1,word2,sum(words) as count FROM 2gramcounts WHERE word2='capitalism' GROUP BY word1 ORDER BY count DESC LIMIT 20;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;+------------+------------+---------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| word1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | word2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | count &amp;nbsp; |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;+------------+------------+---------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| of&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism | 1208824 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| and&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp; 164728 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| to&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp; 139893 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| under&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp; 135168 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| industrial | capitalism |&amp;nbsp; 131933 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| that&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp; 125771 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| modern&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 75524 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| monopoly&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 73631 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| global&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 68416 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| American&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 66332 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| state&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 66250 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| in&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 60258 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| with&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 49600 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| late&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 49006 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| from&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 47667 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 44138 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| between&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 41847 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| -&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 41537 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| for&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 40552 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;| market&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | capitalism |&amp;nbsp;&amp;nbsp; 39614 |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;+------------+------------+---------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;20 rows in set (0.00 sec)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This in itself is slightly more useful, because it gives us a hint of what individual phrases we might care about if we use 'capitalism'. It suggests the possibility of &lt;a href="http://books.google.com/ngrams/graph?content=market+capitalism%2Cstate+capitalism&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;comparing "market" and "state" capitalism&lt;/a&gt;, for example, which is slightly more interesting and meaningful than comparing "capitalism" and "communism," though still a little opaque.&lt;br /&gt;&lt;br /&gt;But it's not historical, it's ugly, and it doesn't really shop hypotheses the way I'd like. If we solve a few of the problems (modern phrases like 'late capitalism' show up more than they should, etc.) with a little bit of basic arithmetic, &lt;a href="http://sappingattention.blogspot.com/2011/04/stopwords-to-wise.html"&gt;exclude some stopwords&lt;/a&gt;, and cluster words by their similarity in frequency and trend,* we start to get towards an ngrams chart that gives a fuller use of the word.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;*This k-means clustering is [ed.--along with the loess smoothing on the lines] the only part of this whole thing that involves any math a typical humanist doesn't know, and it's for visibility only--which quadrant a line appears in. In an earlier version I used a log-likelihood score to select the words most closely associated with 'capitalism', but that turns out not to really be necessary--all the interesting phrases pop up just by using raw frequency. Anytime you can avoid the tricky math, you should.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-_rIDQTeWCeA/TyrPxrQhA_I/AAAAAAAAC-o/_8abCGo3_Zk/s1600/Relative+share+of+most+frequent+words+preceding+%22Capitalism%22.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="508" src="http://1.bp.blogspot.com/-_rIDQTeWCeA/TyrPxrQhA_I/AAAAAAAAC-o/_8abCGo3_Zk/s640/Relative+share+of+most+frequent+words+preceding+%22Capitalism%22.png" width="640" /&gt; &lt;/a&gt;&lt;/div&gt;(You may want to click to enlarge).&lt;br /&gt;&lt;br /&gt;For any given year here, all the displayed words sum to 100%: so 1 in 20 times that 'capitalism' is preceded by any of the above words, it's preceded by 'market,' for example. We could express the ratio as percentage of a) all words, or b) all two-grams ending in capitalism, but there tends to be a lot of not-necessarily-important noise in the first two.&lt;br /&gt;&lt;br /&gt;Substantively, there are some interesting points here. (I should note that I looked at &lt;a href="http://sappingattention.blogspot.com/2010/12/capitalist-lackeys.html"&gt;some similar stuff in a much smaller dataset&lt;/a&gt; a while ago.) The decline of 'state' and 'private' capitalism as meaningful terms in the upper right suggests the continuing movement away from capitalism as a type of political economy; the depression-era rise of 'finance capitalism' and the lack of any return in the 1970s makes me wonder the modern-day rise of 'market' (including 'free-market') capitalism referring to the same thing, but with a more occluded group of actors. I particularly like the peaks for different countries ("american" peaking in 1960, "british" in the late 1930s, "Japanese" around 1970 (so early!) with a bump later). Put that together with the decline in 'state' capitalism, and you could start to make an interesting argument that the published literature has moved away from the describing varieties of capitalism and towards seeing it more and more as an ideal type.&lt;br /&gt;&lt;br /&gt;(Of course, the normal Ngrams weird-sample caveats apply; the incredible ascent of 'late capitalism' is about the prevalence of Western Marxism in academic-press books, surely, and its only historians who get excited about 'agrarian capitalism' in the 1980s.)&lt;br /&gt;&lt;br /&gt;Nonetheless, it's an interesting way of thinking about the paths of a lot of words. Here are a few ones with only minor comment, all click-to-expand ready. ***If you have some individual word or set of words you want to see this on, just let me know.***&lt;br /&gt;&lt;br /&gt;Words &lt;b&gt;following "Capitalist" &lt;/b&gt;shows much the same pattern as "capitalism," but the three words in the upper left (declining, common words) more clearly show a strong Marxist influence that declines. And certain very specific phrases ("capitalist encirclement") that I'd never think to Ngram jump out.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-nhfk9C3ZAvk/TyrWuoeSQiI/AAAAAAAAC-w/yliGfeIF6xA/s1600/Relative+share+of+most+frequent+words+following+%22capitalist%22.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="470" src="http://3.bp.blogspot.com/-nhfk9C3ZAvk/TyrWuoeSQiI/AAAAAAAAC-w/yliGfeIF6xA/s640/Relative+share+of+most+frequent+words+following+%22capitalist%22.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;An abstract word like &lt;b&gt;Freedom &lt;/b&gt;has a lot of contexts, and it's difficult to generalize. There are some interesting local peaks ('absolute freedom' around 1905, 'religious freedom' around 1850) and some obvious secular trends (declines in the freedom of the ancients, in 'perfect freedom', in 'unlimited freedom') but not an overall trend.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-SV33PhzPL7E/TyrXmh488aI/AAAAAAAAC-4/Al54EYE4V0U/s1600/Relative+share+of+most+frequent+words+preceding+%22freedom%22.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="510" src="http://2.bp.blogspot.com/-SV33PhzPL7E/TyrXmh488aI/AAAAAAAAC-4/Al54EYE4V0U/s640/Relative+share+of+most+frequent+words+preceding+%22freedom%22.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;b&gt;Words following the adjective free&lt;/b&gt;, on the other hand, have to my mind a clearer set of trends. (This may be a general rule--you can learn more from the adjectival form than the noun form in the two-grams, since that returns nouns which tend to be more easily glossed than adjectives). A lot of the decliners here, on the left side, are words like "communication," "circulation," "navigations" "passage." That gets at one very important but rarely discussed phenomenon; the removal of freedom of movement (one of the 1848 revolutionaries' major demands) from the very heart of liberalism to its extremities. You get some interesting post-civil rights-era changes as well--the extinction of 'free men', the sudden spike in 'free association' (which generally meant freedom to discriminate), and an interesting rise in free press. There's lots more.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-nQh_nJdsGuw/TyrsmzHEIkI/AAAAAAAAC_A/rme04No-KvY/s1600/Relative+share+of+most+frequent+words+following+%22free%22.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="510" src="http://4.bp.blogspot.com/-nQh_nJdsGuw/TyrsmzHEIkI/AAAAAAAAC_A/rme04No-KvY/s640/Relative+share+of+most+frequent+words+following+%22free%22.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;b&gt;Words after "slave" &lt;/b&gt;give quite a different story, largely (I suspect) because the printed record has frequently been dominated by opponents of slavery (Boston and New York publishers in the 1850s, historians today). The risers (the two lower cells on the left, mostly) give an interesting list of the topics in slavery that only became heavily discussed in the published literature sometime around/after the modern slavery historiography began: slave women, slave quarters, slave revolts, slave society. "Slave property" and "slave states" obviously change inflection at the Civil War; the "slave market" is interestingly constant, while emphasis somehow shifts from the 'slave trade' (80% of all adjective-ish usage of 'slave' in 1805!) towards 'slave traders' and 'slave trading.'&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-QCQL8RzZX7U/TyrxE3-IxjI/AAAAAAAAC_I/nU8EkUNAhjw/s1600/Relative+share+of+most+frequent+words+following+%22slave%22.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="510" src="http://1.bp.blogspot.com/-QCQL8RzZX7U/TyrxE3-IxjI/AAAAAAAAC_I/nU8EkUNAhjw/s640/Relative+share+of+most+frequent+words+following+%22slave%22.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;In any case; that's all largely to prove that one can create historical texts, as it were, out of the ngrams database that are linguistically complex, promote historical interrogation and argument, and all the rest.&lt;br /&gt;&lt;br /&gt;Why doesn't Ngrams or Bookworm have this as a feature, you ask? Two reasons, which I think are somewhat useful to think about:&lt;br /&gt;&lt;br /&gt;1) It takes a lot longer to run the queries. Since this is processing through thousands of 2-grams before settling on a few, the process takes several seconds to run. That doesn't sound long: but no one--particularly not non-computer savvy users like most humanists--waits more than one or two seconds on the web, which is a massive barrier to scholarly tools online. &lt;br /&gt;&lt;br /&gt;2) It takes more storage space. To make these queries effective at all, you have to store sorted tables. (Effectively--&lt;a href="http://sappingattention.blogspot.com/2011/03/what-historians-dont-know-about.html"&gt;actually, it's indexes, as I talked about earlier&lt;/a&gt;.) If you want to be able to search for both preceding and succeeding words, in fact, it often makes sense to store two separate copies of the data for quick lookup, one sorted by succeeding words, and one by preceding. So (with the disclaimer that I have no insider knowledge of this at all) at the Google Ngrams site, I think they store the files in an order where "capitalist lackeys" and "capitalist losses" are easily findable, but where "American capitalism" and "British capitalism" are nowhere near each other. You'd have to read every entry in the whole thing to get them both together, which means that for most practical purposes, it's not possible.&lt;br /&gt;&lt;br /&gt;[I think Fred Gibbs and Dan Cohen did in fact do this herculean task for words within 4 of 'marriage' in the Ngrams database, but it took pay-by-the-hour processing time on Amazon Web Services, which is fine for an individual research project, but unsustainable as a service--each query would probably cost ten dollars, give or take an order of magnitude. This is also probably the place to remind you that Cohen said immediately after release that &lt;a href="http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/"&gt;the longer Ngrams would be more useful to humanists than the unigrams&lt;/a&gt;; it's interesting that, over a year later, no one has taken up the infrastructural task of enabling those uses.]&lt;br /&gt;&lt;br /&gt;In any case: 2-grams are obviously only a test-bed of the real thing here. One would want to compare just certain parts of speech, to group words together at the researcher's whim, and so forth. Perfectionism would probably wait for a natural-language-processing approach to make this work perfectly; a scaled up version of &lt;a href="http://www.cs.berkeley.edu/%7Eaditi/projects/wordseer.html"&gt;wordseer&lt;/a&gt;. (Or you could just use Mark Davies' version of the Ngrams data). But given that what historians is reading individual words in their contexts, I'm not sure we have to wait.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-4895283433829677323?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/4895283433829677323/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2012/02/poor-mans-sentiment-analysis.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/4895283433829677323'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/4895283433829677323'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2012/02/poor-mans-sentiment-analysis.html' title='Poor man&apos;s sentiment analysis'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-Qis6Wq-vjfM/TyrE2EKfj3I/AAAAAAAAC-g/5uVQTqlLOmM/s72-c/Capitalism+Frequency.png' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-8428242382324857817</id><published>2012-01-30T10:20:00.000-05:00</published><updated>2012-01-31T14:24:46.868-05:00</updated><title type='text'>Fixing the job market in two modest steps</title><content type='html'>Another January, another set of hand-wringing about the humanities job market. So, allow me a brief departure from the digital humanities. First, in four paragraphs, the problem with our current understanding of the history job market; and then, in several more, the solution.&lt;br /&gt;&lt;br /&gt;Tony Grafton and Jim Grossman launched the latest exchange with what they call a "&lt;a href="http://www.historians.org/Perspectives/issues/2011/1110/1110pre1.cfm"&gt;modest proposal&lt;/a&gt;" for expanding professional opportunities for historians. &lt;a href="http://hnn.us/articles/history-worth-fighting-where-aha"&gt;Jesse Lemisch&lt;/a&gt; counters that we need to think bigger and mobilize political action. There's a big and productive disagreement there, but also a deep similarity: both agree there isn't funding inside the academy for history PhDs to find work, but think we ought to be able to get our hands on money controlled by someone else. Political pressure and encouraging words will unlock vast employment opportunities in the world of museums, archives, and other public history (Grafton) or government funded jobs programs (Lemisch). These are funny places to look for growth in a 21st-century OECD country (perhaps Bill Cronon could take the more obvious route, and make his signature initiative as AHA president creating new tenure-track jobs in the &lt;a href="http://www.economist.com/node/14845197"&gt;BRICs&lt;/a&gt;?) but the higher levels of the profession don't see much choice but to change the world. &lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;Like most non-combatants, I tend to agree in general with the concrete aims of both Lemisch and Grafton. Still, there's something I always find too transparently self-serving about their arguments. Contra Grafton and Grossman's title, they're not modest at all. Both stances can actually seem quite grandiose in their claims for history. If only we had the courage to proclaim how important history is, Congress would fund it; if only we were clearer about the immense benefit a Ph.D. education can give to someone seeking alt-ac employment, the world would beat a path to our doors. Never mind that the size of the profession is set not by idealism, but by a combination of university economics and a close attention to market share; every family member is a blessing, and we must care for them all. The fault may not lie in the stars, but the only fault in ourselves is that we are underlings.&lt;br /&gt;&lt;br /&gt;There used to be an alternative that didn't take ourselves so seriously: let's call it contractionism. It called for the immediate limiting of new entrants to Ph.D. programs, and a continuing balance of Ph.D. numbers with new tenure track jobs. Thanks largely to Rob Townsend's valiant data collecting at the AHA and chart-making in &lt;i&gt;Perspectives&lt;/i&gt;, this once seemed like a serious plan held back only by collective-action problems. But no longer. It probably gives Marc Bousquet too much credit to say he single-handedly made contractionism disreputable in early 2010. But since his series of posts, originally against Townsend, making arguments against it alternately flawed (&lt;a href="http://chronicle.com/blogs/brainstorm/at-the-aha-huh/19544"&gt;contractionism is discredited because, like Ronald Reagan, it focuses on the supply side&lt;/a&gt;)* and incisive (&lt;a href="http://howtheuniversityworks.com/wordpress/archives/239"&gt;it would take 20 years for contraction to filter to the job market&lt;/a&gt;), support for limiting the number of Ph.D.s seems weakened. (My own program, which is also Grafton's, enrolled 31 new students last year, up from 25 before the financial crisis; numbers across the board, though, are slightly down.)&lt;br /&gt;&lt;br /&gt;&lt;i&gt;*I would think that if anything is supply-side economics, it's the conviction that indeterminate production of Ph.D.s should be expected to create its own demand.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Even if contractionism won't work, I miss its modesty. So let me put forward a new proposal. It's quite a simple two-step plan. It involves only money we have, and doesn't rely on others creating miraculous new opportunities. (Though once we get that going properly, it could work hand in hand with those.) If we &lt;i&gt;did &lt;/i&gt;enact both of these reforms, we'd be far better off than we are now, and quite soon. &lt;span style="font-family: inherit;"&gt;&lt;span style="font-size: small;"&gt; If you know anything equally innocent, cheap, easy, and effectual, let's hear it.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Step 1: &lt;b&gt;Eliminate all post-docs.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The job 'crisis' is now over thirty years old, but two things have changed in the last fifteen. One was the provision of living-wage stipends to large numbers of enrolled graduate students, which was a basically unalloyed good. The other was the appearance of post-doctoral opportunities, previously mostly confined to the sciences. I think a lot of humanists think (or more accurately, feel) that setting up a new postdoctoral program  helps the system in the same way that higher grad stipends do. Not so.&lt;br /&gt;&lt;br /&gt;To the contrary: they create a disastrous incentive structure that undoes the good of higher stipends.  In the science, this has led to a situation where &lt;a href="http://www.miller-mccune.com/science/the-real-science-gap-16191/"&gt;the only sane reason to stay on the academic track is the desire for a green card&lt;/a&gt;; but at least in the sciences a Ph.D. can take 4 years. As post-doctoral positions in the humanities multiply, it pushes the paper qualifications of new applicants higher and higher (and their ages older and older). The ABD hire is rapidly disappearing; some entry-level jobs now require a Ph.D. 9 months &lt;i&gt;before &lt;/i&gt;the start date. For this to happen, there must be some large store of accredited individuals in a holding pen somewhere. And increasingly, those pens are post-doctoral programs.&lt;br /&gt;&lt;br /&gt;Insofar as the extra time lets hiring committees better distinguish between applicants, it's a fine thing. But it also creates perverse effects. First and foremost, it makes the tenure track job market more and more into a waiting game; the interplay between pigeon-holed job descriptions and endless benches to warm on create a system where many faculty can &lt;i&gt;eventually &lt;/i&gt;get a job if they don't let their resumes get too clogged by things that aren't teaching and publishing.&lt;br /&gt;&lt;br /&gt;This  is called rationing by queue. It's familiar from Wal-Marts before  Thanksgiving and Disneyworld after 8:30am, but the most famous example are the post-Soviet  bread lines. The prices for bread are not high enough to make the poorest starve, so scarce bread goes those who are willing to spend the most  time waiting for the bakery to be open or the shipment to arrive:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-aMi6f72D2qg/Tt5W_7QHsOI/AAAAAAAAC6w/GyZd7XXeS4c/s1600/Latvian+Bread+line.jpg" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="255" src="http://4.bp.blogspot.com/-aMi6f72D2qg/Tt5W_7QHsOI/AAAAAAAAC6w/GyZd7XXeS4c/s400/Latvian+Bread+line.jpg" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-weight: bold;"&gt;Fig. 1: Outside the Chicago Sheraton, Jan. 7, 2012&lt;/span&gt;&amp;nbsp; &lt;/div&gt;&lt;br /&gt;Now,  rationing by queue can be a good thing if we want to protect a  commodity from market pressures. Disneyworld would be less magical if  you had to pay a variable rate for each ride; liver transplants are  less just if the wealthiest can &lt;a href="http://www.ama-assn.org/amednews/2009/07/27/prsa0727.htm"&gt;buy themselves to the head of the line&lt;/a&gt;. The textbook economic solution to the oversupply of Ph.D.s would  be to let the benefits associated with a tenured job (pay, security,  workload, respect) plunge until finally everyone just gave up. To its credit historians have not let this happen; but the continuing attack on all of those perks from administrators, politicians, etc., is a direct result of those market forces pushing on the profession. Rationing by queue isn't working at stopping market forces, and in the meantime we're rewarding waiting too strongly.&lt;br /&gt;&lt;br /&gt;Adding  a new postdoctoral position is not baking more bread; it's buying  concrete to make the sidewalk longer. True, more people can stand in  line. But that just makes the situation worse for the next &lt;i&gt;babushka &lt;/i&gt;who wants a sandwich; anyone who has better things to do than spend nine years standing in line is going to give up. And we're out the money for a sidewalk.&lt;br /&gt;&lt;br /&gt;~~~&lt;br /&gt;&lt;br /&gt;So OK: we eliminate the post-docs. Great for the long-term prospects of the profession. But short term, all we're doing is  throwing a bunch of people onto the street. Some of them  will end up in adjuncting jobs (structurally similar to post-doctoral  positions, though less often defended), and some will be out of luck. But remember: this is about using our resources: and taking all that post-doc money plus some more, we can do good, not harm.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 2:&lt;/b&gt;&lt;b&gt; Using money from graduate and postdoctoral fellowships, initiate a massive program of buy-outs at different professional stages.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Instead of spending what little money we have to make the tenure-track market &lt;i&gt;worse&lt;/i&gt;, why not use it to make it better? &lt;br /&gt;&lt;br /&gt;A program of buy-outs would serve many of the same purposes as contraction, but would make the decision to leave a choice of the individual, and sidestep the problem of which departments would have to shrink. Those departments not training for the tenure track would be relatively unaffected; those that are would be only indirectly affected. Unlike contraction, it would immediately relieve the crush on the job market. &lt;br /&gt;&lt;br /&gt;Here's how it would work. Postdoctoral fellowships review committees would stay as they are, and would review applications with statements and letters of recommendation. But instead of paying you to come apply to jobs, they would pay you more (since it wouldn't include overhead like office space or health care) to simply check out of the profession. Instead of a $30,000/year, plus another $10K (say) building and health care overhead for two years, they would pay $65,000 up front for three things:&lt;br /&gt;&lt;br /&gt;1) A firm commitment to never seek tenure-track employment in a history program.&lt;br /&gt;&lt;br /&gt;2) A commitment from each of your committee members and department chair to not send letters of recommendation on your behalf to tenure-track jobs. They would be compensated for this as well--probably around $5K for the primary advisor. The point would be ensure compliance, and to promote conversations between students and advisors about whether they should take the option. (They also get all the departmental service points, if such things  exist, they would have received for advising the dissertation to  completion). &lt;br /&gt;&lt;br /&gt;3) All of your   dissertation-related intellectual property. It would be placed into a public domain (no attribution necessary)   repository managed by the AHA. Archival photographs, out-of-copyright   scans, notes and completed, unpublished chapters—it all goes into the   digital commons. This a) makes it harder for you to welch  on your commitment to leave academia; b) builds up a great base for  open materials for future teaching, research, dissertation proposals,  etc; and c) gives the selection committees a good incentive to choose well in buying people out. (This one might be a harder sell, so we could make it only possible on a sliding scale).&lt;br /&gt;&lt;br /&gt;Ph.D.s would not be required, but the point would be to pull people off the market who &lt;i&gt;look like prime candidates for tenure track jobs&lt;/i&gt;; that's how these reviews would work.&lt;br /&gt;&lt;br /&gt;If you are a currently enrolled grad student, you could apply for one of the buy-out fellowships; but you could also convert the remaining cash value of your stipend anytime after reaching ABD status. So if you had 2 years @ $25K with six course sections, say, you could deduct the marginal cost for a TA (call it 12K?), add 33% for health care and other overhead, and walk out with $50,500 and a master's degree. Once again, some money--5%?--would go to advisors. This would strongly increase everyone's incentives to talk clearly and openly about their academic prospects after reaching candidacy; just the point that many students, now, see no alternative but staying the course.&lt;br /&gt;&lt;br /&gt;We might be able to open it up to junior professors, as well; nothing helps the job market as much as flushing out the people actually in the jobs. We'll have to get some econometricians on board to set the rates properly for them (which will rely mostly on their tenure prospects). Senior faculty, on the other hand, we'll leave out; universities already have a great incentive to force them into retirement and replace them with younger, cheaper replacements.&lt;br /&gt;&lt;br /&gt;The reserve army of the unemployed used as  contingent faculty will &lt;i&gt;not &lt;/i&gt;be eligible. But since many people who would have entered that pool will have been bought out, the buyouts will help them all immediately. Raising the market-clearing rate for qualified  adjuncts will improve conditions for those who remain; and as their  price goes up, in marginal cases it will allow departments to make  stronger cases for new tenure-track lines. Bought-out Ph.D.s could still  teach in adjunct positions; in many cases, they might find it rewarding  personally or a way to tide over a rough financial period. But with the  brass ring permanently revoked, many fewer would choose to.&lt;br /&gt;&lt;br /&gt;Objections:&lt;br /&gt;&lt;b&gt;1. This would lead to lots of people attending graduate school for two to three years, then pocketing the buyout money and leaving.&lt;/b&gt;&lt;br /&gt;Great! We've just achieved some of the benefits of contraction while still exposing more students to the possibilities of academic history, and showed the other students and faculty in those programs alternatives to the 7-year forced march towards tenure.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;2. The buyouts could take the best historians out of the pool, leaving us with a generation of assistant professors whose chief virtue is that they were unable to conceive of doing anything else with their lives.&lt;/b&gt;&lt;br /&gt;Maybe: but that's exactly what queue-rationing is doing right now. We already lose all people who want to save for their retirement in their 20s, live with their spouse every year of their 30s, or live in the same metro area as their aging parents in their 40s. If the goal is to get the best professors, our first priority needs to be change the compensation scheme to &lt;i&gt;make those things possible, &lt;/i&gt;not to reinforce the status quo. In the short run, things might get a bit worse. But in the long run, we'll have a better chance of attracting and keeping the best people possible. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;3. There's not enough money in the system to make this work.&lt;/b&gt;&lt;br /&gt;We'll have to see. At the very least, we'll have the benefit of shutting down the post-docs. If it doesn't prove enough to make the market easier, than maybe we start putting some of that money towards internships or something instead. But it's not clear that there's money in the federal government for a new WPA, or in the broader society for collaborative public history jobs, either.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;4. This is a neo-liberal call to submit to the market.&lt;/b&gt;&lt;br /&gt;To the contrary: this is a call to take control of the commanding heights of the market. The surest way to let market forces (adjunctification, falling salaries and number of tenure slots, etc.) decimate the profession is willful blindness to its operations; that has been the preferred choice of most humanists for the last thirty years. Why not try something new?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-8428242382324857817?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/8428242382324857817/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/12/fixing-job-market-in-two-modest-steps.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8428242382324857817'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8428242382324857817'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/12/fixing-job-market-in-two-modest-steps.html' title='Fixing the job market in two modest steps'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-aMi6f72D2qg/Tt5W_7QHsOI/AAAAAAAAC6w/GyZd7XXeS4c/s72-c/Latvian+Bread+line.jpg' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-8315466629218364553</id><published>2012-01-05T10:01:00.004-05:00</published><updated>2012-01-17T16:23:06.961-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Featured'/><title type='text'>Practices, the periphery, and Pittsburg(h)</title><content type='html'>[This is not what I'll be saying at the AHA on Sunday morning, since I'm participating in a &lt;a href="http://aha.confex.com/aha/2012/webprogram/Session6143.html"&gt;panel discussion with Stefan Sinclair, Tim Sherrat, and Fred Gibbs, chaired by Bill Turkel&lt;/a&gt;. Do come! But if I were to toss something off today to show how text mining can contribute to historical questions and what sort of issues we can answer, now, using simple tools and big data, this might be the story I'd start with to show how much data we have, and how little things can have different meanings at big scales...]&lt;br /&gt;&lt;br /&gt;Spelling variations are not a bread-and-butter historical question, and with good reason.&amp;nbsp; There is nothing at stake in whether someone writes "Pittsburgh" or "Pittsburg." But precisely because spelling is so arbitrary, we only change it for good reason. And so it can give insights into  power, center and periphery, and transmission. One of the insights of cultural history is that the history of practices, however mundane, can be deeply rooted in the history of power and its use. So bear with me through some real arcana here; there's a bit of a payoff. Plus a map.&lt;br /&gt;&lt;br /&gt;The set-up: until 1911, the proper spelling of Pittsburg/Pittsburgh was in flux. Wikipedia (always my go-to source for legalistic minutia) has an exhaustive &lt;a href="http://en.wikipedia.org/wiki/Etymology_of_Pittsburgh"&gt;blow-by-blow&lt;/a&gt;, but basically, it has to do with decisions in Washington DC, not Pittsburgh itself (which has usually used the 'h'). The city was supposedly mostly "Pittsburgh" to 1891, when the new US Board on Geographic Names made it firmly "Pittsburg;" then they changed their minds, and made it and once again and forevermore "Pittsburgh" from 1911 on. This is kind of odd, when you think about it: the government changed the name of the eighth-largest city in the country twice in twenty years. (Harrison and Taft are not the presidents you usually think of as kings of over-reach). But it happened; people seem to have changed the addresses on their envelopes, the names on their baseball uniforms, and everything else right on cue.&lt;br /&gt;&lt;br /&gt;Thanks to about 500,000 books from the Open Library, though, we don't have to accept this prescriptive account as the whole story; what did people actually do when they had to write about Pittsburgh?&lt;br /&gt;&lt;br /&gt;Here's the usage in American books:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-8BO8Pq-udu8/TwSkWkn3zxI/AAAAAAAAC7w/tlPbu_Oos_Y/s1600/Usage+of+Pittsburg+vs+Pittsburgh+in+the+19th+century.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-j9plN5SsgkM/TwTnDK7VXEI/AAAAAAAAC9c/Q8okNs6JzR8/s1600/Usage+of+%2522Pittsburgh%2522+vs+%2522Pittsburg%2522+spellings+in+American+books%252C+19th+century.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-j9plN5SsgkM/TwTnDK7VXEI/AAAAAAAAC9c/Q8okNs6JzR8/s1600/Usage+of+%2522Pittsburgh%2522+vs+%2522Pittsburg%2522+spellings+in+American+books%252C+19th+century.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-DOyc8_oM5JA/TwSpAW-Vh6I/AAAAAAAAC8U/OiWZ8SwlP_s/s1600/Usage+of+%2522Pittsburgh%2522+vs+%2522Pittsburg%2522+spellings+in+American+books%252C+19th+century.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;/a&gt; &lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-gr-K4FzZHJ8/TwSnVihA7HI/AAAAAAAAC78/tCz7XnlaaRw/s1600/Usage+of+%2522Pittsburgh%2522+vs+%2522Pittsburg%2522+spellings+in+American+books%252C+19th+century.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-GS8_ek0HeA0/TwSakIuGwlI/AAAAAAAAC7k/OTAXu-HZOoE/s1600/Pittsburgh+vs.+Pittsburg+spelling+usage+to+1922.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;What does this tell us about how practices change?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This is telling us that usage was fairly confused up to about 1890; there seems to have been a push for 'Pittsburg' in the 1860s, but it somehow lost steam. When the Feds appeared in 1890, the 'h' appeared in about 80% as many books as the solo-G--not such a bad decision to go with the 'G' as some imply. But the rule made a difference: once the state intervened, 'Pittsburg' became about three times as popular as 'Pittsburgh'; and after it changed tacks in 1911, usage quickly tracked back towards 'gh.'*&lt;br /&gt;&lt;br /&gt;*[Quick guide for the perplexed: since we're comparing ratios, we  have to use a log chart to get reasonable data. The dotted line at 1 shows  where "Pittsburgh" and Pittsburg are used in equal numbers of books;  equal distances above and below show proportionally more use of  "Pittsburgh" and "Pittsburg" respectively.&amp;nbsp; Each dot is a year, with a trend  line superimposed. And for all of these, I use number of books using the  words, not word counts; we're interested in what people are doing, and  letting how many times they do it into the data will just muddy it up. The color just reinforces  that scale on a blue-to-white-to-red distribution. And remember those  colors, we're using them again later.]&lt;br /&gt;&lt;br /&gt;Now, I'm interested in not the &lt;i&gt;fact&lt;/i&gt; of linguistic change, but its &lt;i&gt;dynamics. &lt;/i&gt;What happens when the government tries to implement changes of spelling; does everyone respond equally? This is where library metadata starts to get useful.&lt;br /&gt;&lt;br /&gt;The first thing I checked here was the usage by age groups. &lt;a href="http://sappingattention.blogspot.com/2011/04/age-cohort-and-vocabulary-use.html"&gt;A lot of language changes by cohort displacement&lt;/a&gt;; is that the case for spelling reforms? Here's the same sort of chart I made in the spring with author age on the y-axis, and year on the x. (&lt;a href="http://sappingattention.blogspot.com/2011/04/age-cohort-and-vocabulary-use.html"&gt;Longer explanation here&lt;/a&gt;; I'm using a moving average to smooth instead of loess, for sanity reasons). Once again, blue is with an h, and red without one:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-LAu6uuO4Y7g/TwSqaiAIUtI/AAAAAAAAC8g/i9YinDntSuM/s1600/Usage+of+Pittsburgh+vs+Pittsburg+by+age+group.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-LAu6uuO4Y7g/TwSqaiAIUtI/AAAAAAAAC8g/i9YinDntSuM/s1600/Usage+of+Pittsburgh+vs+Pittsburg+by+age+group.png" /&gt;&lt;/a&gt;&lt;/div&gt;This one is quick: there's almost no age effect. We're missing some data from the early period here, but pretty clearly the shifts in 1891 and 1911 happen across all generations, except maybe the very old. (And there, we expect to see some reprints skewing the data more.) &lt;br /&gt;&lt;br /&gt;The next place I looked for this was by stack location. Taking the headline LC classifications, do any patterns jump out?&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Q8m15bv5Bnw/TwSr65FbEcI/AAAAAAAAC84/Vt-bnSOjS28/s1600/Pittsburg+vs+Pittsburgh+by+library+genre.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-Q8m15bv5Bnw/TwSr65FbEcI/AAAAAAAAC84/Vt-bnSOjS28/s1600/Pittsburg+vs+Pittsburgh+by+library+genre.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-kRHN0aLqYJ4/TwSrtLC7AkI/AAAAAAAAC8s/yfVbIpjbtBE/s1600/Pittsburg+vs+Pittsburgh+by+library+genre.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;This one is more suggestive... K, the law, is curiously unresponsive to most shifts; and some areas (J, political science, and S, agriculture) seem more attached to the 'gh' spelling before 1891, perhaps because of individual institutions (the University of Pittsburgh, etc.); and C and E, both history genres, take their time accomodating to changes. But nothing seems overwhelming.&lt;br /&gt;&lt;br /&gt;Publication country, however, has a dramatic difference: (still blue for Pittsburgh, red for Pittsburg)&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-CgJKxY_1QUE/TwStKzgOrPI/AAAAAAAAC9E/lDm-7j54oBI/s1600/Usage+of+Pittsburgh+and+Pittsburg+by+country.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-CgJKxY_1QUE/TwStKzgOrPI/AAAAAAAAC9E/lDm-7j54oBI/s1600/Usage+of+Pittsburgh+and+Pittsburg+by+country.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Only one of these countries flies the red, white and blue. The British don't get the memo--&lt;a href="http://books.google.com/ngrams/graph?content=Pittsburgh%2CPittsburg&amp;amp;year_start=1900&amp;amp;year_end=1950&amp;amp;corpus=6&amp;amp;smoothing=0"&gt;according to ngrams&lt;/a&gt;, it's not until the 1930s they make the switch for good. (The Bookworm database only goes to 1922, so we can't tell here). The Canadians, with many fewer books, show less of a pattern, but might be in between.&lt;br /&gt;&lt;br /&gt;I find this interesting; it says something about either the ability of information like lexicographic reform to travel across international boundaries, or about the ability of states to impose these constraints on others. Britons kept moving along in their own community of practice for two decades; why bother changing spelling? It's not as though they had to address many envelopes.&lt;br /&gt;&lt;br /&gt;But the most compelling for me lies in the patterns across US states, by location of publisher (only including the top 15 states, since &lt;a href="http://sappingattention.blogspot.com/2011/01/where-were-19c-us-books-published.html"&gt;there aren't many books from elsewhere&lt;/a&gt;):&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-NHDmbnWiZuY/TwS0iWhyo6I/AAAAAAAAC9Q/xdHQhLGY22g/s1600/Pittsburg+vs+Pittsburgh+usage+across+states.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-NHDmbnWiZuY/TwS0iWhyo6I/AAAAAAAAC9Q/xdHQhLGY22g/s1600/Pittsburg+vs+Pittsburgh+usage+across+states.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;This reveals two separate, interesting things:&lt;br /&gt;&lt;br /&gt;1) This points toward the possibility of an event that I didn't know about before. 1891 may not actually be the first time the government changed its spelling of 'Pittsburgh'; something happened quickly and decisively to shift it in Washington DC in 1871 as well. Looking at book titles, I'm pretty confident this is a real pattern, albeit &lt;a href="http://www.clpgh.org/exhibit/apology7.html"&gt;one the Pittsburgh sources don't mention&lt;/a&gt;. That mysterious plateau in the first chart from 1871 to 1891 towards 'Pittsburgh,' I suspect, can be directly attributed to whatever this is. Anyone want to put together a grant proposal to the Carnegie foundation? There's an opportunity for a major contribution to Pittsburghiana here!&lt;br /&gt;&lt;br /&gt;2) Of much broader interest; the spread seems to have a geographical pattern. Most states lack the quick and clear definition of Washington DC in their shifts, but if we squint for the white squares and try to group them by the date they cross over to using the 'gh' spelling, a pattern emerges:&lt;br /&gt;&lt;br /&gt;1911: DC,CT&lt;br /&gt;1912: PA,MD,OH&lt;br /&gt;1913: IN, IL,NY&lt;br /&gt;1914: WI, IA, MA, MI&lt;br /&gt;1916: MN&lt;br /&gt;1918: MO&lt;br /&gt;Post-1922: CA (A special case, along with Kansas and Oklahoma, since it has a 'Pittsburg' of its own. Still, they all are dwarfed by the real one; California's had 6,000 people in 1920.)&lt;br /&gt;&lt;br /&gt;With the exception of Connecticut (which already had a strange predilection for the 'h'), what do we see? I would say: a path that traces concentric circles away from Washington, DC (or maybe Pittsburgh itself). I trust this enough to make a map. If we add every state with at least 20 books using either spelling after 1910, and bump up the smoothing window to six years, we can see this on a map: when do they switch over to the newly mandated spelling?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Year majority of books published in state switched from "Pittsburg" to "Pittsburgh"&lt;/b&gt;&lt;!-- GeoChart generated in R 2.12.1 by googleVis 0.2.13 package --&gt;&lt;!-- Sat Jan  7 18:12:17 2012 --&gt;&lt;!-- jsHeader --&gt;&lt;script type="text/javascript" src="http://www.google.com/jsapi"&gt;&lt;/script&gt;&lt;script type="text/javascript"&gt;// jsData function gvisDataGeoChartIDa822dc5 (){  var data = new google.visualization.DataTable();  var datajson =[ [ "US-AL",       1919 ],[ "US-CA",       1923 ],[ "US-CO",       1922 ],[ "US-CT",       1911 ],[ "US-DC",       1911 ],[ "US-GA",       1923 ],[ "US-IA",       1915 ],[ "US-IL",       1913 ],[ "US-IN",       1914 ],[ "US-KS",       1923 ],[ "US-KY",       1919 ],[ "US-LA",       1923 ],[ "US-MA",       1914 ],[ "US-MD",       1912 ],[ "US-MI",       1915 ],[ "US-MN",       1918 ],[ "US-MO",       1918 ],[ "US-NB",       1923 ],[ "US-NC",       1920 ],[ "US-NH",       1916 ],[ "US-NJ",       1911 ],[ "US-NY",       1914 ],[ "US-OH",       1913 ],[ "US-OR",       1923 ],[ "US-PA",       1911 ],[ "US-TN",       1915 ],[ "US-TX",       1923 ],[ "US-UT",       1922 ],[ "US-VA",       1918 ],[ "US-VT",       1915 ],[ "US-WA",       1918 ],[ "US-WI",       1915 ],[ "US-WV",       1915 ],[ "US-XX",       1914 ] ];data.addColumn('string','state');data.addColumn('number','TransitionYear');data.addRows(datajson);return(data);}// jsDrawChartfunction drawChartGeoChartIDa822dc5() {  var data = gvisDataGeoChartIDa822dc5();  var options = {};options["width"] =    650;options["height"] =    420;options["region"] = "US";options["displayMode"] = "regions";options["resolution"] = "provinces";options["colorAxis"] = {colors:['DE2D26','FEE0D2']};     var chart = new google.visualization.GeoChart(       document.getElementById('GeoChartIDa822dc5')     );     chart.draw(data,options);    }  // jsDisplayChart function displayChartGeoChartIDa822dc5(){  google.load("visualization", "1", { packages:["geochart"] });   google.setOnLoadCallback(drawChartGeoChartIDa822dc5);}// jsChart displayChartGeoChartIDa822dc5()&lt;!-- jsFooter --&gt;  //--&gt;&lt;/script&gt;&lt;!-- divChart --&gt;  &lt;div id="GeoChartIDa822dc5"  style="width: 650px; height: 420px;"&gt;&lt;/div&gt;(Anything that crosses later than 1922, I just count as 1923 here. And sorry for the legend; I don't know how to tell GoogleVis about years). &lt;br /&gt;&lt;br /&gt;Now the DC story is slightly less strong visually--Pennsylvania seems to be on top of the trend as well. (The story from the ground seems to be that Pittsburgundians were only too happy to go along with change). Knowing what we do, it's not unreasonable to see the practice as a joint effort of the city and the federal government. And indeed, the change does seem to radiate out from that central point through space--the farther out, the longer it takes the new practice to come into effect. (And the weaker it is when it does). &lt;br /&gt;&lt;br /&gt;Data for the 1891 transition work less well--in large part, the Western states seem to just like the simpler spelling no matter what, and likewise Pennsylvania the 'gh'. But we can make a less easy to read, but more accurate, map as well. It shows how the (log of the ratio of the) ratio of 'gh' to 'g' use changed from the aughts to the teens; bright blue (like Maryland) means it followed the federal mandates more strictly, purples show foot-dragging, and more red means that it departed from them.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;b&gt;Strength of change in "Pittsburg(h)" spelling, 1900-1908 compared to 1913-1921&lt;/b&gt;&lt;/div&gt;&lt;script src="http://www.google.com/jsapi" type="text/javascript"&gt;&lt;/script&gt; &lt;script type="text/javascript"&gt;// jsData function gvisDataGeoChartID4dbc6a06 (){  var data = new google.visualization.DataTable();  var datajson =[ [ "US-AR",0.8472978604 ],[ "US-CA",0.2210739154 ],[ "US-CT",2.333210022 ],[ "US-GA",0.06367309486 ],[ "US-IA",2.182884274 ],[ "US-IL",1.774534387 ],[ "US-IN", 1.00867551 ],[ "US-KS",1.343003753 ],[ "US-KY",1.609437912 ],[ "US-LA",0.5959834321 ],[ "US-MA",1.520522011 ],[ "US-MD",2.373895371 ],[ "US-ME",1.914819562 ],[ "US-MI",1.844977308 ],[ "US-MN",0.4688891632 ],[ "US-MO",0.5795408943 ],[ "US-NH",1.017196897 ],[ "US-NJ", 1.01595686 ],[ "US-NY",1.650051872 ],[ "US-OH",1.393509015 ],[ "US-PA",1.866481647 ],[ "US-TN",2.430459443 ],[ "US-UT",-0.6668298722 ],[ "US-VA",1.428999398 ],[ "US-VT", 1.85629799 ],[ "US-WA",1.662547738 ],[ "US-WI",1.471964631 ],[ "US-WV", 1.70664024 ],[ "US-XX",1.356433509 ] ];data.addColumn('string','state');data.addColumn('number','LogofChange04to17');data.addRows(datajson);return(data);}// jsDrawChartfunction drawChartGeoChartID4dbc6a06() {  var data = gvisDataGeoChartID4dbc6a06();  var options = {};options["width"] =    650;options["height"] =    420;options["region"] = "US";options["displayMode"] = "regions";options["resolution"] = "provinces";options["colorAxis"] = {colors:['red','blue']};     var chart = new google.visualization.GeoChart(       document.getElementById('GeoChartID4dbc6a06')     );     chart.draw(data,options);    }  // jsDisplayChart function displayChartGeoChartID4dbc6a06(){  google.load("visualization", "1", { packages:["geochart"] });   google.setOnLoadCallback(drawChartGeoChartID4dbc6a06);}// jsChart displayChartGeoChartID4dbc6a06()&lt;!-- jsFooter --&gt;  //--&gt;&lt;/script&gt;       &lt;br /&gt;&lt;div id="GeoChartID4dbc6a06" style="height: 420px; text-align: center; width: 650px;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;I find it somewhat compelling that the two poles are Washington DC (not pictured on account of size), which uses the new spelling 20x more frequently after 1911; and Utah, which managed to use the federally mandated spelling proportionally &lt;i&gt;less&lt;/i&gt; in 1917 than it did before the government demanded it. (Of course it's Utah--and just after statehood, too! One wishes South Carolina, in all its fire-eating glory, would have published enough books to show up on the chart.) Perhaps this one is less clear than the straight transition, but there's still quite a noticeable trend from metropole to periphery. The influence of the government wanes as we get farther from its native sphere. (Not shown; the pattern is somewhat similar, with less data, for the 1880s to 1890s transition.)&lt;br /&gt;&lt;br /&gt;This, I'd argue, is pretty close to what you'd expect to see if the normative power of the federal government declines as you get further from its seat. Historians have well overplayed the center-periphery card in recent years, but they do exist; even in America, maybe the field of power declines with distance. (Can anyone think of some more terms that allow exactly this sort of application? British titles come to mind--Who says 'Sir David', and who merely "David Cannadine"?- but I don't have extensive geographic information for the UK.) Certainly, it seems like this one normative pattern does.&lt;br /&gt;&lt;br /&gt;Am I saying that this one orthographic example tells us basic things  about the nature of the fin-de-siecle American state? Well, not &lt;i&gt;really&lt;/i&gt;. (Although I probably would push it a bit farther than you'd like). This data is suggestive at best, misleading at worst; Enough for a hunch, and not much more. But, quite seriously, the accumulation of practices like this, when we can figure out how to group them together, has the potential to tell us enormously compelling things about sources of power and the patterns of imitation in all sorts of cultural spheres. Not all practices are driven by government actors; some &lt;i&gt;are &lt;/i&gt;generational or disciplinary. There are vast numbers of subtle linguistic ticks--like the spelling of Pittsburgh--that exist in the textual record; we can use them to see how practices reproduce, how influences spread.&lt;br /&gt;&lt;br /&gt;And they only work &lt;i&gt;because &lt;/i&gt;they're so hard to spot that even the people using them may not think about what they're doing. An author may remind himself about the up-to-date spelling of Pittsburgh; but it's very rare that he'd think about his relationship to the federal government before deciding which version to use. And yet, what he does reflects his participation in those fields nonetheless. These sorts of changes are like the dark matter of the historical universe; weak, tricky to spot, and not worth much in isolation; but they're everywhere. Cast a wide enough net, and we can use them for our ends.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-8315466629218364553?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/8315466629218364553/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2012/01/practices-periphery-and-pittsburgh.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8315466629218364553'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8315466629218364553'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2012/01/practices-periphery-and-pittsburgh.html' title='Practices, the periphery, and Pittsburg(h)'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-j9plN5SsgkM/TwTnDK7VXEI/AAAAAAAAC9c/Q8okNs6JzR8/s72-c/Usage+of+%2522Pittsburgh%2522+vs+%2522Pittsburg%2522+spellings+in+American+books%252C+19th+century.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-4555363870691013094</id><published>2011-12-16T17:01:00.000-05:00</published><updated>2011-12-19T13:40:03.549-05:00</updated><title type='text'>Genre similarities</title><content type='html'>When data exploration produces Christmas-themed charts, that's a sign it's time to post again. So here's a chart and a problem.&lt;br /&gt;&lt;div class="separator" style="clear: both;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both;"&gt;First, the problem. One of the things I like about the posts I did &lt;a href="http://sappingattention.blogspot.com/2011/04/age-cohort-and-vocabulary-use.html"&gt;on author age and vocabulary&lt;/a&gt; &lt;a href="http://sappingattention.blogspot.com/2011/05/predicting-publication-year-and.html"&gt;change&lt;/a&gt; in the spring is that they have two nice dimensions we can watch changes happening in. This captures the fact that language as a whole doesn't just up and change--things happen among particular groups of people, and the change that results has shape not just in time (it grows, it shrinks) but across those other dimensions as well.&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;There's nothing fundamental about author age for this--in fact, I think it probably captures what, at least at first, I would have thought were the least interesting&lt;i&gt; &lt;/i&gt;types of vocabulary change. But author age has two nice characteristics.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;1) It's straightforwardly linear, and so can be set against publication year cleanly.&lt;/div&gt;&lt;div style="text-align: left;"&gt;2) Librarians have been keeping track of it, pretty much accidentally, by noting the birth year of every book's author. &lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;Neither of these attributes are that remarkable; but the combination is.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;There are plenty of linear variables out there: I'd love to be able to see how vocabulary changes lie in time by linear variables like author income, years of schooling, or annual sales figures for books; but no one has been collecting that data. The stuff that has been collected, on the other hand, is essential categorical--a book can be fiction, published in Philadelphia, about set theory, in English. Nobody keeps track of any of these as linear variables, though they could (it just barely mentions set theory, it has a lot of French words, etc.)&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;The trick is to make this categorical data more ordinal. Given something reasonably good at turning publication location into real life places, for instance, you could turn geographical data into latitude-longitude pairs, or into any number of mildly interesting one-dimensional series. (Maybe the adoption of some vocabulary can be modeled well by miles from Muncie, or by city population at date of publication).&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;But &lt;a href="http://sappingattention.blogspot.com/2011/01/where-were-19c-us-books-published.html"&gt;book data just isn't strongly geographical enough&lt;/a&gt; to make those sorts of comparisons worth coding. (Newspaper data, on the other hand...) And I'm particularly interested in genre. What I'd really like is some way to make genre information univariate. One way to do this is to create new ordinal genre information through principal components analysis or something. But that doesn't use metadata, just the text, which seems somewhat wasteful. The best genre information we have is probably LC classification numbers; and they are frustratingly almost-ordinal. Q-R-S-T is all science-math-technology type stuff; D-E-F is history leading into the social sciences in G-H-K; and so on. But there's not really a continuous scale from A to Z. Right?&lt;br /&gt;&lt;br /&gt;I wanted to get a quick-ish handle on this, and just how similar or dissimilar the various LC classes are, and how that maps to the order they're shelved in.&lt;br /&gt;&lt;br /&gt;This is where the chart comes in. The easiest way to compare genres seemed to be comparing their word usage using cosine similarity. (To keep the data size manageable, I actually compared only words preceding the word 'are.' Good enough, hopefully; it shouldn't seriously compromise the data, but does mean that the variations are mostly about noun-usage, not word usage in general.)&lt;br /&gt;&lt;br /&gt;1 is perfect similarity, and anything below about .85 is 'not very close'--I've lumped those together. Every point is colored to show the similarity score of the genre immediately to the left against the genre below. Green is very similar, white is averagely similar, red is not very similar. You'll see a green line running through the middle--that's because every genre is identical to itself. The chart, in accidentally Christmas colors (click on the chart to enlarge, and to Wikipedia for a refresher on &lt;a href="http://en.wikipedia.org/wiki/Library_of_Congress_Classification"&gt;LC classifications&lt;/a&gt;):&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/-ZaU1dqeuc7s/Tu9o2DzrL2I/AAAAAAAAC7Q/uUHDXwxu76k/s1600/Cosine+similarity+among+LC+classification+genres.png" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="470" src="http://3.bp.blogspot.com/-ZaU1dqeuc7s/Tu9o2DzrL2I/AAAAAAAAC7Q/uUHDXwxu76k/s640/Cosine+similarity+among+LC+classification+genres.png" width="640" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;This is not one of those charts where the meaning just jumps out. But a few notes:&lt;br /&gt;&lt;br /&gt;1) There are roughly three big groupings that are relatively coherent: the social sciences and humanities, let's call them, A to PN; fiction, PQ-PZ; and the sciences, Q to Z. These map on to the LC classification scheme relatively well, so it's not a completely arbitrary mapping.&lt;br /&gt;&lt;br /&gt;2) Some genres are mostly red, meaning they're entirely &lt;i&gt;sui generis. &lt;/i&gt;Most notable is fiction, PZ, which is also the largest one in the collection, and the other P-categories; QA, math; and TK, electrical engineering.&lt;br /&gt;&lt;br /&gt;3) Some genres have green bands running all the way up and down. (Or left and right, since the chart's symmetric). Q, R, T, F, and G are like this; notably, those are all general classes. So the books classed as 'general science' or 'general technology' actually do have some lack of specificity, either individually or when averaged out, that makes them closer to random other books. That's sort of interesting.&lt;br /&gt;&lt;br /&gt;Still, it doesn't exactly look like the genres are placed in the best of all possible orders. AE (encyclopedias) looks more like science than like its nearest neighbors in the B category, psychology-philosophy-religion (although that category, which has always felt too much like a grab-bag to me, actually coheres very nicely in a sea of green. The early Ps, which are world literature and literary studies, look more like world history (the Ds) than they do like the bulk of fiction in PR, PS, and PZ. And so on.&lt;br /&gt;&lt;br /&gt;So, can we create a single best linear ordering? No, not really. The data is too dimensional for that. That would be like trying to create a single ordering of the cities in North America from the distance grid in the corner of a AAA map. You could run a spectrum from San Diego to St John's Newfoundland, or from Vancouver to Miami; either would make sense, but neither would work, because the data is fundamentally two-dimensional. (I actually just tried this using lat-long coordinates; principal components analysis runs a spectrum from Providence RI to Eugene Oregon, that Bangor and Vancouver end up in the &lt;i&gt;inside &lt;/i&gt;of.) Here, the data has many more than 2 dimensions, which makes a single useful ordering all the less likely.&lt;br /&gt;&lt;br /&gt;What we &lt;i&gt;can &lt;/i&gt;do, though, is create any number of somewhat useful orderings; to extend the analogy, the best ranking of the cities in North America for &lt;i&gt;me &lt;/i&gt;is going to be their distance from Somerville, MA. So we can rearrange this chart by showing the distance of various genres from QH, natural history:&lt;br /&gt;&amp;nbsp; &lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/-BmyPfZ6tXuk/Tu9o6AnsLhI/AAAAAAAAC7Y/LJY3wqnzcqg/s1600/Cosine+similarity+among+library+of+congress+genres+ordered+by+distance+from+biology.png" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="470" src="http://3.bp.blogspot.com/-BmyPfZ6tXuk/Tu9o6AnsLhI/AAAAAAAAC7Y/LJY3wqnzcqg/s640/Cosine+similarity+among+library+of+congress+genres+ordered+by+distance+from+biology.png" width="640" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Reading down from the left, QH is identical to itself; next closest is Q (general science, then QL (zoology), QP (physiology), and so on. On the face of it, this doesn't look much better, or much worse, than the original one. We still get some nice groupings, but outside of a few helpful rearrangements close to QH (anthropology &lt;i&gt;is &lt;/i&gt;like natural history!) it's more arbitrary than the original LC ordering, and certainly not as good as &lt;a href="http://sappingattention.blogspot.com/2011/02/fresh-set-of-eyes.html"&gt;the hierarchy I built using textual data&lt;/a&gt; a while back.&lt;br /&gt;&lt;br /&gt;What's potentially interesting, though, about that sort of ordering is that it lets us look at how transmission moves--or doesn't--across those similarity lines. We know that Q is statically similar to S, and not so to PR; when language changes, how do those similarities affect the changes that happens?&lt;br /&gt;&lt;br /&gt;So that's what's next.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-4555363870691013094?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/4555363870691013094/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/12/genre-similarities.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/4555363870691013094'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/4555363870691013094'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/12/genre-similarities.html' title='Genre similarities'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-ZaU1dqeuc7s/Tu9o2DzrL2I/AAAAAAAAC7Q/uUHDXwxu76k/s72-c/Cosine+similarity+among+LC+classification+genres.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-1040833759262748439</id><published>2011-11-18T19:01:00.001-05:00</published><updated>2011-11-28T12:27:04.904-05:00</updated><title type='text'>Treating texts as individuals vs. lumping them together</title><content type='html'>Ted Underwood has been &lt;a href="http://tedunderwood.wordpress.com/2011/11/09/identifying-the-terms-that-characterize-an-author-or-genre-why-dunnings-may-not-be-the-best-method/"&gt;talking up the advantages of the Mann-Whitney test over Dunning's Log-likelihood&lt;/a&gt;, which is currently more widely used. I'm having trouble getting M-W running on large numbers of texts as quickly as I'd like, but I'd say that his basic contention--that Dunning log-likelihood is frequently &lt;i&gt;not &lt;/i&gt;the best method--is definitely true, and there's a lot to like about rank-ordering tests.&lt;br /&gt;&lt;br /&gt;Before I say anything about the specifics, though, I want to make a more general point first, about how we think about comparing groups of texts.The most important difference between these two tests rests on a much bigger question about how to treat the two corpuses we want to compare. &lt;br /&gt;&lt;br /&gt;Are they a single long text? Or are they a collection of shorter texts, which have common elements we wish to uncover? This is a central concern for anyone who wants to algorithmically look at texts: how far can we can ignore the traditional limits between texts and create what are, essentially, new documents to be analyzed? There are extremely strong reasons to think of texts in each of these ways.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The reason to think of many shorter texts are more obvious. In general, that seems to correspond better with the real world; in my case, for example, I &lt;i&gt;am &lt;/i&gt;looking a hundreds of books, not simply two corpuses; any divisions I introduce are imperfect. If one book is a novel with a character named Jack, "Jack" may appear hundreds of times in the second corpus; it would be vastly over-represented. That knowledge, though, doesn't lead us to any useful knowledge about the second corpus--there's nothing distinctively 'Jack'-y about all the other books in it.&lt;br /&gt;&lt;br /&gt;Now, the presumption that every document in a set actually stands on its own is quite frequently misplaced. Books frequently have chapters with separate authors, introductions, extended quotations. For instance: &lt;i&gt;Dubliners &lt;/i&gt;will be more strongly characterized by the 30 times the word "Henchy" appears than the 29 times "Dublin" appears, even though Henchy appears only in "Ivy Day in the Committee Room" and "Dublin" is much more evenly distributed across the set, since "Dublin" is a more common word in general.&lt;br /&gt;&lt;br /&gt;Even if we could create text-bins of just the right size, we wouldn't always want to do so, though. There are lots of occasions where it makes a lot of intuitive sense to group texts together as one large group. Year-ratio counts is one. Over the summer, I was talking about the &lt;a href="http://sappingattention.blogspot.com/2011/08/graphing-and-smoothing.html"&gt;best way to graph ratios per year&lt;/a&gt;, and settled on something that looked roughly like this:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-b9UKOXBpOJQ/Tjr6w8PTqQI/AAAAAAAAC2I/KpcFt7ZWHe4/s1600/Prettier+Evolution+and+Darwin+trends.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-b9UKOXBpOJQ/Tjr6w8PTqQI/AAAAAAAAC2I/KpcFt7ZWHe4/s1600/Prettier+Evolution+and+Darwin+trends.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;This is roughly the same as the ngrams or bookworm charts, but with a smoothing curve.&lt;br /&gt;&lt;br /&gt;If you think about it, though, what's presented as a trend line in all these charts is actually linking data points that consist of massively concatenated texts for each of the years. That spike for evolution in 1877 is possibly due to just one book--it makes us think something's happening in the culture, when it's really just one book again--the Jack/Henchy problem all over again? Instead of umping the individual years together, why don't we just create smoothing lines over all the books published as individual data points?&lt;br /&gt;&lt;br /&gt;Well, partly, it's just too much information:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-3Ytrcuubsmc/TtO7OBYDOAI/AAAAAAAAC6o/QZRwe8KIVn4/s1600/Evolution+plotted+at+the+book+level.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="301" src="http://4.bp.blogspot.com/-3Ytrcuubsmc/TtO7OBYDOAI/AAAAAAAAC6o/QZRwe8KIVn4/s400/Evolution+plotted+at+the+book+level.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Giving each book as a point doesn't really communicate anything more, and we get _less_ sense of the year to year variability than with the more abstract chart. That's not to say there aren't advantages to be gained here (it's good to get a reminder of just how many more books there are from 1890-1922, for instance), and I do think some distributional statistics might be better than the moving average for a lot of the work we're doing. But it also takes an order of magnitude or two longer to calculate, which is a real problem; and it's much easier to communicate what a concept like 'percentage of all words in that year' means than 'expected value from a fitted beta distribution of counts for all books in that year', even if the latter might be closer to what we're actually trying to measure.&lt;br /&gt;&lt;br /&gt;There are other cases where the idea of a text breaks down--with rare words, for instance, or extremely short texts. I'd love to just be able to use the median ratio percentage instead of the mean from the chart above; but since &lt;a href="http://bookworm.culturomics.org/?%7B%22query%22%3A%7B%22index%22%3A0%2C%22time_measure%22%3A%22year%22%2C%22time_limits%22%3A%5B1815%2C1922%5D%2C%22counttype%22%3A%22Percentage_of_Books%22%2C%22words_collation%22%3A%22Case_Sensitive%22%2C%22smoothingSpan%22%3A%225%22%2C%22search_limits%22%3A%5B%7B%22word%22%3A%5B%22evolution%22%5D%7D%5D%7D%2C%22terms%22%3A%5B%22evolution%22%5D%2C%22category_data%22%3A%5B%5B%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc1%22%2C%5B%5D%5D%2C%5B%22country%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%5D%5D%5D%5D%2C%22comparison%22%3A%22texts%22%7D"&gt;3/4 of all books never use the word evolution at all&lt;/a&gt;, we won't get much use out of that. With much rarer words or terms--which we are often very interested in--lots of useful tools will break down entirely.&lt;br /&gt;&lt;br /&gt;How does this connect to these specific comparison algorithms?&lt;br /&gt;&lt;br /&gt;The Dunning log-likelihood test treats a corpus exactly the same as it treats a document. There is both computational and philosophical simplicity in this approach. It lets us radically simplify texts across multiple dimensions, and think abstractly about corporate authorship across whatever dimension we choose. I could compare Howells to Dickens on exactly the same metric I could compare LC classification E to LC classification F; the document is longer, but the comparison is the same. But it suffers that Jack-&lt;br /&gt;&lt;br /&gt;The Mann-Whitney test, on the other hand, &lt;i&gt;requires&lt;/i&gt; both corpus and document levels. By virtue of this, it can be much more sophisticated; it can also, though, be much more restricted in its application. It also takes considerably more computing power to calculate. It *might* be possible to include some sort of Dunning comparisons in Bookworm using the current infrastructure, but Mann-Whitney tests are almost certainly a bridge too far. (Not to get too technical--but basically, Mann-Whitney requires you to load up an manipulate the entire list of books and counts for each word, sort it, and look at those lists, while Dunning lets you just scan through and add up the words as you find them; this is a lot simpler. I don't have too much trouble doing Dunning Tests on tens of thousands of books in a few minutes, but the Mann-Whitney tests get very cumbersome beyond a few hundred.)&lt;br /&gt;&lt;br /&gt;Mann-Whitney also requires that you have a lot of texts, and that your words appear across a lot of books; if you compared all the copies of Time magazine from 1950 to the present to all the phone books in that period, it would conclude that "Clinton" was a phone book word, not a news word, since Bill and Hillary don't show up until late in the picture.&lt;br /&gt;&lt;br /&gt;So when is each one appropriate? This strikes me as just the sort of rule of thumb that &lt;a href="http://sappingattention.blogspot.com/2011/11/compare-and-contrast.html"&gt;we need and don't have&lt;/a&gt; to make corpus comparison more useful for humanists. Maybe I can get into this more later, but it seems like the relative strengths are something like this:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Dunning&lt;/b&gt;:&lt;br /&gt;Very large corpuses (because of memory limitations)&lt;br /&gt;Very small corpuses (only a few documents) &lt;br /&gt;Rare words expected to distinguish corpuses (for instance, key names that may appear in a minority of documents in the distinguishing corpus).&lt;br /&gt;Very short documents&lt;br /&gt;Limited Computational resources&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Mann-Whitney:&lt;/b&gt;&lt;br /&gt;Medium-sized corpuses (hundreds of documents)&lt;br /&gt;Fairly common distinguishing words (appearing in most books in the corpus they are supposed to distinguish)&lt;br /&gt;Fairly long documents&lt;br /&gt;&lt;br /&gt;Also worth noting:&lt;br /&gt;&lt;b&gt;TF-IDF:&lt;/b&gt;&lt;br /&gt;Similar strengths to Dunning, but works with a single baseline set composed of many (probably hundreds, at least) documents used with multiple comparison sets.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-1040833759262748439?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/1040833759262748439/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/11/treating-texts-as-individuals-vs.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/1040833759262748439'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/1040833759262748439'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/11/treating-texts-as-individuals-vs.html' title='Treating texts as individuals vs. lumping them together'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-b9UKOXBpOJQ/Tjr6w8PTqQI/AAAAAAAAC2I/KpcFt7ZWHe4/s72-c/Prettier+Evolution+and+Darwin+trends.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-8937693129760345068</id><published>2011-11-14T15:57:00.043-05:00</published><updated>2012-01-17T16:23:06.964-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Comparisons'/><category scheme='http://www.blogger.com/atom/ns#' term='Dunning'/><category scheme='http://www.blogger.com/atom/ns#' term='Featured'/><title type='text'>Compare and Contrast</title><content type='html'>I may (or may not) be about to dash off a string of corpus-comparison posts to follow up the ones I've been making the last month. On the surface, I think, this comes across as less interesting than some other possible topics. So I want to explain why I think this matters now. This is not quite my long-promised topic-modeling post, but getting closer.&lt;br /&gt;&lt;br /&gt;Off the top of my head, I think there are roughly three things that computers may let us do with text so much faster than was previously possible as to qualitatively change research.&lt;br /&gt;&lt;br /&gt;1. Find texts that use words, phrases, or names we're interested in.&lt;br /&gt;2. Compare individual texts or groups of texts against each other.&lt;br /&gt;3. Classify and cluster texts or words. (Where 'classifying' is assigning texts to predefined groups like 'US History', and 'clustering' is letting the affinities be only between the works themselves).&lt;br /&gt;&lt;br /&gt;These aren't, to be sure, completely different. I've argued before that in some cases, &lt;a href="http://sappingattention.blogspot.com/2011/09/bookworm-and-library-search.html"&gt;full-text search is best thought of as a way to create a new classification scheme and populating it with books&lt;/a&gt;. (Anytime I get fewer than 15 results for a historical subject in a ProQuest newspapers search, I read all of them--the ranking inside them isn't very important). Clustering algorithms are built around models of cross group comparisons; full text searches often have faceted group comparisons. And so on.&lt;br /&gt;&lt;br /&gt;But as ideal types, these are different, and in very different places in the digital humanities right now. Everybody knows about number 1; I think there's little doubt that it continues to be the most important tool for most researchers, and rightly so. (It wasn't, so far as I know, helped along the way by digital humanists at all). More recently, there's a lot of attention to 3. &lt;a href="http://www.scottbot.net/HIAL/?p=221"&gt;Scott Weingart has a good summary/literature review on topic modeling and network analysis&lt;/a&gt; this week--I think his synopsis that "they’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination" gets it just right, although I wish he'd bring the hammer down harder on the danger part. I've read a fair amount about topic models, implemented a few on text collections I've built, and I certainly see the appeal: but not necessarily the embrace. I've also done some work with classification.&lt;br /&gt;&lt;br /&gt;In any case: I'm worried that in the excitement about clustering, we're not sufficiently understanding the element in between: comparisons. It's not as exciting a field as topic modeling or clustering: it doesn't produce much by way of interesting visualizations, and there's not the same density of research in computer science that humanists can piggyback on. At the same time, it's not nearly so mature a technology as search. There are a few production quality applications that include some forms of comparisons (WordHoard uses Dunning Log-Likelihood; I can only find relative ratios on the &lt;a href="http://portal.tapor.ca/portal/portal"&gt;Tapor page&lt;/a&gt;). But there isn't widespread adoption, generally used methodologies for search, or anything else like that. &lt;br /&gt;&lt;br /&gt;This &lt;i&gt;is &lt;/i&gt;a problem, because cross-textual comparison is one of the basic competencies of the humanities, and it's one that computers ought to be able to help with. While we &lt;i&gt;do &lt;/i&gt;talk historically about clusters and networks and spheres of discourse, I think comparisons are also closer to most traditional work; there's nothing quite so classically historiographical as tracing out the similarities and differences between Democratic and Whig campaign literature, Merovingian and Carolingian statecraft, 1960s and 1980s defenses of American capitalism. These are just what we teach in history---I in fact felt like I was coming up with exam or essay questions writing that last sentence.&lt;br /&gt;&lt;br /&gt;So why isn't this a more vibrant area? (Admitting one reason might be: it is, and I just haven't done my research. In that case, I'd love to hear what I'm missing).&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I think the biggest reason for this is probably &lt;b&gt;legal-technical&lt;/b&gt;, and getting solved. A site like J-Stor (or Bookworm, for that matter) can set up full-text search much easier than it can cross-corpus comparisons; one takes tenths of a second, and the other can take minutes. Minutes isn't very much, of course, and if it worked plenty of humanists would be happy to let their laptops plug away at the problem; but restrictions on downloading texts makes that impossible. Add into the mix all the completely un-digitized texts we'd want to include in many comparisons, and there are only a few cases where it's possible. Topic modelling and search both work much better with a research model where one centralized research server provides on-demand service to lots of people who don't necessarily understand what's going on behind the scenes.&lt;br /&gt;&lt;br /&gt;Another reason is &lt;b&gt;algorithmic&lt;/b&gt;. To put it bluntly, Dunning Log-Likelihood doesn't work very well; not only does it over-represent common words, it also finds spurious differences based on one or two texts. Ted Underwood has been exploring some aspects of Mann-Whitney; but it too has it's share of flaws, and in some cases, it can be much more difficult or inappropriate to implement. TF-IDF suffers some difficult translation problems when comparing two parts to comparing a part to a whole. I started a few posts on these things, and hopefully, they'll see the light of day. But in general, I get the impression that there isn't a very good all-around corpus comparison tool any scholar could apply to their questions.&lt;br /&gt;&lt;br /&gt;I also suspect that there are some &lt;b&gt;cultural-psychological &lt;/b&gt;reasons. One of the things that's so appealing about the topic models and the networks is that they alleviate the feeling of being overwhelmed by unstructured information. (Or the print age, or whatever.) Topic models and network graphs put the world in order, which is very reassuring; they also create things that are very cool looking, which is very (too?) important in the web ecosystem where DH lives. (This site tends to a lot of Google image search results to some &lt;a href="http://sappingattention.blogspot.com/2011/01/cluster-charts.html"&gt;cluster charts&lt;/a&gt; I made in the past, which are certainly among the least clear charts I've ever posted; there's something about the untangling that sort of puzzle that people find very rewarding.)&lt;br /&gt;&lt;br /&gt;Comparisons just create word lists, and they aren't as rewarding as topic models--you just get a list of differences to sift through. And they don't deal in the same way with the whole corpus--you are much more restricted in what you work with. I think there's a bit of a tendency to think that as long as we're using computers to read texts, we might as well do all of the ones that are in good enough shape to work with--moving down for comparisons doesn't tell us much.&lt;br /&gt;&lt;br /&gt;I don't see any of these reasons as basically good ones. The inability to get digital texts to work with or text curators who allow multifaceted access is the biggest problem facing textual digital analysis. The lack of good algorithms is just more evidence that this is &lt;i&gt;our &lt;/i&gt;problem; that humanists need to be developing expertise and a feel for the data here themselves. And there's certainly no one who defends eye-candy for itself; they (and I) would only point to its usefulness for asking more interesting questions. But comparisons should let us do that too.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-8937693129760345068?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/8937693129760345068/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/11/compare-and-contrast.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8937693129760345068'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8937693129760345068'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/11/compare-and-contrast.html' title='Compare and Contrast'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-8698578372141637072</id><published>2011-11-10T12:06:00.000-05:00</published><updated>2011-11-14T14:42:00.802-05:00</updated><title type='text'>Dunning Amok</title><content type='html'>A few points following up my &lt;a href="http://sappingattention.blogspot.com/2011/10/dunning-statistics-on-authors.html#comments"&gt;two&lt;/a&gt; &lt;a href="http://sappingattention.blogspot.com/2011/10/comparing-corpuses-by-word-use.html#comments"&gt;posts&lt;/a&gt; on corpus comparison using Dunning Log-Likelihood last month. Nur ein stueck Technik. &lt;br /&gt;&lt;br /&gt;Ted said in the comments that he's interested in literary diction.&lt;br /&gt;&lt;blockquote class="tr_bq"&gt;I've actually been thinking about Dunnings lately too. I was put in mind of it by a great article a couple of months ago by Ben Zimmer&lt;strike&gt;Zimmerman&lt;/strike&gt; addressing the character of "literary diction" in a given period (i.e., Dunnings on a fiction corpus versus the broader corpus of works in the same period).&lt;/blockquote&gt;&lt;blockquote class="tr_bq"&gt;I'd like to incorporate a diachronic dimension to that analysis. In other words, first take a corpus of 18/19c fiction and compare it to other books published in the same period. Then, among the words that are generally overrepresented in 18/19c fiction, look for those whose degree of overrepresentation *peaks in a given period* of 10 or 20 years. Perhaps this would involve doing a kind of meta-Dunnings on the Dunnings results themselves.&lt;/blockquote&gt;&lt;br /&gt;I'm still thinking about this, as I come back to doing some other stuff with the Dunnings. This actually seems to me like a case where the Dunning's wouldn't be much good; so much of a Dunning score is about the sizes of the corpuses, so after an initial comparison to establish 'literary diction' (say), I think we'd just want to compare the percentages.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Specifically; say "mossy" appears 10 times out of a thousand in fiction in 1858, and 100 times out of ten thousand in 1868; and that the comparison set is constant at 10 out of ten thousand. That is to say, nothing's changed except that we have more fiction. The Dunning score will be higher for 1868, but nothing about the character of literary discourse has; I'd think in this case we'd just want to know that it appears ten times more often in fiction than in the baseline.&lt;br /&gt;&lt;br /&gt;Actually, this shouldn't be too hard to do, so maybe I'll just spin it up. [[Since I started this, Ted posted &lt;a href="http://tedunderwood.wordpress.com/2011/11/09/identifying-the-terms-that-characterize-an-author-or-genre-why-dunnings-may-not-be-the-best-method"&gt;some interesting stuff on using Mann-Whitney scores&lt;/a&gt; instead of Dunning; I'm not going to engage with that here since I already started down another road, but it's worth reading]]. &lt;br /&gt;&lt;br /&gt;It takes a lot of time using my current system to get all the books in really large genres, so as a temporary measure I'll just take small samples of about one to ten thousand books from each category. Running Dunning Stats on PZ, my favorite proxy for fiction (although it seems to slant more towards juvenalia than serious literature, which is probably a problem for Ted), lets us generate a list of particularly fictive words.&lt;br /&gt;&lt;br /&gt;&lt;pre class="GD40030CLR" tabindex="0"&gt;&lt;span class="GD40030COR ace_keyword"&gt;&amp;gt; &lt;/span&gt;&lt;span class="GD40030CCR ace_keyword"&gt;sort(-comparison)[1:49]&lt;br /&gt;&lt;/span&gt;        she         you         her          he         had         was        said &lt;br /&gt;-1206331.71 -1038374.23  -868169.97  -649760.38  -496750.13  -260813.56  -233680.58 &lt;br /&gt;        don        look          me         him         his          my          go &lt;br /&gt; -217721.48  -210904.69  -206869.61  -170743.53  -163188.28  -138647.28  -130070.66 &lt;br /&gt;       know          ll       could        eyes          up      little         but &lt;br /&gt; -119696.02  -110606.00  -106066.20  -102207.69   -99798.48   -97299.37   -97279.49 &lt;br /&gt;         ve        face        what        back        your         did         out &lt;br /&gt;  -93464.54   -91830.15   -89691.31   -83970.13   -83532.50   -82152.63   -80811.68 &lt;br /&gt;       girl       think       would        like        down        tell        went &lt;br /&gt;  -78758.80   -76680.73   -76310.96   -74294.19   -73538.03   -73109.93   -71340.36 &lt;br /&gt;       knew         get          Oh         got        came     thought        come &lt;br /&gt;  -71009.17   -70952.46   -70407.68   -69993.36   -68648.85   -64853.88   -64777.57 &lt;br /&gt;       then       asked          do        want        just         why      moment &lt;br /&gt;  -64645.63   -63784.73   -63739.65   -61628.73   -60729.75   -59919.42   -59455.27&amp;nbsp;&lt;/pre&gt;&lt;pre class="GD40030CLR" tabindex="0"&gt;&amp;nbsp;&lt;/pre&gt;The scores are below. Keep in mind that a Dunning score of something like 15 or 20 (it varies depending on the sample size) represents statistical significance; these are off-the-charts important, even using only one out of a hundred books for the sample. A few of these are interesting; most are pretty predictable, though.Pieces of contractions, conversational words... not so interesting, necessarily. Dunning's predilection for ultra-common words strikes again.&lt;br /&gt;&lt;br /&gt;At some point, I made a list of stopwords; if we exclude those, the overrepresented words look like this:&lt;br /&gt;&lt;br /&gt;&lt;pre class="GD40030CLR" tabindex="0"&gt;&lt;span class="GD40030COR ace_keyword"&gt;&amp;gt; &lt;/span&gt;&lt;span class="GD40030CCR ace_keyword"&gt;sort(-comparison[!(names(comparison)) %in% stopwords])[1:49]&lt;br /&gt;&lt;/span&gt;      look       eyes     little       face       girl      think       tell &lt;br /&gt;-210904.69 -102207.69  -97299.37  -91830.15  -78758.80  -76680.73  -73109.93 &lt;br /&gt;      went       knew        get        got      asked       want     moment &lt;br /&gt; -71340.36  -71009.17  -70952.46  -69993.36  -63784.73  -61628.73  -59455.27 &lt;br /&gt;      door       turn       Miss       didn      smile        saw      stood &lt;br /&gt; -58605.09  -58457.83  -58397.37  -57102.08  -56991.56  -56838.73  -56007.04 &lt;br /&gt;       sat       room       talk        man      cried      voice       hand &lt;br /&gt; -55573.10  -54001.07  -53350.13  -52677.01  -52467.04  -51860.90  -50961.02 &lt;br /&gt;     woman       felt  something      laugh        old        boy      young &lt;br /&gt; -50533.76  -50105.91  -48859.61  -48198.84  -46296.65  -44681.84  -43998.46 &lt;br /&gt;      told     glance      night      seems        won       wait      heard &lt;br /&gt; -42303.72  -39290.72  -38343.88  -38311.51  -38138.96  -37179.80  -36854.77 &lt;br /&gt;       isn     mother   suddenly       sure     wouldn       walk       Lady &lt;br /&gt; -36733.91  -36520.81  -36255.29  -34485.38  -34112.25  -33219.47  -32421.47&lt;/pre&gt;&lt;pre class="GD40030CLR" tabindex="0"&gt;&amp;nbsp;&lt;/pre&gt;To take a step back--what this is revealing are generally &lt;i&gt;topical&lt;/i&gt; words, not necessarily as evocative as the ones that &lt;a href="http://tedunderwood.wordpress.com/2011/11/09/identifying-the-terms-that-characterize-an-author-or-genre-why-dunnings-may-not-be-the-best-method"&gt;Ted's been finding&lt;/a&gt; for poetry ("sweet", "thy", "fair","cheek",etc.) I think that may be largely a function of the less distinctive style used for prose fiction; not totally sure. It might also be interesting to compare not against all books, but against just narratives; histories, say.&lt;br /&gt;&lt;br /&gt;Anyhow, I think this is what we need the Dunnings for: extracting a list of words that are worth analyzing a bit more by hand. With each of these, we know there's a real difference: we can then plot the degree of over-representation over time. I'm going to do this for the top 96 words. (Why 96? Why not?) So for instance, here's the plot for "smile." (Including "smiling," "smiles," etc.)&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Jyq1gVUNj5o/Trv-_GSCEII/AAAAAAAAC6I/Zom9dAxnjf0/s1600/Smile+Representation.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-Jyq1gVUNj5o/Trv-_GSCEII/AAAAAAAAC6I/Zom9dAxnjf0/s1600/Smile+Representation.png" /&gt;&amp;nbsp;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;So "smile" drifts from being about 4x as common in fiction as in general literal literature in the 1840s up to about 5.5x by the teens.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;This, it happens, is quite similar to another word in shape: "glance"&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-Deni5Kxelts/Trv_wwBWZHI/AAAAAAAAC6Q/qQchh2wzRs0/s1600/Glances+in+Fiction.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-Deni5Kxelts/Trv_wwBWZHI/AAAAAAAAC6Q/qQchh2wzRs0/s1600/Glances+in+Fiction.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Ted's been talking a lot about these sorts of patterns on his blog. I've held off much similar analysis because, mostly, of the size of my corpus. But by limiting the number of words down to about 100 using Dunning formulas to find the ones that are interesting for another reason, it then becomes more reasonable to cluster based on shape.&lt;br /&gt;&lt;br /&gt;So here's what I'm going to do. (This was developed for a slightly different project coming soon). For each of these 96 stems, I take the pattern of occurrences from 1823 to 1922: I then take the covariance matrix across that pattern, to find words that tend to move in tandem with one another. (Regardless of the absolute size of those swings.) Using kmeans clustering, I can group them into 12 (again, why not?) groups, and we can inspect the patterns for each of those groups to see what's going on. You'll want to click to expand this one.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-76Zo56HC15c/TrwB6KCXoaI/AAAAAAAAC6Y/FYfLRiAAliQ/s1600/Clusters+of+Fictive+discourse.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="481" src="http://1.bp.blogspot.com/-76Zo56HC15c/TrwB6KCXoaI/AAAAAAAAC6Y/FYfLRiAAliQ/s640/Clusters+of+Fictive+discourse.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;There's a lot of fairly interesting stuff going on here, although it may just be noise. Basically, we've got a bunch of wiggles getting classed together based on where there seem to be local peaks. (Although keep in mind that the loess regressions I use for the dark lines here are &lt;i&gt;not &lt;/i&gt;factoring into the clustering, just the year-to-year data. So when they look strikingly similar, as in the 'laugh'-'morning'-'talk'-'woman' cluster, that's actually a sort of confirmation that maybe something is going right.&lt;br /&gt;&lt;br /&gt;So, is this useful? Some of these seem pretty straightforward, such as the contractions cluster. The thing I'm most surprised by is the lack of any cluster for general downward motion; the cluster starting with "Lady" and "Miss" is some of the way there, but most words seem to become more overrepresented in fiction over time. (As I write this, I'm realizing that it's because of a selection bias problem; I was choosing words statistically overrepresented in fiction, but since I have more books from the later 19C than the early 19C I ended up selecting for words that are particularly strong at the latter half.)&lt;br /&gt;&lt;br /&gt;I actually think this sort of clustering/grid plot may be useful for analyzing some other phenomena, but I figured this was as good a place to any to wheel it out.&lt;br /&gt;&lt;br /&gt;Maybe I'll actually do a bit of reading of the clusters, and the paths they take, later. I'm curious if someone else wants to. My gut feeling here is that this needs a little bit bit more sensitivity to the groups being compared than I'm bringing here--PZ is a funny stand-in for fiction, a lot of these effects could be driven by what _else_ is in the library besides what PZ is printing, and so on.&lt;br /&gt;&lt;br /&gt;But nonetheless, I still think it's fairly interesting to see how any selection criteria starts to pull out some interesting trends in discourse shifts.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-8698578372141637072?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/8698578372141637072/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/11/dunning-amok.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8698578372141637072'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8698578372141637072'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/11/dunning-amok.html' title='Dunning Amok'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-Jyq1gVUNj5o/Trv-_GSCEII/AAAAAAAAC6I/Zom9dAxnjf0/s72-c/Smile+Representation.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-3757872659839296301</id><published>2011-11-03T10:48:00.002-04:00</published><updated>2011-11-03T10:53:57.021-04:00</updated><title type='text'>Theory First</title><content type='html'>Natalie Cecire recently started an important debate about &lt;a href="http://nataliacecire.blogspot.com/2011/10/when-dh-was-in-vogue-or-thatcamp-theory.html"&gt;the role of theory&lt;/a&gt; in the digital humanities. She's rightly concerned that the THATcamp motto--"more hack, less  yack"--promotes precisely the wrong understanding of what digital  methods offer:&lt;br /&gt;&lt;blockquote class="tr_bq"&gt;the whole reason DH is theoretically consequential is that the use of technical methods and tools &lt;i&gt;should be making us rethink&lt;/i&gt; the humanities. &lt;/blockquote&gt;Cecire wants a THATcamp theory, so that the teeming DHers can better describe the implications of all the work that's going on. Ted Underwood &lt;a href="http://tedunderwood.wordpress.com/2011/10/22/on-transitive-and-intransitive-uses-of-the-verb-to-theorize/"&gt;worries that claims for the primacy of theory can be nothing more than a power play&lt;/a&gt;, serving to reify existing class distinctions inside the academy; but he's willing to go along with a reciprocal relation between theory and practice going forward. &lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;In short, we could say that smart people are converging on the entirely reasonable conclusion that DH and theory need to have a mutually beneficial relationship. Each should inform the other: the theorists who put big Theory before any empirical data need to explore all the new forms of evidence without prior conclusions, and the DHers who ignore theory entirely jeopardize not only their careers but the soundness of their conclusions. In practice, this probably means digital humanists can keep calm and carry on, with greater tolerance for the occasional French name tossed into the discussion; meanwhile the theory inclined should know they have a seat at the new table, but not necessarily at the head. Even more hack, better yack.&lt;br /&gt;&lt;br /&gt;I've been flirting for a while with a much less reasonable point of view. As I try to figure out how to responsibly use the big trove of data I've been building up for an already-existing dissertation project, this issue of theory keeps cropping up. The view based around two fairly tendentious convictions that seem reasonable enough to me that I want to try spelling them them out:&lt;br /&gt;&lt;br /&gt;1) Work in digital humanities&lt;i&gt; &lt;/i&gt;should &lt;i&gt;always&lt;/i&gt; begin with a grounding in a theory from the humanistic traditions--if it doesn't, it is probably doomed from the start;&lt;br /&gt;&lt;br /&gt;2) The only satisfactory way to apply social/critical theory in humanities research today is to use massive stores of data digitally.&lt;br /&gt;&lt;br /&gt;That is to say, Theory and DH aren't two separate enterprises that can help each other along; these are fundamental the same thing. Digital humanities that doesn't put theory first ends up not really being humanities; social theory that doesn't engage with the explanatory power and the potential for outreach of vast digital data fails to take seriously its own conviction that deeper structures are readable in the historical record.&lt;br /&gt;&lt;br /&gt;I've argued the second point elsewhere a bit, so let me focus on the first. (I should say that by theory, I mostly mean what we'd call social or critical theory, which is some set of independent and occasionally warring states in Europe. Just which ones is less important for now, though one does have to choose.)&lt;br /&gt;&lt;br /&gt;Let's say that at their core, the digital humanities are the practice of using technology to create new objects for humanistic interrogation. That's how I think of it, at least. A lot of  DH's focus is on public humanities for this reason; there is  justifiably enormous excitement about the creation of visualizations,  exhibits, and tools that let us get &lt;i&gt;non-humanists &lt;/i&gt;to think humanistically. (&lt;a href="http://sappingattention.blogspot.com/2011/06/whats-new.html"&gt;I've talked about this before&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;But there is just as much reason to be excited about the prospects of creating new texts &lt;i&gt;for humanists to read, &lt;/i&gt;texts that bear little relation to the sort of books that we are used to reading. Visualizations, search techniques, etc, aren't interpretations, they are texts in themselves. And they demand new sorts of mental gymnastics the same way that a newly discovered archive or poem does. The work of the Stanford Literature Lab seems the farthest down this road in a lot of ways. &lt;br /&gt;&lt;br /&gt;The trick is that we have to decide what new objects we want to read. Social networks, ngram trajectories, interactive maps; stuff that used to be prohibitively difficult is now quite easy. The technical fact of creating these new texts is not as important as figuring out what they should be. How do we decide what to make?&lt;br /&gt;&lt;br /&gt;As far as I can tell, we need to have prior beliefs about the ways the world is structured, and only ever use digital methods to try to create works which let us watch those things in operation. Some, I'm sure, would want to scream 'confirmation bias!' at this--but the wonderful thing about the humanities is that they have always allowed scholars to work from problem to evidence, not vice-versa. I don't think it's a good check on research to have to work with organizations created by large bureaucracies; one of the things that I find the most exciting about textual data is that for once we have a massive statistical store that &lt;i&gt;wasn't &lt;/i&gt;collected by a state, with all the Foucauldian intimations contemporary historians are right to fret about.&lt;br /&gt;&lt;br /&gt;Archives, libraries, censuses, atlases: all of these force us to read juxtapositions far &lt;i&gt;more &lt;/i&gt;aligned with historical ways of thinking than the reconfigurations possible with digital texts. Most historians, at least, are trained to think that this is fundamentally a &lt;i&gt;good &lt;/i&gt;thing, because it gets us out of the cognitive ruts of the contemporary world. &lt;a href="http://goosecommerce.wordpress.com/category/the-past-is-a-foreignsomething/"&gt;The past is a foreign... something&lt;/a&gt;, and travel broadens the mind. I agree to a point that's good; nothing's more important for the historian than realizing that categories that are now sundered apart were once the same.&lt;br /&gt;&lt;br /&gt;The promise and danger of the digital is that it lets us displace these texts, even though though by only a hair's breadth, &lt;i&gt;out &lt;/i&gt;of the systems of the past. Where we want to put it: that's the question. Digital humanities would be a disaster if it simply rewrote our cultural heritage to fit neatly into present categories. That's why we need theory, which is all about reconfiguring the way we look at the world in terms of difficult to see structures that mask the truth: systems and lifeworld, doxa and habitus. There's a powerful significance there, and we need it.&lt;br /&gt;&lt;br /&gt;The reason that the digital humanities need to put theory first is not to pacify the powers-that-be, but to harness their own creativity towards productive ends. &lt;br /&gt;&lt;br /&gt;If it doesn't, skeptics of the digital humanities are right to worry that all's not on the straight and level. Something's fishy when a purportedly non-ideological movement shows up on the scene promising revolutionary change, particularly when so much about it looks suspiciously like the status quo. Why should the 'next big thing' in the  humanities come from the whitest, malest subfield this side of  diplomatic history? Why does it get more coverage in the &lt;i&gt;New York Times &lt;/i&gt;than other scholarship&lt;i&gt;? &lt;/i&gt;Why has it attracted the enthusiasm of state funders across agencies and states in a way that the humanities maybe haven't seen since the cold war? I often think: one of the things DH is potentially very, very good at is naturalizing the world as it is. And our reflexive ways of thinking about the world as it are just what theory has always sought to get us away from; the nightmare from which it tries to jolt us awake.&lt;br /&gt;&lt;br /&gt;Ted Underwood says that "Theory" is "not a determinate object belonging to a particular team." I'm not sure that's quite right. Theory belongs to all sorts of teams, but they do have something fundamental in common: they're the losers. The winners don't need new perspectives to shift their perspective from the world's; the losers do. What good the humanities have ever done largely lies in helping the losers along.&lt;br /&gt;&lt;br /&gt;The digital humanities is perfectly poised at the moment to optimistically and beautifully affirm the world through all of history as it is now, full of progress and decentralized self-organizing networks and rational actors making free choices; or it might also try to take up what Adorno called the only responsible philosophy: to reveal the cracks and fissures of the world in all its contradictions with otherwordly light. That's the demand placed on DH by theory, and it needs to come first: &lt;a href="http://books.google.com/books?id=ZiD-I5vX-oMC&amp;amp;pg=PA247&amp;amp;dq=%22all+else+is+reconstruction,+mere+technique%22+adorno&amp;amp;hl=en&amp;amp;ei=HMGxTr2QE6ms0AGesuTlAQ&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=4&amp;amp;ved=0CD4Q6AEwAw#v=onepage&amp;amp;q&amp;amp;f=false"&gt;all else is mere technique&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-3757872659839296301?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/3757872659839296301/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/11/theory-first.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/3757872659839296301'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/3757872659839296301'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/11/theory-first.html' title='Theory First'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-4960268441324186952</id><published>2011-10-07T15:42:00.002-04:00</published><updated>2011-10-13T18:48:26.259-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data exploration and visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='Comparisons'/><category scheme='http://www.blogger.com/atom/ns#' term='Dunning'/><category scheme='http://www.blogger.com/atom/ns#' term='Howells'/><category scheme='http://www.blogger.com/atom/ns#' term='Literature'/><title type='text'>Dunning Statistics on authors</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;As promised, some quick thoughts broken off &lt;a href="http://sappingattention.blogspot.com/2011/10/comparing-corpuses-by-word-use.html"&gt;my post on Dunning Log-likelihood&lt;/a&gt;. There, I looked at _big_ corpuses--two history classes of about 20,000 books each. But I also wonder how we can use algorithmic comparison on a much smaller scale: particularly, at the level of individual authors or works. English dept. digital  humanists tend to rely on small sets of well curated, TEI texts, but even the ugly  wilds of machine OCR might be able to offer them some insights. (Sidenote--&lt;a href="http://tedunderwood.wordpress.com/2011/10/07/the-challenges-of-digital-work-on-early-19c-collections/"&gt;interesting post by Ted Underwood today&lt;/a&gt; on the mechanics of creating a middle group between these two poles).&lt;br /&gt;&lt;br /&gt;As an example, let's compare all the books in my library  by Charles Dickens and William Dean Howells, respectively. (I have a  peculiar fascination with WDH, regular readers may notice: it's born out  of a month-long fascination with &lt;i&gt;Silas Lapham &lt;/i&gt;several years ago,  and a complete inability to get more than 10 pages into anything else  he's written.) We have about 150 books by each (they're among the most  represented authors in the Open Library, which is why I choose it), which means lots of duplicate  copies published in different years, perhaps some miscategorizations,  certainly some OCR errors. Can Dunning scores act as a crutch to  thinking even on such ugly data? Can they explain my Howells fixation?&lt;br /&gt;&lt;br /&gt;I'll present the results  in faux-wordle form as discussed last time. That means I use wordle.com graphics, but  with the size corresponding not to frequency but to Dunning scores comparing the two corpuses.  What does that look like?&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Words overrepresented in Dickens vs Howells:&lt;/b&gt; &lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-BOExZoju8v0/To3w4riuN9I/AAAAAAAAC3s/eg5xomlgwC4/s1600/Dickens+vs+Howells.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="256" src="http://4.bp.blogspot.com/-BOExZoju8v0/To3w4riuN9I/AAAAAAAAC3s/eg5xomlgwC4/s640/Dickens+vs+Howells.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-rNKcmI7KFhg/To3rkG2CktI/AAAAAAAAC3k/kzUWY7VKEwQ/s1600/Dickens+vs+Howells.png" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-size: large;"&gt;&amp;nbsp;&lt;b&gt;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: small;"&gt;We  get a bit of the British orthography, but less than I'd fear; and we do get a number of insights into Dickens's style  ('&lt;/span&gt;&lt;span style="font-size: small;"&gt;appearance',&lt;/span&gt;&lt;span style="font-size: small;"&gt;'merry','little','bright','eyes') as well as some  interesting social distinctions probably more reflective of the US vs.  Britain and the mid vs the late 19th century ('gentlemen,' 'gentleman,'  'wot', &lt;/span&gt;'coach').&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Words overrepresented in Howells vs. Dickens:&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-TSKZWnlUFe4/To3w4y2xK5I/AAAAAAAAC3w/PWtyZK10HHQ/s1600/Howells+vs+Dickens.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="364" src="http://2.bp.blogspot.com/-TSKZWnlUFe4/To3w4y2xK5I/AAAAAAAAC3w/PWtyZK10HHQ/s640/Howells+vs+Dickens.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-pFWa0xm0EQY/To3rkfQE1HI/AAAAAAAAC3o/xlQehi0m9vM/s1600/Howells+vs+Dickens.png" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;Howells  is dominated by two things compared to Dickens. First, that enormous looming 'she' that  denotes a significantly larger proportion of female characters, and  clusters around it of words like "mother," "girl," "girls"; second, a  string of short, common words that reflect the more pedestrian American  style compared to Dickensian specificity, including a great number of  fragments from contractions (don from "don't", "isn" from "isn't", and  probably "ve" from "could've,should've,would've"?). One gets Howells'  literary interests as well with 'literature,' 'literary', etc., although  I'd have to further refine to see if they came from criticism or from  the inclusion of novels themselves as plot points in books like &lt;i&gt;Silas Lapham. &lt;/i&gt;&lt;br /&gt;&lt;br /&gt;How is reading these texts the same as or different than  comparing Dickens and Howells themselves? It's not quite what I've had  expected on the Howells side: he comes off with an Austinian directness  (Jane, not J.L.) that doesn't match my expectations. In some ways, I'd say that the comparison tells us far more about Dickens than about Howells.&lt;br /&gt;&lt;br /&gt;Just how becomes clear when we compare Howells to a more comparable figure, Henry James. Howells still overuses common words, but now appears overly &lt;i&gt;masculine &lt;/i&gt;in his character choices, and fond of 'and' conjunction (while Dickens used 'and' significantly more than Howells...)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Words overrepresented in Howells vs. James &lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-YKi0kH36lvQ/To3z0aNbHVI/AAAAAAAAC30/VDc9rQpMc0s/s1600/Howells+vs+James+literary+words.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="316" src="http://4.bp.blogspot.com/-YKi0kH36lvQ/To3z0aNbHVI/AAAAAAAAC30/VDc9rQpMc0s/s640/Howells+vs+James+literary+words.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Words overrepresented in James vs. Howells&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-xl3wVF_fR9s/To34CpM0bvI/AAAAAAAAC34/s1nvLcrQWVg/s1600/James+compared+to+Howells.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="324" src="http://1.bp.blogspot.com/-xl3wVF_fR9s/To34CpM0bvI/AAAAAAAAC34/s1nvLcrQWVg/s640/James+compared+to+Howells.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;span style="font-size: small;"&gt;Again, I feel like this more closely captures the distinctive qualities of James (moment-companion-charming-extraordinary-view) than of Howells. Even Howell's distinctive points (lots of boys?) -- can be seen as reflections more of James's attributes (endless portraits of ladies). I'd use it as a sort of evidence for the phenomenal blankness of Howells; it's one of the reasons he's an interesting source. (Dan Rodgers once told me he read a lot of Howells one summer to get a better sense of the late 19th century, since James was just to &lt;i&gt;good &lt;/i&gt;to portray it blankly.)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;But: we're firmly in the fun-with-wordle camp right here. This is not even senior thesis material; I wouldn't count myself qualified to make good English dept. pronouncements about different authors. But I would say--the algorithm seems to be doing a reasonable job using hundreds of minimally processed books to make meaningful distinctions here, and that's for the good. Perfect OCR isn't necessary to get started on this if we know what we're looking for.&lt;br /&gt;&lt;br /&gt;Of course: what are we looking for? Some more database-building for me now, and we'll try to get there soon. If anybody can think of some great corpus-comparisons they'd like to see, let me know.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-4960268441324186952?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/4960268441324186952/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/10/dunning-statistics-on-authors.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/4960268441324186952'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/4960268441324186952'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/10/dunning-statistics-on-authors.html' title='Dunning Statistics on authors'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-BOExZoju8v0/To3w4riuN9I/AAAAAAAAC3s/eg5xomlgwC4/s72-c/Dickens+vs+Howells.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-5220724476872332051</id><published>2011-10-06T15:36:00.000-04:00</published><updated>2011-10-06T15:36:04.552-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data exploration and visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='Comparisons'/><category scheme='http://www.blogger.com/atom/ns#' term='Dunning'/><title type='text'>Comparing Corpuses by Word Use</title><content type='html'>Historians often hope that digitized texts will enable better, faster comparisons of groups of texts. Now that at least the 1grams on &lt;a href="http://bookworm.culturomics.org/"&gt;Bookworm&lt;/a&gt; are running pretty smoothly, I want to start to lay the groundwork for using corpus comparisons to look at words in a big digital library. For the algorithmically minded: this post should act as a somewhat idiosyncratic approach to Dunning's Log-likelihood statistic. For the hermeneutically minded: this post should explain why you might need _any_ log-likelihood statistic.&lt;br /&gt;&lt;br /&gt;What are some interesting, large corpuses to compare? A lot of what we'll be interested in historically are subtle differences  between closely related sets, so a good start might be the two Library of Congress subject classifications called "History of the Americas," letters E and F. The Bookworm database has over 20,000 books from each group. What's the difference between the two? The full descriptions could tell us: but as a test case, it should be informative to use only the texts themselves to see the difference.&lt;br /&gt;&lt;br /&gt;That leads a tricky question. Just what does it mean to compare usage frequencies across two corpuses? This is important, so let me take this quite slowly. (Feel free to skip down to Dunning if you just want the best answer I've got.) I'm comparing E and F: suppose I say my goal to answer this question: &lt;br /&gt;&lt;br /&gt;&lt;b&gt;What words appear the most times more in E than in F, and vice versa?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;There's already an ambiguity here: what does "times more" mean? In plain English, this can mean two completely different things. Say E and F are exactly the same overall length (eg, each have 10,000 books of 100,000 words). Suppose further "presbygational" (to take a nice, rare, American history word) appears 6 times in E and 12 times in F. Do we want to say that it appears two times more (ie, use multiplication), or six more times (use addition)? &lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;It turns out that neither of these simple operations works all that well. In the abstract, multiplication probably sounds more appealing; but it turns out to only catch extremely rare words. In our example set, here are the top words that distinguish E from F by multiplication, by occurences in E divided by occurrences in F. For example, "gradualism" appears 61x more often in E than in F.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; daimyo&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; aftre&amp;nbsp;&amp;nbsp;&amp;nbsp; exercitum intransitive&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; castris&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1994a &lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; 114.00000&amp;nbsp;&amp;nbsp;&amp;nbsp; 103.00000&amp;nbsp;&amp;nbsp;&amp;nbsp; 101.00000&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 82.33333&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 81.66667&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 77.00000 &lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; sherd&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; infti&amp;nbsp;&amp;nbsp; gradualism&amp;nbsp; imperforate&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; equitum&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; brynge &lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 71.71429&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 66.00000&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 61.00000&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 59.33333&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 57.00000&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 56.00000 &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(BWT, I simply omit the hundreds of words that appear in E but never appear in F; and I don't use capitalized words because they tend to _very_ highly concentrated and in fictional works in particular can cause very strange results. Yes, that's not the best excuse.) &lt;br /&gt;&lt;br /&gt;So what about addition? Compensating for different corpus sizes, it's also pretty easy to find out the number of more occurrences than we'd expect based on the previous corpus. (For example, "not" appears about 1.4 million more times than we'd expect in E given the number of times it appears in F and the total number of words in E.)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; to&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; that&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; the&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; not&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; had&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; it &lt;br /&gt;3432895.3 2666614.4 2093465.8 1427220.8 1360559.0 1342948.2 &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; be&amp;nbsp;&amp;nbsp; general&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; we&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; but&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; our&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; would &lt;br /&gt;1208340.5&amp;nbsp; 990988.4&amp;nbsp; 974849.0&amp;nbsp; 841842.6&amp;nbsp; 819680.5&amp;nbsp; 798426.0 &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Clearly, neither of these is working all that well. Basically, the first group are so rare they don't tell us much: and the second group, with the intriguing addition of "general", are so common as to be uninformative. (Except maybe for "our"; more on that later). Is there any way to find words that are interesting on _both_ counts?&lt;br /&gt;&lt;br /&gt;I find it helpful to do this visually. Suppose we make a graph. We'll put the addition score on the X axis, and the multiplication one on the Y axis, and make them both on a logarithmic scale. Every dot represents a word, as it scores on both of these things. Where do the words we're talking about fall? &lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-gjVv_rD25JE/TozQTWhVfMI/AAAAAAAAC3Q/gZHjE9MFICE/s1600/multiplication+addition+comparison.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="452" src="http://1.bp.blogspot.com/-gjVv_rD25JE/TozQTWhVfMI/AAAAAAAAC3Q/gZHjE9MFICE/s640/multiplication+addition+comparison.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;This nicely captures our dilemma. The two groups are in opposite corners, and words don't ever score highly on both. For example, "general" appears about 1,000,000 occurrences more in class E than we'd expect from class F, but only about 1.8x as often; &lt;a href="http://en.wikipedia.org/wiki/Sherd"&gt;sherd&lt;/a&gt; appears about 60x more often in class E, but that adds up to only 500 extra words compared to expectations, since it's a much rarer word overall.&lt;br /&gt;&lt;br /&gt;(BTW, log-scatter plots are fun. Those radiating spots and lines on the left side have to do with the discreteness of our set; a word can appear 1 time in a corpus or twice in a corpus, but it can't appear 1.5 times. So the lefternmost line is words that appear just once in F: the single point farthest left, at about (1.1,2.0) is words that appear twice in E and once in F; a little above it to the right is words that appear three times in E and once in F; directly to its right are words that appear four times in E and twice in F; etc.)&lt;br /&gt;&lt;br /&gt;One possible solution would be to simply draw a line between "daimyo" and "that", and assume that words are interesting to the degree that they stick out beyond that line. That gives us the following word list, placed on that same chart:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-vNTk3GURBk0/Toz1kEFrMTI/AAAAAAAAC3U/qocFwwOT-5w/s1600/Ben%2527s+Log+Likelihood.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="455" src="http://2.bp.blogspot.com/-vNTk3GURBk0/Toz1kEFrMTI/AAAAAAAAC3U/qocFwwOT-5w/s640/Ben%2527s+Log+Likelihood.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-gjVv_rD25JE/TozQTWhVfMI/AAAAAAAAC3Q/gZHjE9MFICE/s1600/multiplication+addition+comparison.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;...which is a lot better. The words are specific enough to be useful, but common enough to be mostly recognizable. Still, though, the less frequent words seem less helpful. Are "sherd" and "peyote" and "daimyo" up there because they really characterize the difference between E and F, or because a few authors just happened to use them a lot? And why assume that "that" and "daimyo" are equally interesting? Maybe "that" actually _is_ more distinctive than daimyo, or vice-versa.&lt;br /&gt;&lt;br /&gt;To put it more formally: words to the left tend to be rarer  (for a word to have 100,000 more occurrences than we'd expect, it has to be  quite common to begin with); and there are a lot more rare words than common words. So  by random chance, we'd expect to have more outliers on the top of the graph than on the bottom. &lt;a href="http://bookworm.culturomics.org/?%7B%22query%22%3A%7B%22index%22%3A0%2C%22time_measure%22%3A%22year%22%2C%22time_limits%22%3A%5B1815%2C1922%5D%2C%22counttype%22%3A%22Percentage_of_Books%22%2C%22words_collation%22%3A%22Case_Sensitive%22%2C%22smoothingSpan%22%3A%225%22%2C%22search_limits%22%3A%5B%7B%22word%22%3A%5B%22daimyo%22%5D%2C%22lc0%22%3A%5B%22E%22%5D%7D%2C%7B%22word%22%3A%5B%22daimyo%22%5D%2C%22lc0%22%3A%5B%22F%22%5D%7D%5D%7D%2C%22terms%22%3A%5B%22daimyo%22%5D%2C%22category_data%22%3A%5B%5B%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc1%22%2C%5B%5D%5D%2C%5B%22country%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%22E%22%5D%5D%5D%2C%5B%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc1%22%2C%5B%5D%5D%2C%5B%22country%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%22F%22%5D%5D%5D%5D%2C%22comparison%22%3A%22texts%22%7D"&gt;By using Bookworm to explore the actual texts&lt;/a&gt;, I can see that "&lt;a href="http://daimyo/"&gt;daimyo&lt;/a&gt;" appears so often in large part because Open Library doesn't recognize these &lt;a href="http://openlibrary.org/books/OL13516236M/The_early_diplomatic_relations_between_the_United_States_and_Japan_1853-1865"&gt;two&lt;/a&gt; &lt;a href="http://openlibrary.org/books/OL7047991M/The_early_diplomatic_relations_between_the_United_States_and_Japan_1853-1865"&gt;books&lt;/a&gt; are the same work. Conversely, that "our" appears 20% more often in E than in F is quite significant; looking at the chart, it &lt;a href="http://bookworm.culturomics.org/?%7B%22query%22%3A%7B%22index%22%3A0%2C%22time_measure%22%3A%22year%22%2C%22time_limits%22%3A%5B1815%2C1919%5D%2C%22counttype%22%3A%22Occurrences_per_Million_Words%22%2C%22words_collation%22%3A%22Case_Insensitive%22%2C%22smoothingSpan%22%3A%220%22%2C%22search_limits%22%3A%5B%7B%22word%22%3A%5B%22our%22%5D%2C%22lc0%22%3A%5B%22E%22%5D%7D%2C%7B%22word%22%3A%5B%22our%22%5D%2C%22lc0%22%3A%5B%22F%22%5D%7D%5D%7D%2C%22terms%22%3A%5B%22our%22%5D%2C%22category_data%22%3A%5B%5B%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc1%22%2C%5B%5D%5D%2C%5B%22country%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%22E%22%5D%5D%5D%2C%5B%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc1%22%2C%5B%5D%5D%2C%5B%22country%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%22F%22%5D%5D%5D%5D%2C%22comparison%22%3A%22texts%22%7D"&gt;seems to actually hold true&lt;/a&gt; across a long period time. If this is a problem with 20,000 books in each set, it will be far worse when we're using smaller sets.&lt;br /&gt;&lt;br /&gt;That would suggest we want a method that takes into account the possibility of random fluctuations for rarer ones. One way to do this is a technique called, after its inventor, &lt;a href="http://wordhoard.northwestern.edu/userman/analysis-comparewords.html"&gt;Dunning's log-likelihood statistic&lt;/a&gt;. I won't explain the details, except to say that like our charts it uses logarithms and that it is much more closely to our addition measure than to the multiplication one. On our E vs F comparison, it turns up the following word-positions (in green) as the 100 most significantly higher in E than F:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-v3IlGm7fXCk/Toz9qIkckmI/AAAAAAAAC3Y/8p7Htulf6GU/s1600/Dunning+Log-likelihood+demonstration.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-v3IlGm7fXCk/Toz9qIkckmI/AAAAAAAAC3Y/8p7Htulf6GU/s1600/Dunning+Log-likelihood+demonstration.png" /&gt;&lt;/a&gt;&lt;/div&gt;Dunning's log-likelihood uses probabilistic statistics to approximate a chi-square test; as a result, the words it identifies tend to come from the most additively over-represented, but it also gives some credit for multiplication. All of the common words from our initial sets of 12 additive words, and none of the rare ones, are included. It includes about half of the words my naive straight-line method produced: even "skirmisher", which seemed to clump with the more common words, isn't frequent enough for Dunning to privilege it over a blander word like "movement".&lt;br /&gt;&lt;br /&gt;Is this satisfying? I should maybe dwell on this longer, because it really matters. Dunning's is the method that seems to be most frequently used by digital humanities types, but the innards aren't exactly what you might think. In MONK, for example, the words with the highest Dunning scores are represented as bigger, which may lead users to think Dunning gives a simple frequency count. It's not--it's fundamentally a probability measure. We can represent it like it has to do with frequency, but it's important to remember that it's not. (Whence the curve on our plot). &lt;br /&gt;&lt;br /&gt;Ultimately, what's useful is defined by results. And I think that the strong showing of common words can be quite interesting. This ties back to my point a few months ago that &lt;a href="http://sappingattention.blogspot.com/2011/04/stopwords-to-wise.html"&gt;stopwords carry a lot of meaning in the aggregate&lt;/a&gt;. If I didn't actually really find the stopwords useful, I'd be more inclined to put some serious effort into building my own log-difference comparison like the straight line above; as it is, I'm curious if anyone knows of some good ones.&lt;br /&gt;&lt;br /&gt;As for results, here's what Dunning's test turns up in series E and in series F, limiting ourselves to uncapitalized words among the 200,000 most common in English:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Significantly overrepresented in E, in order&lt;/b&gt;&lt;b&gt;:&lt;/b&gt;&lt;br /&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[1] "that"           "general"        "army"           "enemy"         &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[5] "not"            "slavery"        "to"             "you"           &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[9] "corps"          "brigade"        "had"            "troops"        &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[13] "would"          "our"            "we"             "men"           &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[17] "war"            "be"             "command"        "if"            &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[21] "slave"          "right"          "it"             "my"            &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[25] "could"          "constitution"   "force"          "what"          &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[29] "wounded"        "artillery"      "division"       "government"   &lt;/div&gt;&lt;br /&gt;&lt;b&gt;Significantly overrepresented in F, in order:&lt;/b&gt; &lt;br /&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[1] "county"         "born"           "married"        "township"      &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[5] "town"           "years"          "children"       "wife"          &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[9] "daughter"       "son"            "acres"          "farm"          &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[13] "business"       "in"             "school"         "is"            &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[17] "and"            "building"       "he"             "died"          &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[21] "year"           "has"            "family"         "father"        &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[25] "located"        "parents"        "land"           "native"        &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[29] "built"          "mill"           "city"           "member"   &lt;/div&gt;&lt;br /&gt;At a first pass, that looks like local history versus military history.&lt;br /&gt;&lt;br /&gt;At a second pass, we'll notice 'constitution' and 'government' in F and 'he' and 'parents' in E and realize that E might include biographies as well as local histories, and that F probably includes a lot of legal and other forms of national histories as well. The national words might not have turned up by my straight-line test, which seemed intent on finding all sorts of rarer military words ("skirmishers", for example).&lt;br /&gt;&lt;br /&gt;Looking at the official LC classification definition (&lt;a href="http://www.loc.gov/aba/cataloging/classification/lcco/lcco_ef.pdf"&gt;pdf&lt;/a&gt;), that turns out to be mostly be the case. (Except for biography--that ought to mostly be in E. That it isn't is actually quite interesting.) So this is reasonably good at giving us a sense of the differences between corpuses as objectively defined. So far, so good.&lt;br /&gt;&lt;br /&gt;But these lists are a) not engaging, and b) don't use frequency data. How can we fix that? I never thought I'd say this, but: let's &lt;a href="http://www.wordle.net/"&gt;wordle&lt;/a&gt;! Wordle in general is a heavily overrated form of text analysis; Drew Conway has a nice post from a few months ago criticizing it because it doesn't &lt;a href="http://www.drewconway.com/zia/?p=2624"&gt;use a meaningful baseline of comparison, and uses spatial arrangement arbitrarily.&lt;/a&gt; Still, it's super-engaging, and possibly useful. We can make use of the Dunning data here to solve the first problem though not the second. Unlike in a normal Wordle, where size is frequency, here size is Dunning score: and the word clouds are &lt;i&gt;paired&lt;/i&gt;, so each one represents two ends of a comparison. Here's a graphic representing class E:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-bFq3kc1NGzs/To3UjOE22EI/AAAAAAAAC3c/4bqdPNLxmFA/s1600/LC+Classificatin+E+words.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="378" src="http://3.bp.blogspot.com/-bFq3kc1NGzs/To3UjOE22EI/AAAAAAAAC3c/4bqdPNLxmFA/s640/LC+Classificatin+E+words.png" width="640" /&gt;&amp;nbsp;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;And then Class F: &lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Y6nQnDqdq8s/To3Ujui8YHI/AAAAAAAAC3g/jMPaYa5_ggA/s1600/Class+F+Distinguishing+words.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="272" src="http://3.bp.blogspot.com/-Y6nQnDqdq8s/To3Ujui8YHI/AAAAAAAAC3g/jMPaYa5_ggA/s640/Class+F+Distinguishing+words.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;(We could also put them together and color-code like MONK does, but I think it's easier to get the categories straight by splitting them apart like this). One nice thing about this is that the statistical overrepresentation of 'county' in class F really comes through.&lt;br /&gt;&lt;br /&gt;On some level, this is going to seem unremarkable--we're just confirming that the LC description does, in fact, apply. But a lot of interesting thoughts can come from the unlikely events in here. For example, 'our' and 'we' are both substantially overrepresented in the national histories as opposed to the local histories. (BTW, I should note somewhere that both E and F include a fair number of historical _documents_, speeches, etc., as well as histories themselves. Here, I'm lumping them all together.) There's no reason this should be so--local histories are often the most intensely insular. &lt;br /&gt;&lt;br /&gt;Is there a historical pattern in the second-person-plural? &lt;a href="http://bookworm.culturomics.org/?%7B%22query%22%3A%7B%22index%22%3A0%2C%22time_measure%22%3A%22year%22%2C%22time_limits%22%3A%5B1815%2C1922%5D%2C%22counttype%22%3A%22Occurrences_per_Million_Words%22%2C%22words_collation%22%3A%22Case_Sensitive%22%2C%22smoothingSpan%22%3A%225%22%2C%22search_limits%22%3A%5B%7B%22word%22%3A%5B%22we%22%2C%22our%22%5D%2C%22lc0%22%3A%5B%22E%22%5D%7D%2C%7B%22word%22%3A%5B%22we%22%2C%22our%22%5D%2C%22lc0%22%3A%5B%22F%22%5D%7D%5D%7D%2C%22terms%22%3A%5B%22we%2Cour%22%5D%2C%22category_data%22%3A%5B%5B%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc1%22%2C%5B%5D%5D%2C%5B%22country%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%22E%22%5D%5D%5D%2C%5B%5B%22state%22%2C%5B%5D%5D%2C%5B%22lc1%22%2C%5B%5D%5D%2C%5B%22country%22%2C%5B%5D%5D%2C%5B%22lc0%22%2C%5B%22F%22%5D%5D%5D%5D%2C%22comparison%22%3A%22texts%22%7D"&gt;Bookworm says yes, emphatically&lt;/a&gt;--in a quite interesting way. "We's" and "Us's" are similar across E and F in the early republican period, and then undergo some serious wiggling starting around the Civil War; that leads to a new equilibrium around 1880 with E around its previous height, and F substantially lower.&lt;br /&gt;&lt;br /&gt;Now, there doesn't &lt;i&gt;have&lt;/i&gt; to be an interesting historical explanation for this. Maybe it's just about memoirs switching from F to E, say. But there might be: we could use this sort of data as a jumping off point for some explorations of nation-building and sectionalism. For example, clicking on the E results around 1900 gives books that use the words 'we' and 'our' the most. One thing I find particularly interesting there are the presence of many books that, by the titles at least, I'd categorize as African-American racial uplift literature. (That's a historian's category, of course, not a librarian's one). If we were to generalize that, it might suggest the rise of several forms of authorial identification with national communities (class, race, international, industrial) in the late nineteenth century, and a corresponding tendency to &lt;i&gt;not&lt;/i&gt; necessarily see local history as first-person history. Once we start to investigate the mechanics of &lt;i&gt;that, &lt;/i&gt;we can get into some quite sophisticated historical questions about the relative priority of different movements in constructing group identities, connections between regionalism in the 1850s vs. (Northern?) nationalism in the 1860s, etc.&lt;br /&gt;&lt;br /&gt;We aren't restricted to questions where the genres are predefined by the Library of Congress. There is a _lot_ to do with cross corpus comparisons in a library as large as the Internet Archive collection. We can compare authors, for example: I'll post that bit tomorrow.&lt;br /&gt;&lt;br /&gt;This isn't stuff that Martin and I could integrate into Bookworm right away,  unfortunately. It simply takes too long. The database driving Bookworm can add up the counts for any  individual word in about half a second; it takes more like two minutes  to add up all the words in a given set of books. For a researcher,  that's no time at all; but for a website, it's an eternity. Both from  the user end (no one will wait that long for data to load) and from the  server end (we can't handle too many concurrent queries, and longer  queries means more concurrent ones).&lt;br /&gt;&lt;br /&gt;But Wordle clouds and UI issues aside, the base idea has all sorts of applications I'll get into more later.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-5220724476872332051?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/5220724476872332051/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/10/comparing-corpuses-by-word-use.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/5220724476872332051'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/5220724476872332051'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/10/comparing-corpuses-by-word-use.html' title='Comparing Corpuses by Word Use'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-gjVv_rD25JE/TozQTWhVfMI/AAAAAAAAC3Q/gZHjE9MFICE/s72-c/multiplication+addition+comparison.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-8422910833195120453</id><published>2011-09-30T10:58:00.000-04:00</published><updated>2011-09-30T11:07:16.394-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Bookworm'/><title type='text'>Bookworm and library search</title><content type='html'>We just launched a new website, &lt;a href="http://bookworm.culturomics.org/"&gt;Bookworm&lt;/a&gt;, from the Cultural Observatory. I might have a lot to say about it from different perspectives; but since it was submitted to the DPLA beta sprint, let's start with the way it helps you find library books. &lt;br /&gt;&lt;br /&gt;Google Ngrams, which Bookworm in many ways resembles, was fundamentally about words and their histories; Bookworm tries to place texts much closer to the center instead. At their hearts, Ngrams uses a large collection of texts to reveal trends in the history of words; Bookworm lets you use words to discover the history of different groups of books--and by extension, their authors and readers.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;This means that rather than presenting one particular set of texts as the best way to understand words, it starts with a library that you yourself narrow down. &lt;a href="http://languagelog.ldc.upenn.edu/nll/?p=3449"&gt;As Mark Liberman rightly notes&lt;/a&gt;, that the flaws in the catalog are foregrounded is a feature, not a problem. (It would be relatively easy to throw out most duplicate works and misdated serials, but for now I like that Bookworm accurately reflects the full catalog driving it, warts and all.)&lt;br /&gt;&lt;br /&gt;As a tool for finding and thinking about books from words, Bookworm in fact straddles the space between something like Ngrams and the more traditional library catalog. So one useful way of describing what it does, I think, is to think about it in that context.&lt;br /&gt;&lt;br /&gt;There are a lot of ways to find a book in the library about a topic you find interesting without ever cracking a spine. Let me walk through a few to explain where digital browsing takes us:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;1) Use the subject headings in a library card catalog.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Subject headings are the best resource for a particular topic, but a lot of the time they won't work; your subject may not exist in the catalog, you may not know what it's called, and librarians may not have assigned some relevant books to the subject heading you're using.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;2) Find one book in the stacks that you find interesting, and see what's next to it on the shelves.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This, for better or worse, has always been the mainstay of my library use. For unscanned/in copyright books, I've spent a fair amount of time flipping through the index of every book on a shelf. And shelves work well; the criteria for classification are relevant enough that most readers will find a few promising leads in the stacks (particularly for non-fiction). Unlike subject catalogs, this is a relational way of finding books--rather than starting from a fixed idea of your subject, you let curation decisions guide your search outward. Next to the book you want may be another book by the same author on the same subject; on adjacent shelves are similar topics; etc.&lt;br /&gt;&lt;br /&gt;This is great for finding books to compare to the one that led to the stacks; but it's littered with problems. LC classifications are a peculiar hierarchy of knowledge, and call numbers are one-dimensional, unlike subject headings (where a book can simultaneously be about 19th-century Russia &lt;i&gt;and &lt;/i&gt;be fiction). And individual libraries have funny exceptions; older books at Princeton are shelved according to a different cataloging system, for example, and many are shipped to two different forms of offsite storage. These problems can lead to unexpected discoveries; but they can keep two closely related books separated forever. &lt;br /&gt;&lt;br /&gt;So this is the pre-digitized library; even electronic card catalogs don't change this balance in any important important way. But the advent of full-text search for books creates a major new entrant, that we're just starting to use:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;3) Search the full text of thousands of books for a word or phrase of interest.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;For scholarly journals and newspapers, two fields where full-text search is older than for books, this is probably the most important way of finding texts. For most purposes, full-text search obliterates method (1) above; where before you had to find a subject vaguely connected to your interests, now you can identify your topic as precisely as you can describe it in language.&lt;br /&gt;&lt;br /&gt;I'd argue, though, that full text searching may be less revolutionary than it seems. If, as I argued earlier, &lt;a href="http://sappingattention.blogspot.com/2011/09/is-catalog-information-really-metadata.html"&gt;we think of full-text indexes as more catalog information, not as the book itself,&lt;/a&gt; text searches seem rather less novel; in some ways, they just allow you to create a new subject heading at a moment's notice, oriented around whatever phrase you enter in. Since much historical work is based around researching subjects who only glancingly appear in the historical record, this can be incredibly useful. But the traditional list of results in response to a search query is in many ways no different than flipping through a section of a card catalog. (With the non-trivial exception that your search results are ranked according to relevance).&lt;br /&gt;&lt;br /&gt;This means that full-text search does far less to replace method (2) above than it does to replace method (1). It's good at finding particular things; it's far less good at revealing relational swathes. For the most part, I find talk of scholarly 'serendipity' a bit puddleheaded, but I think that the condescension at keyword searches might capture something quite real on this point. How can computers help improve on the experience of stack-browsing in the same way that they improve on the experience of subject-browsing?&lt;br /&gt;&lt;br /&gt;&lt;b&gt;4) Organize the library according to your personal principles, and browse it from arbitrary points.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This is where we need to go. Bookworm presents one set of ways for reordering the library based on the principle that language is constrained by the fields of its utterance--geographical (publication place), disciplinary (LC classification), temporal (publication year), even autobiographical (author age). The line chart that a search creates is a representation of overall trends; but it is also, taken point by point, an enormous collection of books. If you search for a term by author age and publication place, Bookworm is reordering the collection of the Open Library (a lot of it, anyway) into chunks divided by author age and place, showing you information about each one of those chunks, and inviting you to dive into a particular one to find the books matching your term.&lt;br /&gt;&lt;br /&gt;There still is a bit of the search engine in ordering the results after you choose your basket; but choosing _where to start browsing_ is driven by different principles entirely. You might want to find peaks for word usage; you might want to see early usage of a phrase without knowing what 'early' means; or you might just want to chop up the library in a way that leads you to books you wouldn't otherwise find. (I was pleased to see Liberman, shortly after his post, using Bookworm to &lt;a href="http://languagelog.ldc.upenn.edu/nll/?p=3450"&gt;stumble onto strange documents from the past&lt;/a&gt; and comment on them; personally, I've found lots of fascinating short documents turned up by random searches for place names.)&lt;br /&gt;&lt;br /&gt;This presumes, of course, that you care about the metadata categories that we're serving. And maybe you aren't. But that highlights one of the biggest challenges facing digital libraries; how better to collect and integrate different forms of metadata that will assist users in browsing their collections in ways that make sense for them while keeping an orientation that lets patrons have a sense of &lt;i&gt;exploring the collection &lt;/i&gt;to answer questions that it creates, rather than of being told which books best fulfill their awkwardly-phrased request.&lt;br /&gt;&lt;br /&gt;Though Bookworm is one way of making a library explorable on new dimensions, there should be lots of them. Some would start with maps or bubble charts instead of line charts; all of them will use different metadata in different ways. (Though all of them should be using wordcount metadata to supplement catalog metadata, which we haven't yet figured out to integrate into bookworm in a non-confusing way).&lt;br /&gt;&lt;br /&gt;These will be as different from traditional search engines as browsing the stacks is from using a card catalog. After a lot of discussion we decided to keep the text search box front and center on Bookworm because that's how people know how to search; but going forward, it doesn't need to be there at all. You should be able to search by networks of words, or by constellations that build out from particular books, by books that show the _fewest_ matches for your terms, etc. (One thing I've learned is that coding these is &lt;i&gt;far &lt;/i&gt;easier than making them available from a UI perspective).&lt;br /&gt;&lt;br /&gt;In all these cases, &lt;a href="http://sappingattention.blogspot.com/2011/02/fresh-set-of-eyes.html"&gt;the ways that a library gets reorganized can tell us things about the past we might not otherwise see&lt;/a&gt;. The one thing they have in common, though, is that they will have to move away from the single search box and the ordered list towards a multifaceted way of playing all the information we have about books against each other.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-8422910833195120453?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/8422910833195120453/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/09/bookworm-and-library-search.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8422910833195120453'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8422910833195120453'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/09/bookworm-and-library-search.html' title='Bookworm and library search'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-5181954010505299079</id><published>2011-09-05T13:46:00.000-04:00</published><updated>2011-09-05T13:46:13.141-04:00</updated><title type='text'>Is catalog information really metadata?</title><content type='html'>We've been working on making a different type of browser using the Open Library books I've been working with to date, and it's raised a interesting question I want to think through here.&lt;br /&gt;&lt;br /&gt;I think many people looking at word countson a large scale right now  (myself included) have tended to make a distinction between wordcount data on the one hand, and catalog metadata on the other. (I know I have the phrase "catalog metadata" burned into my reflex vocabulary at this point--I've had to edit it out of this very post several times.) The idea is that we're looking at the history of words or phrases, and the information from library catalogs can help to split or supplement that. So for example, my big concern about the ngrams viewer when it came out was that it included only one form of metadata (publication year) to supplement the word-count data, when it should really have titles, subjects, and so on. But that still assumes that word data--catalog metadata is a useful binary.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I'm starting to think that it could instead be a fairly pernicious misunderstanding.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The argument for this is that words aren't the base unit of measure at all. What we really care about are texts (which &lt;a href="http://sappingattention.blogspot.com/2011/01/basic-search.html"&gt;aren't necessarily books&lt;/a&gt;, but it doesn't really hurt to think of them that way). Thanks to librarians, we have a number of pieces of information about each book--&lt;a href="http://sappingattention.blogspot.com/2011/01/where-were-19c-us-books-published.html"&gt;where it was written&lt;/a&gt;, &lt;a href="http://sappingattention.blogspot.com/2011/03/author-ages.html"&gt;how old its author was&lt;/a&gt;, etc. And thanks to computers, we can store thousands of new pieces of information about books that relate to their vocabulary: how many times it uses the word 'science', how many times it uses any form of the word 'evolution', how many words it has overall.&lt;br /&gt;&lt;br /&gt;All these pieces of information are variables in the same data set. You can call them metadata or data, depending on what you think a book is--but it's misleading to call one of them data and the other metadata. Pretending that word counts are 'data' and the rest are 'metadata' could promote at least four significant mistakes: &lt;br /&gt;&lt;br /&gt;1. &lt;b&gt;We end up at once getting too hung up on word percentages as, in themselves, meaningful&lt;/b&gt;. But a word percentage has a very obscure meaning outside of the book it's in. Treating each year as a long text that has its own percentages for each word is problematic--it might make more sense, for example, to take the average for all books in that year, or some characterization of a beta distribution for the spread, or something else. (I have a post on this from last month in the hopper). All the people on Twitter who searched for "love" vs. "war" as if it meant something profound about human nature are only the most obvious example of this problem. The jump from this figure to 'fame,' for example, is very problematic, because it doesn't take any of the other metadata into account besides year. To build up a plausible proxy for fame, you'd need both some measure of which books are more popular, which we don't have (although see my third point, below); or at least some sense of how to weight different subject categories against each other, since &lt;a href="http://ngrams.googlelabs.com/graph?content=Derrida%2CThe+Beatles%2CBeatles%2CJacques+Derrida&amp;amp;year_start=1800&amp;amp;year_end=2008&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;some subjects breed books like rabbits and others don't&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;When we talk about the history of words in history, we're usually using them as imperfect proxies for the history of concepts. And when we talk about the history of concepts, we're really talking about the history of groups of people--what they believed, how that changed, who influenced whom. Word percentages tell us meaningful things about these questions, for the most part, only as far as we can define the groups of people we're interested in--and that's what catalog information is good for. Eliding the actual books from analysis is a bad thing.&lt;br /&gt;&lt;br /&gt; 2. &lt;b&gt;We keep ourselves from seeing all the ways variables can interact with each other&lt;/b&gt;. In some cases, words counts are just another form subject categorization to go alongside LC subject headings and BISAC codes. Just as I want to look at how often 'Lincoln' is used in books that are published in 1875, I might want to look at how often "Grant" is used in books that mention "Lincoln" a lot. Thinking of word-counts as a different kind of data can blind us to how well it supplements our existing catalog data. &lt;br /&gt;&lt;br /&gt;In the opposite direction, downplaying metadata can also keep us from seeing ways catalog data and word count data can interplay with each other--it took me too long, for example, to figure out &lt;a href="http://sappingattention.blogspot.com/2011/04/age-cohort-and-vocabulary-use.html"&gt;how to make author age, publication year, and count data interact with each other&lt;/a&gt;; mostly that has to do with dimensionality, but I think it was also because because I was stuck seeing the metadata filtering stage as something that needed to be completed before looking at the word count information, rather than seeing the word counts as being just another element in the metadata filtering.&lt;br /&gt;&lt;br /&gt;3. &lt;b&gt;We downplay the importance of creating catalog information. &lt;/b&gt;This is a more minor point, but still gets at something. Wordcount data's usefulness is limited by how much other data is included in the same series. If we treated it as just another form of catalog information when releasing it, it would always be released tied to some kind of unique book identifier. The more additional data we have about books--what language they're in, what percentage of their words are in a foreign language, etc., the more useful they are. But if we treat word counts as important public resources but other catalog fields as something we wait for library institutions to create, we'll have less stuff to work with. Since most people still are on the fence about whether even wordcount information is a useful public resource, though, we're a ways from having to worry about this problem. Still, I think we should all be more excited about the possibilities of creating and sharing other forms of derived catalog data from texts.&lt;br /&gt;&lt;br /&gt;4. &lt;b&gt;We understate the absurdity of the deference which custodians of our digital books give to copyright holders &lt;i&gt;vis-a-vis &lt;/i&gt;word counts. &lt;/b&gt;(Forgive me this--I haven't had a good copyright rant in a few months.)&lt;b&gt; &lt;/b&gt;If you think of word counts as another form of catalog information, the idea that they should be protected from the public domain while other sorts of information (the title, for example) &lt;i&gt;can &lt;/i&gt;get out is absurd. If you take a few simple precautions such as not allowing ngrams to traverse sentence breaks (which keeps whole books from being reconstituted), there isn't a good argument against even letting 5-grams out into the open for each book. They're just more information that can be used to find a book you want.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-5181954010505299079?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/5181954010505299079/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/09/is-catalog-information-really-metadata.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/5181954010505299079'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/5181954010505299079'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/09/is-catalog-information-really-metadata.html' title='Is catalog information really metadata?'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-3520835942620668055</id><published>2011-08-28T18:11:00.001-04:00</published><updated>2011-09-06T15:04:31.780-04:00</updated><title type='text'>Wars, Recessions, and the size of the ngrams corpus</title><content type='html'>Hank wants me to post more, so here's a little problem I'm working on. I think it's a good example of how quantitative analysis can help to remind us of old problems, and possibly reveal new ones, with library collections.&lt;br /&gt;&lt;br /&gt;My interest in texts as a historian is particularly focused on books in libraries. Used carefully, an academic library is sufficient to answer many important historical questions. (That statement might seem too obvious to utter, but it's not--the three most important legs of historical research are books, newspapers, and archives, and the archival leg has been lengthening for several decades in a way that tips historians farther into irrelevance.) A fair concern about studies of word frequency is that they can ignore the particular histories of library acquisition patterns--although I think Anita Guerrini takes that point a bit too far in her recent article on culturomics &lt;a href="http://www.miller-mccune.com/media/culturomics-an-idea-whose-time-has-come-34742/"&gt;in Miller-McCune&lt;/a&gt;. (By the way, &lt;a href="http://www.miller-mccune.com/magazines/2010-07-01/the-real-science-gap-16191/"&gt;the Miller-McCune article on science PhDs&lt;/a&gt; is my favorite magazine article of the last couple of years). A corollary benefit, though, is that they help us to start understanding better just what &lt;i&gt;is &lt;/i&gt;included in our libraries, both digital and brick.&lt;br /&gt;&lt;br /&gt;Background: right now, I need a list of of the most common English words. (Basically to build a much larger version of the database I've been working with; making it is teaching me quite a bit of computer science but little history right now). I mean 'most common' expansively: earlier I found that &lt;a href="http://sappingattention.blogspot.com/2010/11/how-many-words-are-there-in-english.html"&gt;about 200,000 words gets pretty much every word worth analyzing&lt;/a&gt;. There were some problems with the list I ended up producing. The obvious one, the one I'm trying to fix, is that words from the early 19th century, when many fewer books were published, will be artificially depressed compared to newer ones.&lt;br /&gt;&lt;br /&gt;But it turns out that a secular increase in words published per year isn't the only effect worth fretting about. Words in the Google Books corpus doesn't just increase steadily over time. Looking at the data series on overall growth, one period immediately jumped out at me:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-xJslCv2KfgI/Tlqqct6v4XI/AAAAAAAAC2g/GTcZPsrvWd0/s1600/depression+and+war.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/-xJslCv2KfgI/Tlqqct6v4XI/AAAAAAAAC2g/GTcZPsrvWd0/s320/depression+and+war.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;('Counts' is the sum of the most common words, which I'm using as a slightly better proxy for corpus size). The number of words in the Google Books corpus declines substantially in the Great Depression, and then again in the 1940s during the war. I hadn't really thought about it, but this suggests that raw counts will be suppressed or increased by various world-historical events. This is getting back to one of the things I reflect on most often—that even the most basic bibliographic data can produce interesting evidence.&lt;br /&gt;&lt;br /&gt;I suspect most historians' reaction to this chart would be: well, of course. Wars and recessions both decrease the amount of resources being spent on books—wouldn't we be surprised if this wasn't the case?&lt;br /&gt;&lt;br /&gt;So let me ask a question, then. Suppose I were to tell you that only one of these patterns was actually real: recessions or wars. Which one would you think it was? I would have been wrong.&lt;br /&gt;&lt;br /&gt;Thinking about it?&lt;br /&gt;&lt;br /&gt;Suppose you said recessions: here's a scatterplot of change in words published over the previous year against change in GDP over the previous year. Farther to the right is the greatest growth in words in the corpus (1946 is 1.3 times as large as 1945), and to the top is change in GDP year-over-year (so 1942 is 1.17 or so times larger than 1941).&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-u-U-LcfOXL8/Tlqttdc0XFI/AAAAAAAAC2o/Rl1Sbm9G8RI/s1600/book+publication+against+GDP+growth.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-u-U-LcfOXL8/Tlqttdc0XFI/AAAAAAAAC2o/Rl1Sbm9G8RI/s1600/book+publication+against+GDP+growth.png" /&gt;&lt;/a&gt;&lt;/div&gt;I would have guessed the GDP effect was real, but it turns out there's no correlation between GDP change, and change in words in libraries, at all. (Actually a slightly negative correlation: -0.11).&lt;br /&gt;&lt;br /&gt;The war effect, on the other hand, I would have thought was caused by the particular circumstance of paper rationing in WWII. I wouldn't think it was a general problem. But as you can see, 1861 is the worst overall year for change in number of words in the Google Books corpus. (1923 is the second worst: you should know by now &lt;a href="http://sappingattention.blogspot.com/2011/01/digital-history-and-copyright-black.html"&gt;what that's about&lt;/a&gt;). That suggest a real war effect, and indeed, the two other total wars the United States has been involved in show a similar pattern to WWII.&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-phagQmI7Fhg/TlqwyY1n5II/AAAAAAAAC2w/r4g65SPEfjs/s1600/Civil+War+Books.png" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-phagQmI7Fhg/TlqwyY1n5II/AAAAAAAAC2w/r4g65SPEfjs/s1600/Civil+War+Books.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-JGpMCnhj4qI/TlqxGps9K9I/AAAAAAAAC20/whH1tGc_7D4/s1600/World+War+I+books.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-JGpMCnhj4qI/TlqxGps9K9I/AAAAAAAAC20/whH1tGc_7D4/s1600/World+War+I+books.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-cc5mmCyE8lU/Tlqwx4KuqlI/AAAAAAAAC2s/ozDJFWcQH14/s1600/World+War+I+books.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Now, these statistics are somewhat misleading. For one thing, I suspect the war effect is swamping the year effect a bit: a lot of the outliers are from the 1940s and the 1860s. Multiple regression might find a strong economic effect. In fact, in the 1890s, which had a major recession in 1893 and a minor war in 1898, only the economic travails appear to decrease the number of books:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-9N-7D0c36uo/Tlq0rvAgQHI/AAAAAAAAC28/87XQAu0VuUo/s1600/1890s+books.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-9N-7D0c36uo/Tlq0rvAgQHI/AAAAAAAAC28/87XQAu0VuUo/s1600/1890s+books.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Fdxi88RXwEo/Tlq0fh45oBI/AAAAAAAAC24/cWH67N7J_Vg/s1600/1890s+books.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;So really, I'm not sure what's going on here. My working hypthesis for now would be that the restrictions on international shipping that emerged in the 1860s, 1910s, and 1940s are the main factor suppressing book counts, with the shift to war production probably less important. (In both the World Wars, the decrease in books starts well before the US was drawn into hostilities). But I'd have to throw British GDP figures in here to get a really good idea.&lt;br /&gt;&lt;br /&gt;For my present purposes, I don't think it's a critically important important question. But it might be for some other task. For all historians, these facts about library history are important to know; and they're getting considerably easier to answer quickly.&lt;br /&gt;&lt;br /&gt;Among the most frequent concerns about digitization is that it removes the element of serendipity that used to predominate. (If Google makes a new version of ngrams specifically for academics, they should put a big "I feel serendipitous" button next to the search function to preserve the ability to find that unexpected, field-changing piece of evidence.) But the stack-browser who relies on serendipity might be subtly pushed towards books from peacetime and prosperity: it's better if he knows about it.&lt;br /&gt;&lt;br /&gt;If you want it, here's the full timeseries, on a log-y scale. (The title should be 1700-2008, obviously).&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-ujbCUnBAbgQ/Tlq8sI8JdPI/AAAAAAAAC3A/IMHMRfSfuHE/s1600/Full+series.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-ujbCUnBAbgQ/Tlq8sI8JdPI/AAAAAAAAC3A/IMHMRfSfuHE/s1600/Full+series.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-3520835942620668055?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/3520835942620668055/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/08/word-counts.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/3520835942620668055'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/3520835942620668055'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/08/word-counts.html' title='Wars, Recessions, and the size of the ngrams corpus'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-xJslCv2KfgI/Tlqqct6v4XI/AAAAAAAAC2g/GTcZPsrvWd0/s72-c/depression+and+war.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-5003952174785002230</id><published>2011-08-04T17:53:00.001-04:00</published><updated>2011-08-04T17:56:08.461-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data exploration and visualization'/><title type='text'>Graphing and smoothing</title><content type='html'>I mentioned earlier I've been rebuilding my database; I've also been talking to some of the people here at Harvard about various follow-up projects to ngrams. So this seems like a good moment to rethink a few pretty basic things about different ways of presenting historical language statistics. For extensive interactions, nothing is going to beat a database or direct access to text files in some form. But for quick interactions, which includes a lot of pattern searching and public outreach, we have some interesting choices about presentation.&lt;br /&gt;&lt;br /&gt;This post is mostly playing with graph formats, as a way to think through a couple issues on my mind and put them to rest. I suspect this will be an uninteresting post for many people, but it's probably going to live on the front page for a little while given my schedule the next few weeks. Sorry, visitors!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;To state the obvious: we're trying to chart change in language over time. Word frequency charts are a pretty obvious thing to chart, at least in English: I've been making them, historical language corpora make them, Google ngrams got enormous numbers of people making them for themselves. They can go way back: Gregory Crane presented some charts containing two thousand years of changes in Latin usage at the Digging into Data conference, which must be the record for a single time series.&lt;br /&gt;&lt;br /&gt;These all look pretty much the same, but let me take three examples for now of how this can work before I move on to thinking about how line drawing and smoothing should work.&lt;br /&gt;&lt;br /&gt;First, the default ngrams interface:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-p5Hgq1ztni4/TiCkEZp8EHI/AAAAAAAAC14/XagWtSxvYWo/s1600/chart.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="233" src="http://4.bp.blogspot.com/-p5Hgq1ztni4/TiCkEZp8EHI/AAAAAAAAC14/XagWtSxvYWo/s640/chart.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Next, the &lt;a href="http://corpus.byu.edu/coha/"&gt;Corpus of Historical American English&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-5UhU1NBQIXM/TiCkFx5TxjI/AAAAAAAAC18/d3TW5K48VzA/s1600/Screen+shot+2011-07-15+at+4.32.22+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="132" src="http://3.bp.blogspot.com/-5UhU1NBQIXM/TiCkFx5TxjI/AAAAAAAAC18/d3TW5K48VzA/s640/Screen+shot+2011-07-15+at+4.32.22+PM.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&amp;nbsp;Finally, what I've been doing (which &lt;a href="http://sappingattention.blogspot.com/2011/01/digital-history-and-copyright-black.html"&gt;stops in 1922&lt;/a&gt;). &lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/-5db6m_TwRFU/TiCk1Gqi1XI/AAAAAAAAC2A/bZOpddSNdws/s1600/wordcounts%252Bof%252Bevolution.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-5db6m_TwRFU/TiCk1Gqi1XI/AAAAAAAAC2A/bZOpddSNdws/s1600/wordcounts%252Bof%252Bevolution.png" /&gt;&lt;/a&gt;&lt;br /&gt;These all have a number of things in common.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;All of them assume that the basic quantity we're interested in is occurrences as a share of all words (although ngrams uses per cent, I use per mille, and COHA uses per million; the last is probably the best).&lt;/li&gt;&lt;li&gt;All apply some sort of smoothing, although the type varies. More on that below.&lt;/li&gt;&lt;li&gt;All fail have some poor graphic design choices--I've got ugly tick marks, COHA has way too many grid lines and a questionable use of a bar chart for univariate data, Ngrams has a nearly unreadable y-axis thanks to all those zeros. Ngrams is probably the winner here, though.&lt;/li&gt;&lt;/ul&gt;What are some particularly good features that everyone should have?&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&amp;nbsp;COHA allows you to drill directly into the source materials by clicking, and generally has a much more interactive interface. This is very useful—probably the most important supplement to the chart you can give—although for COHA it is somewhat limited by the (&lt;a href="http://sappingattention.blogspot.com/2011/01/picking-texts-again.html"&gt;relatively&lt;/a&gt;) small corpus. It's not necessary, I don't think, for this to be built directly into the graphical representation, although that would be nice.&lt;/li&gt;&lt;li&gt;Ngrams allows you to change the smoothing on the fly; that's nice.&lt;/li&gt;&lt;li&gt;Ngrams and I allow you to compare multiple words on the same scale: I also can compare &lt;a href="http://sappingattention.blogspot.com/2011/02/graphing-word-trends-inside-genres.html"&gt;subsets of genres&lt;/a&gt; and other arbitrary groupings of books to each other on a given word, which is really important.&lt;/li&gt;&lt;li&gt;This isn't really fair: but I do like how I easily my version can include variants of words (evolution, evolutionary, evolutions), while in the public version of ngrams you can't combine even lowercase and capital versions of a word. (There is a more customizable version for behind the scenes).&lt;/li&gt;&lt;/ul&gt;The big question that really arises here, though, is smoothing. Ngrams using changeable windows of moving averages; COHA groups by decade to eliminate year-to-year noise; I use a &lt;a href="http://en.wikipedia.org/wiki/Local_regression"&gt;moving loess average&lt;/a&gt;. What's optimal?&lt;br /&gt;&lt;br /&gt;This is a tricky question. To get the easy answer out of the way first: it's not plotting by decade. Plotting by decade should always be avoided. From a statistical point of view, plotting by decade is exactly the same as taking a 10-year moving average in n-grams (actually halfway between what ngrams calls a four-year and a five-year average, but you get the point), throwing out 9/10 of the data, and making the remaining year tall and blocky. The benefit of throwing out all that data is virtually nil, and it makes underlying patterns harder to see. There's a plunge in the COHA data after the 20s--at first, I thought this might have been related to corpus composition changes related to the 1922 copyright cutoff date. If COHA presented information like ngrams, it would have been clearer the drop was later—more like 1926, suggesting the interesting event to look into is the Scopes trial.&lt;br /&gt;&lt;br /&gt;You might argue from a user interface perspective that decades at least block data into compartments we understand. But from a historical perspective, that's just what we should be trying to avoid; decade stereotypes are generally poorly thought through, but popularly held ideas people have about history. (I've always thought the stereotype of the "Roaring Twenties" is a big reason the labor unrest and depression in 1919-1921 is mostly forgotten). Designing an interface that reinforces a belief that national culture realigns once every ten years by the clock, and remains relatively constant inside the bins, is going to be contrary to exploration that helps us to see new things about the past. It has, to the contrary, a serious bias built in towards confirming existing patterns that are relatively spurious.&lt;br /&gt;&lt;br /&gt;What about moving average smoothing? It's better, but still has some issues. It flattens one-year aberrations into plateaus over a number of years, which is acceptable but often undesired behavior. One-year blips usually tell us very little about the general trends that ngrams seem most useful for looking into. Sometimes peaks are random noise, in which case we probably want to include them. But there can also be historical reasons for a peak (like &lt;a href="http://sappingattention.blogspot.com/2010/12/centennials-part-ii.html"&gt;presidential centennials&lt;/a&gt;), and in that case it would be better not to apply it to surrounding years.&lt;br /&gt;&lt;br /&gt;That, roughly, is why I like to use loess smoothing, which gives a very smooth underlying curve; it tends to eliminate noise much more than moving-average smoothing, while catching the underlying trends about the same. It's a little messy to include both the yearly variation and the smoothed version, but the aggressive smoothing of loess makes that easier.&lt;br /&gt;&lt;br /&gt;What will this look like? I've been occasionally fooling around with the &lt;a href="http://had.co.nz/ggplot2/"&gt;R package ggplot2&lt;/a&gt;, which has a very well thought-out philosophy of both graphical design and interaction with variables. Medium-term, I'd like to keep that going. Here's a first pass at presenting a year plot for two terms with loess smoothing:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-b9UKOXBpOJQ/Tjr6w8PTqQI/AAAAAAAAC2I/KpcFt7ZWHe4/s1600/Prettier+Evolution+and+Darwin+trends.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-b9UKOXBpOJQ/Tjr6w8PTqQI/AAAAAAAAC2I/KpcFt7ZWHe4/s1600/Prettier+Evolution+and+Darwin+trends.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="" style="clear: both; text-align: left;"&gt;This is generally pretty good, I think, although it's still missing the interactive element I like about COHA. ggplot really is quite pretty, and the philosophy of ggplot combined with some good data storage in MySQL makes it easy to put together a split by something other than word, such as LC classification (this is for usage of the word 'evolution' in two fields which are 5-10x above the norm):&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-rHKKkFqm-OM/TjsBS0SlHFI/AAAAAAAAC2Q/dAfEbWFA0wE/s1600/Evolution+trends+in+psychology+and+sociology.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-rHKKkFqm-OM/TjsBS0SlHFI/AAAAAAAAC2Q/dAfEbWFA0wE/s1600/Evolution+trends+in+psychology+and+sociology.png" /&gt;&lt;/a&gt;&lt;/div&gt;There's still some question in my mind about whether loess works better than moving averages: the following example, though, shows why it might. I've superimposed on the raw data for 'evolution' a loess average with span=.25 and what ngrams calls a 3-year moving average. The general curve for each is almost exactly the same; the difference is that loess is considerably smoother, ignoring the local variations a bit more in favor of the surroundings.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-3YFbWcHca8U/TjsFtwBQCsI/AAAAAAAAC2Y/AbPJMY-MF1o/s1600/Loess+compared+to+moving+average.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-3YFbWcHca8U/TjsFtwBQCsI/AAAAAAAAC2Y/AbPJMY-MF1o/s1600/Loess+compared+to+moving+average.png" /&gt;&lt;/a&gt;&lt;/div&gt;If a loess span looks like a moving average but smoother, that seems about perfect. There are some problems with loess—every once in a while, it takes momentum too seriously and predicts a negative rate of occurrences, for instance—but in general I like the aesthetics of it better than moving averages.&lt;br /&gt;&lt;br /&gt;That's all just to take me to where I am now. There are two much more interesting questions: &lt;br /&gt;&lt;ul&gt;&lt;li&gt;Is occurrences per million the best metric to use? Some of the jumpiness in the data is the result of just a few books, I suspect. A metric like 'percentage of books containing "evolution"' is smoother, and in many ways better. There are other possible metrics, too—we could use a median, for instance (how often does the _average_ book use evolution), or something more complicated (I have a hunch that one of the parameters on some type of distribution—beta?—modeled for each year makes more sense than occurrences per million).&lt;/li&gt;&lt;li&gt;We can use loess or a moving to predict the value for any given year. But how can we characterize the &lt;i&gt;error&lt;/i&gt; around that value? It's possible to get a 95% confidence interval for where the loess should be, as shown below, but that's not necessarily the value we're interested in.&lt;/li&gt;&lt;/ul&gt;These two questions are actually related—the type of error bar we would want to see, and our chosen metric for change in usage over time, both fundamentally depend on why we think a particular metric is good at explaining change in language over time. (Which is connected to what 'change' we think is significant, and what we want to filter out). I have a couple ideas to work out on these, but I'm going to post this before I box myself in.&lt;br /&gt;&lt;br /&gt;(Last picture: example error bars) &lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-o7SRsROJ1nY/TjsN9sgi4UI/AAAAAAAAC2c/dveEBVFgTXc/s1600/Evolution+and+Darwin+with+error+bars.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-o7SRsROJ1nY/TjsN9sgi4UI/AAAAAAAAC2c/dveEBVFgTXc/s1600/Evolution+and+Darwin+with+error+bars.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-5003952174785002230?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/5003952174785002230/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/08/graphing-and-smoothing.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/5003952174785002230'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/5003952174785002230'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/08/graphing-and-smoothing.html' title='Graphing and smoothing'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-p5Hgq1ztni4/TiCkEZp8EHI/AAAAAAAAC14/XagWtSxvYWo/s72-c/chart.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-8830199635078417119</id><published>2011-07-15T15:26:00.003-04:00</published><updated>2011-07-17T09:45:51.786-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ngrams'/><category scheme='http://www.blogger.com/atom/ns#' term='This Blog'/><title type='text'>Moving</title><content type='html'>&lt;div style="font-family: inherit;"&gt;&lt;span id="internal-source-marker_0.28888600184765456" style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;Starting this month, I’m moving from New Jersey to do a fellowship at the &lt;/span&gt;&lt;span style="font-size: small;"&gt;&lt;a href="http://www.culturomics.org/cultural-observatory-at-harvard"&gt;&lt;span style="background-color: transparent; color: #000099; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;"&gt;Harvard Cultural Observatory&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;.  This should be a very interesting place to spend the next year, and I’m  very grateful to JB Michel and Erez Lieberman Aiden for the opportunity  to work on an ongoing and obviously ambitious digital humanities project. A few thoughts on the shift from Princeton to Cambridge:&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="font-size: small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;1.  Although I’ve nurtured some curmudgeonly pride about running my text  analysis so far on a laptop, I’m excited to have access to a bit more  computing power. In addition to relocating to the  Cambridge/Somerville metro area from Princeton, one of the reasons I’ve  been largely silent on the blog the last couple months is that I’ve  been redoing the &lt;/span&gt;&lt;span style="font-size: small;"&gt;&lt;a href="http://sappingattention.blogspot.com/2011/02/technical-notes.html"&gt;&lt;span style="background-color: transparent; color: #000099; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;"&gt;back end of my database system &lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;to  work in Python with larger and more flexible parts of the Open  Library/Internet Archive text collection. Hopefully I’ll get that  running somewhere at Harvard soon, which should provide a good platform  to think through how I think we should be applying somewhat larger  amounts of computing power to digital libraries. (I remain a little  skeptical that  ‘non-consumptive’ reading can be truly effective at the coming Hathi or Google  research centers, and this should give me some concrete examples.) I’ve  been saying for a while that the choices we make in the next few years  might shape our scholarly infrastructure for considerably longer. So it's very interesting to be at a place where other people are thinking about what we  need, and getting to see some of these copyright issues closer up.&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;2.  I offer myself up as more evidence that it can be a good idea for grad students to blog, and  in their own names; JB and Erez would never have connected with me if I  hadn’t had been throwing stuff onto the Internet. &lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;3. Getting ready to start this position, I read with interest the &lt;/span&gt;&lt;span style="font-size: small;"&gt;&lt;a href="http://www.nature.com/news/2011/110617/full/474436a.html"&gt;&lt;span style="background-color: transparent; color: #000099; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;"&gt;profile of Erez Lieberman Aiden in Nature&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;, the push-back that generated from Tim Hitchcock on what he sees as &lt;/span&gt;&lt;span style="font-size: small;"&gt;&lt;a href="http://historyonics.blogspot.com/2011/06/culturomics-big-data-code-breakers-and.html"&gt;&lt;span style="background-color: transparent; color: #000099; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;"&gt;fundamental flaws in the culturomics project&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;, and JB and Erez’s responses in the comments on that post. I can see the sources of Hitchcock’s frustration from the &lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;Nature &lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;profile—as a piece of writing, it seems to work from the presumption that &lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;Nature’s &lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;readers  will only be interested in an article about humanities research if it  departs from the position that the humanities are currently a bit of a  backwater. It does this by casting people like Dan Cohen and Tony Grafton as conservatives preventing new methods from entering the humanities, when in fact, from different positions, they're doing the most to advance the discussion. (Although I have to admit, after a long time at Princeton I  kind of enjoyed the idea that the readers of &lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;Nature&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;  will know nothing about Tony Grafton except that he “uses a giant,  geared wooden reading wheel to help him manage his oversized,  Renaissance texts.” And speaking of authorial intent, I’m starting to  convince myself that Eric Hand was going for some kind of subtle &lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;Ulysses &lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;homage by starting the article with a blessing on a rooftop and writing the profile as a day-in-the-life piece.)&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;But  while the culturomist’s greatest enthusiasm certainly does lie in  bringing more scientific tools and methodologies into humanistic  content, they have been talking to humanists as well as  scientists about what the most exciting research to come will be and what directions might need to be explored. One of  the things I think is most exciting about working with this project,  aside from proximity to the Widener and Google books, is that unlike the  many DH projects that pull in some programmers or scientists to help in humanities  divisions, it instead has to pull humanists like me into an  engineering school to get the collaboration flowing. Reaching out in that direction gives it an  entirely different set of strengths and weaknesses than more traditional (such as it is) DH centers, and to some degree I don't think we entirely know what those are yet. Having some projects like that should be good  for the diversity of work within the digital humanities field as a  whole.&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;&amp;nbsp;&lt;/span&gt; &lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;And a random thought: I’d  add that Hitchcock’s comparative praise for the hypothesis testing  of cliometricians, as opposed to the big data pattern-seeking of the  culturomists, reminded me a bit of the recent Norvig/Chomsky debate  about different statistical paradigms of science. &lt;/span&gt;&lt;span style="font-size: small;"&gt;&lt;a href="http://norvig.com/chomsky.html"&gt;&lt;span style="background-color: transparent; color: #000099; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;"&gt;Norvig’s essay&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;  is pretty interesting and got a lot of circulation on the periphery of circles I follow, so it might be fun  reading if you missed it the first time around. I think it’s worth  remembering that some of what can seem like epistemic difficulties of the  sciences vs. the humanities can actually reflect very current debates  about what passes as evidence or theories within the sciences  themselves. &lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;4. I haven't been blogging as much lately, partly because of moving pains, partly because moving here somewhat challenges my ratio of things I can't talk to people about in person. We'll see how that develops—I have a few posts in the works, at least.&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;5. Finally: if anybody&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt; out there&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-size: small; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt; living in/visiting Boston wants to get coffee sometime, drop me a line. I'd love to talk to some humanists around here, too. &lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-8830199635078417119?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/8830199635078417119/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/07/moving.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8830199635078417119'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8830199635078417119'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/07/moving.html' title='Moving'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-3609527521823647748</id><published>2011-06-16T15:30:00.001-04:00</published><updated>2011-06-16T15:30:55.143-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Digital Humanities'/><title type='text'>What's new?</title><content type='html'>Let me get back into the blogging swing with a (too long—this is why I can't handle Twitter, folks) reflection on an offhand comment. Don't worry, there's some data stuff in the pipe, maybe including some long-delayed playing with topic models.&lt;br /&gt;&lt;br /&gt;Even at the NEH's Digging into Data conference last weekend, one commenter brought out one of the standard criticisms of digital work—that it doesn't tell us anything we didn't know before. The context was some of Gregory Crane's work in describing shifting word use patterns in Latin over very long time spans (2000 years) at the &lt;a href="http://www.perseus.tufts.edu/"&gt;Perseus Project&lt;/a&gt;: Cynthia Damon, from Penn, worried that "being able to represent this as a graph instead by traditional reading is not necessarily a major gain." That is to say, we already know this; having a chart restate the things any classicist could tell you is less than useful. I might have written down the quote wrong; it doesn't really matter, because this is a pretty standard response from humanists to computational work, and Damon didn't press the point as forcefully as others do. Outside the friendly confines of the digital humanities community, we have to deal with it all the time.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Now, there are bunch of responses to this question on the level of pure research. Just a few in passing: Knowledge of overall trends from arbitrary sampling can suffer major confirmation biases; by definition, only the rarest research is truly groundbreaking, and confirmatory research is important and underprivileged across all fields in the academy; there's a difference between knowing the existence of a trend and knowing the magnitude and contours of that trend. It's easy to go on.&lt;br /&gt;&lt;br /&gt;But this time, the question that jumps out for me is Tonto's: What do you mean "we," kemosabe? Just who is it that already knows about these trends? The obvious answer, presumably, is that it's some academic field or subfield. Our expert speaks from authority to say that the research doesn't contribute to their fields. (Note that the statement can be exclusionary: an implication is that if 'you' the researcher find this discovery interesting, you must not &lt;i&gt;really &lt;/i&gt;be in the field, even if you're a professor in it.) But though the field is important, it's more complicated. I've read a lot of pieces in the ever-lively crisis-of-the-humanities/defense-of-the-humanities genre, and pretty much all of them would agree that "we" also means the culture as a whole: scholars know it in their capacity as the keepers of the flame of knowledge. And for that we, different types of knowledge reshaping do actually contribute to what 'we' know.&lt;br /&gt;&lt;br /&gt;This struck me in Damon's commentary because she mentioned elsewhere that she was working on a translation of Tacitus. I'm outside the field, obviously, but I still feel pretty confident in saying that putting Tacitus into modern English contributes &lt;i&gt;very &lt;/i&gt;little to the body of scholarly knowledge. Jack Gladney notwithstanding, scholars speak the language of their field. If we think that kind of work broadens knowledge, it's because it makes Tacitus available to the much larger group of people who can't read Latin. If translations are a worthy activity for senior scholars, why aren't data representations?&lt;br /&gt;&lt;br /&gt;I can think of a couple potentially concerning reasons that humanists don't think this work increases what 'we' know. The first is that while humanists care about non-Latin readers knowing things about Tacitus, we don't care about people who are more persuaded by quantitative data than by anecdotal impressions. Requiring numbers for proof is naive empiricism, blind to the complexities of human experience, etc. While there's certainly an undercurrent of this thinking, I don't think it's insuperable or always present in these critiques: at least in history, I've long been struck by how often maps and stats get used in lecture courses by faculty who would never use them in their published work.&lt;br /&gt;&lt;br /&gt;Moreover, plenty of humanists themselves are interested in this type of knowledge--that's what's driving much of the interest in reaching out to new methods by humanists today now that the data is available. Thus even within the field, there's been an undercurrent of people who don't find our conclusions from traditional reading completely persuasive: I, for one, love to see more solid evidence on a few canards of historical interpretation. (The Culturomics keynote, for example, has &lt;a href="http://ngrams.googlelabs.com/graph?content=The+United+States+are%2CThe+United+States+is&amp;amp;year_start=1800&amp;amp;year_end=1920&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;this slide&lt;/a&gt;, which helps answer some &lt;a href="http://languagelog.ldc.upenn.edu/nll/?p=1794"&gt;live questions&lt;/a&gt; about things many historians claim to simply 'know' about the transition of "the United States" from a plural to a singular subject around the Civil War). &lt;br /&gt;&lt;br /&gt;But if we do accept that new representations help persuade different groups of people, including some who aren't obviously outside the scholarly field, why don't they expand what 'we' know in a real way? I think it has to do with what one of the Digging into Data speakers (can't remember who… ) talked about as the privileging of method over questions in the humanities. Learning that language changed by looking at a chart isn't real knowledge, the argument would go, not like knowledge gained by reading lots of books. Even if I read a result off a google ngram, the only way to confirm its truth is to ask someone who's actually read all the books. It's easy to make a mistake off a graph, so the only real knowledge is rooted in reading in reading techniques anyway. Humanists would fail in their obligations to students if they let them reach conclusions through charts rather than through extended reading.&lt;br /&gt;&lt;br /&gt;Now, a methodological fight may be coming, and it might be fun. I think a lot of participants would like this to be a purely epistemological issue. JB Michel briefly mentioned Viennese logical positivism in the keynote while suggesting that &lt;i&gt;now &lt;/i&gt;we can speak quantitatively about culture, although way back when it wasn't possible. Many more humanists, I suspect, would think that we still can't—that humanistic questions are by definition not tractable to purely quantitative analysis.&lt;br /&gt;&lt;br /&gt;So far as possible, I want to sidestep those issues to point out some more pragmatic problems with defending the scholarly status quo. Although for many humanists defending reading seems like a warmly resounding defending of human practices in texts, for students and outsiders it can seem much closer to an unquestionable assertion of authority. To assert a trend on the basis of experience that is not open to critical interrogation tends to tighten the circle of 'we' for whom humanities knowledge is accessible enormously. That's a mistake; trust in authority is not a core humanistic value. Neither is dismissing the relevance of particular types of learning. By making it easier to draw conclusions about the past, quantification allows us to enormously broaden the circle of people who can know things about the past.&lt;br /&gt;&lt;br /&gt;I say 'know' a bit uncautiously, but I do think there is an enormous difference between someone told by an authority figure that language use changed, and someone given a graph to figure it out. Even on a printed page, a chart invites an engaged reading in a way that a simple pronouncement does not. It's hard to overstate how important that is. Now, a chart doesn't need to be quantitative--I actually think dynamically generated concordances might allow people to reach conclusions in the same way without statistics, although with a bit more effort and a bit more computing power. But if it changes the &lt;i&gt;number &lt;/i&gt;of people to whom basic knowledge about culture is available, even if it doesn't change the type of knowledge, that serves the purpose of the humanities better than anything. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;~~~~~&lt;br /&gt;A concluding parable: Imagine a world with no maps. Most people know only their immediate neighborhoods, but a few social misfits spent their twenties driving the long hauls between cities, sacrificing fame, fortune, and family to learn the lay of the land. If you want to get from New York to midcoast Maine, they tell you about how to take the Hutchison to the Merritt, about the I-84 turnoff from 91 just before Hartford, about the merits of taking the coastal route north from Portland or following the interstate to Augusta. They tell you that they recall a friend who wrote a book mentioning a cutoff from Route 1 that saves a few miles by skipping Rockland—Maine route 90, route 95, something like that—you might want to look into further. This is a useful service. Some people make careers out of it.&lt;br /&gt;&lt;br /&gt;If someone walks into this world with a stack of Hagstrom atlases, what happens? Those people go on about how those maps can't capture the rush hour traffic in Hartford or the backups near Wiscasset on summer weekends, about how you'd never know from a map to buy your gas in Massachusetts and your alcohol in New Hampshire. They say they've already driven all these roads; the maps don't tell us anything that we don't know already. Real knowledge of the terrain can only be gained by &lt;i&gt;driving&lt;/i&gt; it.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;In a way, they'd be right. The routes that a mapreader gets may be more interesting at times, but they will also be shallower. If they rely on a completely algorithmic solution (Google Maps!) they will frequently get terrible results. But there's no surer way to avoid more people learning the landscape than making it as inaccessible as possible. It might validate the choices and expertise of a few, but it certainly does far less for the knowledge of the land itself than opening it up to new audiences.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-3609527521823647748?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/3609527521823647748/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/06/whats-new.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/3609527521823647748'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/3609527521823647748'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/06/whats-new.html' title='What&apos;s new?'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-4524674649471234136</id><published>2011-05-10T18:04:00.001-04:00</published><updated>2011-05-10T18:04:54.853-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Changes in language over time'/><category scheme='http://www.blogger.com/atom/ns#' term='authors'/><category scheme='http://www.blogger.com/atom/ns#' term='pca'/><title type='text'>Predicting publication year and generational language shift</title><content type='html'>Before end-of-semester madness, I was looking at how shifts in vocabulary usage occur. In many cases, I found, vocabulary change doesn't happen evenly across across all authors. Instead, it can happen generationally; older people tend to use words at the rate that was common in their youth, and younger people anticipate future word patterns. An eighty-year-old in 1880 uses a world like "outside" more like a 40-year-old in 1840 than he does like a 40-year-old in 1880. &lt;a href="http://sappingattention.blogspot.com/2011/04/age-cohort-and-vocabulary-use.html"&gt;The original post&lt;/a&gt; has a more detailed explanation.&lt;br /&gt;&lt;br /&gt;Will had some some good questions in the comments about how different words fit these patterns. Looking at different types of words should help find some more ways that this sort of investigation is interesting, and show how different sorts of language vary. But to look at other sorts of words, I should be a little clearer about the kind of words I chose the first time through. If I can describe the usage pattern for a "word like 'outside'," just what kind of words are like 'outside'? Can we generalize the trend that they demonstrate?&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Most of the words I chose last time came out of the results of a &lt;a href="http://sappingattention.blogspot.com/2011/02/pca-on-years.html"&gt;principal components analysis I did earlier&lt;/a&gt; to see how vocabulary usage changed over time. That result gave me back a particular kind of word—those that change slowly but steadily over time. The &lt;a href="http://ngrams.googlelabs.com/graph?content=outside&amp;amp;year_start=1830&amp;amp;year_end=1922&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;Google ngram for 'outside'&lt;/a&gt; shows a steady rise: it also finds &lt;a href="http://ngrams.googlelabs.com/graph?content=prudent%2Cacknowledged%2Cnotwithstanding%2Cdisposed%2Crendered%2Ccommencement%2Ccircumstance%2Cmode&amp;amp;year_start=1830&amp;amp;year_end=1922&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;several words&lt;/a&gt; with a steady decline. (Reminder: I'm not using the ngrams data for &lt;a href="http://sappingattention.blogspot.com/2011/02/technical-notes.html"&gt;my own analysis&lt;/a&gt;, but it serves as a nice check to use a different set to confirm trends I find where that dataset is usable).&lt;br /&gt;&lt;br /&gt;That's to say: these are distinctive words I use. They represent a very particular sort of change in language; it happens slowly, creepingly, and probably without much notice by historical actors. It even happens, I think, without much notice by historians: we tend to care about words that are contentious, that represent major changes, and as a result they tend to have spikes and plateaus. Some historical keywords do have some &lt;a href="http://ngrams.googlelabs.com/graph?content=evolution%2C+liberty&amp;amp;year_start=1830&amp;amp;year_end=1922&amp;amp;corpus=5&amp;amp;smoothing=3"&gt;nice overall trend&lt;/a&gt;; but &lt;a href="http://ngrams.googlelabs.com/graph?content=pragmatism%2Cwhigs%2CJoseph+Smith&amp;amp;year_start=1830&amp;amp;year_end=1922&amp;amp;corpus=5&amp;amp;smoothing=2"&gt;plenty of others&lt;/a&gt; tremble up, down, and steady.&lt;br /&gt;&lt;br /&gt;Our standard ways of writing about causation in the historical profession deal worst, in some ways, with decades-long steady change: there will not be any single useful archival source to write about the long creep upwards of a word like outside, while there may be single people or institutions that themselves propel "Joseph Smith" or "Pragmatism" into common usage. (I think English and Comp Lit may actually have a better vocabulary for describing these long shifts not necessarily tied to individual people; Ted Underwood is doing some &lt;a href="http://tedunderwood.wordpress.com/2011/05/06/the-history-of-an-association-part-two/"&gt;neat stuff looking at century-long patterns in language use&lt;/a&gt; on his blog, for example.) In some ways, that's just what's exciting about using computers to help read books: these are shifts in language over the &lt;i&gt;longue durée, &lt;/i&gt;and they open up new prospects for being quite precise (perhaps misleadingly so) in describing changes that previously might have been murky or unknown. More on that later, hopefully. Also, I should deal more concretely with many of the other words we care about.&lt;br /&gt;&lt;br /&gt;But first, let me take advantage of steady directional drift in language to think about how it helps us in identifying when books were written. I was talking to Jamie recently about dating anonymous Latin hagiographies from the 6th-8th centuries based on their word usage. This is something that we might be able to do fairly well with computers. (I'm sure there's a bit of research out there that I haven't read, actually). I could probably make some headway with the open library database. A full non-linear regression on publication year using all 200,000 words I have would be a major project, and I don't see an immediate payoff for it. But one of the nice things about the pca analysis I did is that it breaks 10,000 words down to five or six dimensions that I know correlate to publication year. The data is already selected to be linear, so normal linear regression should work too.&lt;br /&gt;&lt;br /&gt;That means I can build up a simple linear model that, given the vocabulary usage of a book, makes a guess as to when it was published. To help it out, since I know &lt;a href="http://sappingattention.blogspot.com/2011/02/genres-in-motion.html"&gt;some genres lag behind the times and others ahead (second chart in linked page)&lt;/a&gt;, I give it catalog number information as a variable to work with as well. The result accurately places somewhat more than half the books within a decade of their actual publication date. (R-squared = .56). I could certainly build a better model, but don't need one right now: this is a good enough for the particular sort of slow, steady change that I've been looking at, and all the error should be smoothed out in the end anyway. &lt;br /&gt;&lt;br /&gt;I'll forgo too much exploration of that model, since it's not my main focus here. (It does include some neat features, like the ability to find the most futuristic books in any given genre). But let me make a couple quick, eminently skippable notes with scatter plots to explore the error here. First, the overall error. Remember, I'm using publication year, which includes reprints, because I don't trust the metadata very much on first publication year. The model almost never mispredicts by more than 50 years.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-yYNcYOCosbA/Tcl5snNBdRI/AAAAAAAACxk/4RrM7aqP8TE/s1600/full+error+plot.png" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-yYNcYOCosbA/Tcl5snNBdRI/AAAAAAAACxk/4RrM7aqP8TE/s1600/full+error+plot.png" /&gt;&lt;/a&gt; &lt;/div&gt;&lt;div style="text-align: left;"&gt;Basically all this chart below shows is that most books are written  by people under 100 years old, but I have a few very old authors. The  streak around 350 years old is Shakespeare, whose books are consistently  predicted as published in the 1880s or so. That's somewhat  interesting—it means that a sample based on vocabulary from 1830 to 1922  is terrible at extrapolating out much past those boundaries.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;Let's zoom in to the section of the plot with realistically aged authors and fewer than fifty years of worth of error. &lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-vF25lYnXKNU/Tcl5sAINO-I/AAAAAAAACxg/G2aATbbDfxs/s1600/relevant+error+plot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-vF25lYnXKNU/Tcl5sAINO-I/AAAAAAAACxg/G2aATbbDfxs/s1600/relevant+error+plot.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;We can see that most of the error is not  due to author age. It takes a computer to really notice the slight  downward slant in the scatter (R = –0.21). That's telling us that author age doesn't explain the bulk of missing data, but that it might help some. A better model might explain some of the other variation.&lt;br /&gt;&lt;br /&gt;This model, though, is good enough to explore the variations in some of my data. Last post, I looked at how percentage usage of words changed over time: now I can look at how &lt;i&gt;predicted year &lt;/i&gt;changes over time:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-hPUbVWz2bGA/TclOHdsJWwI/AAAAAAAACxc/gorbkqowgHM/s1600/predicted+year+of+publication+by+age.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-hPUbVWz2bGA/TclOHdsJWwI/AAAAAAAACxc/gorbkqowgHM/s1600/predicted+year+of+publication+by+age.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;For each combination of year and author age, the colors show what year my model thinks the books in that position were written. Let's take books written in 1880 as example. On average, it thinks books by 30 year-olds written in 1880 were written around 1897; books by 50 year-olds look to it like they were written in 1891; and books by 70-year olds look like they were written in 1875. To go back to the language of my last post, an Angstrom theory of language would have 0 years of change for 40 years of age difference, and a Bascombe theory would have 40 years of change; this basically splits the difference with 22 years of predicted difference in the language of different aged authors. You can see that more generally in the slope of the lines towards the northeast; they aren't horizontal (Angstrom) but neither do they match up with the diagonal aging lines (Bascombe).&lt;br /&gt;&lt;br /&gt;So that means that of language change, half is because of generational displacement, and half because of  individuals learning new words as the get older. (The extra two years towards Bascombe  (22 instead of 20), let's just pretend, are because of the relatively  few republished books in the sample). That sounds about right to me. I felt like I might have somewhat oversold the aging component in my previous post, because I think it's more important to notice that it's there. To say it's about half of all steady change in language (which, I should note, is not all change in language) means that it's pretty important.&lt;br /&gt;&lt;br /&gt;That, I think, more firmly establishes how important generational displacement is in driving linguistic change. I think there are still some productive questions to be asked about these charts, though, that get beyond that basic point. Here are some, from most to least technical.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;How can we describe whether a word is type-Angstrom or type-Bascombe without having to eyeball these heat maps? With the results from the linear model, it's easy to just look at slope. But most of the heat maps are a lot more splotchy--they don't simply run in a straight line. I think I could kludge together an approach that works off of the slope of contour lines; that might be good enough.&lt;/li&gt;&lt;li&gt;The Bascombe-Angstrom model I described only includes countour lines running in two of the four cardinal directions. I'm pretty confident that most words should fall roughly between those two models, but there are other possibilities nonetheless. I talked a little last time about words that remain constant in usage across time, but vary in usage by different age groups; Willy asked in the comments about the fourth kind, where changes begin among older people and then filter down into the younger generations. I suspect there are very few words like that (although 'evolution', actually, is close). Am I right? What are those words, if they do exist?&lt;/li&gt;&lt;li&gt;What other aggregations tell us things about whether generations or intrasubjective shifts drive linguistic changes? I'd like to bring some of my genre data into this mess, perhaps, to see if young people drive shifts towards new disciplines or if it generally requires oldsters to lead the way. Or keeping it at the linguistic level, I wonder how shifts in the prevalence of topic-model generated topics over time correspond to changing ages.&lt;/li&gt;&lt;li&gt;To what degree are generalizable claims I can make about the role of generational displacement in linguistic change universal patterns, and to what degree are they an artifact of the particular historical period I'm looking at? I've been mulling over &lt;a href="http://www.newyorkerstore.com/april-25-2011/when-i-was-your-age-things-were-exactly-the-way-they-are-now/invt/136958/"&gt;this New Yorker cartoon&lt;/a&gt;. Let's say it's right, and at one point there was no linguistic change. Once it started, did generational displacement immediately start driving half of it? Or was it across all generations at first? This would take a lot of data I don't have (and that might not exist in sufficient numbers before the 18th century or so, and that's locked inside copyright for the post-1922 period). But there are some neat ways I can think of to ask the question. &lt;/li&gt;&lt;li&gt;Finally—so what? What's the context for shifts in language we haven't noticed before, and how important is the difference between generational change and intrasubjective change? I think it says a lot, but I haven't quite articulated what sort of explanatory theoretical framework I think these descriptive changes are evidence for. That's worth doing.&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-4524674649471234136?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/4524674649471234136/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/05/predicting-publication-year-and.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/4524674649471234136'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/4524674649471234136'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/05/predicting-publication-year-and.html' title='Predicting publication year and generational language shift'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-yYNcYOCosbA/Tcl5snNBdRI/AAAAAAAACxk/4RrM7aqP8TE/s72-c/full+error+plot.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-5172410556526235996</id><published>2011-04-18T18:19:00.002-04:00</published><updated>2011-04-19T21:29:38.919-04:00</updated><title type='text'>The 1940 election</title><content type='html'>&lt;span style="font-size: small;"&gt;A couple weeks ago, &lt;a href="http://sappingattention.blogspot.com/2011/04/stopwords-to-wise.html"&gt;I wrote&lt;/a&gt; about how &lt;a href="http://ancestry.com/"&gt;ancestry.com&lt;/a&gt; structured census data for genealogy, not history, and how that limits what historians can do with it. Last week, I got an interesting e-mail from IPUMS, at the &lt;a href="http://www.pop.umn.edu/"&gt;Minnesota population center&lt;/a&gt;&lt;/span&gt;&lt;span style="font-size: 12pt;"&gt;&lt;span style="font-size: small;"&gt; on just that topic:&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;We have an extraordinary opportunity to  partner with a leading genealogical firm to produce a microdata  collection that will encompass the entire 1940 census of population of  over 130 million cases. It is not feasible to digitize every variable  that was collected in the 1940 census. We are therefore seeking your  help to prioritize variables for inclusion in the 1940 census database.&lt;/blockquote&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;I'd assume that partner is &lt;a href="http://ancestry.com/"&gt;ancestry.com&lt;/a&gt; itself, but maybe there are other 'leading genealogical firms' out there that type in every census entry. This isn't really my beat.&lt;br /&gt;&lt;br /&gt;Given things I've said earlier about historians needing to be more involved in databases, though, let me link to &lt;a href="https://umsurvey.umn.edu/index.php?sid=92649&amp;amp;lang=um"&gt;their survey&lt;/a&gt;. If you think you might do research about the 30s/40s, you should fill it out. (Today is the last day--sorry for the short notice, and I assume they don't mind the link being circulated off their list-serv.) The 1940 census is full of employment information on the tail end of the great depression, captures a population at a unique moment in mobility, etc. Could be a tremendously valuable resource. I won't tell you what to vote for, of course. But if this leads to a fully downloadable set of the 1940 census with restrictions on commercial use but not much else, that would be remarkable.&lt;br /&gt;&lt;br /&gt;On the other hand, I do wonder a bit whether this isn't another example of piecemeal private-public collaboration on digitization that might keep us from realizing the full potential of this data. Wendell Willkie's revenge, you might say. But the people at IPUMS do good work (their microdata samples from 1930 and before are, I think, an underused resource by US historians who prefer to take statistics from secondary sources instead) so it's safe to say that this is what we're going to get (at least until someone perfects 19th handwriting OCR), it will be very useful, and it's great that they're making an effort to reach out to users.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-5172410556526235996?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/5172410556526235996/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/04/1940-election.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/5172410556526235996'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/5172410556526235996'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/04/1940-election.html' title='The 1940 election'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-2835479119488990310</id><published>2011-04-13T12:17:00.002-04:00</published><updated>2011-04-13T14:03:21.014-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Online Databases'/><category scheme='http://www.blogger.com/atom/ns#' term='Metadata'/><category scheme='http://www.blogger.com/atom/ns#' term='Ngrams'/><category scheme='http://www.blogger.com/atom/ns#' term='HathiTrust'/><category scheme='http://www.blogger.com/atom/ns#' term='Digital Humanities'/><category scheme='http://www.blogger.com/atom/ns#' term='Open Library'/><title type='text'>In search of the great white whale</title><content type='html'>All the cool kids are talking about shortcomings in digitized text databases. I don't have anything so detailed to say as what &lt;a href="http://goosecommerce.wordpress.com/2011/04/09/ex-readex-not-much/"&gt;Goose Commerce&lt;/a&gt; or &lt;a href="http://cliotropic.org/blog/2011/03/proquest-historical-serials-caveat-lector/"&gt;Shane Landrum&lt;/a&gt; have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream interests—the neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that it's not just at the margins we're missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Here's an example.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-YDbvzXD5hvo/TaUZJv7t4jI/AAAAAAAACw8/QsOUmXEdQOM/s1600/md_213.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="127" src="http://1.bp.blogspot.com/-YDbvzXD5hvo/TaUZJv7t4jI/AAAAAAAACw8/QsOUmXEdQOM/s400/md_213.jpg" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;Thanks to a question from Hank in the comments on a previous post, I went looking in &lt;a href="http://sappingattention.blogspot.com/2011/02/technical-notes.html"&gt;my database&lt;/a&gt; for books by Herman Melville. I noticed that most of the ones I have are published in the 20th century. That's not surprising, given Melville's obscurity in his lifetime. Still, I'd like to have the books for any study of 19th century culture. The lack of first editions has bothered me before with Mark Twain; I actually changed my publisher list last time I remade the database so it would catch some of his works with obscure presses. But even though Melville usually published with Harper, Open Library only has a few Melville texts published in his lifetime that meet my metadata criteria: two 1847 &lt;i&gt;Typees&lt;/i&gt;, one 1849 &lt;i&gt;Mardi&lt;/i&gt;, one 1856 &lt;i&gt;Piazza Tales&lt;/i&gt;: that's it. No &lt;i&gt;&lt;a href="http://openlibrary.org/works/OL102749W/Moby-Dick"&gt;Moby-Dick&lt;/a&gt;, &lt;/i&gt;no &lt;i&gt;&lt;a href="http://openlibrary.org/works/OL102746W/Billy_Budd"&gt;Billy Budd&lt;/a&gt;, &lt;/i&gt;and only a microform copy of the &lt;a href="http://openlibrary.org/books/OL7185766M/The_confidence-man"&gt;&lt;i&gt;Confidence Man &lt;/i&gt;&lt;/a&gt;without a library call number. HathiTrust's &lt;a href="http://catalog.hathitrust.org/Search/Home?checkspelling=true&amp;amp;type=all&amp;amp;lookfor=Moby+Dick&amp;amp;submit=&amp;amp;type=all&amp;amp;sethtftonly=true&amp;amp;sort=yearup"&gt;earliest copy of Moby Dick&lt;/a&gt;, just like OL's, is from 1892. Google Books' interface is less well &lt;a href="http://www.frbr.org/"&gt;FRBRized&lt;/a&gt;, but &lt;a href="http://www.google.com/search?q=Moby+Dick&amp;amp;hl=en&amp;amp;client=firefox-a&amp;amp;rls=org.mozilla%3Aen-US%3Aofficial&amp;amp;tbas=0&amp;amp;prmd=ivnsb&amp;amp;sa=X&amp;amp;ei=Ab6kTbzaOJG-0QHDgJXqCA&amp;amp;ved=0CBkQpwUoBA&amp;amp;source=lnt&amp;amp;tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F1851%2Ccd_max%3A12%2F31%2F1852&amp;amp;tbm=bks"&gt;a search for the first edition of Moby Dick&lt;/a&gt; shows mostly the bad initial reviews, along with a number of empty bibliographic records Google doesn't seem to realize refer to the same edition. The list that search returns for me starts with one that declares &lt;i&gt;Moby-Dick&lt;/i&gt; a joint project among Melville, Mark Twain, and Mortimer Adler. (I assume Adler wrote all the pedantic bits about whale biology, and Twain cashed his check as soon as he wrote the bit about knocking people's hats off in the street.)&lt;br /&gt;&lt;br /&gt;Why no Moby Dick in the libraries? Here's my guess. The Google book digitization project is the only source for Google Books, and the biggest for Open Library and Hathi. We've got different sources, but they're all using the same library books from the same scanning sessions. Think about how the Google scanning project worked. They set up cameras in various library collections and started scanning books shelf-by-shelf, I believe, which is a sensible way to get a bunch of digital texts in a hurry. But there's a catch: no university library would possibly still have a first edition of &lt;i&gt;Moby-Dick &lt;/i&gt;on its shelves in the 2000s. Any good library would have moved it to rare books long ago. If they didn't, some enterprising undergraduate would have snatched it up to pay &lt;a href="http://www.abebooks.com/servlet/BookDetailsPL?bi=457533586"&gt;a year or two's tuition&lt;/a&gt;. It's a cultural artifact, a prince among books: it's too important to leave among the plebes in the stacks. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;So &lt;i&gt;just because &lt;/i&gt;the first edition of &lt;i&gt;Moby-Dick &lt;/i&gt;is such a cultural touchstone, &lt;i&gt;just because &lt;/i&gt;we want to preserve it so much, it wasn't among the first 10 million or so volumes we put in our most important digital libraries&lt;/b&gt;. Perhaps some collection did their own scan of &lt;i&gt;Moby-Dick&lt;/i&gt; in the early days of digitization. I'd be surprised if not. But if so, it isn't easy to find: Yale seems to have &lt;a href="http://130.132.81.94/dl_crosscollex/SearchExecXC.asp?srchtype=CNO"&gt;given up after two pages&lt;/a&gt; on the copy in Beinecke, and that's the only thing I'm turning up on Google. Any academic-led scanning project might well have started with this book; but the quantity over quality approach that Google Books has used means we &lt;i&gt;still &lt;/i&gt;don't have it easily accessible. It's not the only one.  &lt;i&gt;The Adventures of Huckleberry Finn&lt;/i&gt; exists in the Bodleian Library copy of &lt;a href="http://books.google.com/books?id=-bAIAAAAQAAJ&amp;amp;printsec=frontcover&amp;amp;dq=Huckleberry+Finn&amp;amp;hl=en&amp;amp;ei=KP2kTbN36svRAcuM3f0I&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CC8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false"&gt;the 1884 first British edition&lt;/a&gt;, but the first American edition (1885) seems to be missing. The Bodleian's indifference about American classics seems also to be responsible for Google Books' &lt;a href="http://books.google.com/books?id=uK4BAAAAQAAJ"&gt;copy of the Confidence Man&lt;/a&gt;, not present in Open Library, but a lot of books are missing entirely: there's no &lt;i&gt;&lt;a href="http://www.google.com/search?q=Tom+Sawyer&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ZQ2lTdyrOMm10QGhlp3-CA&amp;amp;ved=0CBYQpwUoBA&amp;amp;source=lnt&amp;amp;tbs=cdr%3A1%2Ccd_min%3A1876%2Ccd_max%3A1876&amp;amp;tbm=bks"&gt;Tom Sawyer&lt;/a&gt;, &lt;/i&gt;either, no &lt;i&gt;Origin of the Species &lt;/i&gt;until 1861… I'm sure the list goes on. &lt;br /&gt;&lt;br /&gt;That's ironic, but also a neat little parable about how how the touchstones of the mid-century academy are approaching the Internet. We're so focused on preserving the book that most contemporary academic research in the humanities is as inaccessible as it's ever been: journal articles available only from within university campuses, and books not available online at all. Since scholars haven't been heavily involved in putting things online, not only the first edition of &lt;i&gt;Moby-Dick&lt;/i&gt; but most of the current scholarship about &lt;i&gt;Moby-Dick &lt;/i&gt;is still nearly invisible on the Internet. Protecting the culture of the past to be used like it always has been means excluding it from new currents of consumptions.&lt;br /&gt;&lt;br /&gt;That's a neat story, but is it really a big deal? For some text analysis, to be sure, this is a bit of a pain. I'd really  like a complete run of Melville or Twain's works to compare to the  language of their contemporaries--but lacking the first editions, I have  to rely on later publications. In extreme cases of late rehabilitation like &lt;i&gt;The Confidence Man,&lt;/i&gt; Open Library has only that one oddly cataloged microfilm copy before the &lt;a href="http://sappingattention.blogspot.com/2011/01/digital-history-and-copyright-black.html"&gt;copyright cutoff&lt;/a&gt;:  and where there are public domain copies, it requires really good  FRBRization to be able to get the original publication year. And of  course, as&lt;a href="http://sappingattention.blogspot.com/2010/12/not-included-in-ngrams-tom-sawyer.html"&gt; I said about Tom Sawyer and Google ngrams&lt;/a&gt;  a while ago, it may be hard to sell humanists on the idea that text analysis measures the entire culture when it's missing its most central  documents.&lt;br /&gt;&lt;br /&gt;But for the real distant reading stuff, I don't think it matters much. For any project that actually takes advantage of what digital reading allows, quantity matters far more than quality. If we wait for nice TEI editions of all books to show up, it will be decades before anyone could leverage the most interesting techniques one can use on large bodies of texts. Either you're doing Melville studies, in which case you can just add a  copy of Moby-Dick to your database, or you're not, in which case a few  nautical terms here and a few whale skeletons there aren't going to  change the language very much. Any study that's results would be changed  by a couple books is probably trying too hard to wring evidence out of a  &lt;a href="http://www.baseball-reference.com/teams/BOS/2011.shtml"&gt;small sample size&lt;/a&gt;.  As long as I'm right that the really famous books are missing just  because they're so famous, and not because the database is completely  ridden with holes, the general picture of the language should be fine. At some point, I'd hope Google, Hathi, or Open Library would take the lead in scanning books from rare-books libraries—I can't imagine, honestly, that the last library will actually be missing these texts for long. But I would be surprised if there were more than a few dozen  books in the 19th century that meet Moby Dick's criteria of incredible  modern value &lt;i&gt;and &lt;/i&gt;original obscurity.&lt;br /&gt;&lt;br /&gt;Nonetheless, it's a helpful reminder that we've got a &lt;i&gt;long&lt;/i&gt; way to go before we can talk about comprehensive book digitization. Our current collection of texts is skewed in all sorts of strange ways. Not only by library collection patterns, but by where they keep keep their physical books, by what books are easier to enter consistent metadata for, by how certain authors' reputations waxed and waned… This is all more evidence that we're just beginning to get a sense of how our big digital libraries differ from our old stone ones. And that without checking each others' work for mistakes of the type that only specialists, or librarians, or archivists can catch, we might find ourselves in some uncomfortable situations.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-QGVRS5HyD00/TaUab0nMJUI/AAAAAAAACxE/mxchyvr8J6c/s1600/kent.jpg" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/-QGVRS5HyD00/TaUab0nMJUI/AAAAAAAACxE/mxchyvr8J6c/s320/kent.jpg" width="256" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-2835479119488990310?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/2835479119488990310/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/04/in-search-of-great-white-whale.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/2835479119488990310'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/2835479119488990310'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/04/in-search-of-great-white-whale.html' title='In search of the great white whale'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-YDbvzXD5hvo/TaUZJv7t4jI/AAAAAAAACw8/QsOUmXEdQOM/s72-c/md_213.jpg' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-2683818589995376671</id><published>2011-04-11T17:33:00.002-04:00</published><updated>2011-04-13T12:50:47.383-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data exploration and visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='authors'/><category scheme='http://www.blogger.com/atom/ns#' term='Featured'/><title type='text'>Age cohort and Vocabulary use</title><content type='html'>Let's start with two self-evident facts about how print culture changes over time:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The words that writers use change. Some words flare into usage and then back out; others steadily grow in popularity; others slowly fade out of the language.&lt;/li&gt;&lt;li&gt; The writers using words change. Some writers retire or die, some hit mid-career spurts of productivity, and every year hundreds of new writers burst onto the scene. In the 19th-century US, median author age stays within a few years of 49: that constancy, year after year, means the supply of writers is constantly being replenished from the next generation.&lt;/li&gt;&lt;/ol&gt;How do (1) and (2) relate to each other? To what extent do the shifting group of authors create the changes in language, and how much do changes happen in a culture that authors all draw from?&lt;br /&gt;&lt;br /&gt;This might be a historical question, but it also might be a linguistics/sociology/culturomics one. Say there are two different models of language use: type A and type B.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Type A means a speaker drifts on the cultural winds: the language shifts and everyone changes their vocabulary every year.&lt;/li&gt;&lt;li&gt;Type B, on the other hand, assumes that vocabulary is largely fixed at a certain age: a speaker will be largely consistent in her word choice from age 30 to 70, say, and new terms will not impinge on her vocabulary.&lt;/li&gt;&lt;/ul&gt;&amp;nbsp;Both of these models are extremes, and we can assume that hardly any words are pure A or pure B. To firm this up, let me concretize this with two nicely alphabetical examples of fictional characters to warm up the subject for all you humanists out there:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Type A: John Updike's &lt;a href="http://www.amazon.com/Rabbit-Run-John-Updike/dp/0449911659"&gt;Rabbit Angstrom&lt;/a&gt;. Rabbit doesn't know what he wants to say. Every decade, his vocabulary changes; he talks like a ennui-ed salaryman in the 50s, flirts with hippiedom &lt;i&gt;and &lt;/i&gt;Nixonian silent-majorityism in the 60s, spends the late 70s hoarding gold and muttering about Consumer Reports and the Japanese. For Updike, part of Rabbit being an everyman is the shifts he undergoes from book to book: there's a sort of implicit type-A model underlying his transformations. He's a different person at every age because America is different in every year.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Type B: Richard Ford's &lt;a href="http://www.amazon.com/Independence-Day-Bascombe-Trilogy-2/dp/0679735186"&gt;Frank Bascombe&lt;/a&gt;. Frank Bascombe, on the other hand, has his own voice. It shifts from decade to decade, to be sure, but 80s Bascombe sounds more like 2000s Bascombe than he sounds like 80s Angstrom. What does change is internal to his own life: he's in the Existence period in the 90s and worries about careers, and the 00s he's in the Permanent Period and worried about death. Bascombe is a dreamy outsider everywhere he goes: the Mississippian who went to Ann Arbor, always perplexed by the present.* &lt;/li&gt;&lt;/ul&gt;Anyhow: I don't have good enough author metadata right now to check this on authors (which would be really interesting), but I &lt;i&gt;can &lt;/i&gt;do it a bit on words. An Angstrom word would be one that pops up across all age cohorts in society simultaneously; a Bascombe word is one that creeps in more with each succeeding generation, but that doesn't change much over time within an age cohort.&lt;br /&gt;&lt;br /&gt;This is getting into some pretty multi-dimensional data, so we need something a little more complicated than line graphs. The solution I like right now is heat maps.&lt;br /&gt;&lt;br /&gt;An example: I know that "outside" is a word that shows a &lt;a href="http://ngrams.googlelabs.com/graph?content=outside&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;steady, upward trend&lt;/a&gt; from 1830 to 1922; in fact, &lt;a href="http://sappingattention.blogspot.com/2011/02/pca-on-years.html"&gt;I found&lt;/a&gt; that it was so steady that it was among the best words at helping to date books based on their vocabulary usage. So how did "outside" become more popular? Was it the Angstrom model, where everyone just started using it more? Or was it the Bascombe model, where each succeeding generation used it more and more? To answer that, we need to combine author birth year with year of publication:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-k2V0pHWKpvs/TaNn2Kwgj2I/AAAAAAAACww/ussSb3_KdVs/s1600/usage+of+outside+by+age+and+year.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-k2V0pHWKpvs/TaNn2Kwgj2I/AAAAAAAACww/ussSb3_KdVs/s1600/usage+of+outside+by+age+and+year.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-klH-AtS7kOM/TaMrB_6es8I/AAAAAAAACwM/BQ6fmH2sjLY/s1600/usage+of+outside+by+age+and+year.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Any given point tells you how much, on average, a given age group used the word in a given year. The scale gives the numeric values for the colors. (This is some messy, incomplete data. There are a lot of years I have no data for at all, and the raw data is so spiky as to be almost unreadable. As a result, I've had to smooth the results, using the same formula I do for charts.) To read it, then, you can pick a random point—60 year olds in 1910, say. It shows you yellowish orange, which means they use the word "outside" a fair amount: maybe .12 words per thousand. &lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;This chart, for example, says that the people using the word 'outside' the &lt;i&gt;most &lt;/i&gt;are people in their 30s around 1920; and people using it the &lt;i&gt;least &lt;/i&gt;are those in their 70s around 1830. At any given age, it becomes more prevalent as time goes on—fifty-year-olds in 1900 say "outside" more than fifty-year-olds in 1850. That's unsurprising. More interesting is that in any given year, young people seem to use 'outside' more than old people. That indicates there's a substantial generation bias to the way the term enters the language, which tends towards a Bascombe interpretation of how language shifts in this case.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;What happens to a single generation over time? The diagonal lines on the chart show how those generations move. Just to the right of 1880 is line labeled "b. 1855". That shows the path on this chart that any &lt;i&gt;individual &lt;/i&gt;or &lt;i&gt;generational cohort &lt;/i&gt;takes. Eugene Debs, for example, was born in 1855. That line shows how his life fits on this chart. Basically, it just does math for us between the two axes: the line shows how old he was at any time: 30 years old in 1885, 45 years old in 1900, 67 in 1922.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;As the 'b. 1855' moves through time, it shows us how Debs' generation changed its usage of "outside" as it aged. The answer: not much. They start off in orange-ish yellow and end there. Although the printed culture &lt;i&gt;as a whole &lt;/i&gt;used the word 'outside' more in the 1920s than in the 1890s, the writers of Debs' generation used it just about the same amount. Looking at some of the other diagonal lines, some generations—1835–1840, say—do seem to increase their usage of the word over time, but generally 'outside' seems to be pretty much a Bascombe word.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;That basic conclusion—that change happens &lt;i&gt;across &lt;/i&gt;generations more than &lt;i&gt;within &lt;/i&gt;them—seems to be true for most words that show a remarkable steady drift. 'Outside' is a particularly neat example, but most of the other words that show a steady change in a single direction show a type-B pattern. (By one measure, here are the words I found earlier that showed a steady &lt;a href="http://ngrams.googlelabs.com/graph?content=justly%2Ccircumstances%2Coccasion%2Cprudent%2Cacknowledged%2Cnotwithstanding%2Cdisposed%2Crendered%2Ccommencement%2Ccircumstance%2Cmode&amp;amp;year_start=1822&amp;amp;year_end=1922&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;decline&lt;/a&gt; and &lt;a href="http://ngrams.googlelabs.com/graph?content=justly%2Ccircumstances%2Coccasion%2Cprudent%2Cacknowledged%2Cnotwithstanding%2Cdisposed%2Crendered%2Ccommencement%2Ccircumstance%2Cmode&amp;amp;year_start=1822&amp;amp;year_end=1922&amp;amp;corpus=0&amp;amp;smoothing=3"&gt;ascent&lt;/a&gt;.) To show a couple other charts at random from that set:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-FX8BWdhJlJM/TaM6gHh7Q1I/AAAAAAAACwU/zk1cf7W9PLo/s1600/usage+of+background+by+age+and+year.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-n5y05-e3WCs/TaNnkw4IggI/AAAAAAAACwk/gqEnUeFSXUU/s1600/usage+of+prudent+by+age+and+year.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-n5y05-e3WCs/TaNnkw4IggI/AAAAAAAACwk/gqEnUeFSXUU/s1600/usage+of+prudent+by+age+and+year.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-Cu7bNqgELgQ/TaNnl8MmJCI/AAAAAAAACwo/-XSocCi6ndE/s1600/usage+of+justly+by+age+and+year.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-Cu7bNqgELgQ/TaNnl8MmJCI/AAAAAAAACwo/-XSocCi6ndE/s1600/usage+of+justly+by+age+and+year.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/-6_XJrvQPsLw/TaNnmlwAu7I/AAAAAAAACws/vtiV3-iNgpE/s1600/usage+of+background+by+age+and+year.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-6_XJrvQPsLw/TaNnmlwAu7I/AAAAAAAACws/vtiV3-iNgpE/s1600/usage+of+background+by+age+and+year.png" /&gt;&lt;/a&gt; &lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-TS62-70A5Zw/TaM6hLrR43I/AAAAAAAACwc/h1ihNkrnyZM/s1600/usage+of+justly+by+age+and+year.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Most of these don't show a &lt;i&gt;complete &lt;/i&gt;diagonal orientation to the contour lines, but they're arguably closer to moving by generation than to moving by year (I actually should concoct a metric for that, but no obviously best method occurs to me).&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Are there any natural Angstrom words at all? Well, as I was saying &lt;a href="http://sappingattention.blogspot.com/2011/04/generations-vs-contexts.html"&gt;earlier&lt;/a&gt;, "evolution"might be one. Here's its heat map:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-GqFohwRfoOw/TaNoFhMOKwI/AAAAAAAACw0/h6sstncrIVc/s1600/usage+of+evolution+by+age+and+year.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-GqFohwRfoOw/TaNoFhMOKwI/AAAAAAAACw0/h6sstncrIVc/s1600/usage+of+evolution+by+age+and+year.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;It breaks in unevenly but across all age groups from 1870–1890, and in the 1890s is most heavily used by old people (in their late 60s and 70s). One age cohort—those born around 1860, coincidentally the same time as the &lt;i&gt;Origin &lt;/i&gt;was published—does seem to use evolution more heavily over time than other generations. But in general, there's very little evidence that "evolution" took up popularity through a youth movement. My initial suspicions that its early advocates are surprisingly elderly seem partly born out.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;More ephemeral words tend to be type-A as well: "war", for example, breaks across all ages in 1917 (although it is surprisingly &lt;i&gt;not&lt;/i&gt; present in the 1860s: my old sample had more periodicals, but apparently books weren't as up on the Civil War as the Great one:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-jE54cS6hWOg/TaNp1mV-I6I/AAAAAAAACw4/gTFzMH-UYn8/s1600/usage+of+war+by+age+and+year.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-jE54cS6hWOg/TaNp1mV-I6I/AAAAAAAACw4/gTFzMH-UYn8/s1600/usage+of+war+by+age+and+year.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;An aside: Looking at these charts makes me realize there is actually a third model for word behavior: there could be words that &lt;i&gt;only &lt;/i&gt;show variation by author age, and that remain constant across the years. These certainly exist in spoken language, where they correspond to life stages ("Potty", "Homework", "Dorm", "Wheelchair") but it's less clear to me that we should expect to find many in the printed record. Looking at a few dozen randomly generated charts doesn't show anything conclusive enough to share that looks like a type-C word. You can imagine it, though. The coasts of the landmasses on these charts would stretch from east to west, instead of North to South like they do for "war", or southwest to northeast as for "outside."&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;If I decide how best to test for words fitting each of these models, it might turn up some interesting cases. Are science words more likely to be Bascombes, while phenomenological language like&amp;nbsp; is more likely to be type-A? How do particularly constructions ("pay attention") shift across generations and time? It would also be possible to test whole baskets of words (or the prevalence of certain topics created by topic-modeling…), which would somewhat correct for the small sample sizes that keep this analysis from being much more than fun at present.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;But for now let me say just that I think this an important way of plugging back into &lt;i&gt;people&lt;/i&gt; from structures of culture which is what we get from books. Metadata about authors is critically important for this sort of work, as well the ability to do on-the-fly parsing of the relationship between different variables (here, publication year and author birth). That need cuts directly against the desires of content providers (Google Books, Jstor data for research) to keep researchers from being able to reconstruct their content after the fact. This is only possible because of the incredible amount of data that &lt;a href="http://openlibrary.org/"&gt;Open Library &lt;/a&gt;at the Internet Archive both makes available and links to their books. (Although eventually, one could do it with HathiTrust data too by just parsing the years out of the author field).&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Of course, generations aren't people. Neither are years or genres. But if we want to understand how people relate to the changes in the cultural fabric around them, we need to look at how these variables interact, not just one at a time. If we want to just understand the culture independent of people, we might not. The former solution, I think, is a lot more pregnant with possibility.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;*Further digression on Updike and Ford: The A-B distinction maps, in a way, to the structure-agency divide. That's  why it seems interesting. Angstrom is at the mercy of social forces,  while Bascombe has his own path. But it's not just because Bascombe is  smarter: Updike himself is a type-A, so great &lt;i&gt;just because &lt;/i&gt;he can  voice the changing the times. Richard Ford, on the other hand, doesn't  seem to have that same need to be recognized as the greatest living  American novelist, and picks his own way. (We can even turn this into a  dorm-room conversation: Jonathan Franzen &lt;a href="http://www.google.com/search?hl=en&amp;amp;client=firefox-a&amp;amp;hs=fp4&amp;amp;rls=org.mozilla%3Aen-US%3Aofficial&amp;amp;q=%22Franzen+knows+that+college+freshmen+are+today+called+%E2%80%9Cfirst+years%2C%E2%80%9D+like+tender+shoots+in+an+overplanted+garden%22&amp;amp;aq=f&amp;amp;aqi=&amp;amp;aql=&amp;amp;oq="&gt;is an Angstrom&lt;/a&gt;, Edith Wharton  was a Bascombe, Don Delillo's work since &lt;i&gt;Underworld&lt;/i&gt; has been so disappointing because he's a Bascombe who feels the need to be an Angstrom.)&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-2683818589995376671?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/2683818589995376671/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/04/age-cohort-and-vocabulary-use.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/2683818589995376671'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/2683818589995376671'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/04/age-cohort-and-vocabulary-use.html' title='Age cohort and Vocabulary use'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-k2V0pHWKpvs/TaNn2Kwgj2I/AAAAAAAACww/ussSb3_KdVs/s72-c/usage+of+outside+by+age+and+year.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-6350155071846728448</id><published>2011-04-03T13:45:00.001-04:00</published><updated>2011-04-03T15:20:46.500-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Building a Corpus'/><category scheme='http://www.blogger.com/atom/ns#' term='Digital Humanities'/><title type='text'>Stopwords to the wise</title><content type='html'>&lt;a href="http://cliotropic.org/"&gt;Shane Landrum&lt;/a&gt; (@cliotropic) &lt;a href="http://twitter.com/?from=emailheader&amp;amp;utm_campaign=newfollow20100823&amp;amp;utm_medium=email&amp;amp;utm_source=follow#%21/cliotropic/status/54409710627061760"&gt;says&lt;/a&gt; my claim that historians have different digital infrastructural needs than other fields might be provocative. I don't mean this as exceptionalism for historians, particularly not compared to other humanities fields. I do think historians are somewhat exceptional in the volume of texts they want to process—at Princeton, they often gloat about being the heaviest users of the library. I do think this volume is one important reason English has a more advanced field of digital humanities than history does. But the needs are independent of the volume, and every academic field has distinct needs. Data, though, is often structured for either one set of users, or for a mushy middle.&lt;br /&gt;&lt;br /&gt;A particularly clear connection is from database structures to "categories of analysis" in our methodology. Since humanists share methods in a lot of ways, digital resources designed for one humanities discipline will carry well for others. But it's quite possible to design a resource that makes extensive use of certain categories of analysis nearly impossible.&lt;br /&gt;&lt;br /&gt;&lt;span&gt;One clear-cut example: &lt;a href="http://ancestry.com/"&gt;ancestry.com&lt;/a&gt;. The bulk of interest in digitized census records lies in two groups: historians and genealogists. That web site is clearly built for the latter: it has lots of genealogy-specific features built into the database for matching sound-alike names and misspellings, for example, but almost nothing for social history. (I'm pretty sure you can't use it to find German cabinet-makers in Camden in 1850, for example.) &lt;a href="http://ancestry.com/"&gt;Ancestry.com&lt;/a&gt; views &lt;/span&gt;&lt;b&gt;names&lt;/b&gt; (last names in particular) as the most important field and structures everything else around serving those up. Lots of historians are more interested in the &lt;b&gt;place&lt;/b&gt;&lt;i&gt; &lt;/i&gt;or the &lt;b&gt;profession&lt;/b&gt;&lt;i&gt; &lt;/i&gt;or the &lt;b&gt;ancestry&lt;/b&gt;&lt;i&gt; &lt;/i&gt;fields in the census: what we take as a unit of analysis affects what we want to see database indexes and search terms built around. (And that's not even getting into the question of aggregating the records into statistics.)&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;That example is too cut-and-dried, to be sure: but a similar set of concerns applies to digitized newspapers, books, archival records, and so forth. For example, if I can get the work I've been doing on &lt;a href="http://sappingattention.blogspot.com/2011/03/author-ages.html"&gt;generational&lt;/a&gt; &lt;a href="http://sappingattention.blogspot.com/2011/04/generations-vs-contexts.html"&gt; differences&lt;/a&gt; in language to point somewhere interesting, that would help demonstrate how better author metadata can cast a new light on book metadata. The boundaries between disciplines aren't necessarily firm. (All the examples I can think of would apply to historicist literary criticism, too.) But &lt;b&gt;generation&lt;/b&gt; is another category historians might want to use that could be lost from a database pretty easily.&lt;br /&gt;&lt;br /&gt;Those are easy examples, though. When it comes to history, some categories are more important than others. The Big Three categories are class-race-gender. And &lt;b&gt;gender&lt;/b&gt;, in particular, can be implicated in the way we structure lexical databases here and now in really nitty-gritty ways. Let me talk about one: &lt;a href="http://en.wikipedia.org/wiki/Stop_words"&gt;Stop words&lt;/a&gt;, the common function words that are removed from many algorithmic processes and some databases in textual analysis.&lt;br /&gt;&lt;br /&gt;What's a stop word? That depends on the database designer. A list of stop words is a functional category: they are words that don't meet a cost-benefit cutoff to include in a given algorithm. The 100 most common words in the language account for about 50% of the word counts in printed matter. Since they're so common, they place a big tax on databases and algorithms, and it makes sense to strip most or all of them out. Lots of common text-processing methods do so--non-quote-enclosed google searches, topic modeling, etc. Those common words lack specific meaning so much that it's &lt;a href="http://www.sporcle.com/games/common_english_words.php"&gt;surprisingly hard&lt;/a&gt; to name them all off the top of your head. So removing them makes sense, generally: the &lt;a href="http://www.ranks.nl/resources/stopwords.html"&gt;first google hit&lt;/a&gt; for English stopwords includes words like "it","if","why","and."&lt;br /&gt;&lt;br /&gt;But: what we think is important depends on what categories we think are relevant. The list also includes words like "he," "she," "him,"and "her." Everything that tells us the gender of an actor or subject, in other words, is frequently presumed to be unimportant by most architects of information retrieval systems. For some purposes, you can get it out, of course; but if use a pre-made standard topic model and want to know how frequently gendered terms appear, you're probably out of luck.&lt;br /&gt;&lt;br /&gt;That might be a problem. There's actually some really interesting information buried inside there. For example, I found that while the two most strongly correlated words in my sample by one definition were "united" and "states," the strongest negative correlation was between &lt;a href="http://sappingattention.blogspot.com/2011/01/correlations.html"&gt;"her" and "government."&lt;/a&gt; If we want to know where women are, that tells us something; if we believe there's something to structures of gendered discourse underlying language, that tells us something more. I think it's safe to say that a lot of computer database programmers aren't quite as interested in things like this as most historians.&lt;br /&gt;&lt;br /&gt;As a simple example of how stopwords can have meaning, we can look at ratio of male to female pronouns in books. I take the number of times "his","him", and "he" appears, and divide it by the number of times "hers","her", and "she" appears. Here's a list of Library of Congress subclasses, sorted from most male-preponderant to least in my &lt;a href="http://sappingattention.blogspot.com/2011/02/technical-notes.html"&gt;bigpubs database&lt;/a&gt; of 48,000 books from 1800 to 1922:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;&amp;nbsp;QD: Chemistry: 35.89 times as many male pronouns&lt;br /&gt;&amp;nbsp;TJ: Mechanical engineering and machinery: 29.64 times as many male pronouns&lt;br /&gt;&amp;nbsp;TA: Engineering (General). Civil engineering: 28.86 times as many male pronouns&lt;br /&gt;&amp;nbsp;QR: Microbiology: 24.8 times as many male pronouns&lt;br /&gt;&amp;nbsp;TP: Chemical technology: 22.48 times as many male pronouns&lt;br /&gt;&amp;nbsp;TN: Mining engineering. Metallurgy: 22.44 times as many male pronouns&lt;br /&gt;&amp;nbsp;B: PHILOSOPHY. PSYCHOLOGY. RELIGION: 21.2 times as many male pronouns&lt;br /&gt;&amp;nbsp;QA: Mathematics: 20.2 times as many male pronouns&lt;br /&gt;&amp;nbsp;QC: Physics: 19.64 times as many male pronouns&lt;br /&gt;&amp;nbsp;BT: Doctrinal Theology: 19.56 times as many male pronouns&lt;br /&gt;&amp;nbsp;TS: Manufactures: 16.54 times as many male pronouns&lt;br /&gt;&amp;nbsp;HB: Economic theory. Demography: 16.42 times as many male pronouns&lt;br /&gt;&amp;nbsp;BS: The Bible: 15.26 times as many male pronouns&lt;br /&gt;&amp;nbsp;HM: Sociology (General): 14.31 times as many male pronouns&lt;br /&gt;&amp;nbsp;BR: Christianity: 13.91 times as many male pronouns&lt;br /&gt;&amp;nbsp;HG: Finance: 13.22 times as many male pronouns&lt;br /&gt;&amp;nbsp;T: TECHNOLOGY: 12.71 times as many male pronouns&lt;br /&gt;&amp;nbsp;Q: SCIENCE: 12.12 times as many male pronouns&lt;br /&gt;&amp;nbsp;Z: BIBLIOGRAPHY. LIBRARY SCIENCE.: 11.04 times as many male pronouns&lt;br /&gt;&amp;nbsp;KF: United States: 10.51 times as many male pronouns&lt;br /&gt;&amp;nbsp;QE: Geology: 9.97 times as many male pronouns&lt;br /&gt;&amp;nbsp;LD: Individual institutions - United States: 9.43 times as many male pronouns&lt;br /&gt;&amp;nbsp;JC: Political theory: 9.36 times as many male pronouns&lt;br /&gt;&amp;nbsp;E: HISTORY OF THE AMERICAS: 9.08 times as many male pronouns&lt;br /&gt;&amp;nbsp;BX: Christian Denominations: 9.02 times as many male pronouns&lt;br /&gt;&amp;nbsp;JK: Political institutions and public administration (United States): 8.97 times as many male pronouns&lt;br /&gt;&amp;nbsp;HF: Commerce: 8.65 times as many male pronouns&lt;br /&gt;&amp;nbsp;SK: Hunting sports: 8.38 times as many male pronouns&lt;br /&gt;&amp;nbsp;TK: Electrical engineering. Electronics. Nuclear engineering: 8.35 times as many male pronouns&lt;br /&gt;&amp;nbsp;BV: Practical Theology: 8.28 times as many male pronouns&lt;br /&gt;&amp;nbsp;F: HISTORY OF THE AMERICAS: 8.26 times as many male pronouns&lt;br /&gt;&amp;nbsp;R: MEDICINE: 8.25 times as many male pronouns&lt;br /&gt;&amp;nbsp;BL: Religions. Mythology. Rationalism: 8.06 times as many male pronouns&lt;br /&gt;&amp;nbsp;QH: Natural history - Biology: 7.77 times as many male pronouns&lt;br /&gt;&amp;nbsp;NA: NA: 7.63 times as many male pronouns&lt;br /&gt;&amp;nbsp;ND: Painting: 7.58 times as many male pronouns&lt;br /&gt;&amp;nbsp;GV: Recreation. Leisure: 7.56 times as many male pronouns&lt;br /&gt;&amp;nbsp;PA: Greek language and literature. Latin language and literature: 7.56 times as many male pronouns&lt;br /&gt;&amp;nbsp;QP: Physiology: 7.43 times as many male pronouns&lt;br /&gt;&amp;nbsp;DT: Africa: 7.23 times as many male pronouns&lt;br /&gt;&amp;nbsp;LA: History of education: 6.91 times as many male pronouns&lt;br /&gt;&amp;nbsp;RM: Therapeutics. Pharmacology: 6.57 times as many male pronouns&lt;br /&gt;&amp;nbsp;HN: Social history and conditions. Social problems. Social reform: 6.53 times as many male pronouns&lt;br /&gt;&amp;nbsp;LB: Theory and practice of education: 6.52 times as many male pronouns&lt;br /&gt;&amp;nbsp;S: AGRICULTURE: 6.5 times as many male pronouns&lt;br /&gt;&amp;nbsp;SB: Plant culture: 6.47 times as many male pronouns&lt;br /&gt;&amp;nbsp;QB: Astronomy: 6.3 times as many male pronouns&lt;br /&gt;&amp;nbsp;PE: English language: 6.16 times as many male pronouns&lt;br /&gt;&amp;nbsp;QK: Botany: 5.94 times as many male pronouns&lt;br /&gt;&amp;nbsp;RD: Surgery: 5.86 times as many male pronouns&lt;br /&gt;&amp;nbsp;HE: Transportation and communications: 5.8 times as many male pronouns&lt;br /&gt;&amp;nbsp;SF: Animal culture: 5.79 times as many male pronouns&lt;br /&gt;&amp;nbsp;DG: Italy - Malta: 5.59 times as many male pronouns&lt;br /&gt;&amp;nbsp;DD: Germany: 5.58 times as many male pronouns&lt;br /&gt;&amp;nbsp;DA: Great Britain: 5.45 times as many male pronouns&lt;br /&gt;&amp;nbsp;N: FINE ARTS: 5.33 times as many male pronouns&lt;br /&gt;&amp;nbsp;BJ: Ethics: 5.32 times as many male pronouns&lt;br /&gt;&amp;nbsp;DS: Asia: 5.12 times as many male pronouns&lt;br /&gt;&amp;nbsp;DK: Russia. Soviet Union. Former Soviet Republics - Poland: 4.74 times as many male pronouns&lt;br /&gt;&amp;nbsp;HC: Economic history and conditions: 4.7 times as many male pronouns&lt;br /&gt;&amp;nbsp;D: WORLD HISTORY: 4.69 times as many male pronouns&lt;br /&gt;&amp;nbsp;LC: Special aspects of education: 4.64 times as many male pronouns&lt;br /&gt;&amp;nbsp;G: GEOGRAPHY. ANTHROPOLOGY. RECREATION: 4.45 times as many male pronouns&lt;br /&gt;&amp;nbsp;CS: Genealogy: 4.42 times as many male pronouns&lt;br /&gt;&amp;nbsp;HD: Industries. Land use. Labor: 4.41 times as many male pronouns&lt;br /&gt;&amp;nbsp;ML: Literature on music: 4.25 times as many male pronouns&lt;br /&gt;&amp;nbsp;RA: Public aspects of medicine: 4.09 times as many male pronouns&lt;br /&gt;&amp;nbsp;NK: Decorative arts: 3.91 times as many male pronouns&lt;br /&gt;&amp;nbsp;PN: Literature (General): 3.86 times as many male pronouns&lt;br /&gt;&amp;nbsp;CT: Biography: 3.79 times as many male pronouns&lt;br /&gt;&amp;nbsp;PQ: French literature - Italian literature - Spanish literature - Portuguese literature: 3.77 times as many male pronouns&lt;br /&gt;&amp;nbsp;JX: International law, see JZ and KZ (obsolete): 3.67 times as many male pronouns&lt;br /&gt;&amp;nbsp;RC: Internal medicine: 3.37 times as many male pronouns&lt;br /&gt;&amp;nbsp;BF: Psychology: 3.28 times as many male pronouns&lt;br /&gt;&amp;nbsp;QL: Zoology: 3.26 times as many male pronouns&lt;br /&gt;&amp;nbsp;PR: English literature: 3.1 times as many male pronouns&lt;br /&gt;&amp;nbsp;DC: France - Andorra - Monaco: 2.84 times as many male pronouns&lt;br /&gt;&amp;nbsp;PT: German and Germanic literature: 2.69 times as many male pronouns&lt;br /&gt;&amp;nbsp;HV: Social pathology. Social and public welfare. Criminology: 2.66 times as many male pronouns&lt;br /&gt;&amp;nbsp;PS: American literature: 2.31 times as many male pronouns&lt;br /&gt;&amp;nbsp;PZ: Fiction and juvenile belles lettres: 1.63 times as many male pronouns&lt;br /&gt;&amp;nbsp;TX: Home economics: 1.27 times as many male pronouns&lt;br /&gt;&amp;nbsp;HQ: The family. Marriage. Women: 0.88 times as many male pronouns&lt;br /&gt;&amp;nbsp;RG: Gynecology and obstetrics: 0.74 times as many male pronouns&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Of course these things aren't transparently about gender—had I not set a  cutoff of 100 books for a genre to be included, you'd see "&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;VK: Navigation. Merchant marine: 0.91 time…&lt;/span&gt;&lt;/span&gt;" at the bottom of the list, thanks to all those female boats. &amp;nbsp;But clearly there's some information in the stopwords, and not just at the obvious extremes. For example, more "serious" American literature (PS) has fewer feminine pronouns than 'fiction and juvenile belles lettres' (PZ: although PS also has more criticism, which is part of it) and American literature has the most women, followed by German, then English, then Romance languages, and finally classical literature (way up there). French history has a surprising surfeit of women. Practical theology has way more women than doctrinal theology (and wouldn't you like to know which denominations in 'doctrinal theology' let femininity creep into their discourse a little bit, even if a male God leads to the overwhelmingly male scores in the field as a whole?) And that doesn't even get into change over time, strategies to find citations of women or female actors engaged in specific grammatical constructions, and so on.&lt;br /&gt;&lt;br /&gt;Designing databases and running algorithms, I have to decide these things all the time. When I did a topic model of all the genres in my database a couple weeks ago, I took out all stop words (including gendered pronouns) just as the algorithm designers recommend. But now I'm really curious how gendered pronouns would have been distributed had I done so. Curious enough to have my computer spend another 14 hours or so churning through the data with pronouns included? Ah, there's the question. We'll see. I've also been thinking about rebuilding my database around sentences, rather than books, but I think I'd have to take stop words out for that to be feasible. Should I leave gendered pronouns in nonetheless? I'll have to decide for my own purposes how much gendered analysis I'm going to do, and whether the significant costs that imposes (I'd have to move some more MP3s off my hard drive to make room; every query might take 5% longer…) are worth the benefit. Say I worked at ProQuest, though: would you want me to decide for you?&lt;br /&gt;&lt;br /&gt;~~&lt;br /&gt;Historians interested in gender, then, might have good reason to view gender pronouns not as stop words at all, but as content words that might be foregrounded in research. Systems designers, however, might seem them as part of a major deadweight on their database architecture. The organizations that host resources for us will be inclined to omit them in some settings unless they think it's important to preserve possibilities for people doing gender history. Unless people doing gender history show any interest in those resources, that's unlikely to happen. For any other category of analysis, similar factors may be at play--it should be up to scholars to figure out what's possible, but infrastructure places constraints on the questions they can ask.&lt;br /&gt;&lt;br /&gt;Shane had an &lt;a href="http://cliotropic.org/blog/2011/03/proquest-historical-serials-caveat-lector/"&gt;interesting posts called 'caveat lector'&lt;/a&gt;&amp;nbsp;I missed while away about surprising omissions in ProQuest databases. I'd only complete the chain and say it really is caveat emptor, too: for products like ProQuest, historians are consumers, plain and simple, even if we consume by reading.&amp;nbsp;The fact that a library writes the check shouldn't blind them to that. And in cases like this, they could stand to learn some things from &lt;a href="http://www.ucpress.edu/book.php?isbn=9780520235908"&gt;consumer history&lt;/a&gt; about how buyers might change what's available to them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-6350155071846728448?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/6350155071846728448/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/04/stopwords-to-wise.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/6350155071846728448'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/6350155071846728448'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/04/stopwords-to-wise.html' title='Stopwords to the wise'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-6588621443109535580</id><published>2011-04-01T13:33:00.000-04:00</published><updated>2011-04-01T13:33:42.641-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Evolution'/><category scheme='http://www.blogger.com/atom/ns#' term='authors'/><title type='text'>Generations vs. contexts</title><content type='html'>When I first thought about using digital texts to track shifts in language usage over time, the largest reliable repository of e-texts was &lt;a href="http://www.gutenberg.org/wiki/Main_Page"&gt;Project Gutenberg&lt;/a&gt;. I quickly found out, though, that they didn't have works for years, somewhat to my surprise. (It's remarkable how much metadata holds this sort of work back, rather than data itself). They did, though, have one kind of year information: author birth dates. You can use those to create same type of charts of word use over time that people like me, the &lt;a href="http://victorianbooks.org/"&gt;Victorian Books project&lt;/a&gt;, or the &lt;a href="http://www.culturomics.org/"&gt;Culturomists&lt;/a&gt; have been doing, but in a different dimension: we can see how all the authors born in a year use language rather than looking at how books published in a year use language. &lt;br /&gt;&lt;br /&gt;I've been using 'evolution' as my test phrase for a while now: but as you'll see, it turns out to be a really interesting word for this kind of analysis. Maybe that's just chance, but I think it might be a sort of indicative test case--generational shifts are particularly important for live intellectual issues, perhaps, compared to overall linguistic drift.&lt;br /&gt;&lt;br /&gt;To start off, here's a chart of the usage of the word "evolution" by share of words per year. There's nothing new here yet, so this is merely a reminder: &lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh6.googleusercontent.com/-xVORlwFx944/TY0HeGI0rDI/AAAAAAAACug/N_q-kZhcHzE/s1600/wordcounts+of+evolution.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="https://lh6.googleusercontent.com/-xVORlwFx944/TY0HeGI0rDI/AAAAAAAACug/N_q-kZhcHzE/s1600/wordcounts+of+evolution.png" /&gt;&amp;nbsp;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Here's what's new: we can also plot by year of author birth, which shows some interesting (if small) differences: &lt;/div&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh6.googleusercontent.com/-rBY64mnEVPM/TY0JoO1IwGI/AAAAAAAACuk/M1aELgNHxeA/s1600/wordcounts+of+evolution.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="https://lh6.googleusercontent.com/-rBY64mnEVPM/TY0JoO1IwGI/AAAAAAAACuk/M1aELgNHxeA/s1600/wordcounts+of+evolution.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;This shows us that authors born before about 1805 hardly use the word evolution at all, and then usage steadily climbs through authors born in 1860, after which it declines. The growth is over about 40 to 50 years, compared to about 30 years of growth to peak (1870 to 1900) for book publication date. The growth occurs on a larger scale even without the big peak in 1820 (which is, I can confirm, due to Herbert Spencer, one of the most frequent authors in my database. Darwin, by comparison, doesn't move the chart at all in 1809).&lt;/div&gt;&lt;br /&gt;We have two different ways of looking at vocabulary usage: immediate context, and generational context. What's the best way to compare them? Well, I found in my first post on author birth dates that the median age of authors when their books are published is about 49. To get a more direct comparison, we can look at the resemblance of these two curves by shifting the birth year forward by 49 years and plotting them together. This is a little bit of apples-to-oranges, but I think it's a road somewhere interesting. By plotting it this way and expecting them to coincide, I'm implying an  assumption that since books are written, on average,  by 49-year-olds,  the language choices of 49-year olds should be about the same as the  dominant language in a given year.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh4.googleusercontent.com/-2ShidWkTJVg/TY5edmBf00I/AAAAAAAACvI/0CXgGMFHzCY/s1600/incidence+by+book+publication+.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="https://lh4.googleusercontent.com/-2ShidWkTJVg/TY5edmBf00I/AAAAAAAACvI/0CXgGMFHzCY/s1600/incidence+by+book+publication+.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;In this case, that assumption is strikingly incorrect. Pretty obviously, in this case, 49-year-olds are 'ahead of the curve.' (That remains true even if I take Herbert Spencer, 49 in 1869, out of the sample). What does that mean, you ask? Me too. It means that people who were 49 in the 1870s, for instance, use the word "evolution" quite a bit even though it's not very popular at the time period we'd think they'd write the most books. This probably means they are using it more in their 60s and 70s than we'd expect. &lt;br /&gt;&lt;br /&gt;That's not, on the surface, particularly surprising, because the word didn't exist at all earlier--but it's still potentially interesting for what it tells us about how the term entered the language. In some ways, for example, this seems very un-Kuhnian, on a generational level; the older generation just picks up the new language of evolution and runs with it, rather than being displaced by a new generation using new words. On the level of individuals, of course, Kuhn might be more right—this could be an interesting thing to check down the road—or all the old folks might be arguing &lt;i&gt;against&lt;/i&gt; evolution.)&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;That chart shows generational use over full lifespans; what happens if we want to know just how different generations use the word "evolution" in a particular period of time? For instance, in the period 1870-1884, when the word really started to take hold, what age groups used it the most? Let's take a look:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-MQ2R4XuaMIc/TZUPo8tlcPI/AAAAAAAACv8/mjNqQgMIjrQ/s1600/Usage+of+Evolution+by+age+group.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-MQ2R4XuaMIc/TZUPo8tlcPI/AAAAAAAACv8/mjNqQgMIjrQ/s1600/Usage+of+Evolution+by+age+group.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;This should be a somewhat surprising chart, I think. It's telling us that from 1870 to 1885, the heaviest users of the term "evolution" were not the young guns—the Civil War generation born in the 30s and 40s—but their slightly older peers. (I should confess I've cheated a bit to make my point--if I include 30-year olds in the sample, there's a huge spike driven by a few books that skews off the whole chart. So it's not as neat as it looks here. But this is accurate for 31- to 80-year-olds, and the high percentage by thirty-year-olds is partly driven by how few of them there are) &lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;That's weird, right? You'd think young people would use emerging words more than do old people, but that doesn't seem to be the case here. On some level, we can explain this anecdotally—1810 is &lt;a href="http://www.huh.harvard.edu/libraries/asa/asabio.html"&gt;Asa Gray&lt;/a&gt;, 1820 is Spencer, 1825 is &lt;a href="http://en.wikipedia.org/wiki/Thomas_Henry_Huxley"&gt;Huxley&lt;/a&gt;, etc. But that's really more &lt;i&gt;description &lt;/i&gt;than explanation—It shows us that it's been staring us in the face in some ways that Darwinists are old, but thinking about it structurally puts it in a new light. (I didn't know Gray was so old, for example, though this isn't my field.)&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Is "evolution" truly odd, or is this a trend? Well, doing principle components analysis &lt;a href="http://sappingattention.blogspot.com/2011/02/pca-on-years.html"&gt;I stumbled across a list of words&lt;/a&gt; that steadily increase their usage over the 19th century. Those tend be function words, not meaning-laden ones like evolution. So how do they compare? If I dump some of those onto an (ugly! R's default colors aren't great) chart, evolution really sticks out; the rest of them move around, but they tend to move upward over time, while evolution clearly has a bump among 50-60 year-olds that other words lack:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-lUljIIpFzt4/TZUVFYClHTI/AAAAAAAACwA/VYu39hFch7g/s1600/Usage+of+Evolution+and+other+words+by+age+group.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-lUljIIpFzt4/TZUVFYClHTI/AAAAAAAACwA/VYu39hFch7g/s1600/Usage+of+Evolution+and+other+words+by+age+group.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Now, those words all represent a particular sort of linguistic drift--the type that computers are great at noticing, and people terrible. A subtle increase in use of a word like "appreciation" instead of other synonyms is a shift in language that probably doesn't represent a shift in ideas the way "evolution" does. The tailing off at the end of the period (in which there are few authors) seems perhaps less notable than it might have initially, and the almost complete lack of evolutionary language by anyone older than Darwin (b. 1809) himself jumps out a bit more. But that hump in the 1820s remains.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;So far, I think this evidence suggests there might be something interesting about evolution's adoption in the USA being driven largely by a somewhat older generation. But to be sure, maybe we should put some other words in the mix that are more similar to evolution. First, let's look at directly connected words:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-Fg1iaPWOA-8/TZYJU0eSotI/AAAAAAAACwE/5PQiv3K3PJo/s1600/use+of+words+for+evolutionary+advances+by+age+group.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-Fg1iaPWOA-8/TZYJU0eSotI/AAAAAAAACwE/5PQiv3K3PJo/s1600/use+of+words+for+evolutionary+advances+by+age+group.png" /&gt;&amp;nbsp;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Here, we see some similar spikes in the 1820s for "Darwin" and "species", and perhaps selection, but the more notable feature are the spikes for those words around 1809-1810 as well; that's the Darwin-Gray generation, and they seem to be more interested in the biological/scientific discourse than the 1820s generation who (to wildly speculate) might be branching "evolution" out more into the realm of the social, etc. To look into this further, I could do a &lt;a href="http://sappingattention.blogspot.com/2011/01/correlations.html"&gt;correlation chart&lt;/a&gt; by birth year to see how different generations use Darwinism differently, just as I saw how "heredity" and evolution became less heavily correlated from 1860 to 1880.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;A final question, moving forward: is evolution driven by this type of old adoption curve because of something about science/technology? Let's look at some other words that have to do with technological adoption in the same period to get a sense.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-B4TPCgJhHFk/TZYJVJu8r1I/AAAAAAAACwI/sVVzXVoSr_E/s1600/use+of+words+for+technological+advances+by+age+group.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-B4TPCgJhHFk/TZYJVJu8r1I/AAAAAAAACwI/sVVzXVoSr_E/s1600/use+of+words+for+technological+advances+by+age+group.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;'Steel' and 'telegraph' show a basically steadily ascent, but 'railroad' is actually similar to evolution in some ways--it has a founding generation in the 1800s that uses it quite heavily, after which it falls off before beginning a new rise.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;There's a lot of interesting stuff here, and I'm not actually sure which threads to chase down at the moment. One thing that seems clear is that the noise on this data is considerably louder when I try to break down birth-years over just a fifteen-year span--I'm reduced to using only 3 or 4 thousand books for some of these charts, which may not be enough. Some of this stuff about evolution is suggestive, but the numbers aren't big enough to tell us much more than that. Still, there may be ways to ask some interesting general questions about how different words and different types of words differ in their age-adoption patterns, which is what I'm thinking about doing next.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-6588621443109535580?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/6588621443109535580/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/04/generations-vs-contexts.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/6588621443109535580'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/6588621443109535580'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/04/generations-vs-contexts.html' title='Generations vs. contexts'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='https://lh6.googleusercontent.com/-xVORlwFx944/TY0HeGI0rDI/AAAAAAAACug/N_q-kZhcHzE/s72-c/wordcounts+of+evolution.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-3646145727375129108</id><published>2011-03-28T16:17:00.002-04:00</published><updated>2011-03-29T16:51:08.202-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='The Profession'/><title type='text'>Cronon's politics</title><content type='html'>Let me step away from digital humanities for just a second to say one thing about the &lt;a href="http://www.newyorker.com/online/blogs/newsdesk/2011/03/wisconsin-the-cronon-affair.html"&gt;Cronon affair&lt;/a&gt;.&lt;br /&gt;(Despite the professor-blogging angle, and that Cronon's upcoming AHA presidency will probably have the same pro-digital history agenda as Grafton's, I don't think this has much to do with DH). The whole "we are all Bill Cronon" sentiment misses what's actually interesting. Cronon's playing a particular angle: one that gets missed if we think about him as either a naïve professor, stumbling into the public sphere, or as a liberal ideologue trying to score some points.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Most people already know the &lt;a href="http://www.newyorker.com/online/blogs/newsdesk/2011/03/wisconsin-the-cronon-affair.html"&gt;basics of the Cronon case&lt;/a&gt;, so I won't go into that. I only have two mild corrections I'd make to the standard intro:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Historians might better explain what it means that Cronon is AHA president-elect. That might sound to outsiders like he's a political animal. The position is mostly honorary; it's a lot closer to the Nobel Prize in history than it is to the presidency of the National Education Association.&lt;/li&gt;&lt;li&gt;Although Cronon has written two highly respected books, &lt;i&gt;Nature's Metropolis&lt;/i&gt; is a better read than &lt;i&gt;Changes in the Land;&lt;/i&gt; that's what you should pick up if you haven't read either.&lt;/li&gt;&lt;/ol&gt;&amp;nbsp;I want to talk about Cronon's goals in all this. I think we could better understand the nature of his "scholar-citizen" intervention. Hank at AmericanScience just &lt;a href="http://americanscience.blogspot.com/2011/03/on-cronon-history-law-and-public-2-of-2.html"&gt;posted&lt;/a&gt; about what Cronon means by calling himself a scholar-citizen. (This post started as a comment there, but then, as you can see, got way too long.) Good stuff, but he focuses more than I would on what &lt;i&gt;historians&lt;/i&gt; in particular have to offer. Also at the core of the fight is a much broader methodological impulse that isn't unique to history: a belief in public openness, in civil conversation, and so on, that applies in some ways to &lt;i&gt;all&lt;/i&gt; academics. Like politicians, scholars have different public personas than private personas, and underlying both Cronon's first blog post about ALEC and the later posts is something that feels to me like an 80s-90s academic's love for &lt;a href="http://www.hup.harvard.edu/catalog.php?isbn=9780674197664"&gt;deliberative&lt;/a&gt; &lt;a href="http://press.princeton.edu/titles/7869.html"&gt;democracy&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Deliberative democracy is a sort of communitarian political project that views the essence of democracy in reached consensus, rather than in conflict and resolution. Deliberative democrats tend to be fuzzy about votes, they like town halls more than parties, and they view jury deliberation as a much more important democratic tradition than political canvassing. Academics love it in part because it makes the political sphere operate a lot more like the academy--little confrontation and few polar opposites, slow consensus around certain points, and an emphasis on civility and constructive criticism. Now, I don't know anything about Cronon's politics personally: but I find this a really helpful way to think about his actions and goals.&lt;br /&gt;&lt;br /&gt;What happens if we assume Cronon is taking a stance on deliberative democracy, not on the labor question? I think it explains a lot of what some have found funny about his actions. Cronon's complaint about ALEC, remember, is not that it opposes unions; it is that it deliberately obscures its actions from the public sphere. He's probably in favor of public-sector unions, but it's the later point that led a non-partisan academic to leap headfirst into the blogosphere. From the start, he has argued for a particular type of public discourse, and he has stayed basically within its bounds at all times. His elaborate efforts to establish his Wisconsin bona-fides are crucial to his standing as a member of the community. His blog is about entering a public sphere in which anyone (even a state employee) should be able to enforce certain norms of democratic civility; and all his protestations of independence, or centrism, are at the core of the belief that democracy should be about consensus, not conflict between parties. A lot of readers, Republican and Democrat, I think, have either been put off by the points he keeps making about the tradition of Wisconsin Republicanism, or ignored them as boilerplate: Walker's bill, they think, is a cut-and-dry partisan issue &lt;i&gt;today&lt;/i&gt;. But those are actually central to his positioning: he's trying to intervene in the political debate as a centrist because on some level &lt;i&gt;that's the only position appropriate for political debate.&lt;/i&gt; To make his cause a &lt;a href="http://www.nytimes.com/2011/03/28/opinion/28krugman.html"&gt;rally-the-troops moment for democrats as the party of reason&lt;/a&gt; (warning: nytimes paywall) would be, in its own way, as impermissible as ALEC's behind the scenes dealings. &lt;br /&gt;&lt;br /&gt;This has implications for how we talk about what's at stake. I think Cronon's defenders should possibly stop focusing on the legality of the FOIA request, and his attackers shouldn't think they've got him red-handed. &lt;a href="http://www.slate.com/id/2289482/"&gt;Will Saletan said&lt;/a&gt; that Cronon should spend less time fighting the request, and more time trying to get the law changed—except that FOIA is a good thing. As &lt;a href="http://americanscience.blogspot.com/2011/03/on-william-cronon-history-law-and.html"&gt;Lukas pointed out on AmericanScience&lt;/a&gt;,  that's a pretty shallow argument even in legal terms: you can make  an interesting argument about competing goods of academic freedom and public information. A lot of people are doing just that. But the rabbit hole of public official vs. state employee, confidential  records vs. FOIA, takes for granted a legal framework when in fact this  about political discourse. I just handed out a bunch of Bs to students who ascribe  the core beliefs of political actors in the Early Republic to differing  methods of constitutional interpretation. ("Jefferson was a strict  constructionist, and therefore he opposed the Bank of the United  States"). The really important thing to defend here is not academic freedom or political speech in general, but a particular type of political speech (deliberative democracy, encouraging openness, etc.) against another type of speech (intimidation, etc.) &lt;br /&gt;&lt;br /&gt;Cronon occasionally gets called "naive," but his naivete is strategic. He's trying to set the terms of debate around political discourse, rather than legalism. If we make this about laws and about rights, we're possibly letting him down. Cronon hasn't intimated that he wants this to play out in the courts; he  pretty clearly prefers to shame the Republican party into withdrawing  their request. That we tend to talk about political issues in a legal way is a sign of the impoverished discourse that deliberative democracy wants to change. Cronon is already a step ahead of that: he's trying to create the conditions under which there would no political benefit to using FOIA against a professor, because ad hominem attacks would disgrace the deliverer more than the recipient.&lt;br /&gt;&lt;br /&gt;The commitment to changing the language also helps explain Cronon's allusion to "McCarthyism," which seems overblown to some. &lt;a href="http://us-intellectual-history.blogspot.com/2011/03/cronon-affair-and-political-culture-of.html"&gt;Ben Alpers at USIH&lt;/a&gt; draws out the comparison to moderate conservatism at length. But I think we should view it as relevant not for its critique of Republicanism today. More important is the other side of the coin: the McCarthy story offers, as its flipside, a tremendous example of civil discourse in American politics. Cronon himself is trying to be Joe Welch: the outsider who bravely called for decency, and who (through McCarthy's fall) got it. Just as McCarthy didn't quite realize he couldn't treat the Army as poorly as he had more political appointees, Cronon is hoping that his own integrity and standing will help lead to more moderate discourse. Will this work? It's clear, certainly, that the Wisconsin GOP has no idea what a respected figure "Cronin" is, but it's also possible that Cronon himself is overestimating the sympathy he can generate. His original intervention around the union issue was clearly done with an effort &lt;i&gt;not &lt;/i&gt;to be partisan, but Madison has already become such a partisan flash point that it might not work.&lt;br /&gt;&lt;br /&gt;The problem with deliberative democracy is that even though its principles seem like they should be universal, hardly anyone behaves in the public sphere like we'd like. Some of this is because of naked self-interest or weakness in the face of exploiting scandals, but a lot may be because individual goals seem more important. Citizens &lt;i&gt;want&lt;/i&gt; their representatives to use all the tools at their disposal to expand health coverage, reduce state debt burdens, or to resist overreach by the governing party. And that might not be crazy—although one might, like Centrist Cronon, care about political process and democracy more than anything else, one might also privilege property rights, or distributive justice, above any notion of political civility. (Although in some way that might not be rational: I'm e-mailing to try to get an orthodox Rawlsian account out of Ryan in the comments, since Rawls figures heavily in the intellectual heritage on both sides here.) That is to say, some people may want to defend Cronon from his left; they should be clear, though, that they're doing their own project, not necessarily his.&lt;br /&gt;&lt;br /&gt;Now, Cronon's not a card-carrying deliberative democrat, as far as I know. As a historian, he's never been a theory-first guy, and he tends to be pretty eclectic in his influences. He may turn around and say something tomorrow that puts the lie to my reading. Certainly not everything he says fits into the deliberative democracy framework. But I find this a helpful enough way to think about what's going on that I thought I'd throw it out there.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-3646145727375129108?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/3646145727375129108/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/03/cronons-politics.html#comment-form' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/3646145727375129108'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/3646145727375129108'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/03/cronons-politics.html' title='Cronon&apos;s politics'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-7928286079098509685</id><published>2011-03-24T15:58:00.001-04:00</published><updated>2011-03-24T16:06:44.125-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='authors'/><title type='text'>Author Ages</title><content type='html'>Back from Venice (which is plastered with posters for "Mapping the Republic of Letters," making a DH-free vacation that much harder), done grading papers, MAW paper presented. That frees up some time for data. So let me start off looking at a new pool for book data for a little while that I think is really interesting.&lt;br /&gt;&lt;br /&gt;Open Library metadata has author birth dates. The interaction of these with publication years offers a lot of really fascinating routes to go down, and hopefully I can sketch out a few over the next week or two. Let me start off, thought, with just a quick note on its reliability, scope, etc., looking only at the metadata itself. The really interesting stuff won't come out of metadata manipulation  like this, but rather out of looking at actual word use patterns. But I need to understand what's going one before that's possible.&lt;br /&gt;&lt;br /&gt;Open Library has pretty comprehensive metadata on authors. In the &lt;a href="http://sappingattention.blogspot.com/2011/02/technical-notes.html"&gt;bigpubs&lt;/a&gt; database I made, about 40,000 books have author birth years, and 8,000 do not; given that some of those are corporate authors, anonymous, etc., that's not bad at all. (About 1500 books have no author listed whatsoever).&lt;br /&gt;&lt;br /&gt;First, a pretty basic question: how old are authors when they write books? I've been meaning to switch over to ggplot in R for basic graphing, so here's a chance to break its histogram function. Here's a chart of author age for all the books in my bigpubs set:&lt;br /&gt;&lt;a href="https://lh3.googleusercontent.com/-_rXyArjhXAI/TXZudKNgb_I/AAAAAAAACfM/H-sPZ8heg2c/s1600/Author+Age+Histogram.png" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="https://lh3.googleusercontent.com/-_rXyArjhXAI/TXZudKNgb_I/AAAAAAAACfM/H-sPZ8heg2c/s1600/Author+Age+Histogram.png" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;It peaks in the 40s (median 49), and has a long but quite small trail of republication going off into the hundreds, which are mostly posthumous republications. There are also a few books written by child authors that are almost all errors. You can't see information on authors yet to be born, but there are a number of people publishing books at negative ages in the data as well. (Death dates may be worse: the first author I checked on their site was William James. He was listed as living from 1842 to 1827, until I fixed it.) These metadata problems, however, are quite insignificant compared to the data that we get. And the overall curve shows us something important: the vast majority of books in the Internet Archive are fairly new, not dusty reprints. That's why this works so well, even including all the republications in the sample--because although a few authors like Shakespeare or Pope keep getting republished, most don't.&lt;br /&gt;&lt;br /&gt;There's genre variation in this. At one end, the LC subclasses for European literature, history of the low countries, and biography all have a median author age of 60; on the other, municipal government, motor vehicles and aeronautics, and military engineering all have median author ages of 38. Technology classes seem in general to have the youngest authors, history and religion to have the oldest. But that variation is pretty slim. Here, for examples, are the density curves (pretty much a smoothed version of the above chart) for three big groups: science and technology (LC classes QRST), literature (class B), and everything else (which ends up being mostly education, social sciences, and humanities). This is showing proportionately how concentrated these different fields are various ages, in pretty transparencies:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh3.googleusercontent.com/-D-cSVw_zKEs/TXbt4-Odg4I/AAAAAAAACfQ/o3bvKphgCVI/s1600/Comparison+of+different+genre+age+at+publication.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="https://lh3.googleusercontent.com/-D-cSVw_zKEs/TXbt4-Odg4I/AAAAAAAACfQ/o3bvKphgCVI/s1600/Comparison+of+different+genre+age+at+publication.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;That's saying that while they all peak in the forties, the sciences have a tighter structure: fewer young authors _and_ fewer old authors, while literature (red) has a lot more young guns &lt;i&gt;as well as &lt;/i&gt;a lot more late republications, and correspondingly fewer of its books written by forty-somethings. The catchall 'other' category is the books I'm the most interested in, for the most part, and they tend to skew older--you can write a novel at 25, but even in the 19th century, evidently, people aren't getting their history books out until their mid-30s.&lt;br /&gt;&lt;br /&gt;But what's most striking is just how similar they are. It would be interesting to see if we could find some axes—probably not LC classification—where there were really marked dissimilarities in author age &lt;br /&gt;&lt;br /&gt;One last point: obviously no one is writing books at 110, but I am getting some of that stuff in the data.&lt;br /&gt;Open Library maintains a separate database entry for works as well as for editions, that would let us lump every publication of Tom Sawyer back into 1880, or count just the first one. In library terms, it's FRBR-ized a bit higher. About 9,000 of the 48,000 books in my corpus are duplicated works republished in a later year. But not everything has a work year, and just as some authors have negative ages, some books are published years before their "first publication."&lt;br /&gt;&lt;br /&gt;Can we use that information to get better information? I'll keep an eye, but it makes surprisingly little difference at this level of aggregation. Most books that &lt;i&gt;are&lt;/i&gt; republished, Open Library thinks, are republished only a year or two later. Using that data also induces some error for reasons not worth getting into. I might, quietly, shift over to that as my primary year information at some point instead of the year of publication data, but for now I'm not convinced that it's good enough to warrant the increased complexity of explaining what it is. Year of publication works well enough. As evidence, here's the same chart as above, but made using authors age at first publication. Where the above chart counts William James as 60 for an 1902 reprint of the &lt;i&gt;Principles of Psychology, &lt;/i&gt;the one below counts that book as having a 48-year-old author since James originally wrote it in 1890. But as you can see, it hardly makes a difference: the science peak is even higher, but that's the only big difference I can see:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh6.googleusercontent.com/-1dHwlLRLZLk/TXbzFW3S8II/AAAAAAAACfU/xU8fSgEhFYI/s1600/Comparison+of+different+genre+age+at+composition.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="https://lh6.googleusercontent.com/-1dHwlLRLZLk/TXbzFW3S8II/AAAAAAAACfU/xU8fSgEhFYI/s1600/Comparison+of+different+genre+age+at+composition.png" /&gt;&lt;/a&gt;&lt;/div&gt;Anyhow, enough of that. Next up is what author age can tell us about language.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-7928286079098509685?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/7928286079098509685/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/03/author-ages.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/7928286079098509685'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/7928286079098509685'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/03/author-ages.html' title='Author Ages'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='https://lh3.googleusercontent.com/-_rXyArjhXAI/TXZudKNgb_I/AAAAAAAACfM/H-sPZ8heg2c/s72-c/Author+Age+Histogram.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-8498437643089686984</id><published>2011-03-02T12:21:00.000-05:00</published><updated>2011-04-11T17:55:08.015-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Building a Corpus'/><category scheme='http://www.blogger.com/atom/ns#' term='Digital Humanities'/><category scheme='http://www.blogger.com/atom/ns#' term='Featured'/><title type='text'>What historians don't know about database design…</title><content type='html'>&lt;div&gt;I've been thinking for a while about the transparency of digital infrastructure, and what historians need to know that currently is only available to the digitally curious. They're occasionally stirred by a project like ngrams to think about the infrastructure, but when that happens they only see the flaws. But those problems—bad OCR, inconsistent metadata, lack of  access to original materials—are present to some degree in all our  texts.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One of the most  illuminating things I've learned in &lt;a href="http://sappingattention.blogspot.com/2011/02/technical-notes.html"&gt;trying  to build up a fairly large corpus of texts &lt;/a&gt;is how database design  constrains the ways historians can use digital sources. This is something  I'm pretty sure most historians using jstor or google books haven't  thought about at all. I've only thought about it a little bit, and I'm  sure I still have major holes in my understanding, but I want to set something down. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Historians  tend to think of our online repositories as black boxes that take  boolean statements from users, apply it to data, and return results. We  ask for all the books about the Soviet Union written before 1917, Google  &lt;a href="http://www.google.com/search?q=%22Soviet+Union%22&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=eP8ETa60CsP_lgftrbzdCQ&amp;amp;ved=0CBMQpwU&amp;amp;source=lnt&amp;amp;tbs=bks:1,cdr:1,cd_min:,cd_max:12/31/1916"&gt;spits  it back&lt;/a&gt;. That's what computers aspire to. Historians respond by  muttering about how we could have 13,000 misdated books for just that  one phrase. The basic state of the discourse in history seems to be stuck there. But those problems are getting fixed, however imperfectly. We should be muttering instead about something else.&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Historians who assume computers are boolean machines gloss over an important distinction: there's not a one-to-one connection between the boolean logic on a processing board and the boolean logic we feed in. There's a hard drive that has to be physically moved around, there's a limited amount computing space that has to be carefully managed, and so on. All  those sites have to expend incredibly energy structuring the data in a  way that lets you return it with any semblance of speed. HathiTrust has a technical but interesting  blog about &lt;a href="http://www.hathitrust.org/blogs/large-scale-search"&gt;making  their massive archive of OCR'ed texts searchable&lt;/a&gt;. Open Library has &lt;a href="http://blog.openlibrary.org/2011/02/02/search_inside_solr/"&gt;some  interesting posts, too&lt;/a&gt;. Google is more reticent about their practices, but you  can see it in action on their pages. That Google results page for the  Soviet Union before 1917 tells me there are&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: arial,sans-serif; font-size: 11px;"&gt;About 13,500 results&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: arial,sans-serif; font-size: 11px;"&gt;&lt;nobr&gt;&amp;nbsp;(0.53 seconds)&amp;nbsp;&lt;/nobr&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: arial,sans-serif; font-size: 11px;"&gt;&lt;nobr&gt;&lt;/nobr&gt;&lt;/span&gt;for my search. [I wrote that a couple months ago, and now I get 1,000 more books in a tenth of a second less. Progress!] If I try it  again, that number goes down to 0.14 seconds or so, because Google now  has a cached entry closer to my computer. The key thing to understand is  that Google doesn't actually look through all the text files it has  every time I perform a search. Instead, it passes the request around to  see if it already has an answer; and if not, it uses an 'index' of all  the books to find the ones that have the terms. It doesn't reach the  actual files until the very end of the process, when you click on them to read. For the ngrams viewer, for example, the whole thing is driven entirely by the  flat files you can download from their site, and by associated indexes in whatever database program they're using.&lt;br /&gt;&lt;br /&gt;The whole  business of search relies on creating indexes, caches, and fragments of  databases to get the search numbers lower and lower. Working out  indexing and storage schemes took up far more energy early in this  project than I anticipated, because the speed difference between good  indexing and bad indexing is so great that it makes the kind of research  that cuts against the indexes very difficult. So what's an index? It's like a  concordance, basically--it makes it easy to access just what the  index-maker designed, but not much else. A concordance to Ovid is great  for understanding his work--but it's considerably less useful if you're  interested in the entire usage of the golden age latin poetry, and even  less so if you're interested in tracing, say, the mentions of some  animal from the Antonines to the Hapsburgs. And if you notice, say, that  there are a lot more fish in the &lt;i&gt;Tristia &lt;/i&gt;than the &lt;i&gt;Ars  Amatoria, &lt;/i&gt;there's no way at all to find out what else distinguishes  the two books from each other save from reading them through.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;A more concrete example: Let's  say I wanted to know how many times the word 'evolution' appeared each  year from 1830 to 1922. I could have the computer read through each of  the books in my catalog and note the years it appears, but that would  take quite a while--twenty minutes to an hour, say. It's much faster for  me to keep database tables in MySQL that tell me all the books that  have my word; it's yet faster if I just keep the counts for each word by  year in a separate table. So, for different purposes, I store both  those things in the database. And to make that run quickly, I have to  create separate indexes that let the computer find results quickly. That  lets me get a basic wordcounts graph, n-gram style, for any word in  well under a second.&lt;br /&gt;&lt;br /&gt;But when I want to start finding  out about words that happen in the same &lt;i&gt;sentence &lt;/i&gt;as evolution,  rather than the same book, I have to go back to the flat text files  again. This means it's hard to ask questions like "&lt;a href="http://sappingattention.blogspot.com/2010/12/age-of-capital.html"&gt;What are the words  that frequently appear in the same sentence as capitalism in 1920&lt;/a&gt;?" And  it's even harder to ask questions &lt;i&gt;about those words, &lt;/i&gt;like&amp;nbsp;"&lt;a href="http://sappingattention.blogspot.com/2011/01/cluster-charts.html"&gt;How  can we cluster books that use words related to evolution on the basis of  words related to 'society' and words related to 'biology'?"&lt;/a&gt; or "What  words appear most disproportionately in the same books as business  words?" If I free up some hard drive space, I might try to index at the  sentence level as well as the book level, so I can ask some of these  questions better. But that comes at the cost of increasing slowness--certain sorts of select queries on my database take minutes, even hours, to run.&amp;nbsp; &lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So what? Am I just  complaining about how slow my funny little research project is? Not  entirely. In many ways, these problems are generalizable. That is to  say: indexing deeply affects the questions we can ask of our digital  libraries. The websites for Jstor, ProQuest, and so on force you into a  certain syntax. You can find passages articles that use a two word  phrase, and you can find that have both of the words, but if you want to  find books that have&amp;nbsp;&lt;i&gt;sentences&lt;/i&gt;&amp;nbsp;that have both the words, you  can't. (Pro-Quest lets you find "within 3", presumably because they  build that up. But "within 5", or "on the same page as" would take a  whole extra index. That variability is the point--different databases  give us different research capabilities.)&lt;br /&gt;&lt;br /&gt;Those  limitations aren't just about web design, although that's part of it--they're about the databases  underlying the website. Even you let me into Jstor headquarters and gave  me all the passwords, any queries on their database different from the  ones the website allows would take far longer to run. The solution to this would be to build new indexes to help those queries run faster, but that's a substantial investment. My relatively constrained indexes take over a day of processing to build. If I correctly understood  something Erez Lieberman-Aiden said to me the other day, it's taking the Harvard culturomics team about a month to build indexes for experimental features in ngrams.  That's a substantial investment of computing power, and in any given  direction you quickly outstrip the capabilities of the index. Any sufficiently interested user can think of a new task.&lt;br /&gt;&lt;br /&gt;As new algorithms and techniques supplement more basic keyword queries, this problem only gets worse. J-Stor has, for instance, implemented a new &lt;a href="http://www.blogger.com/"&gt;experimental topic browser&lt;/a&gt; using topic modeling. They build up new indexes for it, I'm sure. But either because topic models are more complicated than simple keyword searches or because they haven't fully tuned the databases its running on, results&amp;nbsp; take longer than traditional J-Stor results when you rebalance the terms. And this is with the topics they've generated and indexed: if you wanted to generate your own, or use some combination of word-weightings you gleaned from elsewhere, you'd be out of luck.&lt;br /&gt;&lt;br /&gt;This is important&lt;i&gt; &lt;/i&gt;now because we're in an interesting period of transitioning search. Computers are fast enough now that we can get two things we didn't have before: keyword search on text archives orders of magnitude larger than we've had before without it necessarily being done through Google (Hathi, internet archive book search being the most obvious), and search that isn't just done by keyword but by more elaborate models that take many more data points. (Topic modeling being the most promising of these.) Ngrams is a fascinating example of a new way of arranging and searching data. And natural-language-processing opens up new horizons as well.  There are a few really interesting attempts, like &lt;a href="http://www.cs.berkeley.edu/%7Eaditi/projects/wordseer.html"&gt;wordseer&lt;/a&gt;, to have a fully collaborative attempt to build new search tools more in line with the way that humanists actually work. Historians want larger corpuses of texts more regularly than any other field, I think, so the limitations and possibilities of search will be as important for them as anyone.&lt;br /&gt;&lt;br /&gt;Still, all this infrastructural action is mostly behind the scenes for historians. That's a bad thing. The real scholarly infrastructure is still going up at J-stor and Google books and everywhere else with a primary audience of social scientists, or the book-buying public, or some vaguely defined audience that don't have the infrastructural needs that historians might. Any small town in the 19th century knew that the most important task they had was to get the railroad tracks running through their town. Historians need to figure out just what they can get out of texts that they aren't, and how they can ensure the emerging indexes and database infrastructure leads us to where we want to go. &lt;br /&gt;&lt;br /&gt;Where do we want to go? That's the question, and I'll leave it to later to answer it more. But it's something that historians, not just digital ones, need to think about more. The historical profession is not, to put it mildly, very good at dealing with collective-action issues like infrastructure-shaping. There's little reason to think they could ever get the railroad to run through town. It's possible that I'm simply wrong about this being an issue for historians, and it's really a question of librarianship. But I've read perhaps too much history of the social uses of technology to believe that entirely; the end-users really can shape the way that the technology evolves. But we'll need to understand it better first, and in doing so understand how it's a shared challenge.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-8498437643089686984?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/8498437643089686984/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/03/what-historians-dont-know-about.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8498437643089686984'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/8498437643089686984'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/03/what-historians-dont-know-about.html' title='What historians don&apos;t know about database design…'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-6766167849971424380</id><published>2011-02-22T15:38:00.005-05:00</published><updated>2011-04-11T17:54:57.751-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data exploration and visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='pca'/><title type='text'>Genres in Motion</title><content type='html'>Here's an animation of the PCA numbers I've been exploring &lt;a href="http://sappingattention.blogspot.com/2011/02/fresh-set-of-eyes.html"&gt;this&lt;/a&gt; &lt;a href="http://sappingattention.blogspot.com/2011/02/pca-on-years.html"&gt;last&lt;/a&gt; &lt;a href="http://sappingattention.blogspot.com/2011/02/vector-space-overlapping-genres-and.html"&gt;week&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;There's quite a bit of data built in here, and just what it means is up for grabs. But it shows some interesting possibilities. As a reminder: at the end of my &lt;a href="http://sappingattention.blogspot.com/2011/02/fresh-set-of-eyes.html"&gt;first post on categorizing genres&lt;/a&gt;, I arranged all the genres in the Library of Congress Classification in two dimensional space using the first two principal components. PCA basically find the combinations of variables that most define the differences within a group. (Read more by me &lt;a href="http://sappingattention.blogspot.com/2010/12/second-principals.html"&gt;here &lt;/a&gt;or generally &lt;a href="http://en.wikipedia.org/wiki/Principal_component_analysis"&gt;here&lt;/a&gt;.). The first dimension roughly corresponded to science vs. non-science: the second separated social science from the humanities. It did, I think, a pretty good job at showing which fields were close to each other. But since I do history, I wanted to know: do those relations change? Here's that same data, but arranged to show how those positions shift over time. I made this along the same lines as the great &lt;a href="http://www.youtube.com/watch?v=jbkSRLYSojo"&gt;Rosling/Gapminder bubble charts&lt;/a&gt;, created with &lt;a href="http://code.google.com/apis/visualization/interactive_charts.html"&gt;this&lt;/a&gt; via&amp;nbsp;&lt;a href="http://cran.r-project.org/web/packages/googleVis/index.html"&gt;this&lt;/a&gt;.&amp;nbsp;To get it started, I'm highlighting psychology.&lt;br /&gt;&lt;iframe height="700px" src="http://www.princeton.edu/%7Ebschmidt/PCAgoog1.htm" width="700px"&gt;&amp;amp;lt;p&amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;You need iframes and javascript or something to display this content&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;&amp;amp;lt;/p&amp;amp;gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;[If this doesn't load, you can click through to the file &lt;a href="http://www.princeton.edu/%7Ebschmidt/PCAgoog1.htm"&gt;here&lt;/a&gt;]. What in the world does this mean?&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;That's partly up to you to decide, but here's a rough guide to reading it. [Note: this doesn't seem to work in Google Reader, you may need to head to the web site to see the graphics.] The chart begins in 1863 for psychology. (That's the year at the bottom: it goes back to 1861 for some fields). At that point, psychology is closest to the general philosophy/psychology LC classification "B"; its word-use patterns also align it with education, social pathology, and American history. You can look at the general space to get a sense of what PCA regards as the basic characteristics to separate on. The first principal component, for example, runs from south to north.&amp;nbsp;All of the high scorers are science and technology disciplines; chemistry is the most distinctive from other subclasses, and biology the least. In the south are the social sciences and humanities; three religion classes are the least "scientific" works in the sample.&amp;nbsp;(Scare quotes necessary because PCA word-choice differs from semantics in important ways).&amp;nbsp;The second component runs from right to left, and separates out the humanities (literature above all) from the law and the emerging social sciences at the bottom; it also finds a set of words that create a similar split in the north from medicine to manufacturing. I like to think of that axis as being from personal to social in some way, but that's probably an exaggeration. (Although there are nice touches to support it: for example, "Public aspects of Medicine" is the most 'social' of the medical fields, and engineering the most social of the sciences).&lt;br /&gt;&lt;br /&gt;So that's the landscape. Press play to see how it shifts over the period 1861-1922, if you haven't yet. You'll see psychology track steadily&amp;nbsp;[*See important disclaimer about smoothing at the bottom]&amp;nbsp;towards medicine in the upper left until about 1879. (Which is the date of &lt;a href="http://sappingattention.blogspot.com/2011/02/graphing-word-trends-inside-genres.html"&gt;Wundt's lab&lt;/a&gt;, traditionally the origin of scientific psychology—spooky!). The dots are quite small--hover over, and you'll see it only has a few hundred thousand words a year. That means only a couple books&amp;nbsp;annually at first&amp;nbsp;(rule of thumb: 100,000 words = one or two books), and none in a few. Once the field actually gets established, it settles into a relatively constrained area that it occupies until 1922. The size of the dots grows, indicating that more books are published in psychology.&lt;br /&gt;&lt;br /&gt;What other fields might be interesting? I like LB, theory and practice of education, which makes a late dash towards the social sciences after crossing paths with psychology. Rather than go through them all myself, I'm just throwing it out there. Google did a great job making the charts fully interactive; you can change any aspect of them—turn on and off trails, highlight other groups, change the color classification scheme to work by LC class instead of my made-up higher classes, etc. Click away, something's bound to happen. I've included a few additional principal components to substitute out for the axes, though I don't pretend to know just what they mean. Click on the axis name, for example, to switch out the second principal component separating years for the fourth in the psychology graph, for instance, to see a metric that finds a logic in the B classification I'm somewhat blind to.&lt;br /&gt;&lt;br /&gt;One thing that I find particularly interesting is the possibility of using a completely different set of criteria besides the PCA weights to chart on. I described earlier a &lt;a href="http://sappingattention.blogspot.com/2011/02/pca-on-years.html"&gt;separate PCA analysis&lt;/a&gt; that found axes of change over time: I've put that chart in here, too. You can change the options on the above chart to generate it in the window above, but let me give you another example.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;iframe height="650px" src="http://www.princeton.edu/%7Ebschmidt/PCAgoog2.htm" width="650px"&gt;&amp;amp;lt;p&amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;You need iframes and javascript or something to display this content&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;&amp;amp;lt;/p&amp;amp;gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;[Original version hosted &lt;a href="http://www.princeton.edu/%7Ebschmidt/PCAgoog2.htm"&gt;here&lt;/a&gt;.]&lt;br /&gt;&lt;br /&gt;The essential point to this chart is that all the classes drift to right over time. That's because the left is finding language typical of the 1850s, and the right language typical of the 1910s. But within that overall drift, there is variation. Some genres are "ahead of the times," and some behind; some travel against the current. I highlighted three interesting ones here: Chemistry, US Law, and US History. They show different possible paths. Chemistry begins ahead of the pack in 1861--it is an advanced science, it doesn't republish old books much, etc. [See the caveat at the bottom about re-publication.] But as time goes on, its lead erodes. After about 1908, the language of chemistry is as modern as it will be before 1920. Engineering fields are using more modern language, physics catches up from well behind, and so forth. A lot of this probably represents percentage of reprinted books, but some of it as well is about what vocabularies have currency.&lt;br /&gt;&lt;br /&gt;The law/history example shows something similar. Law moves back and forth, well behind the times in its vocabulary: while American history moves more steadily forward, with a number of notable pauses--particularly in the 60s/70s it moves backwards, and in the 90s it stays still long enough for law to catch up for a moment. Again, I'm not sure what's driving it—I'd need to build in a bit more processing to see what sort of vocabulary is causing it to take the course it does—but it's suggestive.&lt;br /&gt;&lt;br /&gt;If you want to play around more, &lt;a href="http://www.princeton.edu/%7Ebschmidt/PCAgooglarge.htm"&gt;here is a wider version&lt;/a&gt; of the charts for free clicking that I'll keep up for a little while. These charts are really great for exploration--they let you change the axes, the arbitrary categories I use for colors, they let you look at moving bar graphs of the shifting distributions over time, etc. They really give a nice way to allow a lot more exploration of somewhat complex data sets than flat files allow, in addition to the cool animations.&lt;br /&gt;&lt;br /&gt;I said this is "suggestive." That's a classic weasel word and raises the question: What's the point here? I admit I mostly just wanted to see what this would look like, and I'm going to stop with all the PCA for a bit. This is about as far as I can get on my own in a week while teaching. Still, let me think about what it's good for for a minute.&lt;br /&gt;&lt;br /&gt;The point I want to make is not necessarily that this particular PCA weighting gives us vast new insights, although there might be some that are good for spurring critical thinking. (The lack of any separation at all between social sciences/humanities along the axis that parcels off science is quite interesting, for example). Certainly it's not that these particular weightings on a relatively small fraction of books are an end in themselves.&lt;br /&gt;&lt;br /&gt;Rather, it's another demonstration of the sort of objects and movements we can study now that weren't possible even five years ago. Historians often want to write about Big Topics, but feel that they can't do so responsibly. Instead, we hunker down in to particular archives, study small movements, etc. We often want to write about concepts, discourses, genres as subjects, but finding ourselves forced to write about individuals instead because of the constraints of what we can read.&lt;br /&gt;&lt;br /&gt;There's a route in to the big questions by aggregation: we can really make genres and other big groups our subjects. (That the dots breath in and out certainly&amp;nbsp; helps further the illusion). We get to talk about structures, relations between genres and between discourses while still having a way to keep our feet on the ground. Or maybe to plant our feet on the ceiling for a change--we work down from the aggregates to find the individual cases, instead of vice versa. I think there are better ways to talk about, say, the scientization of psychology than these charts: as Allen's been &lt;a href="http://sappingattention.blogspot.com/2011/02/pca-on-years.html?showComment=1298168230421#c4861180576103051343"&gt;trying to convince me&lt;/a&gt;, topic modeling is probably one &lt;br /&gt;&lt;br /&gt;I like this the most, then, as a sort of demonstration of the new perspectives that statistical techniques can offer on our sources. We can analyze in new ways, now, the big networks of terms and words and discourses all the humanists got so excited about in the 80s. The structure of ideas and the flow of knowledge on the largest levels is more accessible than it's ever been.&lt;br /&gt;&lt;br /&gt;I had an interesting talk last with Erez Lieberman-Aiden, one of the authors of the culturomics paper in &lt;i&gt;Science &lt;/i&gt;and creator of ngrams, that I'm still mulling over. He thinks one of the big results of data will be in allowing a new sort of research in the humanities that focuses on falsifiability, reproducible claims, and so forth. If that does happen, though, it will at the same time open up the field for much more traditional humanistic research. When we find new angles that take big shifts as facts principally to be &lt;i&gt;explained&lt;/i&gt;, not just &lt;i&gt;unearthed&lt;/i&gt;, we'll find more and more interesting things to write about even within the traditional framing of humanistic research.&lt;br /&gt;&lt;br /&gt;~~~~~~~~~~&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;Fine Print: First, the components data is smoothed by a loess regression against year, span = .5, weighted by the number of books in each year. There simply aren't enough books in each year in my 45,000 book sample to use the raw data stand in and get any sense of overall trends outside of the top 3 or 4 classes. &lt;a href="http://www.princeton.edu/%7Ebschmidt/PCAnosmoothing.htm"&gt;Here's what it looks like&lt;/a&gt;: too much popcorn maker. I could do a decade-by-decade plot, but that's just a cruder form of smoothing. I figured I'd go the pretty way and keep the year results, but it's worth noting that the future is somewhat embedded in each years' point. The data on number of words in each genre is not smoothed, and it provides an important corrective to assigning too much importance to the beginning or end of a discipline's trail.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;I dropped one subclass: AP--Periodicals. Periodicals aren't in any particular genre, and they aren't supposed to be in the Open Library to begin with. There are a few, though, and they come from very different fields in large numbers at increments of a few decades. The net result is that the dot sweeps around all across the screen, while not contributing to an understanding of genre relations. It's kind of neat to watch, though, and gives a little insight into how the smoothing works in the long stretches it tries to find a point while there is no data. Check it out &lt;a href="http://www.princeton.edu/%7Ebschmidt/crazyAP.html"&gt;if you like&lt;/a&gt;.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;Finally, it's worth mentioning again that my data includes reprints. Shakespeare, for example, may be an important part of the reason that British literature lags behind American. Open Library data makes it possible to use creation dates, but I'm not sure that data is so reliable that it's worth porting over to it. Plus, it says something important about fields if some continue to reprint lots of books from decades earlier while others do not.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-6766167849971424380?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/6766167849971424380/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/02/genres-in-motion.html#comment-form' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/6766167849971424380'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/6766167849971424380'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/02/genres-in-motion.html' title='Genres in Motion'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-2357809375972434270</id><published>2011-02-20T12:35:00.000-05:00</published><updated>2011-02-20T12:35:49.343-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='search'/><category scheme='http://www.blogger.com/atom/ns#' term='Digital Humanities'/><category scheme='http://www.blogger.com/atom/ns#' term='pca'/><title type='text'>Vector Space, overlapping genres, and the world beyond keyword search</title><content type='html'>I wanted to see how well the vector space model of documents I've been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if you're sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Lab's &lt;a href="http://litlab.stanford.edu/?page_id=255"&gt;Pamphlet One&lt;/a&gt;, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books&amp;nbsp; in LCC subclasses "BF" (psychology) blue, and use red for "QE" (Geology), overlaying them on a chart of the first two principal components like I've been using for the last &lt;a href="http://sappingattention.blogspot.com/2011/02/fresh-set-of-eyes.html"&gt;two&lt;/a&gt; &lt;a href="http://sappingattention.blogspot.com/2011/02/pca-on-years.html"&gt;posts&lt;/a&gt;:&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-t8ylDh3DGzQ/TWBkAknESzI/AAAAAAAACeg/hzV-GQ6y0Pk/s1600/books+example.png" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;/a&gt; &lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-SfYeb_JktG8/TWCQnDXvuOI/AAAAAAAACek/-_E1P9HgTu8/s1600/books+example.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-SfYeb_JktG8/TWCQnDXvuOI/AAAAAAAACek/-_E1P9HgTu8/s1600/books+example.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-JDEBRej9_cI/TWBgu_4jhBI/AAAAAAAACec/-f3znu04Py0/s1600/books+example.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;/a&gt; &lt;/div&gt;That's a little worse than I was hoping. Generally the books stay close to their term, but there is a lot of variation, and even a little bit of overlap. Can we do better? And what would that mean?&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;First, let's review why we care about this at all. Categorizing and assigning texts is a big and vibrant field in information retrieval, and one that's starting to work around the temporal dimension. (Allen Riddell points to some good related sources, including digital history, in the &lt;a href="http://sappingattention.blogspot.com/2011/02/pca-on-years.html?showComment=1298154427756#c4039857298373951566"&gt;comments&lt;/a&gt;). Humanists playing around with R aren't going to make substantive contributions to those algorithms, I'm afraid. What we can do, though, is see how the existing categorization algorithms might help a) provide insight into the categorizations we work with now, and b) help us find connections or documents that you can't simply search for. I'm not interested in tightening those circles for its own sake--I just want to see how we might use this data to learn things about books we have. There's thus some fairly low-hanging fruit (for example, what are the least coherent genres under the LC classification?) that I'll let be.&lt;br /&gt;&lt;br /&gt;For this post, therefore, I'm just going to use a pretty simple form of categorization--Euclidean distance across the 10,000 words I'm using this week, using standard deviations from the mean term frequency for the word as the base metric. There are more subtle ways to do this--I'm not using the principal components, for example, that tell me which lexical information is a good predictor of genre ("hypothesis" vs. "premonition", say), and which isn't ("Williams" vs. "Smith", which can swing around wildly from book to book). But this should work as a first pass. &lt;br /&gt;&lt;br /&gt;The vector space model lets us see which books a given book is closest to. Those blue dots above are only showing us closeness in two dimensions: if we include all the information that exists in the other 9,998 dimensions that using 10,000 words gets us, we can find the closest genres for the 350 psychology books in my main database as follows:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;&amp;nbsp;BF&amp;nbsp; AC&amp;nbsp; BJ&amp;nbsp; BL&amp;nbsp; LB&amp;nbsp; BD&amp;nbsp;&amp;nbsp; B&amp;nbsp; HM&amp;nbsp; PN&amp;nbsp; PS&amp;nbsp; BX&amp;nbsp; PZ&amp;nbsp;&amp;nbsp; Q&amp;nbsp; HV&amp;nbsp; PQ&amp;nbsp; PR &lt;br /&gt;212&amp;nbsp; 25&amp;nbsp; 19&amp;nbsp; 17&amp;nbsp; 14&amp;nbsp; 13&amp;nbsp;&amp;nbsp; 9&amp;nbsp;&amp;nbsp; 7&amp;nbsp;&amp;nbsp; 5&amp;nbsp;&amp;nbsp; 5&amp;nbsp;&amp;nbsp; 4&amp;nbsp;&amp;nbsp; 4&amp;nbsp;&amp;nbsp; 4&amp;nbsp;&amp;nbsp; 2&amp;nbsp;&amp;nbsp; 2&amp;nbsp;&amp;nbsp; 2 &lt;br /&gt;&amp;nbsp;QH&amp;nbsp; BS&amp;nbsp; BT&amp;nbsp; BV&amp;nbsp; CT&amp;nbsp; DA&amp;nbsp;&amp;nbsp; E&amp;nbsp; HN&amp;nbsp; HQ&amp;nbsp; PE&amp;nbsp; PT &lt;br /&gt;&amp;nbsp; 2&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; 1 &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;That is to say, about 2/3 are closer to BF than they are to any other genre. That's a lot better than the above diagram would indicate. The most common errors are to think a work is in AC (serials); the philosophy/religion classes BJ, BL, and BD; and LB, practice of education. Some of the mistakes aren't terrible: education and psychology are very hard to tell apart in this period, for example. And the oddballs are explicable, too. A lot of them seem to be because spiritualism is classed in with psychology. (I could just filter out everything will a call number above 1000 to get rid of them). Spiritualist tracts, depending on their bent, pop up as English history, in various religion classes, and elsewhere. I might well get &lt;i&gt;better&lt;/i&gt; results in analyzing psychology by using a filter to eliminate texts closest to PZ, novels:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; title year&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;1413&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;a href="http://www.archive.org/stream/shadowworld00garluoft#page/n13/mode/2up"&gt;shadow world&lt;/a&gt; 1908&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;11698&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; After-death communications 1920&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;19354 &lt;a href="http://books.google.com/books?id=pCUPAAAAIAAJ&amp;amp;printsec=frontcover&amp;amp;dq=Past+and+present+with+Mrs.+Piper+1922&amp;amp;source=bl&amp;amp;ots=3B_FE9Lpe2&amp;amp;sig=ruBKItJ21kMOea9dnhDUaFG7_b0&amp;amp;hl=en&amp;amp;ei=0pZgTbuVK8qs8Abhlu2PDA&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CBkQ6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false"&gt;Past and present with Mrs. Piper&lt;/a&gt; 1922&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;46370&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;a href="http://www.archive.org/stream/rachelcomforted00matugoog#page/n5/mode/2up"&gt;Rachel comforted&lt;/a&gt; 1920&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: inherit;"&gt;The other mistakes are more or less explicable, too. The book mistaken for E (US history) is &lt;i&gt;The study of greatness in men&lt;/i&gt;; the one mistaken for PE, English language, is &lt;i&gt;What handwriting indicates&lt;/i&gt;. There's a definite method to the madness. We'd certainly want to have a better algorithm overall, but the LCC classification limits our options. Forget a computer program--no humans not already familiar with the system would be able to determine just what goes into each subclass based purely on its name.&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;And anyway, as I said, our interest probably isn't in classifying: it's in finding interesting things for ourselves. So we can, for instance, find the 5 books in psychology that fit in the least well with the category:&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&amp;nbsp;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; title year&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;1108&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; sibylline oracles, books 3-5, by the Rev. H.N. Bate, M.A. 1918&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;5695&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Methods and results of testing school children 1920&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;20024 Some remarkable passages in the life of Dr. George de Benneville 1890&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;46272&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; What handwriting indicates 1904&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;46370&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Rachel comforted 1920&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;Some of these we saw before. Others are just strange--I had no idea that BF 1745-1779 was for "Oracles, Sibyls, and Divinations." Still, a lot of these books are just the kooks and quacks that anyone who's spent much time in the GAPE sees constantly.&lt;br /&gt;&lt;br /&gt;What about the converse? We can see what books most perfectly mirror the overall language of the discipline by finding ones with the smallest distance:&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; title year&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;3750&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Psychology 1915&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;20867&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Psychology 1910&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;21558&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Psychology 1892&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;25253&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Psychology 1920&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;27473 The principles of psychology 1890&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-family: Times,&amp;quot;Times New Roman&amp;quot;,serif;"&gt;Why do those all look the same? The five most representative books of the BF class are five separate editions/volumes/revisions of William James's &lt;i&gt;Principles of Psychology. &lt;/i&gt;That's rather nice. It might be a bit much to try to test to see if he's so perfect because he followed the field faithfully or because that text actually set the course of research, but there's something comforting about seeing we all read the book for the right reasons. (Joseph LeConte's textbook occupies the analogous place in geology, FTR).&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;More usefully, we can find books that match a given genre: the five psychology books, say, that most closely resemble the education class:&amp;nbsp;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;span style="font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; title&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bookid&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;18009 teacher's handbook of psychology&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; teachershandbook00sull&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;24775&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Mental growth and control. mentalgrowthcont00opperich&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;27473&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; The principles of psychology&amp;nbsp; principlespsych12jamegoog&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;27607&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Human conduct&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; humanconducttext00pete&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: x-small;"&gt;31767&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Teaching to think&amp;nbsp; teachingtothink00boragoog&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;Or the five novels written in 1912 that share the profile of HN, Social Reform and Social movements:&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;br /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; title year&amp;nbsp;&amp;nbsp;&amp;nbsp; authors&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bookid&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;391&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; John Rawn 1912 OL1603793A&amp;nbsp; johnrawnpromine00compgoog&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;10458&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; London Lavender 1912&amp;nbsp; OL113754A&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; londonlavenderen00luca&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;12466 Along the King's highway 1912 OL4472940A alongkingshighwa00shooiala&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;12503&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; White ashes 1912 OL2390372A&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; whiteashes00noblgoog&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;36466&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; King John of Jingalo 1912&amp;nbsp;&amp;nbsp; OL42239A&amp;nbsp; kingjohnjingalo00housgoog&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;46610&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; man's world 1912 OL2126668A&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; amansworld00bullgoog&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: inherit;"&gt;This is starting to get interesting. &lt;i&gt;John Rawn, &lt;/i&gt;for example, carries a &lt;a href="http://www.archive.org/stream/johnrawnpromine00compgoog#page/n14/mode/2up"&gt;dedication&lt;/a&gt; to Woodrow Wilson, "one of the leaders in the third war of independence" and led to a times article in which he defended himself against charges of socialism; &lt;i&gt;King John of Jingalo &lt;/i&gt;is a novel of court life; etc. &lt;i&gt;White Ashes&lt;/i&gt; is about "the romance of fire insurance:" a &lt;a href="http://books.google.com/books?id=TZsNAQAAIAAJ&amp;amp;pg=PA255&amp;amp;lpg=PA255&amp;amp;dq=Kennedy+noble+white+ashes+review&amp;amp;source=bl&amp;amp;ots=baeKZ2U4-g&amp;amp;sig=9MEgj3-gNqk806r3Dmne9qpO62o&amp;amp;hl=en&amp;amp;ei=Q0FhTbLjDIP98Aaku-ycDA&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=6&amp;amp;ved=0CEAQ6AEwBQ#v=onepage&amp;amp;q=white%20ashes&amp;amp;f=false"&gt;review&lt;/a&gt; that says "Big business interests enter into the plot; and while characters subordinate to theme, there is sufficient love interest to relieve the strain of insurance talk and transactions." These are good ways of turning up cultural ephemera that are typical of longer-term trends but might have been lost in the meantime.&lt;br /&gt;&lt;br /&gt;If Ted Underwood is right that search is &lt;a href="http://tedunderwood.wordpress.com/2011/02/06/why-search-was-the-killer-app-in-text-mining-and-what-we-might-learn-from-it/"&gt;the killer app of text mining&lt;/a&gt;, there's still an enormous amount of work that could be done to allow topical, rather than keyword, search. Historians love to and need to find books that cross the same boundaries as their unique research topics. At least at Princeton, that seems to be happening more and more. (Probably because of Google Books, in part). This doesn't need to be limited to genre, incidentally--we could find the book published in Pittsburgh that most resembles books using "evolution" and "society" a lot, or list the books of Henry James by how closely they mirror his brother's overall style. If this were easy to do, historians would find it an incredibly useful tool for filling in gaps in their research. Keyword search is only one kind of search, but it's the only one the vast majority of historians have access to.&lt;br /&gt;&lt;br /&gt;So why can't you go on Amazon or Google and do these searches? Two basic reasons, I think:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Search engines don't work that way, and the ways indexes are constructed on books right now, it would take massive computing power to do these searches. (I'm getting closer to posting my long-delayed thoughts on indexing, I promise…). Keyword search is useful across all kinds of fields: genre-cluster search has a much tinier audience.&lt;/li&gt;&lt;li&gt;Just as important: the interface would be clumsy, and that's not how Cloud Computing likes things to look. From the &lt;a href="http://chroniclingamerica.loc.gov/beta/"&gt;Chronicling America&lt;/a&gt; newspapers project to Google Books, we seem to have come to an agreement that the way for scholars to interact with databases is through opaque web forms. It makes the initial stages of contact much easier, but generates nowhere near the possibilities of SQL queries. Nor does it let you interact with other data sources. If my Zotero library were slightly better tagged, for instance, I would be able to find books in my 15,000-book psychology/education dataset that resembled the various portions of it or the portions as a whole. But I can't do that on Jstor articles or ProQuest newspapers, because aside from archive.org, basically of our book data is trapped inside the &lt;a href="http://sappingattention.blogspot.com/.../digital-history-and-copyright-black.html"&gt;copyright black hole&lt;/a&gt; because of metadata, scanning rights, or pure indifference.&lt;/li&gt;&lt;/ol&gt;I want to do one more post of math before I get political again, but briefly: the fundamental problem is that while historians recognize the social construction of technology in historical actors, they are largely missing the opportunity to shape the wave of technological change that's sweeping across the discipline. The new wave of digitization has been coming online for a few years, but it looks on the surface quite a bit like the resources we got in the 1990s. (Except that now it comes with 'visualizations', which I'm as guilty—fine, more guilty—of fetishizing than anymore else.) It would be tricky, but there's still some room for a different attitude towards how digital libraries can have something better than digital card catalogs. I'm worried, though, that some of our leading figures aren't necessarily embracing &lt;i&gt;all&lt;/i&gt; the possibilities to make them better.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8929346053949579231-2357809375972434270?l=sappingattention.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sappingattention.blogspot.com/feeds/2357809375972434270/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://sappingattention.blogspot.com/2011/02/vector-space-overlapping-genres-and.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/2357809375972434270'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8929346053949579231/posts/default/2357809375972434270'/><link rel='alternate' type='text/html' href='http://sappingattention.blogspot.com/2011/02/vector-space-overlapping-genres-and.html' title='Vector Space, overlapping genres, and the world beyond keyword search'/><author><name>Ben Schmidt</name><uri>https://profiles.google.com/108075792286211090044</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh4.googleusercontent.com/-B9WttXjJttQ/AAAAAAAAAAI/AAAAAAAACoY/l4aPQdZJxlw/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-SfYeb_JktG8/TWCQnDXvuOI/AAAAAAAACek/-_E1P9HgTu8/s72-c/books+example.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8929346053949579231.post-79526590332057070</id><published>2011-02-17T18:07:00.001-05:00</published><updated>2011-02-21T10:57:59.910-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pca'/><title type='text'>PCA on years</title><content type='html'>I used &lt;a href="http://sappingattention.blogspot.com/2010/12/second-principals.html"&gt;principal components analysis&lt;/a&gt; at the end of my &lt;a href="http://sappingattention.blogspot.com/2011/02/fresh-set-of-eyes.html"&gt;last post&lt;/a&gt; to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, here's an improved (using all my data on the 10,000 most common words) version of that plot:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-SBuRz-q0Euw/TV2fapY08zI/AAAAAAAACeQ/nvyx9WK_Qt8/s1600/LC+subclasses+in+pca+space.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-SBuRz-q0Euw/TV2fapY08zI/AAAAAAAACeQ/nvyx9WK_Qt8/s1600/LC+subclasses+in+pca+space.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;I have a professional interest in shifts in genres. But this isn't temporal--it's just a static depiction of genres that presumably waxed and waned over time. What can we do to make it historical?&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Well, the first thing to remember is that, just as a genre can be a single text for the purposes of pca (or any other analysis), so can a year. We can drop onto that same chart all of the years from 1822 to 1922, say, shading from yellow for the early years to red for the later ones:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-1P5tmFTBRkA/TV2fzZ8IiTI/AAAAAAAACeU/vpP6bZNzJFY/s1600/LC+subclasses+in+pca+space+with+years.png" imageanchor="1" style="margin-left: 1em; margin-rig
