Tuesday, November 3, 2015

Word embedding models

A heads-up for those with this blog on their RSS feeds: I've just posted a couple things of potential interest on one of the two other blogs (errm) I'm running on my own site.

One, "Vector Space Models for the digital humanities," describes how a newly improved class of algorithms known as word embedding models work and showcases some of their potential applications for digital humanities researchers.

The other, "Rejecting the gender binary," is a more substantive look at how the method can help us better imagine a version of English without gendered language through some tricks of linear algebra; that results in a sort of translation dictionary between the way students talk about men and the way they talk about women.

I'm aware that this blog is sort of twisting on the vine right now. I like the politics of not using Google, and the ability to embed real javascript that comes with not using Blogger. Perhaps the humane thing to do would be retire this site and direct you to http://benschmidt.org/posts/ and http://bookworm.benschmidt.org instead. But I like keeping it around, and will probably come back here next time I have something to say about, say, the hilariously inadequate college rankings the Economist just published, or just to link other stuff.

Monday, January 19, 2015

State of the Union--and corpus comparison.

Mitch Fraas and I have put together a two-part interactive for the Atlantic using Bookworm as a backend to look at the changing language in the State of Union. Yoni Appelbaum, who just took over this week, spearheaded a great team over there including Chris Barna, Libby Bawcombe, Noah Gordon, Betsy Ebersole, and Jennie Rothenberg Gritz who took some of the Bookworm prototypes and built them into a navigable, attractive overall package. Thanks to everyone.

The first part is an interactive map with every place name we could find using the Stanford Natural Language Toolkit and some (Fraas-flavored) elbow grease. Then we got two great historians of American foreign policy, Dael Norwood and Gretchen Heefner, to explain some of the things in the maps.

The second is about individual words presidents use. So the recent rise in "Freedom," the references to the Constitution predominantly in the time of crisis, and so forth.

My favorite feature, and one that the Atlantic team executed beautifully, is the deep access into individual texts: click on a circle or a bar, and you are off reading the actual paragraph from the state of the union that uses that word on mentions that place. This has always been a core feature of Bookworm on various levels--by treating paragraphs as documents for the modelling, it's easy to drill straight to the interesting stuff. One thing that's mostly missing are the Ngrams-style line charts. I've been saying for a while that I hope people see Bookworm enabling other forms of visualization. These pages are a great example of that; maps and bar charts of words are just as engaging, and sometimes things like "presidents" and "the world" are more engaging than individual years.

So go check those out. They speak for themselves.

But for the text analysis crowd, I also wanted to tell you a little more about the link down right at the bottom of the second (words) piece, and get a little technical about why that, although we decided not to include on the Atlantic site, contains the germ of something I find pretty interesting for online text analysis in general.