Sapping Attention

What's in the Hathi Trust?

2022-02-03T16:53:00.000-05:00

(This is a post I've had unpublished since writing it in 2016. Just hitting publish without reviewing right now because it's something I find myself periodically looking at the charts for).

As we get ready to launch the full Hathi Trust+Bookworm to allow tracking words across 13 million books, I've been working on fixing up the metadata from the original MARC records.

This is useful information to have for anyone using Hathi to find books; it's hard to know the general outlines of a collection like this. So what follows are some general outlines about what books are included in the Hathi Trust. This is closely related, by the way, to what books are included in Google Books; more on that below.

One hugely important question is where the books come from. Different libraries had different scanning agreements with Google, and collect differently. Medical terms show a sharp drop-off in Google Ngrams after 1922 not because they were used less in writing, but because Harvard's excellent medical library wasn't scanned by Google for post-1922 books. Hathi has some of the same issues in its scope; Harvard (dark blue) disappears after 1922 along with the New York Public, and the collection become largely Michigan (light orange) and California (light blue). A huge spike of misdated items from the year 1900 can be blamed mostly on California. The sharp contraction of size in the corpus after 1922 can partially be blamed on the libraries that disappear, but California too contributes fewer books. It will take some work to create a sample that is relatively consistent across the copyright line.

Books by originating library (top 10) by year. Click to enlarge.

Subject domains

Many of these libraries use the Library of Congress classification for their books: we've adopted it as the default subject classification for Bookworm because nothing else (subject headings, for example) is nearly so highly populated.

I was surprised to see how heavily represented two classifications are in the post-1920 period. DS, dark blue, is the history of Asia; PL, dark purple, is Asian Literature. After 1960 or so, both of these classes are larger than their American counterparts (E/F and PS). "Asia" is a much larger and more populated area; still, these indicate that the post-1970 Hathi collection has a less myopic view of the world than I might have expected from purely American universities.

Book Scanners

Less surprising is the universe of scanners. One of the reasons to understand Hathi is that it's as close as we can get, in many ways, to knowing what's in Google books. The big difference is that Google includes many libraries not in Hathi (such as Oxford) and Hathi includes some books not in Google Books. As the visualization below makes clear though, the preponderance of books were scanned by Google, and only other organizations make a numerically significant contribution before 1922: the Internet Archive at a variety of libraries (green) and Microsoft at Cornell (red). (Note that I'm limiting he time scale here to just post-1815, not 1750)

Languages

One major question about the default search under the new format is whether we'll restrict it to just English or make it cross language. As the LC classes indicated, there are two different worlds before and after copyright; pre-1922 is basically English (green), French (light blue), and German (dark blue), while post-1922 begins to bring in significant numbers of texts in Japanese and Chinese (dark and light orange) and Russian. Chinese in particular is a pretty substantial corpus; I'm curious to see if the tokenizers worked well enough to make this a useful tool.

Relative language usage tells the same story from a different angle: here the y axis is *percent* of the corpus, not absolute number of texts. (It narrows slightly after 1960 because language diversity increases). This makes clear that the corpus becomes more English dominated over time, and better acknowledges French and Latin as significant languages early in the corpus.

How badly is Google Books search broken, and why?

2019-02-10T17:40:00.001-05:00

I periodically write about Google Books here, so I thought I'd point out something that I've noticed recently that should be concerning to anyone accustomed to treating it as the largest collection of books: it appears that when you use a year constraint on book search, the search index has dramatically constricted to the point of being, essentially, broken.

Here's an example. While writing something, I became interested in the etymology of the phrase 'set in stone.' Online essays seem to generally give the phrase an absurd antiquity--they talk about Hammurabi and Moses, as if it had been translated from language to language for decades. I thought that it must be more recent--possibly dating from printers working with lithography in the 19th century.

So I put it into Google Ngrams. As it often is, the results were quite surprising; about 8,700 total uses in about 8,000 different books before 2002, the majority of which are after 1985. Hammurabi is out, but lithography doesn't look like a likely origin for widespread popularity either.

That's much more modern that I would have thought--this was not a pat phrase until the 1990s. That's interesting, so I turned to Google Books to find the results. Of those 8,000 books published before 2002, how many show up in the Google Books search result with a date filter before 2002?

Just five. Two books that have "set in stone" in their titles (and thus wouldn't need a working full-text index), one book from 2001, and two volumes of the Congressional record. 99.95% of the books that should be returned in this search--many of which, in my experience, were generally returned four years ago or so--have vanished.

Many of these books *do* still exist in the HathiTrust index.

Changing the date does not produce the results you'd expect, either. "Set in stone" with a date filter set before 1990 returns *nothing*, with a single non-book result returned from a 1982 Washington Post article that has wandered into the Google index. This is especially interesting, because it means that the displayed representation of the two congressional serial's volumes dates as being 1900 is *not* being used for the purpose of retrieval. This is probably wise: books listed as being published in 1900 in the library catalogs feeding into Google can be from any time. Choosing a date before 2020 (which should return all books) adds only a few books to the 2002 listing.

When you search for the term with no date restrictions, Google claims to be returning 100,000-ish results. I have no way of assessing if this is true; but scrolling through results, they do include a few pre-1990 books that didn't show up in the earlier searches.

What's going on? I don't know. I guess I blame the lawyers: I suspect that the reasons have to do with the way the Google books project has become a sort of Herculaneum-on-the-Web, frozen in time at the moment that anti-Books lawsuits erupted in earnest 11 years ago. The site is still littered with pre-2012 branding and icons, and the still-live "project history" page ends with the words "stay tuned..." after describing their annual activity for 2007.

So possibly Google has one year it displays for books online as a best guess, and another it uses internally to represent the year they have legal certainty a book is released. So maybe those volumes of the congressional record have had their access rolled back as Google realized that 1900 might actually mean 1997; and maybe Google doesn't feel confident in library metadata for most of its other books, and doesn't want searchers using date filters to find improperly released books.

Oddly, this pattern seems to work differently on other searches. Trying to find another rare-ish term in Google Ngrams, I settled on "rarely used word"; the Ngrams database lists 192 uses before 2002. Of those, 22 show up in the Google index. A 90% disappearance rate is bad, but still a far cry from 99.95%.

So we can't even know how bad the uncertainty is. One intriguing possibility is that the searches I'm using are themselves caught up in the algorithms used to classify books. If I worked at Google, I would have implemented a text-based date-prediction algorithm to flag erroneously classified books. (I have actually done this and sent a list to the HathiTurst of books they may have erroneously released into the public domain. It works). If they use trigrams, it's possible that a term like "set in stone," because of its recency, might *itself* be pushing a bunch of 20th century books into the realm of uncertainty.

Partly this is the story that we all know: Google Books has failed to live up to its promise as the company has moved away from its original mission of organizing information for people. But the particular ways that it has actually eroded, including this one, are worth documenting, because it's easy to think that search tools that worked perfectly well a few years ago won't have been consciously degraded.

Some preliminary analysis of the Texas salary-by-major data.

2018-08-23T12:57:00.001-04:00

I did a slightly deeper dive into data about the salaries by college majors while working on my new Atlantic article on the humanities crisis. As I say there, the quality of data about salaries by college major has improved dramatically in the last 8 years. I linked to others' analysis of the ACS data rather than run my own, but I did some preliminary exploration of salary stuff that may be useful to see.

That all this salary data exists is, in certain ways, a bad thing--it reflects the ongoing drive to view college majors purely through return on income, without even a halfhearted attempt to make the results valid. (Randomly assign students into college majors and look at their incomes, and we'd be talking; but it's flabbergasting that anyone thinks that business majors often make more than English majors because their education prepared them to, rather than that the people who major in business, you know, care more about money than English majors do.

Anyway, in addition to the ACS data I worked with there, I also took a look at a newer form of information that's just coming online now. The University of Texas is pioneering a new system that will report earnings not just by major, but by school. It seems like the census bureau and the NCES may expand this kind of system of other schools.

This begins to solve, on the surface, a major problem with earnings statistics, which is that majors aren't offered at the same spectrum of institutions. Art history majors might actually be doing pretty well on the job market; but that might be because only rich kids at elite schools major in art history. Engineering majors make more money than anyone else; but they also come from wealthier families than anyone else, and may graduate from wealthier schools.

It also creates an enormous new problem that I want to underline before I show the charts, by not taking student characteristics into account. I've written before about how completely pernicious major earnings statistics that fail to take gender gaps into account are: since gender is a more important determinant of earnings than college major except in extreme cases, a list of majors by earnings often implicitly ranks majors by how male they are. Philosophy might end up having the highest salaries of any humanities field, say, because it's entirely dominated by men; but that doesn't mean male philosophy majors make more than male history majors.

This issue is addressed in the IHE article I link above with reference to race (not sex):

Troutman said one black male student told him he didn’t want to know what he would earn -- he wanted to know what his white peer would make. So the idea was scrapped.

Which is a fine point, in certain ways: to release the earnings of white men by college major instead of everyone would be interesting, if obviously crazy. But that it isn't the data they're releasing; by mish-mashing demographic characteristics, they create crazy incentive structures. If departments at state schools are going to be targeted for contraction based on their income numbers, there is now a clear route to preserving them: discriminate against women to force them out of your major, since they're more likely to experience pay discrimination, leave the workforce, work part time, or accept lower-paying jobs with more flexibility.

Having said that, the data is interesting and more fine grained than other salary data. To explore it, I look not at the actual earnings, but at the earnings deviation for each batch of students based on a linear model. For example: graduates of UT-Austin typically make about $10,000 more than graduates of UT-El Paso; graduates ten years out typically make 25,000 more than graduates one year out. So if you're comparing the salaries of ten-year-out Austin sociologists to one-year-out El Paso engineers, you should start with the baseline that the sociologists will make $35,000 more. If it's less, that's a point in the engineers' favor; if it's more, score one for sociology.

The following chart shows every major in the UT system by its deviation. A score of 1 means majors make more than you'd expect; a score of 2 means twice as much. The boxplot indicates the range of results for cohorts: one point is (eg.) the 25th-percentile income for UT-Brownsville economics majors of the class of 2003 five years out. Interpretation below.

1. Engineers are doing extremely well. This is probably good, because we need the job market for engineers to be artificially restricted; much like doctors, a bad engineer can kill you, and so we can't have mediocre professionals wandering around. Median income doesn't answer one interesting question; what happens to people with engineering degrees who don't become accredited engineers. (I really know very little about the engineering profession.)

2. Certain features appear Texas-specific. Becoming a petroleum engineer in Texas in the early 2000s is a good idea; it also may be a ship that has sailed. Geology majors' high incomes are probably related to the oil industry.

3. Business majors look better on this chart than on some other earnings indicators I've seen: often marketing (say) often looks like a poor career choice.

4. Economics has a high set of incomes: but other than that, there isn't much of a difference between humanities and social sciences. The highest-income humanists seem to come from history and English--these are also the fields that have seen the biggest drops. East Asian studies is also doing quite well.

5. STEM is not a block, and we should not be talking about "STEM majors" as if they all do equally well. Biology-affiliated fields, especially other than general biology, all do quite poorly. (By the way: it's actually not that common for bio majors to go to med school. Out of 150,000 bio majors, 10,000 go to med school each year). Zoology and ecology have earnings comparable to the lowest-paid majors overall. If you lump *any* major in with the engineers, they'll collectively have high incomes; but so what?

6. The error bars here are fairly wide for majors offered at multiple schools. History probably does make more than psychology; but the range on both is wide. Some philosophy cohorts make more than the average. Etc.

Connection between income and change in major numbers.

One other chart is possible with this data that answers a lot of questions I've had; what's the relationship between the earnings multiplier and the change in major numbers? In other words, are students actually deserting the humanities?

Mea culpa: there is a crisis in the humanities

2018-07-27T14:54:00.000-04:00

NOTE 8/23: I've written a more thoughtful version of this argument for the Atlantic. They're not the same, but if you only read one piece, you should read that one.

Back in 2013, I wrote a few blog post arguing that the media was hyperventilating about a "crisis" in the humanities, when, in fact, the long term trends were not especially alarming. I made two claims them: 1. The biggest drop in humanities degrees relative to other degrees in the last 50 years happened between 1970 and 1985, and were steady from 1985 to 2011; as a proportion of the population, humanities majors exploded. 2) The entirety of the long term decline from 1950 to 2010 had to do with the changing majors of women, while men's humanities interest did not change.

I drew two inference from this. The first was: don't panic, because the long-term state of the humanities is fairly stable. Second: since degrees were steady between 1985 and 2005, it's extremely unlikely that changes in those years are responsible for driving students away. So stop complaining about "postmodernism," or African-American studies: the consolidation of those fields actually coincided with a long period of stability.

I stand by the second point. The first, though, can change with new information. I've been watching the data for the last five years to see whether things really are especially catastrophic for humanities majors. I tried to hedge my bets at the time:

It seems totally possible to me that the OECD-wide employment crisis for 20-somethings has caused a drop in humanities degrees. But it's also very hard to prove: degrees take four years, and the numbers aren't yet out for the students that entered college after 2008.

But I may not have hedged it enough. The last five years have been brutal for almost every major in the humanities--it's no longer reasonable to speculate that we are fluctuating around a long term average. So at this point, I want to explain why I am now much more pessimistic about the state of humanities majors than I was five years ago. I'll show a few charts, but here's the one that most inflects my thinking.

One thing I learned in a humanities field is that people with strong opinions are always eager for a crisis because it gives a chance to trot out solutions they came up with years earlier. Little-c conservatives tend to argue that the humanities must return to some set of past practices: teaching Great Works or military history. As I said above, these arguments tend to be misguided--they often assume laughable propositions (yes, colleges still teach Shakespeare), and they don't match the contours of the humanities' decline. Practicing academics tend to write pieces arguing that some pedagogical tactic they've found to work (flipped classrooms! joint majors with CS! community integration! non-traditional assignments!) needs to be more widely adopted. Big-picture thinkers argue that we need to "make the case" for the skills taught in humanities fields, since a society where citizens lack empathy (which you get from reading novels) or a sense of their history (which is often generalized into a pan-humanities virtue) or an ability to realize a figured base at the keyboard (which is what I spent my humanities credits on) is an impoverished and endangered one.

I don't have a solution to peddle. But the drop in majors since 2008 has been so intense that I now think there is, in the only meaningful sense of the word, a crisis. That is: we are in a momentum of rapid change in which decisions are especially important, and will have continuing ramifications. If you still have the same opinions you did in 2010 or 2013, it's worth reassessing the situation. Unless current trends reverse rapidly and for several years, humanities education in the 2020s will have to be different than it was in the 2000s.

Here are the general points.

1. No matter what baseline you use, virtually every humanities major from big, old ones like English to ~~small, newer ones like gender studies~~ went into significant decline around the time of the 2008 financial crisis. (UPDATE: actually, the small fraction of the humanities classed as "cultural, gender and ethnic studies" is one of the few fields *not* to shrink. I was confusing cultural studies with area studies, which did shrink. This is a fairly significant mistake on my part. Sorry.)

2. Rather than recover with the economy, that decline accelerated around 2011-2012. That period constitutes an inflection point for a variety of majors in and out of the humanities. Though it may have slowed a bit in the last few years, there's little sign that the new post-2011 universe holds signs of a turnaround.

3. More humanistic social sciences like sociology or political science are also caught in the undertow: the big winners are mostly concentrated in the STEM fields.

4. These trends are widespread across institutions that they may reflect student preferences formed before they see a college classroom.

The rest of this will spell out some of the data based on preliminary IPEDS releases from the Department of Education. In all cases I'm working directly from the raw data. The code I use for parsing is here. Particularly for 2017 statistics, there have not yet been any government publications giving aggregates. I generally use the American Academy of Arts and Sciences taxonomy of the humanities with one major exception: I don't consider communications to be a humanities discipline. Sometimes I see humanists suggesting that things must be better in the areas where we don't have measurements, often with reference to local anecdotes.

Within colleges and universities, the humanities are testing new lows

The fundamental reason I've changed my mind has to do with the two following charts. In 2013, I posted the following chart of long term trends since 1948. It showed that the long-term trends were steady. There's a small sign of a drop off after 2008: but because I mostly formed my ideas about these trends in 2005, I explained it away as unimportant.

Now look at the last 6 years of data. The slope from 2011 has become a cliff. Where in the 2000s about 7.5% of students had one of the big four humanities majors, now less than 5% do.

The big picture, I think, is: after the boom and bust of the 60s and 70s, the humanities entered a long period of stability from about 1990 to 2010 or so. That period has ended, and now we're entering a new one in which levels will be very different. We'll obviously stabilize somewhere, probably in the next few years, and maybe we'll rebound a bit, but I'd be very surprised if humanities numbers in five years were even 2/3 what they were in 2005.

There are, of course, other majors than these four. Here are all of the fields classed as "humanities" by the American Academy of Arts and Sciences. (Thanks to Rob Townsend for sending their list my way. It's worth noting here that I developed my acquaintance with this dataset working on the first version of the humanities indicators at the Academy in 2005.) According to their taxonomy, every field but humanistic subfields of communications and linguistics (the two fields with the weakest claims to actually being "humanities") has seen substantial drops in share since 2009. EDIT: Not sure why I missed this, but I just noticed that "cultural, ethnic, and gender studies" are also stable. This is a very small percentage of the humanities (all put together, maybe a tenth of English), but an important caveat nonetheless.

You can also see in these charts how recent the shift is. Although English has been in steady decline for an extraordinarily long time (practically unchecked since the mid-1990s), fields like history, philosophy, and classics were booming before 2008.

The decline is not strongly related to the expansion of higher ed.

"Share" can be a funny way to think about colleges, though, since the field of higher education itself is constantly expanding.

So here's the most optimistic take I can give. It shows the 70-year story, which is full of radical ups and downs. The y-axis gives the number of degrees per one thousand 23-year-olds in the country. Since the 2011 results, the humanities have fallen from 37.1 degrees per thousand adults to 28.0 (about a 25% drop); the big four, for which I have data back to the 1940s, have fallen from 29.8 to 21.7 per thousand (a 27% drop). This exceeds all but one previous fall in humanities majors: the drop in the 1970s.

That 1970s drop, I argued, coincided with the opening of professional fields to woman and the quick deflation of the big boomer bubble of the late 1960s, in which first-generation college students seem to have piled into humanities majors at schools across the country. The question is: why the drop since 2008?

Even looking at raw numbers, the past decade shows a sharp decline.

Even by raw numbers--which is not an especially useful baseline in a growing country--you can see a massive shift since 2011. Here I've pegged each major to its maximum year: that was 2009 for English, 2010 for history, and 2011/2012 for the other fields. This makes the shape of the changes abruptly clear--the raw number of English and History majors is down more than a quarter from these extremely recent peaks. On the other hand, all fields except the two largest are at least absolutely higher than in 2000.

It's not just humanities fields

I occasionally worry that talk of the "humanities crisis" could be a self-fulfilling prophecy. Fortunately or unfortunately, though, the shift seems to have more to do with content than with labels. The social science fields that most closely resemble humanistic ones--sociology, anthropology, international relations, and political science--have also seen serious drops. Here are 22 fields that I track. I've put a line at 2011 because it's frequently an inflection point--you also see many shifts around 2008 and 2015 or so.

The big winners in recent years have been health professions, including nursing; computer science and engineering; biological science and to a lesser degree, physical sciences; and what I oddly call "leisure," which includes things like sports management and exercise studies.

The drop in allied disciplines has one important implication worth spelling out. Many historians take solace that they don't primarily rely on majors. (My history department's largest courses are usually driven by students fulfilling requirements from IR and journalism). When those majors drop, there are ripple effects through all the related fields. When just one discipline is in free fall--as was the case, say, with English in the early 2000s--things are not as bad for *even that discipline*, because comp lit majors or communications majors or the like will still take their courses. When all the fields dedicated to the qualitative study of society and culture are falling, there will be ripple effects throughout.

The trend is stronger higher up the prestige chain.

Elite schools matter because they are where almost all humanities PhDs are trained, because they are the only institutions that have historically been especially focused on humanities, and because they tend to unfairly dominate national discussions, and because they present a baseline impervious to the shifting landscape of higher education. Using the top 30 schools in the 2017 US News and World Report rankings as a proxy for quality, here's the breakdown in changes in humanities majors by American Academy classification of humanities and Carnegie classification of schools. (Again, this includes communications and 'liberal studies,' neither of which I personally think are truly humanities fields.)

The elite liberal arts colleges were, until 2011 or so, the only schools where humanities, social sciences, and sciences actually split up the pie evenly: now humanities are down from 35% to 22% of degrees. The drop at elite research universities is similarly steep. (At both, humanities are down to about 70% of their 2008 values.

We can also look at the fall for peak for all the majors that the American Academy tracks at elite universities. Aside from linguistics, communications, and "general studies," all are down by more than 20%. English is down by more than 50% from its 2001 peak at these schools, and history down by almost 50% from its pre-recession peak.

We also know this trend is likely to persist for a few more years, at least. As part of its lawsuit about discrimination against Asian-American applicants, Harvard released information about the intended majors of its applicant pool. The Harvard applicant pool is certainly weird, but is also big (10,000 students) and probably about as good a proxy as we can get for the student body of elite colleges. The class of 2014 (well into the nationwide humanities collapse had about 20% of students intending a humanities major; that share has dropped to about 12%. Only for the class of 2019 do the numbers seem to stabilize.

PhDs are steady, but not coupled to degrees in the long term.

I said up front I don't have a solution. But I should be clear that one thing I would have liked to argue isn't backed up by the data, in the spirit of mea culpas.

I hold the slightly unfashionable opinion among humanities professors that universities should award fewer PhDs in the humanities, and that efforts to "refocus" the PhD into a degree that doesn't only point to academic employment are likely to be much more beneficial to humanities professors than to the students who spend 8 years in them. (For instance; the American Academy recently found, although they didn't frame it in these terms that humanities PhDs in non-academic employment are twice as likely to be dissatisfied with their jobs as those who stay in the academy, while in every other field of education non-academic employment is just as rewarding.)

During the 1970s drop in humanities degrees, humanities PhDs halved in number. I would have liked to exploit this crisis to argue we need to do that again. But in fact, PhDs have leveled out, and the long-term variability of phd numbers mean that the BA to PhD ratio is at an unremarkable level--it's dropped 20% from 2008, but is higher than at any point before 2000.

Am I wrong again?

I was too complacent in 2013: am I being too pessimistic now? Maybe. There are plenty of ways you can squint and think the data is leveling out. I certainly don't think the steep declines of the past decade can continue for another decade.

I may also be over-dismissive of course enrollment numbers, since they're harder to get. AHA surveys show history department enrollments dropping by about 9% since 2014. This is bad, but not as catastrophic as the major numbers. (Although: we don't know what happened from 2008 to 2014, which might matter more.)

But still: I think any empirically-inclined person needs to be more pessimistic than they were five years ago.

Appendix: Supplemental charts

A few supplemental notes that I may expand.

1. There's not a very strong racial component, but there are a couple interesting facts about African Americans. First is that HBCUs have not seen the declines that affect most other institution types (albeit from a low baseline.) Second is that black men show much less change in their humanities inclination than any other demographic group. I do not have any explanation of this.

Google Books and the open web.

2018-07-10T16:26:00.001-04:00

Historians generally acknowledge that both undergraduate and graduate methods training need to teach students how to navigate and understand online searches. See, for example, this recent article in Perspectives. Google Books is the most important online resource for full-text search; we should have some idea what's in it.

A few years ago, I felt I had some general sense of what was in the Books search engine and how it works. That sense is diminishing as things change more and more. I used to think I had a sense of how search engines work: you put in some words or phrases, and a computer traverses a sorted index to find instances of the word or phrase you entered; it then returns the documents with the highest share of those words, possibly weighted by something like TF-IDF.

Nowadays it's far more complicated than that. This post is just some notes on my trying to figure out one strange Google result, and what it says about how things get returned.

What's happening nowadays has little to do with that mental model. Idly wondering whether a strange phrase in the Wikipedia article on the 19th century Supreme Court justice Benjamin Curtis was plagiarized, I put it into Google books search. Consider the results here.

First, we get a photograph from the Watertown Public Library in Massachusetts, Google Books is full of library catalog items which are not in fact books; in this case, though, it seems to be simply the top search result because the precise phrase I searched for is not present in any books. (The normal Google Books search format stuff is missing).

The next results though, are more interesting in what they say about the internal workings of the Google Books engine. They do not contain the phrase in question, but they are all about Benjamin Curtis. There are only a dozen or so, not a full set of Curtis-related documents; in fact, it seems that Google Books is presenting me with the "Further Reading" section (not the bibliography) of the Wikipedia page from which the quote came. This seems to be general. Search for any arbitrary phrase in Wikipedia, and you'll get a list back including some of the cited and required texts.

Why is it doing that? What mechanics of a search engine would cause this to happen? Is Books translating otherwise unmatched queries into Wikipedia pages, and returning their contents? Not in general: most unmatched phrases turn up nothing.

I can start to understand this individual rule by typing in random strings. I came up with the phrase "such a place would never be", which appears in no books but does have 6 search results on the web.

That phrase returns a series of books, starting with a book titled "John Brown to James Brown" which one Amazon reviewer described as "where such a place would never be imagined to have existed." The relationship of the rest of the books, which touch mostly on early history of the Mormons and moneymaking schemes, is less clear: I'd bet they all appeared on the Amazon page as "frequently bought," but I can't be sure. (I don't get "frequently bought" when I visit book, and the Google cache doesn't have it either.

So--there's some form of linked data about books driving Google Books search, driven by the open web possibly as a fallback, or possibly as part of the core search. It seems to work best from structured data like Amazon or Wikipedia, but also engage in some pretty wild guesses based on semantic parsing.

Basically, the web index is hooked into Books search in ways that aren't obvious or transparent. And it leads to an extremely strange world of stacking algorithms on algorithms; that an Amazon review would lead to a phrase giving some random books assembled by Amazon at some point in the past is completely inscrutable.

EDIT, one day later:

It occurs to me that part of what's going here may be that the same algorithm is used for books as in Google's image search. If you do an image search for these phrases, it pops up images from the Wikipedia article and the Amazon page (and, now, this very blog post). Google appears to treat books both as collections of text to be searched, *and* as entities that exist on the web described through the text of web pages.

Meaning chains with word embeddings

2018-06-13T11:57:00.002-04:00

Matthew Lincoln recently put up a Twitter bot that walks through chains of historical artwork by vector space similarity. https://twitter.com/matthewdlincoln/status/1003690836150792192.
The idea comes from a Google project looking at paths that traverse similar paintings.

This reminds that I'd meaning for a while to do something similar with words in an embedding space. Word embeddings and image embeddings are, more or less, equivalent; so the same sorts of methods will work on both. There are--and will continue to be!--lots of interesting ways to bring strategies from convoluational image representations to language models, and vice versa. At first I though I could just drop Lincoln's code onto a word2vec model, but the paths it finds tend to oscillate around in the high dimensional space more than I'd like. So instead I coded up a new, divide and conquer strategy using the Google News corpus. Here's how it works.

1. Take any two words. I used "duck" and "soup" for my testing.
2. Find a word that is, in cosine distance, *between* the two words: that is, that is closer to both of them than either is to each other. Select for one as close to the midpoint as possible.* With "duck" and "soup," that word turns out to be "chicken": it's a bird, but it's also something that frequently shows up in the same context as soup.
3. Repeat the process to find words between "duck" and "chicken." That, in this corpus, turns out to be "quail." The vector here seems to be similar to the one above--quail is food relatively more often than duck, but less overwhelmingly than chicken.
4. Continue subdividing each path until no more intermediaries exist. For example, "turkey" works as a point between "quail" and "chicken"; but nothing intermediates between turkey and quail, or between turkey and chicken.

The overall path then sketches out an arc between the two words. (The shape of the arc itself is a component of PCA, but it's also a useful reminder that the choice of the first pivot is quite important--it sets the entire region for the rest of the search.

(I put a few random words as unlabelled dots in the background--this should serve mostly as a reminder of how odd the geometries of high-dimensional spaces are. With any of these paths, even between relatively similar words, it's not hard to find a perspective where they appear to be on opposite ends of the full galaxy of language).

As with any method, this can be somewhat useful as a way to start understanding the contents of the vectorspace being used. How, for example, do you get from "Trump" to "Obama?" Google News is mostly from almost a decade ago, so the answer lies primarily in New York state. Hillary Clinton has stronger connections to the New Yorkers, and Rudy Giuliani serves as a proto-Trump in that space. Between Trump and Giuliani, interestingly, is the Tea Party 2010 GOP NY gubernatorial nominee Carl Paladino (I had not though about that name in a quite a while!) who seems to serve as a kind of bridge between whatever Trump was doing in 2011 and the GOP establishment, such as it was.

As this suggests, there's a certain kind of deep cultural knowledge built into the co-occurrence networks of news articles that word2vec trains on. The path from "Seinfeld" to "Breaking Bad," for example, initially realizes that "Curb Your Enthusiasm" mediates between the 90s show and the Golden Age of television; and then sets about bridging the drama-comedy divide by finding its way into the primetime soaps through "Scrubs" out of the comedies and "Mad Men" out of the dramas.
There are some odd choices--"Everybody Loves Raymond" and the raw word "sitcom" seem unnecessary to mediate between "Curb" and "Arrested Development," but in general the paths are interesting, at least.

Other times, they simply suggest connecting angles. From "word" to "vector" moves through rasters and parsers to kiss in the land of .dll files. How supremely unlovely.

Occasionally the paths get downright baroque--this, I think, has to do with the oddness of using cosine similarity to interpolate, which makes it relatively easy to find intermediate points.** The path from "iPhone" to "garden," for example, takes a wonderfully evocative path through the cultivated landscape ("rockery", "pergola") into a home ("dining room", "china cabinet", "bookcases") and upon pulling a "hardback book" off the shelf abruptly shifts into the digital age through e-readers and forgotten formats (remember the Zune?) before landing at the iPad.

This enterprise is more evocative than useful, except in that it gives another way to understand the vector models that are an increasingly important part of the information architecture of modern life. Looking at them reminds me in an odd way of the way I was taught to read poetry in high school and college, in two senses.

The first is that it relies heavily on the ability of individual words to conjure up their surroundings. The left half of the above chart pleases (if it does) because, for example, "rockery" is such a rare and specific word--it lives primarily in the world of Beatrix Potter. The whole path to the bookcases past the credenza, above, brings to mind the small soul in the window seat in TS Eliot's Animula. The second part *doesn't* please because the brands on the right have no weight or meaning on their own.

The second is the idea of the path itself. I was taught to read (short) poems as voyages that move in a direction; exteriority to interiority, personal to social, etc. It would be interesting to take individual poems and fit their direction to a general path in the overall vectorspace. This would provide a nice analogue to plot arceology for poems; are there particularly schemas especially present in certain genres? What are dissimilar poems that follow similar directional paths in different parts of the space?

Or you could just use them to seed bad nature poems; start with a drop of dew, move out through the humidity to the mites and aphids, out to the raptors and cormorants.

*1: Math/high dimensional spaces note: This turns out to be an interesting problem that makes me wish I had taken some linear algebra at some point in my life. You might think you could simply find points that are close to the midpoint of the two words, but it's almost always the case that the two words are closer to their midpoint than any other word. Similarly with differences

**It's possible that a distance measure that satisfies the triangle inequality would work better; it's also possible that what I really should do is manually choose between the top 5 pivot points, although I don't want to do that on purely a useless thing that's hard to build into a web app.

"Peer review" is younger than you think. Does that mean it can go away?

2017-09-15T11:41:00.004-04:00

This is a blog post I've had sitting around in some form for a few years; I wanted to post it today because:

1) It's about peer review, and it's peer review week! I just read this nice piece by Ken Wissoker in its defense.
2) There's a conference on argumentation in Digital History this weekend at George Mason which I couldn't attend for family reasons but wanted to resonate with at a distance.

It's still sketchy in places, but I'm putting it up as a provocation to think (and to tell me) more about the history of peer review, and how fundamentally malleable scholarly norms are, rather than as a completed historical essay in its own right. [Edit--for a longer and better-informed version of many of these points, particularly as they relate to the sciences, Konrad Lawson points out this essay by Aileen Fyfe; my old grad school colleague Melinda Baldwin has an essay in Physics Today from her forthcoming project that covers the whole shebang as well, with a particular emphasis on physics.]

It's easy, when writing about "the digital," to become foolishly besotted by the radical transformation it offers. There's sometimes a millenarian strand in the digital humanities that can be dangerous, foolish, or both, and which critics of the field occasionally seize on as evidence of its perfidy. But it's just as great a betrayal of historical thinking to essentialize the recent past as to hope that technology lets us uproot the past. We should not fall short of imagining the changes that are possible in the disciplines; and we shouldn't think that disciplines need revolve around particular ways of reviewing, arguing, or producing scholarship.

Here's a short historical story about one thing we tend to essentialize, peer review. I find it useful for illustrating two things. The first is that scholarly concepts we think of as central to the field are often far more recent than we think. This is, I think, a hopeful story; it means the window for change may also be greater than we think. The second is that they are, indeed, intricately tied up with social and technological changes in living memory; the humanities are not some wonderful time container of practices back to Erasmus or even Matthew Arnold. I'm posting it now, after delivering it as a hand-wavy talk at Northeastern in 2015.

Peer review seems to be so fundamental to scholarship that we can hardly imagine a world without it. Conventional histories of peer review suggest that it is old indeed. Kathleen Fitzpatrick starts her discussion of its history in the 1750s, although she suggests that the "history of peer review thus appears to have been both longer and shorter than we may realize," extending back to the 17th century but still imperfect by the 1940s. Wikipedia editors are more firm in their straightforward assertion that is was developed by Henry Oldenburg (1619–1677), who built on the work of Ishāq ibn ʻAlī al-Ruhāwī (854–931). (Wikipedia is less clear on just how it quietly gestated for 7 centuries.)

But even if peer review is ancient, "peer review" itself is quite new. I was surprised, a few years ago, in performing anachronism consulting for the show "Masters of Sex," set in the early 1960s, to see my algorithms reject one character's suggestion that Masters and Johnson needed to publish in peer reviewed journals as hopelessly anachronistic. But that is indeed the case. Google Ngrams shows only sporadic uses before about 1970; the adjectival form "peer reviewed," as adhering to scholarship, barely exists before 1980. (As always, you should basically ignore Google Ngrams results from after 2000, but why not include them?)

Of course the thing may exist before the word: but one thing I've found invariably in looking at these etymologies is words usually do not march straight out of the primordial ooze into widespread use for no reason at all, particularly words describing so specific and unpoetic as a practice like this. Usually there is some reason, some new thing in the world that requires a new term to distinguish it from what has come before.

So what new thing gave rise to peer review? Reading through the texts gives some sense. JStor contains no uses of the phrase until 1965, in reference to “peer review groups” at the NIH. Through the late 1960s the phrase was only used in the context of doctors supervising medical care of other doctors. The first use outside of medicine I find is in 1969, in a library sciences context,1 also about professional self-evaluation. An early usage for grants is in 1970.2 The first usage of the phrase in the American Historical Review is in 1978, in a decidedly negative evaluation of the “peer review bureaucracies of foundations and government,” “manned by scientists of lesser achievement.”

So what is the new thing being described here? As best as I can tell, the reason for its existence is the rise of the new government funding bureaucracy in the 1960s; the NIH, the NSF, and their smaller cousins like the NEH all needed mechanisms to distribute their new government largess: and peer review was a way to ensure that government money was not handed out by the government, but by experts from the scientific community. (This story, by the way, bears an interesting relationship to the one I wrote earlier this month about government attributes versus public ones; that sentence would have a different valence if I said that scholars wished to ensure the "public money was not handed out by the public.") Its use in non-publication situations--peer review boards investigating medical malpractice, for example--is almost entirely about protecting professional organizations from state or other bureaucratic interference. And when there are large-scale discussions of peer review in science--as in a special 1985 edition of "Science, Technology, and Human values," they frequently take grant-making as the archetypal form, not the refereeing of scholarship.

The central technology that causes this new term to spread to scholarship, then, is the grantmaking state. But this relies on a more prosaic technology, as well. Peer review is the archetypal form of scholarly organization in the age of the xerox machine. Without it, a sheaf of grant proposal could not be easily snapped into a binder, mailed across the country, and then (perhaps) flown back to Washington with its expert reader for discussion.

Digging around a little recently, I see that historian of science Alex Czsisar wrote a short piece for Nature in 2016 (after I gave this as a talk, so not fully incorporated here) where he says this, which is very much along the same lines.

'Peer review' was a term borrowed from the procedures that government agencies used to decide who would receive financial support for scientific and medical research. When 'referee systems' turned into 'peer review', the process became a mighty public symbol of the claim that these powerful and expensive investigators of the natural world had procedures for regulating themselves and for producing consensus, even though some observers quietly wondered whether scientific referees were up to this grand calling.

All of this suggests, though it doesn't prove, that the shift to a language of "peer review" involves a model of research that draws on a nationally organized scientific funding system that merges with a series of older traditions. Most of the histories of peer review in the sciences note how late journals were to adopt it: leading British publications like the Lancet and Nature don't take up outside peer reviewers until the 1970s.

If the history of peer review in the sciences is young, the history of peer review in the humanities is even younger. The earliest usage in the front-or-back matter of the American Historical Review that I see at first glance is from 1996, in describing who can perform book reviews; the second use outside a history-of-science context is in 1997, when Sheila Fitzpatrick notes unhappily that a book has obviously "undergone extensive peer review" in a way that weakens it, serving as an "uncomfortable reminder that peer review may function not only as a gate-keeping procedure but also as a kind of censorship of unpopular opinions." (Here's a similar 2012 argument in favor of editorial review, *not* peer review, in a science journal). (Not far down the list is Ayers and Thomas's 2003 digital article "The Differences Slavery Made;" rather than showing up to challenge decades of consensus on peer review, digital scholarship was already arriving on the hardly after the consensus had set).

I'd have to do more research to really understand what the outside review policies of leading journals were in the (say) 1950s and 1960s. I have the impression, but have lost the reference, that at least single-copy referee reviewing was common, if not mandatory, in the period. ("Outside referee," though, is another 1960s neologism.) Still, the technology of mimeographs and carbon copies makes for different forms of outside review: In 1956, the AHR only demanded a single paper copy of each article submitted. (The ribbon copy, please; keep the carbon copy for yourself.) This seems to have been still true in 1970; I suppose by then the AHR could have been paying for photocopies on their own, but I doubt they did. By 1980, the number of required copies number has increased to 2; by the late 1990s, the frontmatter demanded four, or a Microsoft-compatible disk, which is what it would take for a really modern peer review system. But surely I've said enough now that someone will chime in in the comments about how mimeographs were sent to various reviewers in serial.

I know even less about monographs; but it should be widely known that some important work in the field *continues* to be published as part of edited volumes, trade presses, and other channels subject only to editorial review, not peer review.

So what? Obviously peer review didn't *really* come into being in the historical profession in 1995. That's ridiculous! Books and journals had various forms of outside referees in the mimeograph/carbon copy age as well. But the point is this: something did change, and more recently than we might think. Peer review was created, rather than always existing in the humanities; its rhetorical adoption in those fields as a core term may be less connected to eternal ideas of scholarship, and have more to do with the effort to make them more like the sciences in the postwar university. Be skeptical of the administrative-ese, buzzword-inflected vogue for the digital humanities all you want: but also, be aware that if you say "peer review is fundamental to sound scholarship," you're speaking in the buzzwords of not long ago, yourself.

I wouldn't argue in good faith that peer review is new, so it's bad. It might be that the scholarly practices from 1970 to 2010 are indeed worth preserving.

But at the same time, I'm not sure that digital scholarship could ever fully reconcile itself to peer review in the traditional sense without transforming it enough that we'll need a whole new phrase to describe the new regime to come.

Randomly selected bibliography:

Some early citations, and some previous histories.

“Atlantic City Conference: A Great Show–in Two Parts and a Cast of Thousands.” ALA Bulletin 63, no. 7 (): 915–964. http://www.jstor.org/stable/ 25698237.

Cooper, Joseph D. “Onward the Management of Science: The Wooldridge Report.” Science 148, no. 3676. New Series (): 1433–1439. http://www.jstor.org/ stable/1716537.

Csiszar, Alex. “Peer Review: Troubled from the Start.” Nature News 532, no. 7599 (April 21, 2016): 306. doi:10.1038/532306a.

Population Density 2: Old and New New England

2017-07-24T18:43:00.002-04:00

Digging through old census data, I realized that Wikipedia has some really amazing town-level historical population data, particularly for the Northeast, thanks to one editor in particular typing up old census reports by hand. (And also for French communes, but that's neither here nor there.) I'm working on pulling it into shape for the whole country, but this is the most interesting part.

Browsing through it, one thing that comes out immediately is just how long it has been since many towns experienced major growth. There are a lot of ways to think about this (e.g., year of maximum population; year of greatest growth rate) but I like the compromise here, which shows the first year that a town had ~~2/3~~ 3/4 its present population.

There are a lot of features I like here. Most interesting are all the small green dots in the southern tier of upstate New York, across Vermont and Western MA, and throughout southern Maine. In all of these places, it's been a century-and-half since any significant economic growth. The largest in this class is Nantucket, which hasn't seen major growth since losing the whaling industry to New Bedford.

The purple towns are mostly born of the industrial revolution; a few are scattered in the less habitable regions of New York and Maine, parts of which have more in common with the upper midwest than the rest of New England. (There are whole, depopulated Swedish towns in northern Maine; I swear I heard an accent that would sound at home in "Fargo" in Caribou.) The largest are, obviously, New York City and Boston, which stopped growing because they ran out of land; Hartford, Providence, Buffalo, and Worcester are all there too.

Most of the major industrial centers in Southern New England and along the Erie Canal are surrounded by a ring of orange towns which peaked between 1945 and 1971; those are almost suburbs. The largest of these is Hempstead, Long Island, which is obscure enough as a place name that I had to look up even though 1) the town (not the smaller village) has a bigger population than Boston and 2) I was biking in it literally yesterday. It's the lowest third of Nassau county, just past NYC's JFK airport, and includes prototypical suburbs like Levittown. Most of the largest cities like this are New York suburbs; in New England, the largest are Framingham, MA, pretty much every municipality from Greenwich to Hartford in CT, and the Portland suburbs.

Finally, the yellow encompasses almost all the formerly rural portions of New York/New England south of Portland and within a few dozen miles of the ocean. Nantucket aside, there are almost no towns in this area that didn't grow; many of the rare green spots are Connecticut towns like Lyme that have calved off other towns over the year. The largest here, Brookhaven, has half a million people and is essentially the same as Hempstead but for Suffolk county, farther out Long Island.

Is there a grand conclusion to be drawn from this? Well, it does shake my thinking a little bit about one thing. Even stasis, let along depopulation, sometimes feels like a trauma that the United States has never experienced before. If I blow these same legend parameters out to a wider notion of the northeast (plus a smattering of places in Canada), the legend classes cease making much sense at all--there's almost no green in the entire country outside of the region I've plotted here. Partly that's data availability; rural place-name level information from the census would be great to have, but we really only have it for these seven states right now. (Five because of that editor I mentioned earlier; the other two, Massachusetts and Rhode Island, I've pulled straight from the Wikipedia full-text dump).

Don't bother enlarging this, it's not consistently made, there are weird spots, and the legend is positioned over the map!

But it's also that there has been at least one destructive pass of the same magnitude as deindustrialization in the past; the wave of de-agriculturation that started in the 1820s in New England as farmers went west, and hit the midwestern cities and towns in full force probably starting in the agricultural depression of the 1920s and, obviously, 30s. Of course historians know that this happened, and it's a commonplace to think of the trials of the shift to a post-industrial economy as on a par with those accompanying the shift to an industrial one. But there's something odd about that phrasing that makes industrialism too central: we don't talk about a post-agricultural economy, and "de-agriculturation" is not a word.

But if depopulation is traumatic, not just loss of jobs, it helps me think about Northern New England in a slightly more complete way. If you go to almost any of these green towns today, the most striking thing in the town square is the Civil War memorial with more names on it than than it seems like the town has people nowadays. How did the proximate experience of depopulation change attitudes towards that war? (That is: how does the fate of the Union look different from an atrophying town in Vermont as opposed to a booming one in Illinois?) Those questions may be well-trod in the literature, but I haven't read them: probably because at least in my grad education, post-civil-war agrarian Northeasterners were uninteresting compared to agrarian southerners of both races, westerners of all stripes, and urban and suburban Northerners.

"There are white people we don't think about enough!" is weak sauce, though, and I freely admit that not even I plan to take this data in that direction.

One weird feature of this map is that it uses a quartic-root scale. So each circle is proportional to the size of a two-dimensional manifestation of a four-dimensional hypersphere of appropriate volume. (Do you call it volume in four dimensions? I dunno). That's an idiosyncratic choice I won't repeat anytime soon. But next up population-density wise, here: maybe a return to the giant floating spheres of the 1930s.

Population Density 1: Do cities have a land area? And a literal use of the Joy Division map

2017-07-11T15:08:00.001-04:00

I've been doing a lot of reading about population density cartography recently. With election-map cartography remaining a major issue, there's been lots of discussion of them: and the "Joy Plot" is currently getting lots of attention.

So I thought I'd finally post some musings I wrote up last month about population density, the built environment, and this plot I made of New York City building height:

This chart appears at the bottom of this post, but bigger!

One of the continuing strands is that cities have always present major challenges for population density mapping. The idea of population density, at heart, imagines humans as particles that repel each other on a surface; the intense aggregation of populations in a few cities are unnatural. Walker's maps of population density (1870) in the American west exclude cities with populations over 2500; August Petermann's maps of the British population for the 1851 census do the same. (Here's one of Petermann's).

So: cities are unnatural and don't really count when mapping ordinary human population density. I just encountered the most extreme enunciation of this problem in a book by Cornell professor Francis Willcox from 1897, Density and Distribution of Population in the United States at the Eleventh Census.

Density of population may be defined as the average number of human beings living on a unit of land surface. [...] Rough volumetric measurements, such as the number of persons to a dwelling or a room, or the cubic feet of air per capita, are sometimes used in congested districts and notwithstanding the obvious objections to so vague a unit as a dwelling or room, they may give better results than the method in general use. For with modern advances in engineering, civilized people have become able to live or work at some distance above or below the surface, where the advantages of mines or cities assemble them. Under such circumstances the ratio of population to surface may be insignificant or misleading. [My emphasis]

So not only do cities present unnatural density; the widespread practice of building above or below ground means that there should be some other criteria of measuring density. Land area isn't constant but manipulable; build a two-story warehouse or dig a tunnel, and you've doubled the space for people to live in.

There are some interesting technical questions here. It got me wondering about a distorted cartogram that might distend areas not by population, but simply by surface area. Does it really make a difference?

To check, I downloaded a list of building footprints from nyc.gov and just ran some numbers on how many square feet of floor space. I assumed all buildings in the city had the same shape at the top floor as at the ground, which is spectacularly wrong--after the 1916 zoning code, most tall buildings were supposed to imitate the tiered floorplan of the Woolworth building--so this should be taken as an upper limit, not a description of fact. But the building stock in the city is old enough that I doubt it's an egregious overestimate: I'd guess, wildly, that this overstates floor space by five to thirty percent. (Even with this assumption, half of all floor space in the city is in buildings of five or fewer floors; I also assume that no buildings have basements, when in fact most do.)

So: how much building space is there in NYC? By my calculation, the answer comes out to about 225 square miles of indoor space; that's less than the land area of the city (300 sq mi). Which is to say: if you could tear up every piece of linoleum, concrete, and parquet from every floor above the first in Manhattan, you'd be able to lay each one on the ground on unbuilt space in the city's streets, parks, airports, and parking lots.

Subtract out first floors, and the built environment adds at most 164 square miles to the cities area, bumping it from 300 to 460 square miles. That's a lot in urban terms--construction has added to the surface of the city about the combined area of Brooklyn and Queens--but it's not an amount that would make any difference, in, say, a map of the national popular vote by area. So: even a century of development later, Willcox's critique doesn't argue for rethinking the denominator in population density.

Still, it suggests a kind of neat way of thinking about walking across the city. Suppose you were to start at the Hudson and walk due east across the city. Every time you encounter a multistory building, you walk over each floor exactly once. How much out of your way do you go?

This suggests a kind of interesting alteration of the "Joy Division" maps that have been popular since (at least) James Cheshire's great map of population by latitude. Those charts essentially do a double encoding scheme on the y axis: it means variation but also population. That's clever, but population height means nothing in the context of latitude and longitude.

Plotting extra land area instead of population adds the interesting additional constraint that there's an obvious connection between the primary scaling factor (latitude) and the secondary one (additional area). The additional distance in the latitude factor is on the same scale as the rest of the map.

It doesn't quite match the constraint that distance matches onto the length of travel you would go if walking across the city at that line; for that to be true, the heights should be cut in half, and there would need to be histogram bars so that at each point you climb up and then down again. (Or you could do it as a vibrating squiggle, setting both amplitude and frequency to get the line the right length--I'm curious enough about that one that I might run it.)

What is described as belonging to the "public" versus the "government?"

2017-07-05T13:54:00.002-04:00

Robert Leonard has an op-ed in the Times today that includes the following anecdote:

Out here some conservatives aren’t even calling them “public” schools anymore. They call them “government schools,” as in, “We don’t want to pay for your damn ‘government schools.’ ” They’re afraid to send their kids to them.

I'm pretty interested in the process of objects shifting from belonging to the "public" to the "government." In my 2015 interactive at the Atlantic about State of the Union addresses, I highlighted the decline of "public" from one of the most common words out of president's mouths into a comparatively rare one. And this is a shift that large digital libraries can help us better understand.

The switch is bad if you wish for any sort of small-d democratic politics. A distinction between things controlled by the "public" and the "government" is bad; they *should* be the same in a democratic country. If people feel they aren't, that means that they're either resigned to the idea government is controlled by interests other than theirs (this would be the "populist" view) or that they've lost some animating public spirit (this, the "David Brooks" view).

If schools become "government" run instead of "public," that's a loss. But of what sort? This seems like one of the species of questions that it's worth firing up the Google Ngrams database for: what words have shifted from government to public, or the other way around?

The methodology here is pretty simple: pull every phrase "government X" and "public X" from the dataset, filter to those that appear in both forms, and then fit a linear model to the log-ratio. Essentially, this looks for words where the word appears to be undergoing a sigmoid function over time.

So, for example, it used to be that "public taxes" was 99% of uses, and "government taxes" about 1%. In the late 90s (the latest period Google Ngrams is useful) it was more like 80% "government taxes", 20% "public taxes".

Here are 12 of the most heavily-changing words over 200 years. Note that none of these actually cap out at 100%, probably in part because so much old stuff is reprinted.

There's a general displacement here, but some of the track is interesting. "Reports" become ascribed to the government before "income" does. "Efforts" is fairly late as well, which suggests it takes until around the New Deal for the things that public efforts might once have addressed (poverty? morals? mineral deposit mapping?) to be firmly placed in the hands of the government.

Most of these are frustratingly vague. But among the biggest changers, there is also a fair amount of physical infrastructure, which is some way more interesting. Do you feel a personal sense of ownership on walking into an official building? For most of the vaguely physical things I checked, the government-to-public ratio is rising (although note the differing y-axes here: although "government schools" is indeed becoming more common, it's now up to about 5% of uses as opposed to 50% for "government archives" or 25% for "government hospitals."

(That "archives" has swung so strongly is especially interesting because it suggests that we historians are deeply complicit in the shift, even if we think schools are a public good--the Google Ngrams corpus is, after all, in large part academic writing, and it certainly is when talking about archives. But when we go to archives, we represent them not as part of a public good in whatever country they are, but as a purely arbitrary state object.)

Most words shift towards "government" over this period, as the word "government" shifts towards greater use as an adjective and as it shifts to describe a more permanent entity sort of synonymous with "state." (There cannot be something as permanent as a "government hospital" in the same regime of meaning that allows the sentence "Theresa May struggles to form a government.")

But: there are *some* words that change in the opposite direction. Here are the most interesting:

"Telephones" and "reaction" are a bit of an anomaly, but "colleges," "university," and "transportation" all point to an ability of this trend to go the other direction. There is no reason not to describe the University of Michigan (say) as a "government college," yet we do not. And "public transportation" continues to be recognized as a good independent of the government that provides it, displacing whatever "government transportation" would be. (I think it would be a mistake to assume that "government transportation" means subways and not, say, military movements, but I haven't dug into the texts on any of these).

Most interesting is the "public domain," which is one of the few places where it's clearly quite easy to distinguish "public" ownership from "government" ownership, since most of the public domain is intellectual property. (Although especially in the 19th century, I think it would have been largely land.)

And, of course, there are a variety of terms that haven't shifted at all, although it's hard to separate the wheat from the chaff. We don't talk about "government parks" or "government streets" or "government golf courses."

Can we make synoptic statements about what's changed and what hasn't? Maybe we could; parks and schools are places of greater interaction and continuing ownership than are courts and administrative buildings. It's possible that a sensible rearguard action would be to abandon the adjective "public" altogether and insist on speaking about "government parks;" the rhetorical posture that allows "government spending" on "public schools" might help create the rhetorical space for imagining "waste, fraud, and abuse."

But I have to admit that the easiest story I find to tell here is a more pessimistic one in which the administrative state slowly erodes away the spaces for the practice of self government. Lexical fixes are likely not enough.

A brief visual history of MARC cataloging at the Library of Congress.

2017-05-16T12:30:00.003-04:00

The Library of Congress has released MARC records that I'll be doing more with over the next several months to understand the books and their classifications. As a first stab, though, I wanted to simply look at the history of how the Library digitized card catalogs to begin with.

A couple notes for the technically inclined:
1. the years are pulled from field 260c (or if that doesn't exist or is unparseable, from field 008). Years in non-western calendars are often not converted correctly.
2. There are obviously books from before 1770, but they aren't included.
3. By "books", I mean items in the LC's recently-released retrospective (to 2014) "Books all" MARC files. http://www.loc.gov/cds/products/product.php?productID=5. Not the serial, map, etc. files: the total number is just over 10 million items.

See after the break for the R code to create the chart and the initial version Jacob is talking about in the comments.


my_annotate = function(lab,label,width=40,col="#EEEEEE",...)  {
  annotate(geom="text",...,label=str_wrap(label,width = width),family="OpenSans-CondensedLight", lineheight=0.75,size=3.5,col=col)
}

plot = data %>% filter(marc_record_created_year < 2017, marc_record_created_year>1965, record_date > 1700, record_date < 2017) %>%
  ggplot() + 
  geom_raster() + 
  aes(x=record_date,y=marc_record_created_year, fill=TextCount) + 
  scale_y_continuous(position=c("right"),expand=c(0,0)) + 
  scale_x_continuous(position=c("bottom"),breaks=seq(1770,2020,by=10),limits=c(1770,2025),expand = c(0,0)) + 

           scale_fill_gradientn("Number\nof books\ncataloged", trans="log10",
                       colors = rev(magma(5)),breaks=outer(c(1,2,5,10),c(1,10,100,1000,10000),"*") %>% as.vector %>% unique) + 
  theme_bw() + 
  theme(legend.key.height = unit(1,"in"),
        plot.title = element_text(size=22)
  ) + 
  annotate("segment",y=1968,yend=2014,x=1968,xend=2014,lwd=1.5,color="white",lty=2) + 
  labs(x="Year of book",y="Year of MARC record",title="MARC cataloging at the Library of Congress",
       subtitle="A brief visual history comparing the year that records were created (left to right) with the year that the books described in them were published" %>% str_wrap) +
  coord_flip() + 
  my_annotate(y=2013,x=2017,label="Books along the dashed line were cataloged in the same year as they were published; this has always been the most common practice",hjust=1,vjust=0,col="black") + 
  my_annotate(y=1995,x=1999,label="Anything to the upper left of this line was (supposedly) cataloged before it was published: this is impossible! These are all errors of some type.",hjust=1,vjust=0,col="black") + 
  my_annotate(y=1975,x=1969,label="MARC cataloging began in 1968; in the first years, only new books were added.",hjust=0,vjust=0.5,width=40,col="#EEEEEE") + 
  annotate(geom="segment",x=1967,xend=1969,y=1967,yend=1975,col="grey") + 
  my_annotate(y=1975,x=1960,label="In the early 1970s catalogers began to input older books: by 1972, there were hundreds of books a year entered from the early twentieth century",hjust=0,vjust=0.8,width=35,col="#EEEEEE") + 
    annotate(geom="segment",x=1910,xend=1960,y=1971,yend=1974.5,col="#DDDDDD") + 
    annotate(geom="segment",x=1950,xend=1960,y=1970,yend=1974.5,col="#DDDDDD") + 
    annotate(geom="segment",x=1960,xend=1960,y=1969,yend=1974.5,col="#DDDDDD") + 
  my_annotate(y=2000,x=1940,label="It took until 2000 for the backlog to be (mostly) cleared: the lighter patches here show that only a few records from the mid-twentieth century were being digitized",hjust=1,vjust=0.8,width=30,col="#EEEEEE") + 
    annotate(geom="segment",x=1940,xend=1955,yend=2003,y=2000.5,col="#DDDDDD") + 
    annotate(geom="segment",x=1940,xend=1920,yend=2003,y=2000.5,col="#DDDDDD") +
    my_annotate(y=1995,x=1902,label="There is a dark band in the year 1900, which is used as a catchall year for books published anytime in the century",hjust=1,vjust=0,width=30) + 
  annotate("rect",ymin=1968, ymax=2014, xmin=1899,xmax=1902,fill=NA,color="#DDDDDD",alpha=.5) + 
  annotate("rect",ymin=1994.8, ymax=1997.2, xmin=1850,xmax=1896,fill=NA,color="#DDDDDD",alpha=.5) + 
  my_annotate(y=1993.5,x=1880,label="A horizontal line shows that 1996 was an especially furious year of digitizing older records from the 19th and 20th centuries",hjust=1,vjust=0.5,width=30) +
      my_annotate(y=2002,x=1830,label="Staircase patterns moving up and to the right show smaller efforts that proceeded in chronological order through a subcollection. It took about 6 years to catalog 25 years worth of books from 1825 to 1850",hjust=1,vjust=0,width=35,col="#111111") + 
    annotate(geom="segment",x=1838,xend=1830,yend=2002,y=2007) +
      annotate(geom="segment",x=1800,xend=1830,yend=2002,y=1999) +

  annotate("rect",ymin=2007,ymax=2014,xmin=1848,xmax=1825,fill=NA,color="black",alpha=.5) +
  annotate("rect",ymin=1997,ymax=2001,xmin=1780,xmax=1800,fill=NA,color="black",alpha=.5)


ggsave(plot,device="png",filename="~/Pictures/MARC.png",width=7.5,height=20)

The history of looking at data visualizations

2017-04-14T15:34:00.002-04:00

One of the interesting things about contemporary data visualization is that the field has a deep sense of its own history, but that "professional" historians haven't paid a great deal of attention to it yet. That's changing. I attended a conference at Columbia last weekend about the history of data visualization and data visualization as history. One of the most important strands that emerged was about the cultural conditions necessary to read data visualization. Dancing around many mentions of the canonical figures in the history of datavis (Playfair, Tukey, Tufte) were questions about the underlying cognitive apparatus with which humans absorb data visualization. What makes the designers of visualizations think that some forms of data visualization are better than others? Does that change?

There's an interesting paradox about what the history of data visualization shows. The standards for data visualization being good change seem to change over time. Preferred color schemes, preferred geometries, and standards about the use of things like ideograms change over time. But, although styles change, the justifications for styles are frequently cast in terms of science or objective rules. People don't say "pie charts are out this decade"; they say, "pie charts are objectively bad at displaying quantity." A lot of the most exciting work in the computer science side of information visualization is now trying to make the field finally scientific. It works to bring scientific research into perception from mere style, like the influential and frequently acerbic work of Tableau's Robert Kosara; or to precisely identify what a visualization is supposed to do (be memorable? promote understanding?) like the work of Michelle Borkin, my colleague at Northeastern, so that the success of different elements can be measured.

I think basically everyone who's thought about it agrees that good data visualization is not simply art and not simply science, but the artful combination of both. To make a good data visualization you have to both be creative, and understand the basic perceptual limits on your viewer. So you might think that I'm just saying: the style changes, but the science of perception remains the same.

That's kind of true: but what's interesting about thinking historically about data visualization is that the science itself changes over time, so that both what's stylistically desirable and what a visualization's audience has the cognitive capacity to apprehend changes over time. Studies of perception can tap into psychological constants, but they also invariable hit on cultural conditioning. People might be bad at judging angles in general, but if you want to depict a number that runs on a scale from 1 to 60, you'll get better results by using a clock face because most people spend a lot of time looking at analog clocks and can more or less instantly determine that a hand is pointing at the 45. (Maybe this example is dated by now. But that's precisely the point. These things change; old people may be better at judging clock angles than young people.)

This reminds me of the period I studied in my dissertation, the period in the 1921s-1950s when advertisers and psychologists attempted to measure the graphical properties of an attention-getting advertisement. Researchers worked to understand the rules of whether babies or beautiful drew more attention, whether the left or the right side of the page was more viewed; but whether a baby grabs attention depends as much on how many other babies are on the page as on how much the viewer loves to look at babies. The canniest copywriters did better following their instinct because they understood that the attention economy was always in flux, never in equilibrium.

So one of the most interesting historical (in some ways art-historical) questions here is: are the conditions of apprehension of data visualization changing? Crystal Lee gave a fascinating talk at the conference about the choices that Joseph Priestley made in his chart of history; I often use in teaching Joseph Priestley's description of his chart of biography, which uses several pages to justify the idea of timeline. In the extensive explanation, you can clearly see Priestley pushing back at contemporaries who found the idea of time on the x-axis unclear, or odd to understand.

This seems obvious: so why did Priestley take pages and pages to make the point?

That doesn't mean that "time-as-the-x-axis" was impossible for *everyone* to understand: after all, Priestley's timelines were sensations in the late 18th century. But there were some people who clearly found it very difficult to wrap their heads around, in much the same way that--for instance--I find many people have a lot of trouble today with the idea that the line charts in Google Ngrams are insensitive to the number of books published in each year because they present a ratio rather than an absolute number. (Anyone reading this may have trouble themselves believing that this is hard to understand or would require more than a word of clarification. For many, it does.)

That is to say: data visualizations create the conditions for their own comprehension. Lauren Klein spoke about a particularly interesting case of this, Elizabeth Peabody's mid-19th century pedagogical visualizations of history, which depict each century as a square, divided into four more squares, each divided into 25 squares, and finally divided into 9 more for a total of 900 cells.

Peabody's grid, explanation: http://shapeofhistory.net/

There's an oddly numerological aspect to this division that draws it structures by the squares of the first three primes; Manan Ahmed suggested that it drew on a medieval manuscript tradition of magic squares.

Old manuscript from pinterest: I don't really know what this is. But wow, squares within squares!

Klein has created a fully interactive recreation of Peabody's visualization online here, with original sources. Her accompanying argument (talk form here), which I think is correct, includes the idea that Peabody deliberately engineered a "difficult" data visualization because she wanted a form that would promote reflection and investment, not something that would make structures immediately apparent without a lot of cognition.

Still, one of the things that emerged again and again in the talks was how little we know about how people historically read data visualizations. Klein's archival work demonstrates that many students had no idea what to do with Peabody's visualizations; but there's an interesting open question about whether they were easier to understand then than they are now?

The standard narrative of data visualization, insofar as there is one, is of steadily increasing capacity as data visualizations forms become widespread. (The more scientific you are, I guess, the more you might also believe in constant capacity to apprehend data visualizations.) Landmark visualizations, you might think, introduce new forms that expand our capacity to understand quantities spatially. Michael Friendly's timeline of milestone visualizations, which was occasionally referenced, lays out this idea fairly clearly; first we can read maps, then we learn to read timelines, then arbitrary coordinate charts, then boxplots; finally in the 90s and 00s we get treemaps and animated bubble charts, with every step expanding our ability to interpret. These techniques help expand understanding both for experts and, through popularizers (Playfair, Tufte, Rosling), the general public.

What that story misses are the capacities, practices, and cognitive abilities that were lost. (And the roads not taken, of course; but lost practices seem particularly interesting).

So could Peabody's squares have made more sense in the 19th century? Ahmed's magic squares suggest that maybe they were. I was also struck by the similarity to a conceptual framing that some 19th-century Americans would have known well; the public land survey system that, just like Peabody's grid, divided its object (most of the new United States) into three nested series of squares.

Did Peabody's readers see her squares in terms of magic squares or public lands? It's very hard--though not impossible--to know. It's hard enough to get visualization creators nowadays to do end-user testing; to hope for archival evidence from the 19th century is a bridge too far.

But it's certainly possible to hope for evidence; and it doesn't seem crazy to me to suggest that the nested series of squares used to be a first-order visualization technique that people could understand well, that has since withered away to the point where the only related modern form is the rectangular treemap, which is not widely used and lacks the mystical regularity of the squares.

I'm emphatically not saying that 'nested squares are a powerful visualization technique professionals should use more.' Unless your audience is a bunch of Sufi mystics just thawed out of a glacier in the Elburz mountains, you're probably better off with a bar chart. I am saying that maybe they used to be; that our intuitions about how much more natural a hierarchical tree are might be just as incorrect as our intuitions about whether left-to-right or right-to-left is the better direction to organize text.

From the data visualization science side, this stuff may be interesting because it helps provide an alternative slate of subjects for visualization research. Psychometry more generally knows it has a problem with WEIRD (Western, educated, industrialized, rich and democratic) subjects. The data visualization literature has to grapple with the same problem; and since Tufte (at least) it's looked to its own history as a place to find the conditions of possible. If it's possible to change what people are good at reading, that both suggests that "hard" dataviz might be more important than "easy" dataviz, and that experiments may not run long enough (decades?) to tell if something works. (I haven't seen this stuff in the dataviz literature, but I also haven't gone looking for it. I suspect it must exist in the medical visualization literature, where there are wars about whether it's worthwhile to replace old colorschemes in, say, an MRI readout that are perceptually suboptimal but which individual doctors may be )

From the historical side, it suggests a lot of interesting alignments with the literature. The grid of the survey system or Peabody's maps is also the "grid" Foucault describes as constitutive of early modern theories of knowledge. The epistemologies of scientific image production in the 19th century are the subject of one of the most influential history of science books of the last decade, Daston and Gallison's Objectivity. The intersections are rich and considerably more explored, from what I've seen well beyond history of science into fields like communications. I'd welcome any references here, too, particularly if they're not to the established, directly relevant field of the history of cartography. (Or the equally vast field of books Tony Grafton wrote.)

That history of science perspective was well represented at Columbia, but an equally important discipline was mostly absent. These questions of aesthetics and reception in visualization feel to me a lot like art-historical questions; there's a useful analogy between understanding how a 19th century American read a population bump chart, and understanding how a thirteenth century Catholic read a stained glass window. But most of the people I know writing about visualization are exiles from studying either texts or numbers, not from art history. External excitement about the digital humanities tends to get too excited about interdisciplinarity between the humanities and sciences and not excited enough about bridging traditions inside the humanities; one of the most interesting areas in this field going forward may be bridging the newfound recognition of the significance of data visualization as a powerful form of political rhetoric and scientific debate with a richer vocabulary for talking about the history of reading images.

Some notes on corpora for diachronic word2vec

2016-12-23T14:29:00.002-05:00

I want to post a quick methodological note on diachronic (and other forms of comparative) word2vec models.

This is a really interesting field right now. Hamilton et al have a nice paper that shows how to track changes using procrustean transformations: as the grad students in my DH class will tell you with some dismay, the web site is all humanists really need to get the gist.

Semantic shifts from Hamilton, Leskovec, and Jurafsky

I think these plots are really fascinating and potentially useful for researchers. Just like Google Ngrams lets you see how a word changed in frequency, these let you see how a word changed in *context*. That can be useful in all the ways that Ngrams is, without necessarily needing a quantitative, operationalized research question. I'm working on building this into my R package for building and exploring word2vec models: here, for example, is a visualization of how the use of the word "empire" changes across five time chunks in the words spoken on the floor of the British parliament (i.e., the Hansard Corpus). This seems to me to be a potentially interesting way of exploring a large corpus like this.

I would gloss this to say that "empire" remains in a relatively similar semantic space until 1945, although it drifts from a space of national "greatness", "unity", and "solidarity" towards a more technical language of "dependencies" and "possessions." But then after 1945 the parliamentarians tend to forget that they ever had an empire at all; in the post-Thatcher period, the British speak of "empire" in the same breath as "Stalinism," "Napoleonic" adventures, and the "Asiatic" empires of the past.

This visualization suggests these observations but in no way proves it; at the least, you'd have to look at the individual relationships in each decade to understand how the PCA is parcelling things out, and if you were to write a paper on British uses of empire you might want to retreat to more traditional concordance methods. But the ability to perform these kinds of explorations in sub-second times for any word is really exciting to me; given an appropriate master corpus, this would be an extremely interesting web tool to build into a large-scale discovery engine.

But "given an appropriate master corpus" is a big caveat. The Hansard corpus is a decent one, which is why I use it here. My least favorite thing about the Stanford paper is that they use the Google Ngrams 5-grams as the basis for their diachronic comparisons.

Now I see there's a new paper out by Johannes Hellrich and Udo Hahn called "Bad Company—Neighborhoods in Neural Embedding Spaces Considered Harmful" that follows in their footsteps by suggesting that SGNS word2vec shouldn't be used because the random seeding produces unreliably different neighborhoods for words. This is a useful concern to raise; I like to try to treat SGNS word2vec as essentially a low-memory matrix factorization algorithm, but they suggest it varies significantly between runs. We certainly need more studies along these lines.

But they, too, use the Google Ngrams 5-grams for this, which makes it very hard to tell the difference between problems of the training corpus and problems of the word2vec algorithm. This is partly because of the well-known issues with the selection criteria, compounded by their decision to use a time periods (2005-2009) which the authors of Google Ngrams highly discourage including in published study. But it's even more because of the weird differences between Google Ngrams 5grams and actual 5-character sequences of natural language text. Ngrams only includes phrases that appear more than 40 times; that means that the vast majority of the text of books is thrown out in favor of pat or frequent phrases. Going by Hellrich and Hahn's counts, in the English Fiction corpus, three in four five-word phrases are thrown away.

I've never seen a good summary of what's lost, but I can say with certainty that the effects are much worse for infrequent words than for frequent words. I don't have a well-indexed copy of the five grams, but some random spot checks on three grams shows that infrequent words like "preexistent" or "inebriated" (in bin 5 below) are dropped 85% of the time in the 3grams, while moderately common words like "greed" or "upstairs" are dropped more like 60% of the time. Super common words are presumably dropped even less, and I imagine the numbers rise and the gaps widen somewhat as you approach 5-grams.

So for a word like "lazzaroni," which the authors worry shouldn't be showing up in the near neighborhood of "romantic," the training data is learning on the basis of a very small number of pat phrases; the ~~third~~fifth-most common 3-gram starting with "Lazzaroni" in ngrams comes from a widely reprinted passage of George Eliot's Adam Bede, "Neither are picturesque lazzaroni or romantic criminals half so frequent as your common labourer."

In natural language, there would be a wide array of less pat phrases the word2vec algorithm would encounter with a word like that; in the ngrams five-grams, though, I doubt it ever sees Lazzaroni outside more than five or ten phrases. What will the result of this be? I'm not entirely sure, but I doubt it's good. Uncommon words will be boxed into the contexts they are used in by one or a few widely-reprinted authors or pat phrases, rather than how they were used at large; and then those contexts will, at scale, potentially shift around the locations of somewhat common words in confusing ways because the signals tying them together are so strong. This applies regardless of whether the algorithm is word2vec, or PPMI-SVD, as the authors suggest.

So while I'd really like to know how word2vec reliability performs *and* I'd like to see it or whatever else (SVD-PPMI, yeah; I don't know how they handle some computational complexity issues around that, and I'd like to hear more) extended to work with diachronic historical corpora, the next step is clearly to stop doing this with Google Ngrams and finding actual natural language to work with.

OCR failures in 2016

2016-12-20T14:55:00.004-05:00

This is a quick digital-humanities public service post with a few sketchy questions about OCR as performed by Google.

When I started working intentionally with computational texts in 2010 or so, I spent a while worrying about the various ways that OCR--optical character recognition--could fail.

But a lot of that knowledge seems to have become out of date with the switch to whatever post-ABBY, post-Tesseract state of the art has emerged.

I used to think of OCR mistakes taking place inside of the standard ASCII character set, like this image from Ted Underwood I've used occasionally in slide decks for the past few years:

But as I browse through the Google-executed OCR, I'm seeing an increasing number of character-set issues that are more like this, handwritten numbers into a mix of numbers and Chinese characters.

or this set of typewriting into Arabic:

or this Hebrew into Arabic:

Or this set of English into Cherokee characters:

That last example is the key to what's going on here: the book itself *is* mostly in Cherokee, but the title page is not. Nonetheless, Google OCR translates the whole thing as being in Cherokee, making some terrible suppositions along the way: depending on your font set, you may not be able to display the following publisher information.

ᏢᎻᏆᏞᎪᎠᎬᏞᏢᎻᏆᎪ:ᎪᎷᎬᎡᏆᏨᎪᎡ ᏴᎪᏢᎢᏆᏚᎢ ᏢᏌᏴᏞᏆᏟᎪᎢᏆᏫᏒ ᏚᏫᏩᏦᎬᎢᎩ,1420 ᏨᎻᎬᏚᎢᎡᏴᎢ ᏚᎢᎡᎬᎬᎢ

Some unstructured takeaways:

1. I usually think of OCR as happening on a letter-by-letter basis with some pre-trained sense of the underlying dictionary. But the state-of-the-art is working at a much larger level where it makes inference on the basis of full pages and books, so the fact that a lot of a book is sideways can make the pages that *aren't* sideways get read as Arabic.

2. There's a cultural imperialism at work here. The Google algorithms can succesfully recognize a book as Cherokee, but nonetheless only about 3 in 4 books in Arabic, Chinese, Greek, or Hebrew are in predominantly the appropriate character set. (Image). This is a little complicated, though; these are pre-1922 texts I'm looking at, so most Japanese, Chinese, and Arabic is not printed.

(This is based on a random sample of 28,075 pre-1922 books from Hathi).

But when you look at what characters are incorrectly imputed onto English and French texts, things get tricky. When books are interpreted as Arabic, the entire book tends to get read in Arabic script; the CJK unified ideograph set tends be interspersed with other languages. And there are some baseline assumptions throughout here about what languages count: the unified Han character set gets read in a lot, as do the Japanese extensions, but the only substantial example of Thai character sets erroneously read I can find is not from English but from Sanskrit. That's an easy problem to define: Google doesn't like to recognize Thai script, and that's bad if you want to read Thai. But why does Google allow books to contain trace amounts of Greek, Cyrillic, and Chinese, but not Arabic or Hebrew? This probably has something to do with training data that includes western-style numerals in particular, I would bet. (You can see this in action above: the Arabic examples I pasted are almost straight Arabic, while the Chinese ones are interspersed.) The implications of this for large corpora are fuzzy to me.

You can read this chart if you like, but it's probably not clear what it shows.

3. Ordinary mortals don't have access to extensive visual data about books, but the decisions about character set that these algorithms make is a potentially useful shadow of overall image info. Sheet music isn't OCR'ed by Google, for instance, as anything but junk. But it looks to me like it should be possible to train a classifier to pull out sheet music from the Hathi trust just based on a few seeds and the OCR errors that sheet music tends to be misread as.

A 192-year heatmap of presidential elections with a y axis ordering you have to see to believe

2016-12-01T15:48:00.000-05:00

Like everyone else, I've been churning over the election results all month. Setting aside the important stuff, understanding election results temporally presents an interesting challenge for visualization.

Geographical realignments are common in American history, but they're difficult to get an aggregate handle on. You can animate a map, but that makes comparison through time difficult. (One with snappy music is here). You can make a bunch of small multiple maps for every given election, but that makes it quite hard to compare a state to itself across periods. You can make a heatmap, but there's no ability to look regionally if states are in alphabetical order.

This same problem led me a while ago to try and determine the best linear ordering of US states for data visualizations. I came up with a trick for combining some research on hierarchical and traditional census regions, which yields the following order:

This keeps every census-defined region (large and small) in a block, and groups the states sensibly both within those groups and across them.

Applied to election results, this allows a visualization that can be read both at the state and regional level (like a map) but also horizontally across time. Here's what that looks like: if you know something about the candidates in the various elections, it can spark some observations. Mine are after the image. Note that red/blue (or orange/blue) here are not the *absolute* winner, but the relative winner. Although Hillary Clinton won the national popular vote, and she won New Hampshire in 2016, for example, New Hampshire is red because it was more Republican than the nation as a whole.

Click to enlarge

Starting from the most obvious features: the giant blue block in the center is the Democratic "Solid South" from 1856 to 1956. Tennessee and Kentucky are borderline in the old solid South, but active participants in the new bright red solid south of the Republican party. The story of the last 25 years is the expansion of the Democratic mass in the upper quadrant gradually farther south (into VA, NC, maybe even Florida).

The middle band of the map is the old midwest: Wisconsin through Iowa. It shows the least clear long-term alignments, and is remarkable in the 20th century for its incredible moderation. Only four times between 1932 and 2012 did a midwestern state depart from the national margin by more than 9%; Illinois for Obama once, Indiana for GW Bush once, and Minnesota and Iowa once apiece against Reagan. Both Indiana and Missouri broke that pattern this time around; and only Illinois's moderate democratic lean towards the Clinton kept this year from being the first when the entire region was unified for the same party.

There are also a number of details I've never noticed before.

I've always admired the strangeness of Carter's electoral maps in 1976 and 1980, when he carried the Christian vote and not the northern semi-liberals. But Carter's races were typical in that they helped cement the new alignment of the non-coastal West against the Democrats.
I would have expected the 1912 vertical bar to stick out, because there were two candidates splitting the Republican vote. (Since I'm showing 2-party vote by the top two vote-getters, I only show Teddy Roosevelt's progressives on the chart above). But the 1916 election is actually the really strange-looking one, in which Wilson seems to somehow activate the old Bryanite western coalition from 1896 to pull out a narrow win.
There are a lot states I think of as pairs (the Carolinas, the Dakotas, Kentucky/Tennessee), and most tend to vote along the same lines. But despite the tiny size and deep demographic similarity of New Hampshire and Vermont, they have managed to maintain strikingly distinct political cultures for decades. I had no idea New Hampshire actually flirted with being Democratic-leaning during the 1920s.
When Grover Cleveland ran in 1892, he had barely lost his re-election bid in 1888 to Benjamin Harrison (winning the popular vote but falling short in the Electoral College). When he came back for a rematch in 1892, the Republicans had installed 6 new states that all voted strongly against him--SD, ND, MT, ID, WY, and WA. If I were a Democrat walking into that situation, I would be angered. For all the fears about congressional-district gerrymandering nowadays, at least we don't have the perpetual threat of actual new states being fabricated to shore up the current leading coalition.
(The motivation for the map, more than the outcome): I recall in October, people criticized Clinton for campaigning in Georgia and Arizona because they were unnecessary "reaches," when she should have focused on important swing states. But in fact, Clinton was slightly closer closer to winning in Arizona than in North Carolina, and much (2.8%) closer to winning Georgia than Ohio (the quintessential swing state). (She was closer to winning Texas than to winning Iowa, which I don't think anyone would have predicted ahead of time. I am curious how close some of the simulator sites got to these results.) And for all the incantation of ["Pennsylvania," "Michigan," "Wisconsin"] on the left, it's worth remembering that Florida was just as close as Pennsylvania percentage-wise.

Methodological footnotes:

1. I use orange for Whigs and other miscellaneous parties (Roosevelt's Bull Moose progressives in 1912, National Republicans in 1832) because in every election since 1828 there's some party roughly equivalent to the modern Democrats (blue), and orange and red are discernable enough to tell if you care but not if you don't.

2. Data is from Steven Wolf's spreadsheet of election results from David Leip's election atlas through 2012, and David Wasserman's spreadsheet for 2016. I only took their raw counts; it's my own version of PVI, as deviation from the national 2-party mean. (So for 1912, for instance, this means I threw out all Taft votes, and calculated every state by whether Wilson or TR did better in the head-to-head vote.)

3. All states are equal sizes, but some states have more population. One interesting elaboration here would be to use a stream graph, so that state ordering would be preserved but California could get bigger as time goes on.

The efficient plots hypothesis

2016-09-09T13:07:00.002-04:00

I'm pulling this discussion out of the comments thread on Scott Enderle's blog, because it's fun. This is the formal statement of what will forever be known as the efficient plot hypothesis for plot arceology. Noble prize in culturomics, here I come.

Brief background: Enderle shows pretty persuasively that all the fundamental plot arcs described in a paper by a math-based computational story lab can be ascribed to random (brownian) noise. As I wrote earlier, and Hannah Walser explored in more depth recently, that this happens with their data isn't so surprising; the "stories" they are modeling are mostly random documents to begin with.

Still, there's some reason to think that maybe sentiment trajectories are random walks even in actual databases of stories like those Matt Jockers uses. Enderle finds that, well, weird: "Should we find that sentiment data from novels does indeed amount to “mere noise,” literary critics will have some very difficult questions to ask themselves about the conditions under which noise signifies." The idea that plots are random seems offensive to the idea of plot at all. Others in the field, like Jockers and Ted Underwood, have also expressed the idea that there should be some regularities to plot, particularly that map across genre.

I had earlier raised the idea that the null hypothesis for plot testing should be a random walk (Brownian noise, as Enderle calls it) but I thought of it as just that--a null hypothesis that indicates nothing interesting is going on.

But of course, it *would be interesting if nothing was going on.* It would demand explanation! And now I've got one: the efficient plots hypothesis, a corollary of the efficient markets hypothesis (EMH) for the literary world.

The EMH states that stock prices are efficient; you can't know reliably if they're about to go up or down, because if they were someone would have bought them. There's been a lot of research on whether stocks move in Brownian noise; they don't, totally, but they come pretty close.

The EPH, as I imagine it, says that the ideal reader can't know if the mood of a book is about to get sunnier or darker at any given point in the plot. This not because of market forces directly, but because the purpose of a narrative is to engross the reader. Engrossment proceeds through uncertainty. If you knew what was about to happen, you'd skim ahead or stop reading.

That is: at any moment in a story, the emotional trajectory is a random walk for the reader because anything else would be *boring.* And stories aren't boring.

This could be tested empirically by asking readers if a book will get more positive or more negative over the next five pages, and by how much. In a pure EPH world, they'll only be right about half the time. Enderle thinks the EPH is obviously wrong, particularly for genre fiction.

I'm not so sure. To take an example: I read some John le Carré novels over the summer. Periodically, a spy has to secretly pass from the East to the West without getting by the commies. (Through the Berlin wall, over the Chinese border to Hong Kong, etc.) Do you know if they'll make it? The emotional sentiment of the next few pages depends on whether they get killed or not. I can see two models here:
1. Genre determines plot arceology: There are conventions to the spy novel that make it possible to tell in advance.
2. The EPH: The whole point of reading a spy novel is that you don't know what will happen; the job of a spy novelist is to make you unsure.

My reading experience is much closer to the latter; that the conventions of genre fiction are *precisely* that you don't know what's going to happen next; otherwise no one would read it.
For most good genre fiction, I think this holds. Will Lockhardt/Gardner win the case? Is Don Draper going to hit the bottle or stay sober? The rise of "anyone can die" as the predominant trope of 2010s TV suggests that the economics are forcing stronger and stronger forms of the EPH onto us every day.

The major objection to this would be: "but there *are* genres where you know the outcomes precisely!" In a Hardy Boys novel, they'll rebound from danger and catch the bad guy every time. One response to this is: sure, *you* know that; but you don't read Hardy boys novels. The people who do are 10-year-olds who legitimately think that, just maybe, the killer's going to drown the brothers in the quarry and the next 20 books on the shelf will turn out to be prequels.

Even if you know how certain books will *end*, that doesn't mean that you'll ever be able to predict the next two pages, which is what this is about. I think this distinction is crucially important and maybe underestimated. Sure, a romantic comedy always has a temporary breakup in the middle; but whether that happens 40% of the way through or 70% of the way through makes all the difference; and if you've made it 90% of the way through without the breakup happening, you start to think "maybe this is one of those comedies without a breakup in it."

If the EPH holds, then, it doesn't suggest that fiction is truly arbitrary; rather, that it's an elaborately constructed game between reader and writer, socially conditioned and in no way permanent. It would suggest that there are enough fundamental plots that at any point in a book you are unsure what plot you are in; and that plots tend to wear themselves out over time.

It does completely throw into the ringer my analogy between musical tonality and emotional valence. Key signatures in music are highly predictable. But I think that's OK: it's really clear that there aren't underlying structures quite so strong as sonata form under novels; this would explain why.

For a lunatic idea, the EPH is actually empirically kind of testable. Just ask people to predict the direction of books as they're reading them. Someone could totally do this. Maybe some movie studios even do.

For more details, see my forthcoming book with Stephen Dubner, Jane Austen was a Derivatives Trader (Harper Collins 2017).

Language is biased. What should engineers do?

2016-08-29T09:53:00.001-04:00

Word embedding models are kicking up some interesting debates at the confluence of ethics, semantics, computer science, and structuralism. Here I want to lay out some of the elements in one recent place that debate has been taking place inside computer science.

I've been chewing on this paper out of Princeton and Bath on bias and word embedding algorithms. (Link is to a blog post description that includes the draft). It stands in an interesting relation to this paper out of BU and Microsoft Research, which presents many similar findings but also a debiasing algorithm similar to (but better than) the one I'd used to find "gendered synonyms" in a gender-neutralized model. (I've since gotten a chance to talk in person to the second team, so I'm reflecting primarily on the first paper here).

Although it's ostensibly about word embeddings, it draws some extremely large claims that seem like it's worth filtering over into the digital humanities community. They offer further evidence for my argument from last year that word embeddings offer digital humanists a tool particularly attuned to making broad statements about discourses and meaning.

The actual content of the paper is not especially surprising: they show that patterns mapping to positive and negative feelings in implicit association tests run by psychologists also exist in word embeddings trained on large corpora. This represents something of an advance over the published literature, I guess, in demonstrating with p-values* that it isn't just gender binaries that show up in word embeddings: they also associate black names on the side of a pleasant/unpleasant binary with white names.

They also include one killer example of how word embeddings act in a real-world service that's worth trotting out for those who don't think word embeddings are their like are an important piece of social infrastructure:

Translations to English from many gender-neutral languages such as Finnish, Estonian, Hungarian, Persian, and Turkish lead to gender-stereotyped sentences. For example, Google Translate converts these Turkish sentences with genderless pronouns: “O bir doktor. O bir hems ̧ire.” to these English sentences: “He is a doctor. She is a nurse.”

The paper quickly moves, though, to some quite broad generalizations about what language, meaning and bias are. These, I think, are more debatable.

We demonstrate here for the first time what some have long suspected (Quine, 1960)—that semantics, the meaning of words, necessarily reflects regularities latent in our culture, some of which we now know to be prejudiced.

[...] Our work lends credence to the highly parsimonious theory that all that is needed to create prejudicial discrimination is not malice towards others, but preference for one’s ingroup

[...] The simplicity and strength of our results suggests a new null hypothesis for explaining origins of prejudicial behavior in humans, namely, the implicit transmission of ingroup/outgroup identity information through language. That is, before providing an explicit or institutional explanation for why individuals make decisions that disadvantage one group with regards to another, one must show that the unjust decision was not a simple outcome of unthinking reproduction of statistical regularities absorbed with language.

[..] First, our results suggest that word embeddings don’t merely pick up specific, enumerable biases such as gender stereotypes (Bolukbasi et al., 2016), but rather the entire spectrum of human biases reflected in language

[...] Bias is identical to meaning, and it is impossible to employ language meaningfully without incorporating human bias. [In the blog post: "At a high level, bias is meaning. 'Debiasing' these machine models, while intriguing and technically interesting, necessarily harms meaning."]

[...] Normally when we design AI architectures, we try to keep them as simple as possible to facilitate our capacity to debug and maintain AI systems. However, where AI is partially constructed automatically by machine learning of human culture, we may also need an analog of human explicit memory and deliberate actions, that can be trained or programmed to avoid the expression of prejudice.
Of course, such an approach doesn’t lend itself to a straightforward algorithmic formulation. Instead it requires a long-term, interdisciplinary research program that includes cognitive scientists.

Heady stuff!

What to make of it? Well, my first reaction is that many of these statements much too easily elide "similarity in vector space" with "meaning." Quine's "suspicions" (sort of an odd way to describe analytic philosophy--as a set of suspicions awaiting proof) are only proven if a word embedding is itself true. If you think of word embeddings as a neatly factorized co-occurrence matrix--which is all they actually are--they can't really prove anything, on their own.

There's also a sort of strong Whorfianism to these statements that I think the authors share with a lot of cultural historians (including myself, 4 days a week or so): the implication is strong that language determines our abilities to interpret the world. My current, non-expert understanding, though, is that linguists tend to be a bit more restrained.

Some inductions in the second part are real leaps unless you believe a very strong version of the Sapir-Whorf hypothesis, and some other things besides. Why, for example, should 'implicit transmission of ingroup/outgroup identity' through language be the null hypothesis to explain any prejudicial behavior? The jump to ingroup/outgroup seems totally unfounded (even in their three examples; I'd say that humans identify with insects more even as they find flowers more pleasant, and I don't what ingroup-outgroup has to do with assumptions about gender and profession.) And it takes an incredible belief in the power of language to conclude that evidence of bias in language makes language the default source of bias. In a way, it's a correlation/causation problem.

Some of this seems to be coming from an underlying theory of meaning as not embodied in individual actors. One of the authors, Joanna Bryson, links in the comments to previous, more philosophical, work of hers that characterizes language in terms of memetics, in which memes evolve "more or less independently of their human-agent substrates." I'm sure there's more to say here; memetics often seems reductionist to people who do cultural studies for a living. One key question going forward, I guess, is whether tokens in vector space and memetics prove to be useful partners for each other.

But the big question here is about social responsibility and how we divvy it up.

The exciting thing about the BU/Microsoft paper is that they show how a fairly simple set of tricks can potentially make word embeddings less biased than they naturally are; and even, potentially, less biased than "ordinary people," whatever that means. (This is exciting only if you didn't share the folk assumption that "algorithms are neutral." I'd guess that while most people think that, most reading this blog don't.) They lay out parameters for eliminating the largest (largest, that is, in magnitude) form of gender bias in a word embedding. Expanded to its limit, the assumption there is that if you can name a prejudice, you can also make it disappear through linear algebra.

The Princeton/Bath paper has a more tragic take; that bias and meaning are entangled (or even identical; at times their stance seems to be that meaning is the sum total of all biases that individuals have), and that elimination of prejudice isn't something we can safely leave to engineers. It needs "cognitive scientists and ethicists." (In the blog post, cognitive scientists are eliminated in favor of "domain experts." "Domain experts," I've realized, play roughly the same role in computer science as angels and scripture do in Thomistic philosophy; a source of revealed knowledge wholly outside normal channels of deduction.)

There's much to be praised in that sort of disciplinary modesty about engineering social choices. But it's also a little problematic in absolving computer engineering of social responsibility. I suppose there's a valid criticism here of the idea that all word-embeddings should be de-biased before being distributed; but that argument seems like a straw man. It doesn't take a professional ethicist that (say) a job-recommendation engine should be debiased. Telling engineers that that they should consult ethicists before changing embeddings would, in practice, be tantamount to telling them to do nothing.

In some ways, maybe, the debate is over how seriously emerging techniques of "deep learning" are taken seriously as real artificial intelligence. Word embeddings are an interesting place to think about this because in some ways they seem remarkably intelligent (they use "neural networks;" they perform "analogical reasoning;") while in others they're foolishly simple (they're just matrix factorization over a small window; they don't even have a hidden layer in the network, let alone any of the structures that ostensibly make deep learning the new synonym for AI.) The argument from Princeton/Bath is that word embeddings *do* know meanings (for some definition of 'meaning'), and that we need to teach them how to avoid prejudice in the same way humans do. The flip side might be that word embeddings are quite arbitrary in any case, however well they work; and it's fine to make them work straightforwardly better.

Or they might be no difference here at all: I don't find it too hard to read the Bath paper as an argument that debiasing should be done when word vectors are *used*, not when they're distributed, and that there are cases in which biased vectors are useful. The latter is obviously true; the former might be, although I think in most cases no one would do it.

Why Digital Humanists don't need to understand algorithms, but do need to understand transformations

2016-07-20T14:09:00.000-04:00

Debates in the Digital Humanities 2016 is now online, and includes my contribution, "Do Digital Humanists Need to Understand Algorithms?" (As well as a pretty snazzy cover image…) In it I lay out distinction between transformations, which are about states of texts, and algorithms, which are about processes. Put briefly:

Put simply: digital humanists do not need to understand algorithms at all. They do need, however, to understand the transformations that algorithms attempt to bring about. If we do so, our practice will be more effective and more likely to be truly original.

It then moves into one case study; the Jockers-Swafford debate of 2015, large parts of which hung on whether the Fourier transform was a black box and how it its use as a smoothing device might be understood. It's like a lot of what's on this blog, only better thought and edited.

The transformation/algorithm distinction is not a completely firm one, but I have found it extremely useful in a lot of research and teaching problems I've approached over the last year. So in addition to advertising that article for your consumption/fall syllabi production, I wanted to take the occasion to put on github a tiny little germ of a project to provide one-page, transformation-oriented introductions to basic text-analysis concepts that came out of using this thinking for a workshop on text analysis at the NIH in Bethesda, and describe what's in it. I'd love for anyone else to use it, fork it, whatever.

The canonical example of transformation/algorithm I give in the article is sorting. "Sortedness" is a transformation you can understand; "quicksort" is a particular algorithm that puts a list into sorted order. If you want to use, say, a concordance to the works of Thomas Aquinas, you must understand sortedness and what it does; but the precise sorting algorithm is unimportant.

Saying we don't need to understand algorithms doesn't mean we shouldn't. There are many cases where we should. But the need to understand the basic transformations attempted is a bare minimum anyone reading an article or performing an early exploratory step should be able to do.

This has shaped the way I introduce algorithmic concepts for humanities audiences.

Take, for example, how to introduce topic modeling. The genre of topic-modeling literature tends to get very quickly into the generative method that underlies LDA (the most widely used topic modeling algorithm). It stops just shy of what a Direchlet distribution is, but puts probability front and center.

The basic transformation of topic modeling, though, is not probabilistic: it's about breaking up a series of documents into coherent topics that make them up based on co-occurrence. This framing is much larger than LDA, and need have nothing to do with probabilities; I would (maybe controversially) say that for non-insiders, we should think of matrix operations like latent semantic analysis as topic models as well, just of a different stripe. It's fine if stage 2, or stage 3, of understanding topic models is about the probabilities that drive all effective ones. But for stage 1 we should think about the general goal and simply give the basic advice for who has implement it best. (In this case, "Use mallet with hyperparameters.")

Thinking about transformations means spending less time worrying about whether boxes are black or not, and more time thinking about how to create inputs and read outputs. Creating inputs is particularly important for something like topic modeling. One piece of advice that is given much too little in introductions to topic modeling, I think, is: the most important choice is how big your individual documents will be. Should you use paragraphs? Book chapters? Whole books? Unlike setting hyperparameters for LDA, that's immediately addressable by any humanist; and it's possible to understand why it matters without knowing anything about the particulars of a topic modeling algorithm.

This has also heavily influenced my thinking about word embedding models. In my 1-page introduction to word embeddings, I say "use word2vec" and largely leave it at that. If you want to use LSA, though, or Michael Gavin's raw vector method based on hand-selected words, that's great. As with topic models or sorting, the things you do with a word embedding model are largely independent of the algorithm that sorted it.

That's all here. Here's that link to the article again.

Plot arceology 2016: emotion and tension

2016-07-18T11:46:00.000-04:00

Some scientists came up with a list of the 6 core story types. On the surface, this is extremely similar to Matt Jockers's work from last year. Like Jockers, they use a method for disentangling plots that is based on sentiment analysis, justify it mostly with reference to Kurt Vonnegut, and choose a method for extracting ur-shapes that naturally but opaquely produces harmonic-shaped curves. (Jockers using the Fourier transform, and the authors here use SVD.) I started writing up some thoughts on this two weeks ago, stopped, and then got a media inquiry about the paper so thought I'd post my concerns here. These sort of ramp up from the basic but important (only about 40% of the texts they are using are actually fictional stories) to the big one that ties back into Jockers's original work; why use sentiment analysis at all? This leads back into a sort of defense of my method of topic trajectories for describing plots and some bigger requests for others working in the field.

Basic methodology

1. They use some unspecified mechanism to limit the ~50,000 books in project Gutenberg to 1,700 "stories" or "works of fiction." They use these terms interchangeably. But this suffers 2 problems.

1a. First, whatever fiction/nonfiction classifier they use seems to work extraordinarily poorly--almost certainly worse than simply using the Library of Congress classifications that Gutenberg itself distributes with some of its dumps. It includes personal narratives, political essays, instructions for building bird houses, psychology texts, and so forth. If you click the "random" button on their page (which is a great thing to include), you'll see many of these.

2a. It also includes many collections of short stories or pairs of novels published in a single volume. Some of these are the highest scoring "plots" for their basic arcs: for instance, "The Wonder Book of Bible Stories" is the best instance of the inverse of plot 3, and only 1 of the top 5 representatives of (- SV 2) seems to actually be a single narrative.

I ran a spot check of 50 random texts in their browser. I counted 18 non-fiction; 20 novels or short stories; and 12 collections of short stories, or other multi-work texts. So roughly 40% of the texts used are actually what the authors say they are. This makes the conclusions only provisional at best. So many of the titles in the captions are obviously not stories that it's a little baffling they didn't bother to clean up their data set, or use one of the many *actual* fiction collections out there. [Edit: I noticed in the appendix that they classify "fiction" on the basis of length and download count. How they chose the parameters they use aren't clear to me; in any case, it's obvious that just length and download count are *terrible* inputs into a fiction/nonfiction classifier, so it's no wonder they do so poorly.]

2. The null hypothesis that they test against is "word salad;" a completely reshuffled set of orders. They do indeed seem to show that their stories have stronger shapes than word salads. But this is an extremely weak finding. It's akin to saying that you can predict the stock market because you can show that stock prices exhibit greater regularity than random digits. Of course stocks are not random dice rolls every second; they have a trajectory that they move from randomly. But for a time series like this, I think the null hypothesis should be at the least a random walk, not complete random words: that is, particularly when using normalized scores as here, the assumption should be that any given paragraph has the same emotional valence as the previous paragraph, not a completely new one. That is to say, it is easy to generative a "null narrative" that is distinct from a "null text." This is not to say that there isn't some benefit to checking the weaker null hypothesis first. [Although see below in the comments: Scott Enderle suggests that the random noise they get shouldn't be producing results like it is. So what's going on is yet more unclear.]

Another question about plot as a time series is: can you predict what will happen? No one working in the field, to my knowledge, has tried to do this, but it could be interesting. In terms of emotional valences, this makes clear, I think, why the word salad null hypothesis is silly; if you want to predict the end of the book from the middle and beginning, you could do better than say "It will start randomly vacillating every word from negative to positive and so on."

3. They decide to test success by number of downloads, and argue that shapes (SV 3) and (-SV 3) are most successful because they "have markedly higher downloads, and somewhat higher variance." The designation as "higher" is based entirely on mean downloads, since the medians are roughly the same. If both mean and median don't tell the story, there's probably something else going on. Maybe there's simply more variance, for example, and the number of downloads varies log-normally. When the summary statistics don't agree, it's a stretch to claim any actual conclusions.

What are we doing here?

Next on to some bigger questions of what it means to study plot. This and Jockers are two of the more prominent things recently using sentiment analysis as a proxy for "plot." I saw Ted Underwood on Twitter arguing that the next step must be following up on David Bamman's work on experimenting on whether sentiment analysis actually works by using Mechanical Turk to annotate the "emotional trajectory of texts."

I'm basically done thinking about all of these; the combination of my paper on topic-modeling arcs and my meta-reflections on algorithms, plots, and the Jockers-Swafford affair of 2015 for Debates in the Digital Humanities 2016 give most of what I have to say formally about the issue. There are some slides from the IEEE paper that have nice interactives about the beginnings and ends of TV shows. But I thought I'd just blog out a few additional directions I'd like to see followed up.

I have some issues with the idea of validating sentiment analysis results being especially useful for literary analysis, principally because I don't think that even perfectly working sentiment analysis would be a very good way to measure plot. Citing Vonnegut is a bit of a bait-and-switch; he writes about "good fortune" and "ill fortune," not "positive sentiment" and "negative sentiment." Sentiment analysis is already trained on large numbers of human samples of whether something is positive or negative; if we want to explicitly test Vonnegut's hypothesis, we ought to be building new models that classify text as "fortunate" and "misfortunate," which should subtly differ from ""positive sentiment" and "negative sentiment."

Or we should be testing theories of plot that, unlike Vonnegut's, actually have any influence beyond a web video from a few year's ago. (Vonnegut doesn't even strike me as a writer who was especially good at plot, to be honest.) Train an LTSM model on human-tagged data that can accurately extract the "call to adventure" or the reaching of the innermost cave from a script, and then we might have something interesting, because there's a real interplay between the stories we consume through mass media and popularizations of Joseph Campbell.

Of course that brings me to the final problem here, which is that you *can't* use mechanical turk to label stories by their Cambpellian archetypes because ordinary readers don't speak in those terms. Is that a problem? Can we expect to find structures that most people wouldn't recognize?

I've said before that I think formal musical analysis is the real place to look here. One could, I imagine, try to classify every Beethoven sonata movement by its emotional trajectory; in some popular understanding of music that is what actually happens. But if macro-musicologists tried to do that, they'd obviously be missing out on the actual formal elements the composer was working with. Early 19th-century European music is organized tonally; a good model of its structure would look at tonal organization, not some nebulous notion of emotionality.

I do not believe there are general story principles as firm as classical-era sonata form. I do think that some combination of Joseph Campbell, commercial organization, and three-act structure conformism leaves contemporary television and movies somewhat predisposed to one or a few narratives that could be usefully explored. Which is why I think it's a huge strategic blunder for everyone working with plots to be looking at novels--probably the least coherent narrative form in existence--instead of any of the many other forms of narrative out there.

Even if there are "master plots," I suspect they will be revealed as much in terms of tension as emotion. (Tension is also more easily analogized to classical form music, for better or worse, as dominant-tonic relationships.) A plot classifier shouldn't be looking at local emotion; it should be looking at arcs of introduction of tension and release. This requires a very different form of machine reading; every gun on every mantlepiece needs to be tracked until it goes off. (As with everything else these days, this seems structurally better suited for neural networks than the locally tokenized texts we're mostly working with.) Tension explains a wide variety of plots that none of the emotionally based mechanisms can. For example, the preponderance of plots in my TV and movie database are procedurals which are not organized around a single character's rise and fall; instead, they proceed from crime to punishment, from disease to cure, or from acquisition to sale.

I have no idea how to define "tension." You could do it through Mechanical Turk, I guess. But what's really interesting is that we may be able to define it operationally. What sort of events in texts demand resolutions? What distinguishes beginnings from ends? These are more unsupervised questions than ones about emotional trajectories, and ones that might provide us with much more interesting questions to build on as well as answers.

Nature publishes flat-earth research paper

2016-07-05T11:37:00.002-04:00

I usually keep my mouth shut in the face of the many hilarious errors that crop up in the burgeoning world of datasets for cultural analytics, but this one is too good to pass up. Nature has just published a dataset description paper that appears to devote several paragraphs to describing "center of population" calculations made on the basis of a flat earth.

"Spatializing 6,000 years of global urbanization from 3700 BC to AD 2000" by Reba et al presents what seems like a fairly useful digitization of two books that give fairly speculative historical estimates of city populations. But as part of the writeup, it includes the following chart calculating "Global mean centers" of population as they have shifted through history: starting somewhere near Baghdad in 1000 BCE, to heading northwest for the next 2500 years, pushing west almost to Izmir by 1900, and then veering sharply south in the 20th century into the Sahara.

"Center of population" calculations are something I've become very interested in lately. The earliest I've seen is Moses Greenleaf's calculation of the shifting center of population of Maine in his 1828 atlas; from 1870, they were laboriously calculated at the US Census bureau for the full population and a large number of regional and ethnic subgroups. (If anyone knows a pre-1900 example of center of population calculations happening outside the United States, I'd love to hear it.)

It didn't really matter for Greenleaf on a scale as small as Maine, but census cartographers were aware of the challenges of calculating centers on a round earth. In the early twentieth century they weighted calculations by miles instead of lines of longitude; more recently they use trigonometry to weight the calculations. But they knew that simply averaging latitude and longitude would misrepresent the sphere. These problems are even worse for the whole globe at once. There hasn't been much serious work put into calculating the global center, since we don't have global population data at the granularity of the US census's; but any reasonable approach would have to begin by assuming the earth was a sphere.

What's particularly baffling about the Nature map here is that although they do assume a flat earth, they don't make the easy mistake of assuming a rectangular one. Instead, according to the caption above, they calculate their centers of population using the Goode Homolosine projection. It's not immediately recognizable in their map, so here's a version from Wikipedia:

The homolosine projection is great for thematic global mapping because it preserves equal area without distorting the shapes and north-south orientations of local land areas too much. It does this at the cost, though, of several gigantic cuts through the oceans and Greenland; these make it singularly inappropriate for a center of population calculation. For example, Tokyo is just about due north of Adelaide in real life: but because Goode chose splits that would pull Japan closer to the Eurasian mainland, in this projection it ends up closer to Perth. Eastern North American cities for nearly double the distance of the Atlantic ocean.

And obviously, there's no particular reason that Africa should be in the middle of the map. The Americas could be east of Asia instead of West of Europe. The interrupted homolosine projection I came up with to map shipping routes (below) splits Western Europe from the Near East; it would probably put the center moving from Iran due westward into the Pacific. And that's not even to get into maps that move the equator off the center of the map.

So why are they using a homolosine projection at all? I don't want to put too much more thought into this than they did. But it must be some combination of them acknowledging the problems with the mercator and/or equirectangular projections, while just wanting to get along with center of population calculations that show the march of population. So they use the first equal-area projection that comes to hand, and assume it's good enough to show centers of population. Which, honestly, it is; centers of population are such a vaguely-defined thing that there isn't really any harm in presenting them in whatever light you want.

But it's still kind of a joke on all of us that we inhabit a scholarly ecosystem where a data publication has to be accompanied with lots of explanatory text and diagrams to seem respectable, but in which no cares if you demonstrate that without worrying about the shape of the globe. (And as Seth Denbo points out, one of the first applications of the new set was an animation that places Moses leading the Israelites from Egypt as just another precisely dated historical migration).

Literary Dopplegängers and interestingness

2016-05-30T16:37:00.001-04:00

I started this post with a few digital-humanities posturing paragraphs: if you want to read them, you'll encounter them eventually. But instead let me just get the point: here's a trite new category of analysis that wouldn't be possible without distant reading techniques that produces sometimes charmingly serendipitous results.

I'll call it dopplegänger books. A dopplegänger is, for any world-historically great work of literature, a book that shares many of the same themes, subjects, and language, but is comparatively obscure, not widely read, and--most likely--of surpassingly mediocre quality.

Edit: Ryan Cordell informs me privately and regretfully that I'm wrong in some of my conclusions here. I said, "I took a grand total of one English literature class in college; does anyone expect me to be right?" But he's worried that my wrongness might reflect poorly on the field of DH, which has a history of critics straw-manning offhand blog posts into terrible representatives of the field. So let me say up front: Persons attempting to find an argument in this post will be prosecuted; persons attempting to find political advocacy in it will be banished; persons expecting me to have anything above a high-schooler's knowledge of English literature will be shot.

Take Huck Finn. In hazy recollection (I haven't read the whole book in probably 10 years), much of what seems great about it is the purely American picaresque of a vision of America. Twain's interest "is in the the boy in whose mouth he puts the story, and in this boy's view of the world as it passes under his eye." Huck "is a true child of the river," and gives us a view of America seen through the eyes of "a perfect vagabond of a youngster, wandering up and down the river at his will, taking in the passing show with open mind, finding it all for to admire."

All those quotes, as you may already have guessed guessed, are not describing Huck Finn at all, but instead come from a review of Charles Stewart's Partners of Providence (1904).

Read the book online through Hathi

The table of contents is pretty fascinatingly close to Huckleberry Finn; the reviewers note the comparison, and it's hard to imagine that the tale of a young boy's adventures up and down the river with an entertaining ethnic (here Irish) sidekick past swindlers and exhibitions and perils wasn't somehow noveled on the most famous humorist in the country.

But there are surely differences as well; I wouldn't be surprised if an in-class discussion on the racial politics Huckleberry Finn couldn't benefit from a brief comparison to Partners' account of "the marooning and subsequent escape of a pair of pugnacious darkies."

Across the c. 4.5 million public domain volumes in the Hathi Trust, there are a surprising number of these, many books that seem (based on Google searches) to languish in deserved obscurity. (I've got a set of tricks that actually finding the pairings more feasible than running 20 trillion pairwise comparisons, but the exact mechanics of that are for another day). But they're interesting; not in a "distant reading" way, but in that they provide some greater focus around the core texts we all read already.

So let me just plug a few books in here and see what comes back. My criteria are just that the original book be canonical.

Huckleberry Finn

Twain is closest to himself; Huck Finn is closest to the later Tom Sawyer books than to Tom Sawyer itself, which should perhaps not be surprising.

But nearest-neighbor searching also reveals a deep vein of western boys literature. We know that this exists; the interesting questions here would probably involve the specific ways (especially dialect: these are mostly first person narratives in highly vernacular styles) that writers imitate Twain.

Publication years also provide a point of departure. All the books here were written substantially later than Huck Finn except for "Live boys in the Black Hills." So if I were going to pick any up, maybe I'd start there.

0.628 Danny's own story, (1912)

0.662 Mr. Pratt, a novel, (1906)

(and 4 nearly identical books): 1 2 3 4

0.665 Lige Mounts: free trapper, (1922)

(and 3 nearly identical books): 1 2 3

0.679 Jim Hands / (1911)

(and 1 nearly identical books): 1

0.681 Swatty; a story of real boys, (1920)

(and 1 nearly identical books): 1

0.683 Mark Tidd in the backwoods, (1914)

0.683 Live boys in the Black Hills, or, The young Texan gold hunters : a narrative in Charley's own language, describing their adventures during a second trip over the great Texas cattle trail ... (1880)

0.685 Peace in Friendship Village, (1919)

(and 2 nearly identical books): 1 2

0.689 Billy Fortune, (1912)

Moby-Dick

This has fewer straightforward imitators; but the whaling novel is a perfectly well-represented genre.
The closest match is the romance "The Red Eric; or, The whaler's last cruise. A tale" from 1883. Some elements of the contents are provocative, at least; but the similarities are less than perfect. (Red Eric's captain's "insane resolution" is to bring his daughter on a whaling cruise with him, for example).

A few other options include a collection of sea stories,

0.615 Round the gallery fire / (1914)

0.616 A Bounty boy: being some adventures of a Christian barbarian on an unpremeditated trip round the world, (1912)

0.616 Sea-wrack, (1903)

(and 1 nearly identical books): 1

0.618 Old Jack, a man of-war's man and South-Sea whaler, (1859)

0.620 The cruise of the Cachalot round the world after sperm whales / (1911)

(and 3 nearly identical books): 1 2 3

The Cruise of the Cachalot and Sea-wrack, by Frank Bullen, offer some of the more interesting comparisons. Properly shuffled, it makes sense that Moby Dick's closest companions might include not literature at all, but piecemeal miscellanea from the magazines like this ("Sea-Wrack")

Middlemarch

Middlemarch is somewhat harder to find close matches for uninteresting reasons: since the novel is so long, it was frequently chopped into 2, 3, or 4 parts; and each one of those sections ranks highly on the list.

```
0.736
```
Hannah. (1890)
```
0.764
```
A brave lady. (1870)
```
0.770
```
Fraternity; a romance ... (1910)

The nearest novels are by Dinah Craik, who I don't know, but who seems well enough established as a poor man's George Eliot in the scholarly literature. (Googling quickly brought me to the online version of Sally Mitchell's monograph on the author.). "Hannah", the closest, is characterized by Mitchell as "a one-issue novel with a narrow legislative aim."

Fraternity; a romance ... (1910) is a harder nut to crack. It's a rural novel set in Wales and published by Macmillan around 1888, but the only surviving digital copy was (according to library metadata) published in the United States in 1910. (Galsworthy's 1911 novel Fraternity further muddies things here.) It's the subject of a strikingly positive review in the Boston press that explicitly casts it as a diamond in the rough.

I was going to let it go there, but then discovered a whole separate track via this book. The author is one Miss M. M. Holland Thomas, and the novel somehow attracted the intense admiration of JP Morgan for its message of social reform through benevolent patronage. (It is Morgan who paid for the American reprint in 1910.) Does this story have anything to do with a similarity to Middlemarch? Hmm. there's definitely something here about the connections between the English social novel and political intentions. But beyond that, I couldn't say.

The Education of Henry Adams

The absolute closest match is his brother's autobiography. Which should surprise no one, and I'm sure I've encountered the book before. "Early Memories" by Henry Cabot Lodge is also high on the list, which is probably a decent choice as well. But I'll pick as the dopplegänger Cambridge Sketches by Frank Preston Stearns, which hits a number of the same points

0.586 Charles Francis Adams, 1835-1915; an autobiography; (1916)

(and 12 nearly identical books): 1 2 3 4 5 6 7 8 9 10 11 12

0.600 Studies of men / (1895)

(and 4 nearly identical books): 1 2 3 4

0.606 Cambridge sketches (1905)

(and 2 nearly identical books): 1 2

0.608 Early memories, (1913)

(and 3 nearly identical books): 1 2 3

0.609 Charles Sumner, (1892)

(and 3 nearly identical books): 1 2 3

0.610 History of the United States of America. (1889)

(and 5 nearly identical books): 1 2 3 4 5

0.610 Life and letters of Edwin Lawrence Godkin; (1907)

(and 4 nearly identical books): 1 2 3 4

The Souls of Black Folk

A real genre-bender of a book, even more than Moby Dick. And even less often reprinted.

The closest match is a fairly dull-seeming hagiography of Booker T. Washington. But I'll take as a shadow "Up stream: an American chronicle" by Ludwig Lewisohn. It seems to be the personal memoir of a German-born Jew who grew up in Charleston, SC before attending Columbia and (eventually) becoming a founding faculty member at Brandeis. The grounds for similarity aren't entirely clear--perhaps some odd combination of self-recognition, music, and the South?--but that's what makes it an interesting track. Some of the

Autobiography of an ex-colored man

On the topic of great Af-Am literature. This one was suggested to me as a candidate by John Reuland. For this one I'm pasting in a longer list of matches, because we were initially very disappointed at the results. (Very little African American literature on the list).

But on looking at the list, what there is is an extraordinary amount of autobiographical self-help literature about money. So maybe there's some lesson to be gleaned there.

0.610 A victorious defeat; the story of a franchise, (1906)

0.612 Banner bearers; tales of the suffrage campaigns, (1920)

(and 2 nearly identical books): 1 2

0.619 Not angels quite. (1893)

0.619 In paradise : a novel, from the German of Paul Heyse. (1878)

(and 2 nearly identical books): 1 2

0.620 Years of experience; an autobiographical narrative. (1886)

(and 1 nearly identical books): 1

0.621 The "goldfish" : being the confessions of a successful man. (1921)

(and 7 nearly identical books): 1 2 3 4 5 6 7

0.623 Philip Gerard : an individual / (1899)

0.624 Courtship under contract : the science of selection, a tale of woman's emancipation / (1910)

(and 1 nearly identical books): 1

0.624 Of one blood / (1916)

0.625 The Lawton girl, (1897)

(and 1 nearly identical books): 1

0.625 My threescore years and ten : An autobiography / (1892)

(and 2 nearly identical books): 1 2

0.626 The works of Charles Dickens ... (1898)

(and 8 nearly identical books): 1 2 3 4 5 6 7 8

0.627 A man of millions / (1901)

0.628 The writings of Mark Twain. (1899)

(and 6 nearly identical books): 1 2 3 4 5 6

0.629 Jacob Schuyler's millions. A novel. (1886)

0.630 The £1,000,000 bank-note, and other new stories, (1893)

(and 2 nearly identical books): 1 2

0.630 Bubble reputation : a story of modern life / (1906)

(and 1 nearly identical books): 1

OK, that's enough.

Portrait of the Artist as a Young Man

Again, the matches aren't as clear; a vocabulary-based approach like mine works best thematically distinct themes like riverboats, not with "childhood."

There are some vaguely interesting similarities: at #3, I particularly like "What to read at Winter Entertainments," in which it appears the closest antecedent to Joyce is a stuffed-together hodgepodge of great British writers from the 19th century. Sounds about right.

But as a Doppleganer, I'll take Shaw Desmond's Gods, which seems to cover similar places in the Irish experience of the early 20th century.

On Interestingness

I've thinking about Ted Underwood's "old-fashioned, shamelessly opinionated, 1000-word blog post" from yesterday. There are parts I wholeheartedly agree with, such as the section where he dances near to, but decorously avoids citing, Kieran Healy's magnum opus on what calls for nuance do in contemporary academic discourse. There are parts I don't; I'm increasingly convinced that efforts to apply and invent novel algorithmic practices should be fully central to the work of some humanists, and that calls to return to the primary questions of the disciplines are not just premature but somewhat misguided.*

(Roughly, although I should boil this up into a richer stew at some point: very few people outside a philosophy department think that only academic philosophers should do philosophy; very few people *inside* history departments think that only academic historians should do history. Just as we let political philosophy flourish in politics departments and cultural history flourish in art and music departments, computer programming shouldn't be the sole province of computer science departments.)

Is this interesting? I'm not sure. It's not here-I-come-PMLA interesting, for sure. But then again, I've never deliberately sought out much contemporary literary history written since 1980 or so. For a certain sort of Arnoldian prudish conception of literature, I kind of like the game. Much like my anachronism-searching blog posts, it's a field-and-context approach to literature where the whole is not treated as the object of study itself (the stated purpose of much "distant reading") but as a conveniently large wall on which to reposition the works of literature we're already interested in. What that means for literary history, I think I'm under no professional obligation to say.

Bonus links

A little bonus for those who read through to the end; a temporary link to a live interface to the engine I used for this thing, so you can play along at home. Just go to http://benschmidt.org/similarities/ and you can paste in any text you're interested in. Terms and conditions are: don't link to that page, because this may not scale; and e-mail me or post in the comments if you find any terrible bugs or interesting matches.

Word embedding models

2015-11-03T10:25:00.002-05:00

A heads-up for those with this blog on their RSS feeds: I've just posted a couple things of potential interest on one of the two other blogs (errm) I'm running on my own site.

One, "Vector Space Models for the digital humanities," describes how a newly improved class of algorithms known as word embedding models work and showcases some of their potential applications for digital humanities researchers.

The other, "Rejecting the gender binary," is a more substantive look at how the method can help us better imagine a version of English without gendered language through some tricks of linear algebra; that results in a sort of translation dictionary between the way students talk about men and the way they talk about women.

I'm aware that this blog is sort of twisting on the vine right now. I like the politics of not using Google, and the ability to embed real javascript that comes with not using Blogger. Perhaps the humane thing to do would be retire this site and direct you to http://benschmidt.org/posts/ and http://bookworm.benschmidt.org instead. But I like keeping it around, and will probably come back here next time I have something to say about, say, the hilariously inadequate college rankings the Economist just published, or just to link other stuff.

State of the Union--and corpus comparison.

2015-01-19T14:47:00.001-05:00

Mitch Fraas and I have put together a two-part interactive for the Atlantic using Bookworm as a backend to look at the changing language in the State of Union. Yoni Appelbaum, who just took over this week, spearheaded a great team over there including Chris Barna, Libby Bawcombe, Noah Gordon, Betsy Ebersole, and Jennie Rothenberg Gritz who took some of the Bookworm prototypes and built them into a navigable, attractive overall package. Thanks to everyone.

The first part is an interactive map with every place name we could find using the Stanford Natural Language Toolkit and some (Fraas-flavored) elbow grease. Then we got two great historians of American foreign policy, Dael Norwood and Gretchen Heefner, to explain some of the things in the maps.

The second is about individual words presidents use. So the recent rise in "Freedom," the references to the Constitution predominantly in the time of crisis, and so forth.

My favorite feature, and one that the Atlantic team executed beautifully, is the deep access into individual texts: click on a circle or a bar, and you are off reading the actual paragraph from the state of the union that uses that word on mentions that place. This has always been a core feature of Bookworm on various levels--by treating paragraphs as documents for the modelling, it's easy to drill straight to the interesting stuff. One thing that's mostly missing are the Ngrams-style line charts. I've been saying for a while that I hope people see Bookworm enabling other forms of visualization. These pages are a great example of that; maps and bar charts of words are just as engaging, and sometimes things like "presidents" and "the world" are more engaging than individual years.

So go check those out. They speak for themselves.

But for the text analysis crowd, I also wanted to tell you a little more about the link down right at the bottom of the second (words) piece, and get a little technical about why that, although we decided not to include on the Atlantic site, contains the germ of something I find pretty interesting for online text analysis in general.

That page gives you an in-browser live corpus comparison using Dunning Log-Likelihood between the complete speeches of any two presidents. Click on any word on either chart and you get a bar graph showing which presidents used it the most. (I'm obligated to Isabel Mereilles for pointing out to me that this was the best approach over some awful wordcloud thing I had originally).

This visualization is a little more inside baseball--but for those who know either presidential history or text analysis, it should be pretty interesting. These aren't pre-compiled numbers: every search runs the Dunning keyword analysis anew. That means that although this front end is strictly about comparing presidents, you can use it for all sorts of other questions. I used myself, in fact, to find interest words to graph: I settled on "freedom" as one of my blurbs under the chart after rewriting the display to show words that showed marked differences between Republican and Democratic presidents since 1960.

I find this exciting--and I think you should too--because as I've been saying for a while, comparison is the most underused tool in the toolkit of the digital humanities. A lot of pre-2010 work did fascinating comparative work. (The MONK interface for Dunning comparisons on TEI-annotated documents, for example). This is a proof-of-concept in a very similar space, but implemented inside quite a rich environment for text analysis that makes it easier to adopt the same tools for very different applications.

People who topic model talk about the potential to generate insights. I believe them. But I also think that in the presence of rich metadata, good comparison metrics (which Dunning approaches) can be far, far more productive. For the State of the Union, there are all sorts of useful comparisons to make: president vs. president, republican vs. Democrat, lame duck vs recently elected, opposition congress vs. friendly crowd... And for every other corpus, there are just as many. We currently treat these kinds of analytics as things that should be run client side, requiring individuals to obtain digital texts (frequently impossible) and install and run some tools for corpus comparison (a high barrier to entry.) But libraries and other content holders can--and I would argue, should--support these things as a form of exploration out of the box.

Currently the richest tools for building comparisons like these in Bookworm are still locked away on the developer side of the matrix. (Though it really only requires tweaking a few javascript variables on an existing template.) In part, this is deliberate; comparison metrics like this involve much more processing and much more data transfer than simple search techniques. Poorly designed, they can bring down a server. (Please don't abuse mine). But we offered search in the 1980s, when computers were nowhere near so powerful as today; it's certain realistic to serve this kind of thing now. Unlike topic modeling, say, it is realistic to offer them as a server side project; particularly when that server sits in a library. If you're a developer-librarian yourself, you can download the bookworm repos and get these sorts of comparisons running pretty easily on any metadata categories that you have. (In addition to Dunning Log Likelihood, the Bookworm API now supports TF-IDF as a comparison statistic, though both are probably going to have some changes in their execution methods. I think there's some room for improvement here.)

But where I see this ultimately going is towards real-time, fully customizable in-browser comparison across any facets of a corpus as a service libraries and other content providers can easily offer on medium-sized (c. 20,000 documents) corpora. And "ultimately" here is really quite close. There are challenges of scale at the Hathi-scale level, but may be ways to address that through random sampling. And for smaller corpora, individuals and institutions could start to be spinning up Bookworm instances running these things and developing their own interfaces right now.

Federal College Rankings: The pitfalls of a magical regression model

2014-12-30T15:43:00.002-05:00

Far and away the most interesting idea of the new government college ratings emerges toward the end of the report. It doesn't quite square the circle of competing constituencies for the rankings I worries about in my last post, but it gets close. Lots of weight is placed on a single magic model that will predict outcomes regardless of all the confounding factors they raise (differing pay by gender, sex, possibly even degree composition). As an inveterate modeler and data hound, I can see the appeal here. The federal government has far better data than US News and World Report, in the guise of the student loan repayment forms; this data will enable all sorts of useful studies on the effects of everything from home-schooling to early-marriage. I don't know that anyone is using it yet for the sort of studies it makes possible (do you?), but it sounds like they're opening the vault just for these college ranking purposes.

The challenges raised to the rankings in the report are formidable. Whether you think they can work depends on how much faith you have in the model. I think it's likely to be dicey for two reasons: it's hard to define "success" based on the data we have, and there are potentially disastrous downsides to the mix of variables that will be used as inputs.

I should say, by the way that a lot of details about the model is unclear; it looks for the moment like the standard economist's trick of throwing every variable they have into a linear regression and hoping for the best. But there are interesting possibilities here. Is it going to be a true multilevel model allowing variable coefficients to vary by school? That would let us know if, hypothetically, Ole Miss tends to depress the post-graduation performance of African American students while being a great place for whites, or if Harvard Business School adds more value for men than for women. That would open up the prospect of truly personal college rankings. It might also enable all sorts of Title IX suits. But given the data and the ease of doing this kind of thing, I suspect someone will at least run it in their testing phase: it might be worth getting some subpoenas ready.

How will the rankings define success? It seems to be through a combination of several factors including graduation rate, cost, loan debt, and income years after graduation. None of these are especially new, and all have their problems. (Spelled out in the preliminary report). College presidents like Drew Faust are right to worry that the particular variables being chosen constitute a bit of a federal nudge to train students vocationally and for the short term, which is directly contrary to the mission of much higher education. We don't yet know exactly how these statistics might be gamed, but it seems likely that colleges may put comparatively too much emphasis to helping students find a job by a certain date. I believe that one key component in my university, Northeastern's, quick ascent in the rankings was convincing the magazine not to use 4-year graduation rate as a major component, since the typical Northeastern student takes five years to graduate with two 6-month job placements. A federal ranking will likely inadvertently punish schools that deviate from the norm even when there are good reasons.

But where things really get dicey are in the factors that are used to *offset* student success. I previously worried that rankings which used earnings measures would punish schools or disciplines with many women. A massive regression model will eliminate that concern, but will produce really strange and uncertain results.

As the coefficients vary, individual schools will shoot up and down in the rankings based on their demographic profiles. This is likely to be a quite unstable ranking from year to year, which is an unreservedly bad thing--it will encourage deans to make radical shifts and pinpoints turns on the basis of the slightest evidence. This holds promise only for the sort of people who use the word "disrupt" too much.

Most dispiriting about the tone of the report for me is the implicit optimism that they have the data to solve all of the problems of varying group performance. Even if they solve race and gender, a whole slew of other factors will persist. I joked on Twitter that the "moneyball" dean should summarily reject short people and twins from admission, since their lifetime earnings tend to be lower; although that's obviously foolish, plenty of universities and groups will suffer under a regression-based ranking.

For instance, schools with a large Asian-American population will be expected by the model to perform extremely well. But it's well known that certain Asian demographic groups don't share in the benefits. These subgroups tend to be geographically concentrated, so it's a fair bet that the University of Minnesota, say, will appear to be doing a much worse job than it actually is because the model expects its Hmong-American students to perform as well as UCSD's Chinese-American ones.

There are also obvious questions about what to include in the model. It won't include, for instance, a for-profit/not-for-profit flag, because if students attending for-profit schools do worse, that should reflect poorly on them. But should there be a public/private flag? This is less clear. Perhaps thorniest of all is the issue of accounting for degree composition. People training to be nurses make more than people training to be clergy. But the report vacillates, for understandable reasons, on whether degree mix should be included in the model. Most politicians, President Obama included, tend to like the idea that a model like this might give an extra nudge to schools to eliminate their art history major. (His stereotype, not mine).

But that's the real problem. Including all these factors is a double-edged sword: although it means that the rankings will be more fair to socially disadvantaged groups and schools serving those populations, it also brings a new form of gaming into play. Although people love to complain about the distortionary effects of the US News ranking, the effects are relatively innocuous compared to what they could be. Sure, it's absurd that Princeton probably deliberately accepts underqualified applicants to improve its yield or that default enrollment caps at Northeastern are 19 and 49 students to fall just short of the minimum class sizes: but those effects aren't especially pernicious. US News doesn't include all that many variables in its list, so really evil discriminatory practices in gaming the rankings aren't as common as they could be.

If this federal ranking is adopted, it could unleash all sorts of new problems precisely because the data being used is so much richer. Depending on the exact calibration of the model, it may quickly become apparent that it makes sense to discriminate against or reward all sorts of classes (wealthy African Americans). And depending on what isn't included, it may be obviously beneficial to start shuttering or discouraging enrollments in the liberal arts. It will take the actual introduction of the ranking for all the ingenuity of the managerial class to be deployed to reveal the ways that they can be gamed.

Federal college rankings: who are they for?

2014-12-30T15:32:00.002-05:00

Before the holiday, the Department of Education circulated a draft prospectus of the new college rankings they hope to release next year. That afternoon, I wrote a somewhat dyspeptic post on the way that these rankings, like all rankings, will inevitably be gamed. But it's probably better to bury that off and instead point out a couple looming problems with the system we may be working under soon. The first is that the audience for these rankings is unresolved in a very problematic way; the second is that altogether two much weight is placed on a regression model solving every objection that has been raised. Finally, I'll lay out my "constructive" solution for salvaging something out of this, which is that rather than use a three-tiered "excellent" - "adequate" - "needs improvement", everyone would be better served if we switched to a two-tiered "Good"/"Needs Improvement" system. Since this is sort of long, I'll break it up into three posts: the first is below.

The first is that the draft document is unclear at its core about who the audience for these rankings is going to be. In the imagination of the authors, it seems most often to be a family sitting around their kitchen table, deciding which college offers the best "value." The report has the germ of a ranking system designed to assess that value--using really extraordinary data from every student who's ever taken out of a line to build a large regression model predicting "success." (More about that below).

As a ranking for prospective students, it's possible to imagine something useful coming out of this. The vision driving it seems to be, roughly, an antidote to US News and World Report that doesn't include endowment size or library or alumni giving or any of the things designed to keep the same five schools at the head of the ranking. It's essentially the federal government adding its voice to the massive chorus saying "for God's sake, don't pay full tuition at USC when you have in-state at UCLA." Looking at the criteria, it seems like we'll probably get a school like the Air Force Academy (mixed socioeconomic applicants, no tuition/debts, universal employment) as #1 overall instead of Yale. It wouldn't be bad to have a ranking out there like that, so long as we're clear that it's just about some minimum threshold of employability and postgraduate debt.

But there's another set of language that keeps creeping in, about "rewarding" colleges for the good they do. This probably comes of the constituent meetings. The greatest beneficiaries are a different audience, of deans, who want to protect the socially valuable parts of the mission from the econometricians. (Or, more cynically, more metrics to demonstrate the successes of their initiatives in the cover letter for their next deanship.) So the report spends pages on the boost colleges will get for having large numbers of Pell Grant awardees enrolled, because it would be unfair to punish a college for having a large number of low-income or first-generation students. In fact, a number of these people, including myself, have been screaming bloody murder about the ways that a purely value ranking would enforce existing socioeconomic disparities because of things like the gender wage gap.
The goal of this audience here is to use the rankings as tools to make colleges behave more the way that Education Department stakeholders think they should. If Harvard had more poor students, the country would be better off. Therefore, we should change the rankings to reward Harvard for having fewer rich kids.

The problem is that these two goals are deeply unreconcilable. No individual student (at least not the rational-actor, wage oriented automaton that the report assumes all prospective applicants are) benefits from those systemic alignments. If low-income Hunter College graduates do better than Lehman College graduates but Lehman has more low-income students, it would be irresponsible for the government to tell students to go to Lehman just because its mission is more socially important. But that's one of the things the proposal mulls over. (This is in addition to using income metrics as inputs into the magical model, below).

The eagerness to build "movement" into the model is similarly baroque. Any complex system will show statistically significant movement from year to year; but to flag those in the ranking with the implication that it's likely to continue is both a disservice to students and to the next set of deans, who will have to keep whatever bubble of juked stats the previous ones set in, whether it benefits their mission or not.

It seems likely that this tension will eventually tear the reports apart into two completely separate rankings--one for students, and one for administrators. (The cleavage is already taking place in the report). The student one will then have all of the flaws of "punishing" socially useful institutions that the report worried about; and to cover those up, they're already insisting that this is a "rating," rather than a ranking. My suspicion is the ultimate release will be so occluded in variable measures that it fall like a lead balloon. Or, to use a more relevant metaphor, it will fall like the 2010 NRC graduate program rankings. Those were smothered caveats, regression models, and error bands. Any sensible consumer will turn right back to the regularly-updated US News graduate program rankings, done with a straight reputational survey. Which is, by the way, far from the worst way to handle these things. While they tend to conservatism, reputational surveys are much harder to game than data-based ones, and much more understanding of particular institutional dynamics. And conservatism in ranking isn't a bad thing. If I remember correctly, the NRC reports ultimately scored schools (on one of the two metrics) not on their reputations but on what their predicted reputations would be based on their statistical profile: so if you ran a highly regarded program at a university with a tiny library, the ranking algorithm assumed nudged your score back down.

But if these reports do somehow produce a ranking that manages to be clear, it will inevitably be misleading one of the two groups. Perhaps the intention of the department is to fudge the ratings just enough that they'll still look credible to students, while also nudging Harvard to admit more poor students. But this is a difficult balancing act to pull, and it relies on a complicated ranking model that could put the distortionary effects of the US News ranking to shame. So that's what the next post is about.

Sapping Attention

What's in the Hathi Trust?

How badly is Google Books search broken, and why?

Some preliminary analysis of the Texas salary-by-major data.

Connection between income and change in major numbers.

Mea culpa: there *is* a crisis in the humanities

Within colleges and universities, the humanities are testing new lows

The decline is not strongly related to the expansion of higher ed.

Even looking at raw numbers, the past decade shows a sharp decline.

It's not just humanities fields

PhDs are steady, but not coupled to degrees in the long term.

Am I wrong again?

Appendix: Supplemental charts

Google Books and the open web.

Meaning chains with word embeddings

"Peer review" is younger than you think. Does that mean it can go away?

Population Density 2: Old and New New England

Population Density 1: Do cities have a land area? And a literal use of the Joy Division map

What is described as belonging to the "public" versus the "government?"

A brief visual history of MARC cataloging at the Library of Congress.

The history of looking at data visualizations

Some notes on corpora for diachronic word2vec

OCR failures in 2016

A 192-year heatmap of presidential elections with a y axis ordering you have to see to believe

The efficient plots hypothesis

Language is biased. What should engineers do?

Why Digital Humanists don't need to understand algorithms, but do need to understand transformations

Plot arceology 2016: emotion and tension

Nature publishes flat-earth research paper

Literary Dopplegängers and interestingness

Word embedding models

State of the Union--and corpus comparison.

Federal College Rankings: The pitfalls of a magical regression model

Federal college rankings: who are they for?

Mea culpa: there is a crisis in the humanities