Here's a little irony I've been meaning to post. Large scale book digitization makes tools like Ngrams possible; but it also makes tools like Ngrams obsolete for the future. It changes what a "book" is in ways that makes the selection criteria for Ngrams—if it made it into print, it must have some significance—completely meaningless.
So as interesting as Google Ngrams is for all sorts of purposes, it seems it might always end right in 2008. (I could have sworn the 2012 update included through 2011 in some collections; but all seem to end in 2008 now.)
Lo: the Ngram chart of three major publishers, showing the percentage of times each is mentioned compared to all other words in the corpus:
Digital Humanities: Using tools from the 1990s to answer questions from the 1960s about 19th century America.
Showing posts with label Ngrams. Show all posts
Showing posts with label Ngrams. Show all posts
Thursday, April 3, 2014
Friday, July 15, 2011
Moving
Starting this month, I’m moving from New Jersey to do a fellowship at the Harvard Cultural Observatory. This should be a very interesting place to spend the next year, and I’m very grateful to JB Michel and Erez Lieberman Aiden for the opportunity to work on an ongoing and obviously ambitious digital humanities project. A few thoughts on the shift from Princeton to Cambridge:
Wednesday, April 13, 2011
In search of the great white whale
All the cool kids are talking about shortcomings in digitized text databases. I don't have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream interests—the neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that it's not just at the margins we're missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Here's an example.
Friday, January 21, 2011
Digital history and the copyright black hole
In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. I've called 1922 the year digital history ends before; for the kind of work I want to see, it's nearly an insuperable barrier, and it's one I think not enough non-tech-savvy humanists think about. So let me dig in a little.
The Sonny Bono Copyright Term Extension Act is a black hole. It has trapped 95% of the books ever written, and 1922 lies just outside its event horizon. Small amounts of energy can leak out past that barrier, but the information they convey (or don't) is miniscule compared to what's locked away inside. We can dive headlong inside the horizon and risk our work never getting out; we can play with the scraps of radiation that seep out and hope it adequately characterizes what's been lost inside; or we can figure out how to work with the material that isn't trapped to see just what we want. I'm in favor of the latter: let me give a bit of my reasoning why.
My favorite individual ngram is for the zip code 02138. It is steadily persistent from 1800 to 1922, and then disappears completely until the invention of the zip code in the 1960s. Can you tell what's going on?
The Sonny Bono Copyright Term Extension Act is a black hole. It has trapped 95% of the books ever written, and 1922 lies just outside its event horizon. Small amounts of energy can leak out past that barrier, but the information they convey (or don't) is miniscule compared to what's locked away inside. We can dive headlong inside the horizon and risk our work never getting out; we can play with the scraps of radiation that seep out and hope it adequately characterizes what's been lost inside; or we can figure out how to work with the material that isn't trapped to see just what we want. I'm in favor of the latter: let me give a bit of my reasoning why.
My favorite individual ngram is for the zip code 02138. It is steadily persistent from 1800 to 1922, and then disappears completely until the invention of the zip code in the 1960s. Can you tell what's going on?
Thursday, January 20, 2011
Openness and Culturomics
The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. I'll get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.
Thursday, December 30, 2010
Assisted Reading vs. Data Mining
I've started thinking that there's a useful distinction to be made in two different ways of doing historical textual analysis. First stab, I'd call them:
The distinction is important because the way we use texts is tied to humanists' reactions to new work in digital humanities. Ted Underwood started an interesting blog to look at ngrams results from an English lit perspective: he makes a good point in his first post:
- Assisted Reading: Using a computer as a means of targeting and enhancing traditional textual reading—finding texts relevant to a topic, doing low level things like counting mentions, etc.
- Text Mining: Treating texts as data sources to chopped up entirely and recast into new forms like charts of word use or graphics of information exchange that, themselves, require a sort of historical reading.
The distinction is important because the way we use texts is tied to humanists' reactions to new work in digital humanities. Ted Underwood started an interesting blog to look at ngrams results from an English lit perspective: he makes a good point in his first post:
Thursday, December 23, 2010
What good are the 5-grams?
Dan Cohen gives the comprehensive Digital Humanities treatment on Ngrams, and he mostly gets it right. There's just one technical point I want to push back on. He says the best research opportunities are in the multi-grams. For the post-copyright era, this is true, since they are the only data anyone has on those books. But for pre-copyright stuff, there's no reason to use the ngrams data rather than just downloading the original books, because:
- Ngrams are not complete; and
- Were they complete, they wouldn't offer significant computing benefits over reading the whole corpus.
Labels:
Ngrams
Sunday, December 19, 2010
Not included in ngrams: Tom Sawyer
I wrote yesterday about how well the filters applied to remove some books from ngrams work for increasing the quality of year information and OCR compared to Google books.
But we have no idea what books are in there. There's no connection to the texts from the data.
I'm particularly interested in how they deal with subsequent editions of books. Their methodology (pdf) talks about multiple editions of Tom Sawyer. I think it says that they eliminate multiple copies of the same edition but keep different years.
I thought I'd check this. There are about 5 occasions in Tom Sawyer where the phrase "Huck said" appears with separating quotes, and 11 for "said Huck." Both are phrases that basically appear only in Tom Sawyer in the 19th century (the latter also has a tiny life in legal contracts involving huckaback, and a few other places), so we can use it as a fair proxy for different editions. The first edition of Tom Sawyer was 1881: there are loads of later ones, obviously. Here's what you get from ngrams:
Three big spikes around 1900, and nothing before. Until about 1940, the ratio is somewhat consistent with the internal usage in the book, 11 to 5, although "said huck" is a little overrepresented as we might think. Note:
Labels:
Ngrams
Saturday, December 18, 2010
State of the Art/Science
As I said: ngrams represents the state of the art for digital humanities right now in some ways. Put together some smart Harvard postdocs, a few eminent professors, the Google Books team, some undergrad research assistants for programming, then give them access to Google computing power and proprietary information to produce the datasets, and you're pretty much guaranteed an explosion of theories and methods.
Some of the theories are deeply interesting. I really like the censorship stuff. That really does deal with books specifically, not 'culture,' so it makes a lot of sense to do with this dataset. The stuff about half-lives for celebrity fame and particularly for years is cool, although without strict genre controls and a little more context I'm not sure what it actually says--it might be something as elegaic as the article's "We are forgetting our past faster with each passing year," but there are certainly more prosaic explanations. (Say: 1) footnotes are getting more and more common, and 2) footnotes generally cite more recent years than does main text. I think that might cover all the bases, too.) Yes, the big ideas, at least the ones I feel qualified to assess, are a little fuzzier—it's hard to tell what to do with the concluding description of wordcounts as "a great cache of bones from which to reconstruct the skeleton of a new science," aside from marveling at theBrooksianFreedmanian tangle of metaphor. (Sciences once roamed the earth?) But although a lot of the language of a new world order (have you seen the "days since first light" counter on their web page?) will rankle humanists, that fuzziness about the goals is probably good. This isn't quite sociobiology redux, intent on forcing a particular understanding of humanity on the humanities. It's just a collection of data and tools that they find interesting uses for, and we can too.
But it's the methods that should be more exciting for people following this. Google remains ahead of the curve in terms of both metadata and OCR, which are the stuff of which digital humanities is made. What does the Science team get?
Some of the theories are deeply interesting. I really like the censorship stuff. That really does deal with books specifically, not 'culture,' so it makes a lot of sense to do with this dataset. The stuff about half-lives for celebrity fame and particularly for years is cool, although without strict genre controls and a little more context I'm not sure what it actually says--it might be something as elegaic as the article's "We are forgetting our past faster with each passing year," but there are certainly more prosaic explanations. (Say: 1) footnotes are getting more and more common, and 2) footnotes generally cite more recent years than does main text. I think that might cover all the bases, too.) Yes, the big ideas, at least the ones I feel qualified to assess, are a little fuzzier—it's hard to tell what to do with the concluding description of wordcounts as "a great cache of bones from which to reconstruct the skeleton of a new science," aside from marveling at the
But it's the methods that should be more exciting for people following this. Google remains ahead of the curve in terms of both metadata and OCR, which are the stuff of which digital humanities is made. What does the Science team get?
Friday, December 17, 2010
Missing humanists
(First in a series on yesterday's Google/Harvard paper in Science and its reception.)
So there are four things I'm immediately interested from yesterday's Google/Harvard paper.
Twitter seems largely focused on #3 as a fascinating tool/diversion, the researchers seem to hope that #1 will create a burst of serious research using #4, and anyone doing research in the field should be eagerly scanning #2 for clues about what the state of art is—how far you can get with full cooperation from Google, with money to hire programmers, etc, and with unlimited computing infrastructure.
Each of these is worth thinking about in turn. Cut through all of it, though, and I think the core takeaway should be this:
Humanists need to be more involved in how these massive stores of data are used.
So there are four things I'm immediately interested from yesterday's Google/Harvard paper.
- A team of linguists, computer scientists and other non-humanists published that paper in Science about using Google data for word counts to outline the new science of 'culturomics';
- They described the methodology they used to get word counts out of the raw metadata and scans, which presumably represents the best Google could do in 2008-09;
- Google released a web site letting you chart the shifts in words and phrases over time;
- Google released the core data powering that site containing data on word, book, and page occurrences for various combinations of words.
Twitter seems largely focused on #3 as a fascinating tool/diversion, the researchers seem to hope that #1 will create a burst of serious research using #4, and anyone doing research in the field should be eagerly scanning #2 for clues about what the state of art is—how far you can get with full cooperation from Google, with money to hire programmers, etc, and with unlimited computing infrastructure.
Each of these is worth thinking about in turn. Cut through all of it, though, and I think the core takeaway should be this:
Humanists need to be more involved in how these massive stores of data are used.
Subscribe to:
Posts (Atom)



