All the cool kids are talking about shortcomings in digitized text databases. I don't have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream interests—the neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that it's not just at the margins we're missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Here's an example.
Digital Humanities: Using tools from the 1990s to answer questions from the 1960s about 19th century America.
Wednesday, April 13, 2011
Monday, April 11, 2011
Age cohort and Vocabulary use
Let's start with two self-evident facts about how print culture changes over time:
This might be a historical question, but it also might be a linguistics/sociology/culturomics one. Say there are two different models of language use: type A and type B.
This is getting into some pretty multi-dimensional data, so we need something a little more complicated than line graphs. The solution I like right now is heat maps.
An example: I know that "outside" is a word that shows a steady, upward trend from 1830 to 1922; in fact, I found that it was so steady that it was among the best words at helping to date books based on their vocabulary usage. So how did "outside" become more popular? Was it the Angstrom model, where everyone just started using it more? Or was it the Bascombe model, where each succeeding generation used it more and more? To answer that, we need to combine author birth year with year of publication:
- The words that writers use change. Some words flare into usage and then back out; others steadily grow in popularity; others slowly fade out of the language.
- The writers using words change. Some writers retire or die, some hit mid-career spurts of productivity, and every year hundreds of new writers burst onto the scene. In the 19th-century US, median author age stays within a few years of 49: that constancy, year after year, means the supply of writers is constantly being replenished from the next generation.
This might be a historical question, but it also might be a linguistics/sociology/culturomics one. Say there are two different models of language use: type A and type B.
- Type A means a speaker drifts on the cultural winds: the language shifts and everyone changes their vocabulary every year.
- Type B, on the other hand, assumes that vocabulary is largely fixed at a certain age: a speaker will be largely consistent in her word choice from age 30 to 70, say, and new terms will not impinge on her vocabulary.
- Type A: John Updike's Rabbit Angstrom. Rabbit doesn't know what he wants to say. Every decade, his vocabulary changes; he talks like a ennui-ed salaryman in the 50s, flirts with hippiedom and Nixonian silent-majorityism in the 60s, spends the late 70s hoarding gold and muttering about Consumer Reports and the Japanese. For Updike, part of Rabbit being an everyman is the shifts he undergoes from book to book: there's a sort of implicit type-A model underlying his transformations. He's a different person at every age because America is different in every year.
- Type B: Richard Ford's Frank Bascombe. Frank Bascombe, on the other hand, has his own voice. It shifts from decade to decade, to be sure, but 80s Bascombe sounds more like 2000s Bascombe than he sounds like 80s Angstrom. What does change is internal to his own life: he's in the Existence period in the 90s and worries about careers, and the 00s he's in the Permanent Period and worried about death. Bascombe is a dreamy outsider everywhere he goes: the Mississippian who went to Ann Arbor, always perplexed by the present.*
This is getting into some pretty multi-dimensional data, so we need something a little more complicated than line graphs. The solution I like right now is heat maps.
An example: I know that "outside" is a word that shows a steady, upward trend from 1830 to 1922; in fact, I found that it was so steady that it was among the best words at helping to date books based on their vocabulary usage. So how did "outside" become more popular? Was it the Angstrom model, where everyone just started using it more? Or was it the Bascombe model, where each succeeding generation used it more and more? To answer that, we need to combine author birth year with year of publication:
Sunday, April 3, 2011
Stopwords to the wise
Shane Landrum (@cliotropic) says my claim that historians have different digital infrastructural needs than other fields might be provocative. I don't mean this as exceptionalism for historians, particularly not compared to other humanities fields. I do think historians are somewhat exceptional in the volume of texts they want to process—at Princeton, they often gloat about being the heaviest users of the library. I do think this volume is one important reason English has a more advanced field of digital humanities than history does. But the needs are independent of the volume, and every academic field has distinct needs. Data, though, is often structured for either one set of users, or for a mushy middle.
A particularly clear connection is from database structures to "categories of analysis" in our methodology. Since humanists share methods in a lot of ways, digital resources designed for one humanities discipline will carry well for others. But it's quite possible to design a resource that makes extensive use of certain categories of analysis nearly impossible.
One clear-cut example: ancestry.com. The bulk of interest in digitized census records lies in two groups: historians and genealogists. That web site is clearly built for the latter: it has lots of genealogy-specific features built into the database for matching sound-alike names and misspellings, for example, but almost nothing for social history. (I'm pretty sure you can't use it to find German cabinet-makers in Camden in 1850, for example.) Ancestry.com views names (last names in particular) as the most important field and structures everything else around serving those up. Lots of historians are more interested in the place or the profession or the ancestry fields in the census: what we take as a unit of analysis affects what we want to see database indexes and search terms built around. (And that's not even getting into the question of aggregating the records into statistics.)
A particularly clear connection is from database structures to "categories of analysis" in our methodology. Since humanists share methods in a lot of ways, digital resources designed for one humanities discipline will carry well for others. But it's quite possible to design a resource that makes extensive use of certain categories of analysis nearly impossible.
One clear-cut example: ancestry.com. The bulk of interest in digitized census records lies in two groups: historians and genealogists. That web site is clearly built for the latter: it has lots of genealogy-specific features built into the database for matching sound-alike names and misspellings, for example, but almost nothing for social history. (I'm pretty sure you can't use it to find German cabinet-makers in Camden in 1850, for example.) Ancestry.com views names (last names in particular) as the most important field and structures everything else around serving those up. Lots of historians are more interested in the place or the profession or the ancestry fields in the census: what we take as a unit of analysis affects what we want to see database indexes and search terms built around. (And that's not even getting into the question of aggregating the records into statistics.)
Friday, April 1, 2011
Generations vs. contexts
When I first thought about using digital texts to track shifts in language usage over time, the largest reliable repository of e-texts was Project Gutenberg. I quickly found out, though, that they didn't have works for years, somewhat to my surprise. (It's remarkable how much metadata holds this sort of work back, rather than data itself). They did, though, have one kind of year information: author birth dates. You can use those to create same type of charts of word use over time that people like me, the Victorian Books project, or the Culturomists have been doing, but in a different dimension: we can see how all the authors born in a year use language rather than looking at how books published in a year use language.
I've been using 'evolution' as my test phrase for a while now: but as you'll see, it turns out to be a really interesting word for this kind of analysis. Maybe that's just chance, but I think it might be a sort of indicative test case--generational shifts are particularly important for live intellectual issues, perhaps, compared to overall linguistic drift.
To start off, here's a chart of the usage of the word "evolution" by share of words per year. There's nothing new here yet, so this is merely a reminder:
I've been using 'evolution' as my test phrase for a while now: but as you'll see, it turns out to be a really interesting word for this kind of analysis. Maybe that's just chance, but I think it might be a sort of indicative test case--generational shifts are particularly important for live intellectual issues, perhaps, compared to overall linguistic drift.
To start off, here's a chart of the usage of the word "evolution" by share of words per year. There's nothing new here yet, so this is merely a reminder:
Here's what's new: we can also plot by year of author birth, which shows some interesting (if small) differences:
Monday, March 28, 2011
Cronon's politics
Let me step away from digital humanities for just a second to say one thing about the Cronon affair.
(Despite the professor-blogging angle, and that Cronon's upcoming AHA presidency will probably have the same pro-digital history agenda as Grafton's, I don't think this has much to do with DH). The whole "we are all Bill Cronon" sentiment misses what's actually interesting. Cronon's playing a particular angle: one that gets missed if we think about him as either a naïve professor, stumbling into the public sphere, or as a liberal ideologue trying to score some points.
(Despite the professor-blogging angle, and that Cronon's upcoming AHA presidency will probably have the same pro-digital history agenda as Grafton's, I don't think this has much to do with DH). The whole "we are all Bill Cronon" sentiment misses what's actually interesting. Cronon's playing a particular angle: one that gets missed if we think about him as either a naïve professor, stumbling into the public sphere, or as a liberal ideologue trying to score some points.
Thursday, March 24, 2011
Author Ages
Back from Venice (which is plastered with posters for "Mapping the Republic of Letters," making a DH-free vacation that much harder), done grading papers, MAW paper presented. That frees up some time for data. So let me start off looking at a new pool for book data for a little while that I think is really interesting.
Open Library metadata has author birth dates. The interaction of these with publication years offers a lot of really fascinating routes to go down, and hopefully I can sketch out a few over the next week or two. Let me start off, thought, with just a quick note on its reliability, scope, etc., looking only at the metadata itself. The really interesting stuff won't come out of metadata manipulation like this, but rather out of looking at actual word use patterns. But I need to understand what's going one before that's possible.
Open Library has pretty comprehensive metadata on authors. In the bigpubs database I made, about 40,000 books have author birth years, and 8,000 do not; given that some of those are corporate authors, anonymous, etc., that's not bad at all. (About 1500 books have no author listed whatsoever).
First, a pretty basic question: how old are authors when they write books? I've been meaning to switch over to ggplot in R for basic graphing, so here's a chance to break its histogram function. Here's a chart of author age for all the books in my bigpubs set:

Open Library metadata has author birth dates. The interaction of these with publication years offers a lot of really fascinating routes to go down, and hopefully I can sketch out a few over the next week or two. Let me start off, thought, with just a quick note on its reliability, scope, etc., looking only at the metadata itself. The really interesting stuff won't come out of metadata manipulation like this, but rather out of looking at actual word use patterns. But I need to understand what's going one before that's possible.
Open Library has pretty comprehensive metadata on authors. In the bigpubs database I made, about 40,000 books have author birth years, and 8,000 do not; given that some of those are corporate authors, anonymous, etc., that's not bad at all. (About 1500 books have no author listed whatsoever).
First, a pretty basic question: how old are authors when they write books? I've been meaning to switch over to ggplot in R for basic graphing, so here's a chance to break its histogram function. Here's a chart of author age for all the books in my bigpubs set:

Wednesday, March 2, 2011
What historians don't know about database design…
I've been thinking for a while about the transparency of digital infrastructure, and what historians need to know that currently is only available to the digitally curious. They're occasionally stirred by a project like ngrams to think about the infrastructure, but when that happens they only see the flaws. But those problems—bad OCR, inconsistent metadata, lack of access to original materials—are present to some degree in all our texts.
One of the most illuminating things I've learned in trying to build up a fairly large corpus of texts is how database design constrains the ways historians can use digital sources. This is something I'm pretty sure most historians using jstor or google books haven't thought about at all. I've only thought about it a little bit, and I'm sure I still have major holes in my understanding, but I want to set something down.
Historians tend to think of our online repositories as black boxes that take boolean statements from users, apply it to data, and return results. We ask for all the books about the Soviet Union written before 1917, Google spits it back. That's what computers aspire to. Historians respond by muttering about how we could have 13,000 misdated books for just that one phrase. The basic state of the discourse in history seems to be stuck there. But those problems are getting fixed, however imperfectly. We should be muttering instead about something else.
Tuesday, February 22, 2011
Genres in Motion
Here's an animation of the PCA numbers I've been exploring this last week.
There's quite a bit of data built in here, and just what it means is up for grabs. But it shows some interesting possibilities. As a reminder: at the end of my first post on categorizing genres, I arranged all the genres in the Library of Congress Classification in two dimensional space using the first two principal components. PCA basically find the combinations of variables that most define the differences within a group. (Read more by me here or generally here.). The first dimension roughly corresponded to science vs. non-science: the second separated social science from the humanities. It did, I think, a pretty good job at showing which fields were close to each other. But since I do history, I wanted to know: do those relations change? Here's that same data, but arranged to show how those positions shift over time. I made this along the same lines as the great Rosling/Gapminder bubble charts, created with this via this. To get it started, I'm highlighting psychology.
[If this doesn't load, you can click through to the file here]. What in the world does this mean?
There's quite a bit of data built in here, and just what it means is up for grabs. But it shows some interesting possibilities. As a reminder: at the end of my first post on categorizing genres, I arranged all the genres in the Library of Congress Classification in two dimensional space using the first two principal components. PCA basically find the combinations of variables that most define the differences within a group. (Read more by me here or generally here.). The first dimension roughly corresponded to science vs. non-science: the second separated social science from the humanities. It did, I think, a pretty good job at showing which fields were close to each other. But since I do history, I wanted to know: do those relations change? Here's that same data, but arranged to show how those positions shift over time. I made this along the same lines as the great Rosling/Gapminder bubble charts, created with this via this. To get it started, I'm highlighting psychology.
[If this doesn't load, you can click through to the file here]. What in the world does this mean?
Sunday, February 20, 2011
Vector Space, overlapping genres, and the world beyond keyword search
I wanted to see how well the vector space model of documents I've been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if you're sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Lab's Pamphlet One, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books in LCC subclasses "BF" (psychology) blue, and use red for "QE" (Geology), overlaying them on a chart of the first two principal components like I've been using for the last two posts:
That's a little worse than I was hoping. Generally the books stay close to their term, but there is a lot of variation, and even a little bit of overlap. Can we do better? And what would that mean?
That's a little worse than I was hoping. Generally the books stay close to their term, but there is a lot of variation, and even a little bit of overlap. Can we do better? And what would that mean?
Thursday, February 17, 2011
PCA on years
I used principal components analysis at the end of my last post to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, here's an improved (using all my data on the 10,000 most common words) version of that plot:
I have a professional interest in shifts in genres. But this isn't temporal--it's just a static depiction of genres that presumably waxed and waned over time. What can we do to make it historical?
I have a professional interest in shifts in genres. But this isn't temporal--it's just a static depiction of genres that presumably waxed and waned over time. What can we do to make it historical?
Labels:
pca
Monday, February 14, 2011
Fresh set of eyes
One of the most important services a computer can provide for us is a different way of reading. It's fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.
And though a text can be a book, it can also be something much larger. Take library call numbers. Library of Congressheadings classifications are probably the best hierarchical classification of books we'll ever get. Certainly they're the best human-done hierarchical classification. It's literally taken decades for librarians to amass the card catalogs we have now, with their classifications of every book in every university library down to several degrees of specificity. But they're also a little foreign, at times, and it's not clear how well they'll correspond to machine-centric ways of categorizing books. I've been playing around with some of the data on LCC headings classes and subclasses with some vague ideas of what it might be useful for and how we can use categorized genre to learn about patterns in intellectual history. This post is the first part of that.
***
Everybody loves dendrograms, even if they don't like statistics. Here's a famous one, from the French Encylopedia.
And though a text can be a book, it can also be something much larger. Take library call numbers. Library of Congress
***
Everybody loves dendrograms, even if they don't like statistics. Here's a famous one, from the French Encylopedia.
That famous tree of knowledge raises two questions for me:
Friday, February 11, 2011
Going it alone
I've spent a lot of the last week trying to convince Princeton undergrads it's OK to occasionally disagree with each other, even if they're not sure they're right. So let me make one of my notes on one of the places I've felt a little bit of skepticism as I try to figure what's going on with the digital humanities.
Since I'm late to the party, I've been trying to catch up a bit on where the field is now. One thing that jumped out is how wide-ranging the hopes are for what the digital humanities might do if they take over the existing disciplines or create their own. Being a bit of a job market determinist myself, I wonder if the wreckage many see in the current structure of the humanities doesn't promote a little bit of millenarian strand about how great the reconstruction might be. I feel occasionally I've stumbled into Moscow 1919 or Paris 1968; there are manifestos, there are spontaneous leaderless youth, and in the wreckage of the old system, anything seems possible for the new technological man. Digital humanities, to exaggerate the claims, will create the mass audience academic historians have lost, will reaffirm the importance of public history in the field, will create new fields with new jobs, will break down the boundaries between disciplines, will allow collaborative history to finally emerge. And it might be in danger if it's co-opted by the powers-that-be, as John Unsworth finds many worrying (pdf).
Paris 1968 is an exciting place to be. I've been watching Al-Jazeera all week. But all these transformations promised by DH won't happen all at once, and some of them won't happen at all. As I try to write some of this up for a Princeton audience (which is why, along with the start of our term last week, I'm not blogging much right now) I'm thinking about what it takes to get skeptical historians on board, and what parts of the promised land might put them off.
The thing I'm mulling over: collaboration. A colleague said to me yesterday he thought the digital humanities will come and go before most historians ever stopped working alone, and I think I tend to agree. I'm pretty much agnostic on the need for collaborative history, myself. Certainly, digital humanities open up fascinating new prospects for collaborative projects. But so far as we're trying to get anyone established on board, an insistence on collaboration might be as much a liability as a benefit. I'm signing up for a THATcamp, but I have to admit a bit of trepidation about putting in volunteer work onto anything that isn't mine. Not just for selfishness, but because we often have funny standards about academic work it's difficult to impose on others. I went to a talk this week where one participant says he refuses to use the words "idea" or "concept." No one can live up to all the constraints we might want to put on work, but it's often fascinating to see what people come up with when we let them do things wholly their own way. Labs aren't always amenable to humanist practices because it's critically important for the health of our disciplines that we don't agree on methodology.
Luckily, then, I've been most struck by in the last couple months is how far one can go it alone right now--unlike the early years of humanities computing (or so I gather), you don't need teams to get computing time, all the truly technical work of digitization, OCR, and cataloging has been done by groups like the Internet Archive, and free software makes it possible to get started on some forms of analysis quite quickly. It's quite possible for someone at a university without any digital humanities infrastructure to do work in text mining or GIS without having a full lab or collaborative team behind them. Sure, it's harder than firing up an iPad app; but I'm not sure it's that much worse than all the commands plenty of senior academics learned in the dark ages to check their e-mail on pine or elm.
What about all the collaborative the labs and programs we already have? Clearly they do more than anything to advance the field, and it's hard to imagine all the great work coming out of GMU or Stanford (say) happening with lone scholars. But it's equally hard for me to imagine that the digital humanities will have actually succeeded until there's a lot of good work coming out that doesn't need the collaborative model, and that answers to some of the expectations of solitary scholars about how humanistic work is produced. At least, that's what I'm thinking for now.
Since I'm late to the party, I've been trying to catch up a bit on where the field is now. One thing that jumped out is how wide-ranging the hopes are for what the digital humanities might do if they take over the existing disciplines or create their own. Being a bit of a job market determinist myself, I wonder if the wreckage many see in the current structure of the humanities doesn't promote a little bit of millenarian strand about how great the reconstruction might be. I feel occasionally I've stumbled into Moscow 1919 or Paris 1968; there are manifestos, there are spontaneous leaderless youth, and in the wreckage of the old system, anything seems possible for the new technological man. Digital humanities, to exaggerate the claims, will create the mass audience academic historians have lost, will reaffirm the importance of public history in the field, will create new fields with new jobs, will break down the boundaries between disciplines, will allow collaborative history to finally emerge. And it might be in danger if it's co-opted by the powers-that-be, as John Unsworth finds many worrying (pdf).
Paris 1968 is an exciting place to be. I've been watching Al-Jazeera all week. But all these transformations promised by DH won't happen all at once, and some of them won't happen at all. As I try to write some of this up for a Princeton audience (which is why, along with the start of our term last week, I'm not blogging much right now) I'm thinking about what it takes to get skeptical historians on board, and what parts of the promised land might put them off.
The thing I'm mulling over: collaboration. A colleague said to me yesterday he thought the digital humanities will come and go before most historians ever stopped working alone, and I think I tend to agree. I'm pretty much agnostic on the need for collaborative history, myself. Certainly, digital humanities open up fascinating new prospects for collaborative projects. But so far as we're trying to get anyone established on board, an insistence on collaboration might be as much a liability as a benefit. I'm signing up for a THATcamp, but I have to admit a bit of trepidation about putting in volunteer work onto anything that isn't mine. Not just for selfishness, but because we often have funny standards about academic work it's difficult to impose on others. I went to a talk this week where one participant says he refuses to use the words "idea" or "concept." No one can live up to all the constraints we might want to put on work, but it's often fascinating to see what people come up with when we let them do things wholly their own way. Labs aren't always amenable to humanist practices because it's critically important for the health of our disciplines that we don't agree on methodology.
Luckily, then, I've been most struck by in the last couple months is how far one can go it alone right now--unlike the early years of humanities computing (or so I gather), you don't need teams to get computing time, all the truly technical work of digitization, OCR, and cataloging has been done by groups like the Internet Archive, and free software makes it possible to get started on some forms of analysis quite quickly. It's quite possible for someone at a university without any digital humanities infrastructure to do work in text mining or GIS without having a full lab or collaborative team behind them. Sure, it's harder than firing up an iPad app; but I'm not sure it's that much worse than all the commands plenty of senior academics learned in the dark ages to check their e-mail on pine or elm.
What about all the collaborative the labs and programs we already have? Clearly they do more than anything to advance the field, and it's hard to imagine all the great work coming out of GMU or Stanford (say) happening with lone scholars. But it's equally hard for me to imagine that the digital humanities will have actually succeeded until there's a lot of good work coming out that doesn't need the collaborative model, and that answers to some of the expectations of solitary scholars about how humanistic work is produced. At least, that's what I'm thinking for now.
Wednesday, February 2, 2011
Graphing word trends inside genres
Genre information is important and interesting. Using the smaller of my two book databases, I can get some pretty good genre information about some fields I'm interested in for my dissertation by using the Library of Congress classifications for the books. I'm going to start with the difference between psychology and philosophy. I've already got some more interesting stuff than these basic charts, but I think a constrained comparison like this should be somewhat more clear.
Most people know that psychology emerged out of philosophy, becoming a more scientific or experimental study of the mind sometime in the second half of the 19C. The process of discipline formation is interesting, well studied, and clearly connected to the vocabulary used. Given that, there should be something for lexical statistics in it. Also, there's something neatly meta about using the split of a 'scientific' discipline off of a humanities one, since some rhetoric in or around the digital humanities promises a bit more rigor in our analysis by using numbers. So what are the actual differences we can find?
Let me start by just introducing these charts with a simple one. How much do the two fields talk about "truth?"
Tuesday, February 1, 2011
Technical notes
I'm changing several things about my data, so I'm going to describe my system again in case anyone is interested, and so I have a page to link to in the future.
Platform
Everything is done using MySQL, Perl, and R. These are all general computing tools, not the specific digital humanities or text processing ones that various people have contributed over the years. That's mostly because the number and size of files I'm dealing with are so large that I don't trust an existing program to handle them, and because the existing packages don't necessarily have implementations for the patterns of change over time I want as a historian. I feel bad about not using existing tools, because the collaboration and exchange of tools is one of the major selling points of the digital humanities right now, and something like Voyeur or MONK has a lot of features I wouldn't necessarily think to implement on my own. Maybe I'll find some way to get on board with all that later. First, a quick note on the programs:
Platform
Everything is done using MySQL, Perl, and R. These are all general computing tools, not the specific digital humanities or text processing ones that various people have contributed over the years. That's mostly because the number and size of files I'm dealing with are so large that I don't trust an existing program to handle them, and because the existing packages don't necessarily have implementations for the patterns of change over time I want as a historian. I feel bad about not using existing tools, because the collaboration and exchange of tools is one of the major selling points of the digital humanities right now, and something like Voyeur or MONK has a lot of features I wouldn't necessarily think to implement on my own. Maybe I'll find some way to get on board with all that later. First, a quick note on the programs:
Monday, January 31, 2011
Where were 19C US books published?
Open Library has pretty good metadata. I'm using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While I'm waiting for some indexes to build, that will give a good chance to figure out just what's in these digital sources.
Most interestingly, it has state level information on books you can download from the Internet Archive. There are about 500,000 books with library call numbers or other good metadata, 225,000 of which are published in the US. How much geographical diversity is there within that? Not much. About 70% of the books are published in three states: New York, Massachusetts, and Pennsylvania. That's because the US publishing industry was heavily concentrated in Boston, NYC, and Philadelphia. Here's a map, using the Google graph API through the great new GoogleViz R package, of how many books there are from each state. (Hover over for the numbers, and let me know if it doesn't load, there still seem to be some kinks). Not included is Washington DC, which has 13,000 books, slightly fewer than Illinois.
I'm going to try to pick publishers that aren't just in the big three cities, but any study of "culture," not the publishing industry, is going to be heavily influenced by the pull of the Northeastern cities.
Most interestingly, it has state level information on books you can download from the Internet Archive. There are about 500,000 books with library call numbers or other good metadata, 225,000 of which are published in the US. How much geographical diversity is there within that? Not much. About 70% of the books are published in three states: New York, Massachusetts, and Pennsylvania. That's because the US publishing industry was heavily concentrated in Boston, NYC, and Philadelphia. Here's a map, using the Google graph API through the great new GoogleViz R package, of how many books there are from each state. (Hover over for the numbers, and let me know if it doesn't load, there still seem to be some kinks). Not included is Washington DC, which has 13,000 books, slightly fewer than Illinois.
I'm going to try to pick publishers that aren't just in the big three cities, but any study of "culture," not the publishing industry, is going to be heavily influenced by the pull of the Northeastern cities.
Subscribe to:
Posts (Atom)




