Showing posts with label authors. Show all posts
Showing posts with label authors. Show all posts

Tuesday, May 10, 2011

Predicting publication year and generational language shift

Before end-of-semester madness, I was looking at how shifts in vocabulary usage occur. In many cases, I found, vocabulary change doesn't happen evenly across across all authors. Instead, it can happen generationally; older people tend to use words at the rate that was common in their youth, and younger people anticipate future word patterns. An eighty-year-old in 1880 uses a world like "outside" more like a 40-year-old in 1840 than he does like a 40-year-old in 1880. The original post has a more detailed explanation.

Will had some some good questions in the comments about how different words fit these patterns. Looking at different types of words should help find some more ways that this sort of investigation is interesting, and show how different sorts of language vary. But to look at other sorts of words, I should be a little clearer about the kind of words I chose the first time through. If I can describe the usage pattern for a "word like 'outside'," just what kind of words are like 'outside'? Can we generalize the trend that they demonstrate?

Monday, April 11, 2011

Age cohort and Vocabulary use

Let's start with two self-evident facts about how print culture changes over time:
  1. The words that writers use change. Some words flare into usage and then back out; others steadily grow in popularity; others slowly fade out of the language.
  2. The writers using words change. Some writers retire or die, some hit mid-career spurts of productivity, and every year hundreds of new writers burst onto the scene. In the 19th-century US, median author age stays within a few years of 49: that constancy, year after year, means the supply of writers is constantly being replenished from the next generation.
How do (1) and (2) relate to each other? To what extent do the shifting group of authors create the changes in language, and how much do changes happen in a culture that authors all draw from?

This might be a historical question, but it also might be a linguistics/sociology/culturomics one. Say there are two different models of language use: type A and type B.
  • Type A means a speaker drifts on the cultural winds: the language shifts and everyone changes their vocabulary every year.
  • Type B, on the other hand, assumes that vocabulary is largely fixed at a certain age: a speaker will be largely consistent in her word choice from age 30 to 70, say, and new terms will not impinge on her vocabulary.
 Both of these models are extremes, and we can assume that hardly any words are pure A or pure B. To firm this up, let me concretize this with two nicely alphabetical examples of fictional characters to warm up the subject for all you humanists out there:
  • Type A: John Updike's Rabbit Angstrom. Rabbit doesn't know what he wants to say. Every decade, his vocabulary changes; he talks like a ennui-ed salaryman in the 50s, flirts with hippiedom and Nixonian silent-majorityism in the 60s, spends the late 70s hoarding gold and muttering about Consumer Reports and the Japanese. For Updike, part of Rabbit being an everyman is the shifts he undergoes from book to book: there's a sort of implicit type-A model underlying his transformations. He's a different person at every age because America is different in every year.
  • Type B: Richard Ford's Frank Bascombe. Frank Bascombe, on the other hand, has his own voice. It shifts from decade to decade, to be sure, but 80s Bascombe sounds more like 2000s Bascombe than he sounds like 80s Angstrom. What does change is internal to his own life: he's in the Existence period in the 90s and worries about careers, and the 00s he's in the Permanent Period and worried about death. Bascombe is a dreamy outsider everywhere he goes: the Mississippian who went to Ann Arbor, always perplexed by the present.*
Anyhow: I don't have good enough author metadata right now to check this on authors (which would be really interesting), but I can do it a bit on words. An Angstrom word would be one that pops up across all age cohorts in society simultaneously; a Bascombe word is one that creeps in more with each succeeding generation, but that doesn't change much over time within an age cohort.

This is getting into some pretty multi-dimensional data, so we need something a little more complicated than line graphs. The solution I like right now is heat maps.

An example: I know that "outside" is a word that shows a steady, upward trend from 1830 to 1922; in fact, I found that it was so steady that it was among the best words at helping to date books based on their vocabulary usage. So how did "outside" become more popular? Was it the Angstrom model, where everyone just started using it more? Or was it the Bascombe model, where each succeeding generation used it more and more? To answer that, we need to combine author birth year with year of publication:

Friday, April 1, 2011

Generations vs. contexts

When I first thought about using digital texts to track shifts in language usage over time, the largest reliable repository of e-texts was Project Gutenberg. I quickly found out, though, that they didn't have works for years, somewhat to my surprise. (It's remarkable how much metadata holds this sort of work back, rather than data itself). They did, though, have one kind of year information: author birth dates. You can use those to create same type of charts of word use over time that people like me, the Victorian Books project, or the Culturomists have been doing, but in a different dimension: we can see how all the authors born in a year use language rather than looking at how books published in a year use language.

I've been using 'evolution' as my test phrase for a while now: but as you'll see, it turns out to be a really interesting word for this kind of analysis. Maybe that's just chance, but I think it might be a sort of indicative test case--generational shifts are particularly important for live intellectual issues, perhaps, compared to overall linguistic drift.

To start off, here's a chart of the usage of the word "evolution" by share of words per year. There's nothing new here yet, so this is merely a reminder:

Here's what's new: we can also plot by year of author birth, which shows some interesting (if small) differences:

Thursday, March 24, 2011

Author Ages

Back from Venice (which is plastered with posters for "Mapping the Republic of Letters," making a DH-free vacation that much harder), done grading papers, MAW paper presented. That frees up some time for data. So let me start off looking at a new pool for book data for a little while that I think is really interesting.

Open Library metadata has author birth dates. The interaction of these with publication years offers a lot of really fascinating routes to go down, and hopefully I can sketch out a few over the next week or two. Let me start off, thought, with just a quick note on its reliability, scope, etc., looking only at the metadata itself. The really interesting stuff won't come out of metadata manipulation like this, but rather out of looking at actual word use patterns. But I need to understand what's going one before that's possible.

Open Library has pretty comprehensive metadata on authors. In the bigpubs database I made, about 40,000 books have author birth years, and 8,000 do not; given that some of those are corporate authors, anonymous, etc., that's not bad at all. (About 1500 books have no author listed whatsoever).

First, a pretty basic question: how old are authors when they write books? I've been meaning to switch over to ggplot in R for basic graphing, so here's a chance to break its histogram function. Here's a chart of author age for all the books in my bigpubs set: