Saturday, November 13, 2010

Infrastructure

It's time for another bookkeeping post. Read below if you want to know about changes I'm making and contemplating to software and data structures, which I ought to put in public somewhere. Henry posted questions in the comments earlier about whether we use Princeton's supercomputer time, and why I didn't just create a text scatter chart for evolution like the one I made for scientific method. This answers those questions. It also explains why I continue to drag my feet on letting us segment counts by some sort of genre, which would be very useful.

Back to Darwin

Henry asks in the comments whether the decline in evolutionary thought in the 1890s is the "'Eclipse of Darwinism,' rise or prominence of neo-Lamarckians and saltationism and kooky discussions of hereditary mechanisms?" Let's take a look, with our new and improved data (and better charts, too, compared to earlier in the week--any suggestions on design?). First,three words very closely tied to the theory of natural selection.

Three rises from around 1859, Origin's publication date (obviously the numbers for Spencer are inflated by other Spencers in the world, but the trend seems like it might be driven by Herbert); and three peaks at different points from 1885 to 1900, followed by a fall and perhaps a recovery. The question is: how significant are those falls, and how can we interpret them? First, let's look at the bookcounts: are those falls a result of less intensive discussion of the subjects, or of a waning in interest across all books?

Friday, November 12, 2010

Wordcounts in starting research--what do we have now?

All right, let's put this machine into action. A lot of digital humanities is about visualization, which has its place in teaching, which Jamie asked for more about. Before I do that, though, I want to show some more about how this can be a research tool. Henry asked about the history of the term 'scientific method.' I assume he was asking a chart showing its usage over time, but I already have, with the data in hand, a lot of other interesting displays that we can use. This post is a sort of catalog of what some of the low-hanging fruit in text analysis are.

The basic theory I'm working on here is that textual analysis isn't necessarily about answering research questions. (It's not always so good at doing that.) It can also help us channel our thinking into different directions. That's why I like to use charts and random samples rather than lists--they can help us come up with unexpected ideas, and help us make associations that wouldn't come naturally. Essentially, it's a different form of reading--just like we can get different sorts of ideas from looking at visual evidence vs. textual evidence, so can we get yet other ideas by reading quantitative evidence. The last chart in the post is good for that, I think. But first things first: the total occurrences of "scientific method" per thousand words.


This is what we've already had. But now I've finally got those bookcounts running too. Here is the number of books per thousand* that contain the phrase "scientific method":


Thursday, November 11, 2010

Bookcounts are in

I now have counts for the number of books a word appears in, as well as the number of times it appeared. Just as I hoped, it gives a new perspective on a lot of the questions we looked at already. That telephone-telegraph-railroad chart, in particular, has a lot of interesting differences. But before I get into that, probably later today, I want to step back and think about what we can learn from the contrast between between wordcounts and bookcounts. (I'm just going to call them bookcounts--I hope that's a clear enough phrase).

Roughly, wordcounts are how many times a word is said in my library, and bookcounts is how many different authors are saying it--or how many different readers are reading it. (Given prolific authors, multiple authors, etc. that's not quite true, but it's still an OK way to think about it). Since this is a quantitative blog, let's start with a chart. Here are the two different counts simply plotted against each other (please someone e-mail me if these image files don't come through as well as the earlier ones):
Each of the 200,000 points is a word--from "the," all the way up in the upper right hand corner, to a whole morass of words we've all forgotten and typos, down in the lower left. The red line is the theoretical minimum-- a word appearing in exactly as many books as is its word count.* This is abstract, I know, so let's add some of the words we've already been analyzing to the chart to humanize it a little. (By the time we get through with this, my little linguistic studies the last few entries will have taken us back to history, I promise).

Wednesday, November 10, 2010

digitizecr by ljooq ic

Obviously, I like charts. But I've periodically been presenting data as a number of random samples, as well.  It's a technique that can be important for digital humanities analysis. And it's one that can draw more on the skills in humanistic training, so might help make this sort of work more appealing. In the sciences, an individual data point often has very little meaning on its own--it's just a set of coordinates. Even in the big education datasets I used to work with, the core facts that I was aggregating up from were generally very dull--one university awarded three degrees in criminal science in 1984, one faculty member earned $55,000 a year. But with language, there's real meaning embodied in every point, that we're far better equipped to understand than the computer. The main point of text processing is to act as a sort of extraordinarily stupid and extraordinarily perseverant research assistant, who can bring patterns to our attention but is terrible at telling which patterns are really important. We can't read everything ourselves, but it's good to check up periodically--that's why I do things like see what sort of words are the 300,000th in the language, or what 20 random book titles from the sample are.

So any good text processing application will let us delve into the individual data as well as giving the individual picture. I'm circling around something commenter "Jamie" said, though not addressing it directly: (quote after break)


Tuesday, November 9, 2010

How Many Words are there in the English language?

Here's what googling that question will tell you: about 400,000 words in the big dictionaries (OED, Webster's); but including technical vocabulary, a million, give or take a few hundred thousand. But for my poor computer, that's too many, for reasons too technical to go into here. Suffice it to say that I'm asking this question for mundane reasons, but the answer is kind of interesting anyway. No real historical questions in this post, though--I'll put the only big thought I have about it in another post later tonight.

Monday, November 8, 2010

More on Technologies--and, what the graphs show.

I can't resist making a few more comments on that technologies graph that I laid out. I'm going to add a few thousand more books to the counts overnight, so I won't make any new charts until tomorrow, but look at this one again.
You remember I claimed that some intellectual movements spread like technologies, and some like news. But that's only looking at the smoothed curves. As you've probably figured out, I plot two sets of data for each word--a thin solid line for the actual data, and a large dotted line for a smoothed version (for the record, a loess smoothing that looks at 15% of the total for each point). What do the peaks mean?

Diffusion patterns for news and technological events

An anonymous correspondent says:
You mention in the post about evolution & efficiency that "Offhand, the evolution curve looks more the ones I see for technologies, while the efficiency curve resembles news events."

That's a very interesting observation, and possibly a very important one if it's original to you, and can be substantiated. Do you have an example of a tech vs news event graph? Something like lightbulbs or batteris vs the Spanish American war might provide a good test case.

Also, do you think there might be changes in how these graphs play out over a century? That is, do news events remain separate from tech stuff? Tech changes these days are often news events themselves, and distributed similarly across media.

I think another way to put the tech vs news event could be in terms of the kind of event it is: structural change vs superficial, mid-range event vs short-term.

Anyhow, a very interesting idea, of using the visual pattern to recognize and characterize a change. While I think your emphasis on the teaching angle (rather than research) is spot on, this could be one application of these techniques where it'd be more useful in research.
He or she is right that technology vs. news isn't quite the right way to describe it. Even in the 19C, some technology changes are news events, while others aren't. But let's look at some examples here.

Back to Basics

I've rushed straight into applications here without taking much time to look at the data I'm working with. So let me take a minute to describe the set and how I'm trimming it.

Sunday, November 7, 2010

Collocation

A collection as large as the Internet Archive's OCR database means I have to think through what I want well in advance of doing it. I'm only using a small subset of their 900,000 Google-scanned books, but that's still 16 gigabytes--it takes a couple hours just to get my baseline count of the 200,000 most common words. I could probably improve a lot of my search time through some more sophisticated database management, but I'll still have to figure out what sort of relations are worth looking for. So what are some?

Taylor vs. Darwin

Let's start with just some of the basic wordcount results. Dan Cohen posted some similar things for the Victorian period on his blog, and used the numbers mostly to test hypotheses about change over time. I can give you a lot more like that (I confirmed for someone, though not as neatly as he'd probably like, that 'business' became a much more prevalent word through the 19C). But as Cohen implies, such charts can be cooler than they are illuminating.


But you can get into some interesting analysis even just at the wordcount level. Look at this chart (sorry if it's small):


The red line is the frequency of the word 'evolution'; the green one, 'efficiency.' (I'll put a post at some point on how to read these graphs better.) Each one of those terms gets a week in your typical intellectual history survey, and each one has a canonical author associated with it. (If you're a little fuzzy, Darwin's Origin of the Species is 1859; Taylor's Scientific Management is 1911, Shop Management 1903). And sure enough, both climb out of the ocean of insignificant words around the time we'd expect them to.

But look at the differences between the two. "Evolution" starts its climb shortly after Darwin publishes (the real spike in the data seems to be 1864, which gives Americans a chance to make it through that book), and rises in prominence for decades. "Efficiency," on the other hand, increases in prominence five times in the first fifteen years of the century, before leveling off a bit. Taylor's most famous work is at middle of the curve, and his first one is part of the rise, not before it. Both words show a major addition to shared cultural vocabulary, but the way they are taken in shows they are two very different intellectual movements.

Saturday, November 6, 2010

Intro

I'm going to start using this blog to work through some issues in finding useful applications for digital history. (Interesting applications? Applications at all?)

Right now, that means trying to figure out how to use large amounts of textual data to draw conclusions or refine questions. I currently have the Internet Archive's OCRed text files for about 30,000 books by large American publishers from 1830 to 1920. I've done this partly to help with my own research, and partly to try a different way of thinking about history and the texts we read.

I'm putting it online to help convince one or two people (I'm looking at you, Henry) that this sort of exploration is important for research and teaching. Not necessarily that it's research itself; I'm still unimpressed by the conclusions I'm getting out of it. But at least that any historian looking at the meanings of words (which is most of us, at least around here) should make some stab at using the texts of books we haven't read. And if I can get some good graphics out of it, maybe we can start to think about how this might be useful in teaching, particularly students who respond better to data than stories.

Anyhow, on with it.

Wednesday, July 15, 2009