Showing posts with label Data exploration and visualization. Show all posts
Showing posts with label Data exploration and visualization. Show all posts

Thursday, November 11, 2010

Bookcounts are in

I now have counts for the number of books a word appears in, as well as the number of times it appeared. Just as I hoped, it gives a new perspective on a lot of the questions we looked at already. That telephone-telegraph-railroad chart, in particular, has a lot of interesting differences. But before I get into that, probably later today, I want to step back and think about what we can learn from the contrast between between wordcounts and bookcounts. (I'm just going to call them bookcounts--I hope that's a clear enough phrase).

Roughly, wordcounts are how many times a word is said in my library, and bookcounts is how many different authors are saying it--or how many different readers are reading it. (Given prolific authors, multiple authors, etc. that's not quite true, but it's still an OK way to think about it). Since this is a quantitative blog, let's start with a chart. Here are the two different counts simply plotted against each other (please someone e-mail me if these image files don't come through as well as the earlier ones):
Each of the 200,000 points is a word--from "the," all the way up in the upper right hand corner, to a whole morass of words we've all forgotten and typos, down in the lower left. The red line is the theoretical minimum-- a word appearing in exactly as many books as is its word count.* This is abstract, I know, so let's add some of the words we've already been analyzing to the chart to humanize it a little. (By the time we get through with this, my little linguistic studies the last few entries will have taken us back to history, I promise).

Tuesday, November 9, 2010

How Many Words are there in the English language?

Here's what googling that question will tell you: about 400,000 words in the big dictionaries (OED, Webster's); but including technical vocabulary, a million, give or take a few hundred thousand. But for my poor computer, that's too many, for reasons too technical to go into here. Suffice it to say that I'm asking this question for mundane reasons, but the answer is kind of interesting anyway. No real historical questions in this post, though--I'll put the only big thought I have about it in another post later tonight.