But now that I see some concerns about gender biases in big digital corpora, I do have a bit to say. Partly that I have seen nothing to make me think social prejudices played into the scanning decisions at all. Rather, Google Books, Hathi Trust, the Internet Archive, and all the other similar projects are pretty much representative of the state of academic libraries. (With strange exceptions, of course). You can choose where to vaccum, but not what gets sucked up the machine; likewise the companies.
I have, though, seen everything to make me think that libraries collections are spectacularly biased in who they collect. This is true of gender, it's true of professions, it's true of race, it's true of class. It's even true of the time in which an author write. I never get tired of thinking about this. Still, this doesn't bother me much. There are lots of interesting questions that can be successfully posed to library books despite their biases; there are quite a few than can be successfully posed because of their biases. The fact that a library chose to keep some books is not an inconvenient sample of some occluded truth; it's the central fact of what they, back then, just to let let us see. I deliberately say "their biases," and not "their selection biases:" I try never to talk of library collections as 'samples.' Sample implies a whole; I have no idea what that would be here. Would it be every book ever written? That would show extremely similar biases against the dispossessed. Not every word ever spoken; only some are permitted to speak. Even the thoughts of historical actors are constrained by what they are permitted to know.
We have to rest somewhere. In seeking the ideas bandied about in the past, the library is not only a good place to start, it is as a good a place to end as any. As for its biases; there is no getting around it; but there is understanding it. That's always been a core obligation of historians.
To come back to gender: those biases are a big reason why I like digital libraries. It's possible to get one's arms around the biases in the whole just a bit better; and since they resemble physical libraries so well, they tell us how we might have been misreading them. This applies to texts, but sometimes to their authors as well. Remember: we can only understand past actors insofar as they have attributes--gender, race, nationality--enumerated by the state, but it is those state categories in which we're so caught up today. (Perhaps to our detriment; maybe we should be looking for bias against aesthetes, or agnostics, or the victims of violence). It's only rarely possible to escape those categories even a bit to see around the big issues. Tim Sherratt's now-canonical example of faces does it for real individuals. But in the aggregates, I think there's something about names. Names can tell us about gender; but they can also take us outside it.
A couple weeks ago, I downloaded the 1% IPUMS sample of 1910 and 1920 US census records. Those years, unlike many others, have names for each record. I already have author names as well, from the Open Library. It's Downton Abbey all over again: I can just divide between the two sets to see what names are used too little, and what names too much. I could guess at a headline figure on the gender breakdown of books in libraries; something like 10% of books are by women before 1922, maybe, but let's not reduce so far yet. Let's stay with the names. If more than 50% of the holders are women, call it a female name, and vice versa.
How do these names stack up? The numbers will not surprise you, but it's worth putting them down because we don't really know this about our libraries, I don't think, and we clearly want to:
That red line at 10^0 (ie, 1) is where a name is equally frequent in the US census, and in the Open Library list of authors. 10^1 is ten times as frequent in real life, 10^-1 is ten times as frequent in books, and so on. So the most men's names are at about 2x more common in books than in the census, and the peak of the women's names fall somewhere between 8x less common and just slightly more.
And remember, each of those distributions is built up of individual names. Let's make them dots at first. Here I've arranged each first name from left to right by overall frequency: you can still see the differing patterns of women's names, usually about 9x more common in the census, and men's names, usually about 2x as common in libraries.
But there's no story to dots. Split up the genders and see the names themselves, and you get something we can talk about.
And for the men. Why is John so different from William and George? How many of those over-represented single letters, which are usually men in the census, are actually women in disguise (GEM Anscombe, JK Rowling) in the libraries? Are those Willies all boys, those Joes all men?
I find it easy to spin out stories here about access to publishing that may or may not be true: about predominantly wealthy Eleanors allowed to write novels, about poor Irish Michaels who never finished school, about rich Yankee Williams and hardscrabble, farming Johns. It would be easy to test some of these stories, harder to test others. But we could always start.
We may not care about names. But for the questions we do care about, the ones that the state has been codifying and enumerating since it came of age, the questions are even easier to answer, because that's what they've been collecting on. I suspect that the solution lies in what we already have.