tag:blogger.com,1999:blog-8929346053949579231.post6350155071846728448..comments2024-03-23T00:59:24.057-04:00Comments on Sapping Attention: Stopwords to the wiseBenhttp://www.blogger.com/profile/04856020368342677253noreply@blogger.comBlogger4125tag:blogger.com,1999:blog-8929346053949579231.post-1278878109949960432011-04-03T15:54:41.543-04:002011-04-03T15:54:41.543-04:00Working on the 19c, the question of spelling norma...Working on the 19c, the question of spelling normalization is relatively painless. But as you get back before 1800 it becomes hairy -- both because there are more variants, and because the line between "a different spelling" and "a different word" can get blurry. I'm trying to build a corpus that covers 1700-1900, and that's going to be one of the trickier aspects.<br /><br />The information about gendered pronouns is interesting. Equally interesting is the implicit overview of DH, which hints that on the whole historians are more likely to emphasize big data, and lit types more likely to emphasize intensive analysis of smaller datasets. I think that's probably right, on the whole, although I can think of a number of exceptions on both sides.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8929346053949579231.post-79337191535957921602011-04-03T15:43:49.015-04:002011-04-03T15:43:49.015-04:00Ben, thanks for this post.
On keyword searching ...Ben, thanks for this post. <br /><br />On keyword searching and the biases of databases, April Merleaux's talk <a href="http://digitalhumanities.yale.edu/pdp/2010/04/28/merleauxthe-keyword-historian-adventures-in-the-digital-archives/" rel="nofollow">"The Keyword Historian: Adventures in the Digital Archives"</a>, given at the <a href="http://digitalhumanities.yale.edu/pdp/" rel="nofollow">Past's Digital Presence conference</a> at Yale last year, may be of interest. The video's online at that link. She talks about her use of the keyword "piloncillo" (a little cone of unrefined sugar) as a way to research the sugar trade on the US-Mexico border in the 19th century. She also, towards the end of the talk, mentions the biases of Ancestry.com and the limitations of its family-focused searching.Shane Landrumhttps://www.blogger.com/profile/09431323570161284017noreply@blogger.comtag:blogger.com,1999:blog-8929346053949579231.post-78547761954402551002011-04-03T15:35:35.123-04:002011-04-03T15:35:35.123-04:00@Caleb,
Thanks, I really liked your recent post o...@Caleb,<br /><br />Thanks, I really liked your <a href="http://mcdaniel.blogs.rice.edu/?p=126" rel="nofollow">recent post on Lincoln</a>.<br /><br />If I actually knew how Google worked, I'd probably go cash out by starting a search-engine optimization firm. So this is written from ignorance. But: Google seems to correct for plurals sometimes, as far as I can tell. If they did it fully, it would actually take less storage and so faster queries, although more initial processing to create it--instead of having database entries for 'pay' and 'paid' and 'paying', you can just have a single one for 'pay.' I've thought about changing my database to work like that, because overall it would work faster. But obviously inflection, just like stopwords, occasionally keeps good information in place. (The N-grams guys go one farther than me in keeping capitalization, but I decided nearly doubling the size of the database to be able to tell polish remover from Polish remover wasn't worth it.)<br /><br />But probably part of the labor thing is that a British site might link "child labour" to an American site even if it doesn't use that spelling; that's the behavior that <a href="http://en.wikipedia.org/wiki/Political_Google_bombs_in_the_2004_U.S._Presidential_election" rel="nofollow">Google Bombing</a> exploits.<br /><br />But you're right, that's a case in point about how computer systems are designed for audiences. Not many people besides historians care about the difference between labor movements and labour movements, right? So Google's right to assume they should be conflated as far as possible in search results.Benhttps://www.blogger.com/profile/04856020368342677253noreply@blogger.comtag:blogger.com,1999:blog-8929346053949579231.post-67001289622342144502011-04-03T15:12:16.944-04:002011-04-03T15:12:16.944-04:00I recently discovered your blog and have been lear...I recently discovered your blog and have been learning a lot. Thanks for your work!<br /><br />This is a question from ignorance, but your interesting observations about stop words made me wonder about Google's decision to lump together spelling variations (for example, if I search "child labor" I get some results about "child labour"). From a database perspective, I assume that what's going on behind the scenes here is different from what happens with a stop word. To a layman like me, it would seem like returning results with multiple spellings of the search term would require more processing, algorithmic power than sticking with the user spelling. But I guess my question is whether there are database design decisions in a case like this that might limit an historian's access to questions that might be interesting--like figuring out the rate at which Anglicized spellings of a word like "labour" disappeared from American texts.Caleb McDanielhttp://www.owlnet.rice.edu/~wcm1/noreply@blogger.com