Friday, October 12, 2012

Logbooks and the long history of digitization

Note: this post is part II of my series on whaling logs and digital history. For the full overview, click here.

To read the data in ship's logs we first must know where the data came from. The short answer--ICOADS--might be enough. But working with digitized books has convinced me that knowing the full provenance of your data, through all its twists and turns, is one of the most important parts of any digital humanities project.

Like most humanists, the real digitization projects I care about are books, periodicals, and archives.  A major theme on this blog is the attempt to understand how particular choices in digitization history shape the books available to us.

But ship's logs are interesting because they present a wholly alternate digitization history that can help us understand the mechanics of digitization more clearly. Logs are a digitized data source that has been driving large-scale research projects for  more than 150 years: because of that, they can be a useful abstraction for reflecting on what digitization means. Logbook digitization is an interesting process in its own right; the particular cast of characters--Confederate technocrats, Nazi data thieves--in the history of shipping logs is unique. But the general problems are the same as those found in other large-scale sources of data. Unless humanists intend only to work with data digitized by our own standards, we have to be better at understanding just what can go wrong.

So before I get to those Nazis, let me lay out the basic themes that the story reinforces.



First, digitization is a spectrum, not a condition. There's no one point at which information switches from analog to digital, just as there's no single format that which is abstractly 'computer readable'. It's tempting to draw a bright line at digitization, and to have it separate the old and the modern, the virtual and the real. But the history of digitization stretches much farther back than we might think.

Second, the essence of digitization is abstraction. Abstraction necessarily entails loss; but it also enables new connections by making things directly comparable that weren't before. Humanists tend to worry much more about the 'richness' that we lose by abstracting; these old projects, operating under much greater constraints than anything we have today, can help remind us why abstraction can be useful.

Third, digitization is highly dependent on human labor; and generally lots of it. Whoever chooses the labor chooses what's digitized, and what aspects of the original object make it to the next iteration. And at every line down the chain, these choices get made again and again.

And finally: the great promise of digitization--indeed, the reason to do it all--is about transformation, not preservation. We talk too much about digital 'copies' as if they are supposed to provide mirrors the originals. In fact, preservation is almost never a good enough reason to spur all the effort that digitization entails. A digital artifact is separate from the individual, and offers its own modes of reading; if it doesn't, the work of digitization has been largely wasted. As it becomes possible to think about digitizing more and more aspects of an object, we may tempted to think of the transformation as a flaw. The history of digitization, I think, tells a different story.

For the first two points, I'll talk about the particular history of the Maury collections from which I drew the whaling logs. The third follows that story into the 20th century and across the Atlantic to Dutch and German shipping records; and the fourth point brings the now-computerized records into the present day.

The digitization spectrum: Abstract logs and sailing data.

The whaling ship logbooks I've been looking at started as incredibly detailed, analogue, accounts of a voyage. They were written without any standard forms; and for the substantial contingent of humanists who are concerned about the experiential richness of the act of reading, they have their own pleasures. They included, for example, hand-carved stamps made by the sailors themselves, saying whether a whale was seen, wounded, or killed, depending on its orientation:

From Bauman Rare Books


A single day's entries for a whaling voyage in the digitized files I'm using, on the other hand, looks like this:

1848 6 1     3723 29038 02 4    10ISABE*_N   1   5                                                           165 20779701 69 5 0 1                  FFFFFF77AAAAAAAAAAAA     99 0 790044118480601  3714N 6937W                                                                           NW     51 NW     57 NW     51                                          201A.STEWART       NEW BEDFORD             WHALING VOYAGE           2620 199

Mechanical reproduction has removed more than just the aura from this log. The final version is beyond the limits of what most humanists would consider worth the effort to read. But while it's obvious how much has been lost, it's worth remembering also what's been gained. This one entry is now directly comparable to millions of others; and that abstraction allows all sorts of uses of it other than reveling in its materiality, from tracking trade patterns to--determining the pace of global climate change.

Abstraction is the promise and peril of digitization: and perhaps the most surprising thing in the story of these logs is just how early they began their trip towards the digital.

The most important figure in the history of ship's logbooks is Lt. Matthew Maury.* Maury was the first superintendent of the US Naval Observatory, from 1842 to 1861. (Maury chose his native Virginia when war broke out: that's why, I think, the maps I've made of shipping from his data appear to show a completely collapse in 1861, rather than just a decrease.) His grand project was the creation of atlases of the ocean; from Washington he directed a stream of publications and charts using data from logbooks to increase navigation speeds, choose the route for the transatlantic telegraph cable, and describe the geographical limits of marine life.

*For Maury, as in all else involving ships, I'm leaning heavily on things I think Dael Norwood told me.

Maury didn't digitize, of course: but he did the really hard work of abstracting and reducing experience to data. At first, he hired three people (one in New York, one in Boston, and "an old experienced but retired whaling master" in New Bedford) to find old logbooks and copy over their observations at a rate 1 to 3 cents an entry. (The first version of the Maury charts, about 265,000 days, thus cost a few million dollars in contemporary money). What they produced—and ultimately what Maury began to have ships themselves create—were 'abstract logbooks'. He included one as an appendix to a publication: two contiguous pages look like this:


(click to enlarge)


With the logbook data suitably abstracted, Maury could create atlases and charts intended to speed up shipping routes. Maury's charts are not masterpieces of graphic design; but they did prove immensely useful to sailors. (In the publication I copied these abstract logs from, he bragged that ships sailing to San Francisco after the gold rush with his charts aboard took, on average, two fewer weeks than those without). So useful, in fact, that that national agencies from London to Rome were willing to send logbooks of their voyages to Maury in exchange for the charts. In a virtuous circle of abstraction, international standards emerged around such fine distinctions as the line between a 'fresh' and a 'strong' breeze.

Ship's logs were well down the path to digitization in the 1850s. Maury's standards directly shapes the logs we have available today: only the CLIWOC project extends meteorological observations into the pre-Maury period. You might think we have enormous amounts of digitized ship's logs from the 19th and 20th centuries and relatively few from before 1850 because the books are simply rarer. But in fact, it's been both easier and more valuable to digitize a logbook that was arranged in accordance with Maury's rules because he performed some of the most difficult parts of digitization—selection, standardization, abstraction—before the Civil War.

Depression-era Digitization

Abstract logs are nonetheless a far cry from that snippet of code at the beginning. They are still human-readable; they're printed on paper; and they include that catchall 'notes' column for information that stubbornly resists becoming data.

To understand where I got that snippet, you have to start at the end. I found Maury's logbooks in a US-government supported collection called ICOADS. The US Maury collection is deck 701; other decks include the CLIWOC data I charted in the spring, the movements of the US Navy in the Second World War, and millions of drifting buoy observations from recent decades.

What's a 'deck', you ask? That's where the history of digitization comes back into play. The logbook data goes back in to the era when computers stored information on paper: a deck is a stack of punch cards. (Some nice pictures, not of logs, here). The Maury data takes the form it does because it's been designed to be consistent with old punch card forms to which the abstract logs could be transferred:
(From Wallbrink-Koek)


So when were those forms digitized? Two of the older decks in ICOADS are 192 and 193; they contain records from the Dutch and German merchant marines, respectively, up through about World War II. Hendrik Wallbrink and Frits Koek of the Royal Netherlands Meteorological Institute wrote a comprehensive and at times fascinating summary of how Dutch contributions to our logbook data sets were put together. Assuming that by 'digitized' you mean "converted to a binary, machine-readable format" the answer seems to be: as early as the 1920s and the 1930s. Closer to Maury's time than ours.

Digitization is a process that largely depends on cheap labor. Maury paid old sea captains to track down logs; the Dutch digitization projects seem to have really been spurred on by work-relief programs in the Great Depression. An agreement with the Dutch minister for Social Services and Employment led to the retaining of "4 unemployed ship officers and 8 office clerks" for one year in 1936, and 7 unemployed ship's officers and 8 unemployed office clerks from 1938. (14-15). In 1940, another dozen or two were added, about half officers and half clerks.
(From Wallbrink-Koek)
It should be no surprise that we haven't preserved complete data integrity over the last 75 years. For the Dutch punchers, major problems arose with the German occupation. "48 sailing ship logbooks were sent to the prison of Utrecht for repairs and rebinding" in 1942; H. Keyser was fired in July 1943 "due to differing ideas with the German present." By 1944, only German ships were being punched, and many of the cards were taken away to Berlin.

German punchcard mischief complicated the data from across Europe. A 1983 American report blames the Nazis for spurious duplicates of the Norwegian whaling records I plotted in my whaling post:
Because of a less stringent check upon location in the current duplicate elimination plan, deck 188 was found to be a complete duplicate with deck 192 within the Atlas file. Based upon further research, it was found that the original records from deck 188 (Norwegian Whaling Ships) were probably captured by the Germans and rekeyed during the Nazi Regime under deck 192. The duplication problem came about because ship coordinates in deck 188 were keyed to tenths of a degree while those in deck 192 were only keyed to whole degrees. When data were converted to TDF 11, the tenths positions for deck 192 were placed at zero while those for deck 188 were placed at the keyed value.
Another result of the Nazi occupation: all Dutch ships are ordered not to keep any records on board that show where the ship has been, including weather logs, since they might compromise security.

These tangled histories entail all sorts of compromises in the data. Rather than try to describe everything, let me just visualize some data the Dutch may have keyed in under German occupation that highlights how messy these things can be:


Unlike the Maury or ICOADS data, the problems in data integrity quickly become apparent here: as shipping volume picks up, for example, you'll see that almost no entries were made for the North Atlantic (since the meteorology there was already well understood) in the late 19th century; instead, ship routes start out of nowhere at the equator. Patterns look more normal for the period immediately before the records break off almost entirely during World War I, with the exception of some Baltic sea routes: but new problems appear in the 1930s, where the continuous lines break off into short staccato signals where my algorithm to link a single ship together breaks down under the weight of ambiguous ship identifiers and rounded location paths.

And Deck 192 is better than most of the climatological data for historical research; many decks completely dissociate the original readings from the ships that made them, so the recreation of paths is impossible. Identifier codes are costly things.

The Maury data and the future of logbook digitization
If Deck 192 is in the middle of the pack, Deck 701--the US Maury collection--is towards the top. The records, for example, includes data about the ship's name and departure locations, and the observation locations are not as heavily rounded. This was possible because it was digitized much later than the punchcard era; but the imperatives of cheap labor and abstraction keep it far from perfect.

I've seen contradictory information on this, but I believe that the Maury logs themselves were digitized in Tianjin, China, between 1993 and 1996. The data I've seen still lacks important information like a unique identifier for the originating logbook, let alone real historical metadata about the ships. As the  article linked above points out, an emphasis on speed led to many mistakes. And the desired output format, based on those punch cards from the 1930s, didn't have room for much of the information Maury wanted to track, including those whale sightings. (The IMMA format, is one of the least convenient data formats I've ever seen--you have to read each record before you know which of several related fixed-width formats it's stored in. The expected language of processing is Fortran; I probably would have given up if I hadn't found Philip Brohan's perl module for dealing with IMMA data.) Climatologists are aware of some errors in the data, such as the differences between bucket types in different navies used to measure seawater temperature, but any closer examination yields a number of other problems that are only obvious when you try to treat the voyages as historical artifacts, such as the same ship having two different reported tracks in the same year.

The Maury data, or the CLIWOC, is--I suspect--only of borderline interest to most historians. So will we get more? On the one hand, exciting developments continue: there seem to be serious attempts to digitize the logbooks of East India Company underway; the Old Weather project is bringing crowdsourcing and richer standards of data quality to logbooks, including those of the East India Company (although at a somewhat smaller scale, so far, than the ICOADS data we already have); and somewhere out there is some beautiful data on 1850s German shipping that Maury collected.

But there's plenty of reason to be pessimistic as well. Old weather has been tearing through pages so far; but it's unclear how sustainable crowd-sourcing is as a model for future work.

More importantly, digitization requires paying for labor; and we seem quite unwilling to pay for the speculative benefits it generatezs. Part of it is a breakdown in the collaborative international systems that Maury set up. The Dutch used the Great Depression to advance their data collection efforts with work relief; but every page on the ICOADS site is plastered with a description of funding cuts halting further development.

And even ICOADS or Old Weather does pay for massive new digitization, the rules of abstraction will be designed for the same purposes Maury began with--centered around weather observations, with other uses as a happy side-effect.

For shipping logs, this is probably OK. Freedom from punch cards field-length limitations means Old Weather has all sorts of observations that should be useful for humanists, too (including some full text); and in any case, understanding historical shipping patterns is nowhere near as important as understanding climate change. But if someone else's abstractions and someone else's digitization are so useful for us, imagine if we did more of it ourselves.

Humanists are singularly bad at helping along the abstractions that make digital resources useful. We tend to be proud of seeing the complexities of the data, and easily distracted to still-impossible projects once an abstraction's force has been created by someone else. Maybe it's best to let Google digitize our books and the Nazis digitize ship logs, and then to grouse about what they omitted after the fact. But I suspect the more humanists participate in digitization as a creative project of abstraction, not a last-ditch attempt at preservation, the more we'll be able to do with the digital archives we get, and possibly the more digital materials we'll have at all.

7 comments:

  1. Hi Ben,
    loved this post (of course!) and learned a great deal. I have been looking at early 17th-c. logbooks, and have to say that what Maury calls an 'abstract logbook' is actually very standard format for the daily clean 'journals' (as opposed to the 'waste books' where various navigators and mates jot down info every 4 hours or so). The example you show isn't that different from the first printed Dutch model I've seen, from 1597! (I can send you refs/pics if it would be useful.) And men like Pepys and Colbert spearheaded extensive, fairly systematic efforts to collect navigators' logbooks--to create national repositories, correct errors, etc. Of course, very little came of those, and no one in the earlier centuries succeeded in making the kind of charts Maury did... but they had similar goals! I hadn't considered all this as an early form of digitization... so thank you!
    Another thing, related to your concluding points: I'm actually tangentially involved in a crowd-sourcing archival transcription project -- MarineLives.org -- that's begun working their way through volumes of 17th-c. High Court of Admiralty depositions... They're looking for scholars who want to provide feedback on what metadata would be most useful, etc. Let me know if you have any suggestions!
    I'd love to chat at greater length about all this. Cheers, Margaret

    ReplyDelete
  2. Hi Margaret,

    Even though I said all this in person to yesterday, for the record:

    That's very interesting that the Dutch logbooks were so normally standardized: I would have thought that the primary route of transmission here was from the military (the Beaufort scale for recording wind strength, for example, was adopted internationally in the 1850s from Napoleonic-era Royal Navy terms), and that the commercial shippers would have been generally working from all sorts of standards. Maybe I'll ring up Dael and ask what the early China trade logbooks looked like; I admit to only checking the whaling ones, which tend to look like those in this blog post; completely unstructured, closer to a diary than an accounting register. I know there's endless literature on accounting practices in this period: is there stuff out there about when more standardized logging practices emerged?

    I will check that MarineLives.org out, for sure.

    ReplyDelete
  3. I posted on this back in 2009 when I read a book on ocean racing that mentioned the Maury story. It's great to read things in more depth so your post is awesome!

    For people who want to read how it that data was situated within Clipper ship and ocean racing, here is my book review:

    http://trends.wordpress.com/2009/11/23/ships-and-tools/

    ReplyDelete
  4. Great info. (As an aside, I helped digitize books while in grad school).

    ReplyDelete
  5. Hi Ben. I can add something to all of this. I'm a whaling researcher and "discovered" the Maury Charts in the late 80s when I found a bound uncatalogued set of the Wind and Current Charts in the National Library of Australia. For many years after I searched for the elusive Tally Sheets or whale sightings data (including in the US Whaling Museums and Tianjin Data Set) hoping it would allow us to unpack the data recorded in each 5 degree square. In the end when I was in Washington in 2008 I managed to secure full browse access within the stacks to the Maury material inside the National Archives main building. I spent hours going through it. Despite my confidence that the Tally Sheets must still exist there I found none. I would love to hear from anyone who has any other ideas where they might be.

    ReplyDelete
  6. Here's a link to an example Tally Sheet http://www.google.com.au/imgres?hl=en&sa=X&tbo=d&biw=1680&bih=935&tbm=isch&tbnid=x1m1-j3BLc9jrM:&imgrefurl=http://www.historyshots.com/whalechart/backstory2.cfm&docid=-g8rYL1beUoUfM&itg=1&imgurl=http://www.historyshots.com/images/PlateIX.jpg&w=599&h=533&ei=SwS8UNXAPKudiAec0YCwBA&zoom=1&iact=rc&dur=374&sig=118325796627731548497&page=1&tbnh=141&tbnw=158&start=0&ndsp=53&ved=1t:429,r:10,s:0,i:114&tx=97&ty=61

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete