To read the data in ship's logs we first must know where the data came from. The short answer--ICOADS--might be enough. But working with digitized books has convinced me that knowing the full provenance of your data, through all its twists and turns, is one of the most important parts of any digital humanities project.
Like most humanists, the real digitization projects I care about are books, periodicals, and archives. A major theme on this blog is the attempt to understand how particular choices in digitization history shape the books available to us.
But ship's logs are interesting because they present a wholly alternate digitization history that can help us understand the mechanics of digitization more clearly. Logs are a digitized data source that has been driving large-scale research projects for more than 150 years: because of that, they can be a useful abstraction for reflecting on what digitization means. Logbook digitization is an interesting process in its own right; the particular cast of characters--Confederate technocrats, Nazi data thieves--in the history of shipping logs is unique. But the general problems are the same as those found in other large-scale sources of data. Unless humanists intend only to work with data digitized by our own standards, we have to be better at understanding just what can go wrong.
So before I get to those Nazis, let me lay out the basic themes that the story reinforces.
First, digitization is a spectrum, not a condition. There's no one point at which information switches from analog to digital, just as there's no single format that which is abstractly 'computer readable'. It's tempting to draw a bright line at digitization, and to have it separate the old and the modern, the virtual and the real. But the history of digitization stretches much farther back than we might think.
Second, the essence of digitization is abstraction. Abstraction necessarily entails loss; but it also enables new connections by making things directly comparable that weren't before. Humanists tend to worry much more about the 'richness' that we lose by abstracting; these old projects, operating under much greater constraints than anything we have today, can help remind us why abstraction can be useful.
Third, digitization is highly dependent on human labor; and generally lots of it. Whoever chooses the labor chooses what's digitized, and what aspects of the original object make it to the next iteration. And at every line down the chain, these choices get made again and again.
And finally: the great promise of digitization--indeed, the reason to do it all--is about transformation, not preservation. We talk too much about digital 'copies' as if they are supposed to provide mirrors the originals. In fact, preservation is almost never a good enough reason to spur all the effort that digitization entails. A digital artifact is separate from the individual, and offers its own modes of reading; if it doesn't, the work of digitization has been largely wasted. As it becomes possible to think about digitizing more and more aspects of an object, we may tempted to think of the transformation as a flaw. The history of digitization, I think, tells a different story.
For the first two points, I'll talk about the particular history of the Maury collections from which I drew the whaling logs. The third follows that story into the 20th century and across the Atlantic to Dutch and German shipping records; and the fourth point brings the now-computerized records into the present day.
The digitization spectrum: Abstract logs and sailing data.
The whaling ship logbooks I've been looking at started as incredibly detailed, analogue, accounts of a voyage. They were written without any standard forms; and for the substantial contingent of humanists who are concerned about the experiential richness of the act of reading, they have their own pleasures. They included, for example, hand-carved stamps made by the sailors themselves, saying whether a whale was seen, wounded, or killed, depending on its orientation:
|From Bauman Rare Books|
A single day's entries for a whaling voyage in the digitized files I'm using, on the other hand, looks like this:
1848 6 1 3723 29038 02 4 10ISABE*_N 1 5 165 20779701 69 5 0 1 FFFFFF77AAAAAAAAAAAA 99 0 790044118480601 3714N 6937W NW 51 NW 57 NW 51 201A.STEWART NEW BEDFORD WHALING VOYAGE 2620 199
Mechanical reproduction has removed more than just the aura from this log. The final version is beyond the limits of what most humanists would consider worth the effort to read. But while it's obvious how much has been lost, it's worth remembering also what's been gained. This one entry is now directly comparable to millions of others; and that abstraction allows all sorts of uses of it other than reveling in its materiality, from tracking trade patterns to--determining the pace of global climate change.
Abstraction is the promise and peril of digitization: and perhaps the most surprising thing in the story of these logs is just how early they began their trip towards the digital.
The most important figure in the history of ship's logbooks is Lt. Matthew Maury.* Maury was the first superintendent of the US Naval Observatory, from 1842 to 1861. (Maury chose his native Virginia when war broke out: that's why, I think, the maps I've made of shipping from his data appear to show a completely collapse in 1861, rather than just a decrease.) His grand project was the creation of atlases of the ocean; from Washington he directed a stream of publications and charts using data from logbooks to increase navigation speeds, choose the route for the transatlantic telegraph cable, and describe the geographical limits of marine life.
*For Maury, as in all else involving ships, I'm leaning heavily on things I think Dael Norwood told me.
(click to enlarge)
With the logbook data suitably abstracted, Maury could create atlases and charts intended to speed up shipping routes. Maury's charts are not masterpieces of graphic design; but they did prove immensely useful to sailors. (In the publication I copied these abstract logs from, he bragged that ships sailing to San Francisco after the gold rush with his charts aboard took, on average, two fewer weeks than those without). So useful, in fact, that that national agencies from London to Rome were willing to send logbooks of their voyages to Maury in exchange for the charts. In a virtuous circle of abstraction, international standards emerged around such fine distinctions as the line between a 'fresh' and a 'strong' breeze.
Ship's logs were well down the path to digitization in the 1850s. Maury's standards directly shapes the logs we have available today: only the CLIWOC project extends meteorological observations into the pre-Maury period. You might think we have enormous amounts of digitized ship's logs from the 19th and 20th centuries and relatively few from before 1850 because the books are simply rarer. But in fact, it's been both easier and more valuable to digitize a logbook that was arranged in accordance with Maury's rules because he performed some of the most difficult parts of digitization—selection, standardization, abstraction—before the Civil War.
Abstract logs are nonetheless a far cry from that snippet of code at the beginning. They are still human-readable; they're printed on paper; and they include that catchall 'notes' column for information that stubbornly resists becoming data.
To understand where I got that snippet, you have to start at the end. I found Maury's logbooks in a US-government supported collection called ICOADS. The US Maury collection is deck 701; other decks include the CLIWOC data I charted in the spring, the movements of the US Navy in the Second World War, and millions of drifting buoy observations from recent decades.
What's a 'deck', you ask? That's where the history of digitization comes back into play. The logbook data goes back in to the era when computers stored information on paper: a deck is a stack of punch cards. (Some nice pictures, not of logs, here). The Maury data takes the form it does because it's been designed to be consistent with old punch card forms to which the abstract logs could be transferred:
Digitization is a process that largely depends on cheap labor. Maury paid old sea captains to track down logs; the Dutch digitization projects seem to have really been spurred on by work-relief programs in the Great Depression. An agreement with the Dutch minister for Social Services and Employment led to the retaining of "4 unemployed ship officers and 8 office clerks" for one year in 1936, and 7 unemployed ship's officers and 8 unemployed office clerks from 1938. (14-15). In 1940, another dozen or two were added, about half officers and half clerks.
German punchcard mischief complicated the data from across Europe. A 1983 American report blames the Nazis for spurious duplicates of the Norwegian whaling records I plotted in my whaling post:
Because of a less stringent check upon location in the current duplicate elimination plan, deck 188 was found to be a complete duplicate with deck 192 within the Atlas file. Based upon further research, it was found that the original records from deck 188 (Norwegian Whaling Ships) were probably captured by the Germans and rekeyed during the Nazi Regime under deck 192. The duplication problem came about because ship coordinates in deck 188 were keyed to tenths of a degree while those in deck 192 were only keyed to whole degrees. When data were converted to TDF 11, the tenths positions for deck 192 were placed at zero while those for deck 188 were placed at the keyed value.Another result of the Nazi occupation: all Dutch ships are ordered not to keep any records on board that show where the ship has been, including weather logs, since they might compromise security.
These tangled histories entail all sorts of compromises in the data. Rather than try to describe everything, let me just visualize some data the Dutch may have keyed in under German occupation that highlights how messy these things can be:
Unlike the Maury or ICOADS data, the problems in data integrity quickly become apparent here: as shipping volume picks up, for example, you'll see that almost no entries were made for the North Atlantic (since the meteorology there was already well understood) in the late 19th century; instead, ship routes start out of nowhere at the equator. Patterns look more normal for the period immediately before the records break off almost entirely during World War I, with the exception of some Baltic sea routes: but new problems appear in the 1930s, where the continuous lines break off into short staccato signals where my algorithm to link a single ship together breaks down under the weight of ambiguous ship identifiers and rounded location paths.
And Deck 192 is better than most of the climatological data for historical research; many decks completely dissociate the original readings from the ships that made them, so the recreation of paths is impossible. Identifier codes are costly things.
The Maury data and the future of logbook digitization
If Deck 192 is in the middle of the pack, Deck 701--the US Maury collection--is towards the top. The records, for example, includes data about the ship's name and departure locations, and the observation locations are not as heavily rounded. This was possible because it was digitized much later than the punchcard era; but the imperatives of cheap labor and abstraction keep it far from perfect.
I've seen contradictory information on this, but I believe that the Maury logs themselves were digitized in Tianjin, China, between 1993 and 1996. The data I've seen still lacks important information like a unique identifier for the originating logbook, let alone real historical metadata about the ships. As the article linked above points out, an emphasis on speed led to many mistakes. And the desired output format, based on those punch cards from the 1930s, didn't have room for much of the information Maury wanted to track, including those whale sightings. (The IMMA format, is one of the least convenient data formats I've ever seen--you have to read each record before you know which of several related fixed-width formats it's stored in. The expected language of processing is Fortran; I probably would have given up if I hadn't found Philip Brohan's perl module for dealing with IMMA data.) Climatologists are aware of some errors in the data, such as the differences between bucket types in different navies used to measure seawater temperature, but any closer examination yields a number of other problems that are only obvious when you try to treat the voyages as historical artifacts, such as the same ship having two different reported tracks in the same year.
The Maury data, or the CLIWOC, is--I suspect--only of borderline interest to most historians. So will we get more? On the one hand, exciting developments continue: there seem to be serious attempts to digitize the logbooks of East India Company underway; the Old Weather project is bringing crowdsourcing and richer standards of data quality to logbooks, including those of the East India Company (although at a somewhat smaller scale, so far, than the ICOADS data we already have); and somewhere out there is some beautiful data on 1850s German shipping that Maury collected.
But there's plenty of reason to be pessimistic as well. Old weather has been tearing through pages so far; but it's unclear how sustainable crowd-sourcing is as a model for future work.
More importantly, digitization requires paying for labor; and we seem quite unwilling to pay for the speculative benefits it generatezs. Part of it is a breakdown in the collaborative international systems that Maury set up. The Dutch used the Great Depression to advance their data collection efforts with work relief; but every page on the ICOADS site is plastered with a description of funding cuts halting further development.
And even ICOADS or Old Weather does pay for massive new digitization, the rules of abstraction will be designed for the same purposes Maury began with--centered around weather observations, with other uses as a happy side-effect.
For shipping logs, this is probably OK. Freedom from punch cards field-length limitations means Old Weather has all sorts of observations that should be useful for humanists, too (including some full text); and in any case, understanding historical shipping patterns is nowhere near as important as understanding climate change. But if someone else's abstractions and someone else's digitization are so useful for us, imagine if we did more of it ourselves.
Humanists are singularly bad at helping along the abstractions that make digital resources useful. We tend to be proud of seeing the complexities of the data, and easily distracted to still-impossible projects once an abstraction's force has been created by someone else. Maybe it's best to let Google digitize our books and the Nazis digitize ship logs, and then to grouse about what they omitted after the fact. But I suspect the more humanists participate in digitization as a creative project of abstraction, not a last-ditch attempt at preservation, the more we'll be able to do with the digital archives we get, and possibly the more digital materials we'll have at all.