Sapping Attention: A brief visual history of MARC cataloging at the Library of Congress.

Tuesday, May 16, 2017

A brief visual history of MARC cataloging at the Library of Congress.

The Library of Congress has released MARC records that I'll be doing more with over the next several months to understand the books and their classifications. As a first stab, though, I wanted to simply look at the history of how the Library digitized card catalogs to begin with.

A couple notes for the technically inclined:
1. the years are pulled from field 260c (or if that doesn't exist or is unparseable, from field 008). Years in non-western calendars are often not converted correctly.
2. There are obviously books from before 1770, but they aren't included.
3. By "books", I mean items in the LC's recently-released retrospective (to 2014) "Books all" MARC files. http://www.loc.gov/cds/products/product.php?productID=5. Not the serial, map, etc. files: the total number is just over 10 million items.

See after the break for the R code to create the chart and the initial version Jacob is talking about in the comments.


my_annotate = function(lab,label,width=40,col="#EEEEEE",...)  {
  annotate(geom="text",...,label=str_wrap(label,width = width),family="OpenSans-CondensedLight", lineheight=0.75,size=3.5,col=col)
}

plot = data %>% filter(marc_record_created_year < 2017, marc_record_created_year>1965, record_date > 1700, record_date < 2017) %>%
  ggplot() + 
  geom_raster() + 
  aes(x=record_date,y=marc_record_created_year, fill=TextCount) + 
  scale_y_continuous(position=c("right"),expand=c(0,0)) + 
  scale_x_continuous(position=c("bottom"),breaks=seq(1770,2020,by=10),limits=c(1770,2025),expand = c(0,0)) + 

           scale_fill_gradientn("Number\nof books\ncataloged", trans="log10",
                       colors = rev(magma(5)),breaks=outer(c(1,2,5,10),c(1,10,100,1000,10000),"*") %>% as.vector %>% unique) + 
  theme_bw() + 
  theme(legend.key.height = unit(1,"in"),
        plot.title = element_text(size=22)
  ) + 
  annotate("segment",y=1968,yend=2014,x=1968,xend=2014,lwd=1.5,color="white",lty=2) + 
  labs(x="Year of book",y="Year of MARC record",title="MARC cataloging at the Library of Congress",
       subtitle="A brief visual history comparing the year that records were created (left to right) with the year that the books described in them were published" %>% str_wrap) +
  coord_flip() + 
  my_annotate(y=2013,x=2017,label="Books along the dashed line were cataloged in the same year as they were published; this has always been the most common practice",hjust=1,vjust=0,col="black") + 
  my_annotate(y=1995,x=1999,label="Anything to the upper left of this line was (supposedly) cataloged before it was published: this is impossible! These are all errors of some type.",hjust=1,vjust=0,col="black") + 
  my_annotate(y=1975,x=1969,label="MARC cataloging began in 1968; in the first years, only new books were added.",hjust=0,vjust=0.5,width=40,col="#EEEEEE") + 
  annotate(geom="segment",x=1967,xend=1969,y=1967,yend=1975,col="grey") + 
  my_annotate(y=1975,x=1960,label="In the early 1970s catalogers began to input older books: by 1972, there were hundreds of books a year entered from the early twentieth century",hjust=0,vjust=0.8,width=35,col="#EEEEEE") + 
    annotate(geom="segment",x=1910,xend=1960,y=1971,yend=1974.5,col="#DDDDDD") + 
    annotate(geom="segment",x=1950,xend=1960,y=1970,yend=1974.5,col="#DDDDDD") + 
    annotate(geom="segment",x=1960,xend=1960,y=1969,yend=1974.5,col="#DDDDDD") + 
  my_annotate(y=2000,x=1940,label="It took until 2000 for the backlog to be (mostly) cleared: the lighter patches here show that only a few records from the mid-twentieth century were being digitized",hjust=1,vjust=0.8,width=30,col="#EEEEEE") + 
    annotate(geom="segment",x=1940,xend=1955,yend=2003,y=2000.5,col="#DDDDDD") + 
    annotate(geom="segment",x=1940,xend=1920,yend=2003,y=2000.5,col="#DDDDDD") +
    my_annotate(y=1995,x=1902,label="There is a dark band in the year 1900, which is used as a catchall year for books published anytime in the century",hjust=1,vjust=0,width=30) + 
  annotate("rect",ymin=1968, ymax=2014, xmin=1899,xmax=1902,fill=NA,color="#DDDDDD",alpha=.5) + 
  annotate("rect",ymin=1994.8, ymax=1997.2, xmin=1850,xmax=1896,fill=NA,color="#DDDDDD",alpha=.5) + 
  my_annotate(y=1993.5,x=1880,label="A horizontal line shows that 1996 was an especially furious year of digitizing older records from the 19th and 20th centuries",hjust=1,vjust=0.5,width=30) +
      my_annotate(y=2002,x=1830,label="Staircase patterns moving up and to the right show smaller efforts that proceeded in chronological order through a subcollection. It took about 6 years to catalog 25 years worth of books from 1825 to 1850",hjust=1,vjust=0,width=35,col="#111111") + 
    annotate(geom="segment",x=1838,xend=1830,yend=2002,y=2007) +
      annotate(geom="segment",x=1800,xend=1830,yend=2002,y=1999) +

  annotate("rect",ymin=2007,ymax=2014,xmin=1848,xmax=1825,fill=NA,color="black",alpha=.5) +
  annotate("rect",ymin=1997,ymax=2001,xmin=1780,xmax=1800,fill=NA,color="black",alpha=.5)


ggsave(plot,device="png",filename="~/Pictures/MARC.png",width=7.5,height=20)

22 comments:

JacobMay 16, 2017 at 1:25 PM
Cataloged before publication...actually these indicate MARC (pre-)records added by LoC that reflect information from pre-publication copies deposited at LoC. See this wikipedia article for the full low-down (https://en.wikipedia.org/wiki/Cataloging_in_Publication). You might also want to actually check with a cataloger before you publish your (wrong) conclusions. #domainexpertisestillking
ReplyDelete
Replies
JacobMay 16, 2017 at 4:58 PM
Sorry to sound like a grumpy cat but I wouldn't assume that those dates are wrong unless you can find someone who can authoritatively state that they are. That said, there are sure to be lots of things that look suspicious and might be outliers. But it's not easy to just say something is wrong because frequently we know things will be published at some (unfixed) point in the future (case in point - George RR Martin's Winds of Winter novel). Looking at the area above your curve and noting that in the 1980s there seem to be records for the 2010s (or in the example in your reply), I'd say that not only is it possible, it's normal, because of CIP. The more interesting thing is how it begins to contract and intersect as we move through time. This may be for several reasons.
1) You might have an arbitrary cut-off point for the visualization and, the contraction is an artifact of that.
2) The practice of CIP might be ending.
3) The lead time between when a publisher provides CIP data and the actual publication date of the item is getting smaller.
4) Both 2 and 3 are occurring.

I'd be inclined to think that it's #4 but I'm surprised there isn't a noticable change in the curve when ebooks are introduced but it possible that there might be no ebooks (or marc records for such) in your dataset. But I also don't want to discount #1.

I'm also suspicious of the 1900 line. As we've explained to you in the past, MARC records have a number of places where the reliability or contextual meaning for that date is recorded. Always recording it as 1900 is misleading because sometimes it means "20th century" but it could also mean any of "sometime between 1900 and 2000", "sometime between 1900 and 1910", "circa 1900", and similar ranges or estimated dates. The dark color is actually indicating the semantic overlap among particular values. The MARC record should have additional information which is typically distributed across several MARC fields to help disambiguate it but additional parsing and some assembly would be required to make sense of "1900". Development of a semantic classifier might be helpful here.

As for your example in your reply, I'm not sure about your Serbian MARC records, but as it looks systematic, i.e., there are alot of them cataloged with the claim that the record is from the future, I'd suggest it might be an artifact of something at the policy level and that without contacting the source library or any of the catalogers involved with making the records we can't tell if it is an error of it simply violates our expectations for what the value of that data element should be.
ReplyDelete
Replies
Karen CoyleMay 16, 2017 at 9:07 PM
There was a concerted retrospective conversion effort that took LC's card catalog and converted it to MARC at a facility in Aberdeen Scotland. It was called REMARC. You can find some data

http://dx.doi.org/10.1108/eb046867

http://dx.doi.org/10.1300/J124v02n03_02

and undoubtedly others. This was in the mid-1980's. I wonder if this fits anywhere into your timeline. I don't think it explains the 1996 band, but I also do not know exactly when LoC integrated this data into its own catalog. Many other libraries used this data as it was sold as a service. Would love to hear if you find evidence of this.
ReplyDelete
Replies
chachakerMay 19, 2017 at 2:00 AM
For the Serbian books above: the record date is actually the correct year of publication (according the online catalogue maintained by the National Library of Serbia), if it helps for the discussion.
Bogdan
ReplyDelete
Replies
UnknownMay 19, 2017 at 8:50 AM
Data mining MARC records isn't gong to help you understand the books and their classifications. Visualizing some dates also won't help you get the history of how "digital card catalogs" (drawers full of digital 3x5 paper cards?) were created, either. You missed out on all the changes in date coding practice, subjective current practices and subjective historical applicationS of work that most of us have spent years and money on Masters degrees, and years of practice, understanding ourselves. DH is so much more than collecting matching data points and visualizing them!
ReplyDelete
Replies
Terrell RussellMay 19, 2017 at 11:33 PM
"A horizontal line shows that 1996..."

you mean "vertical"?
ReplyDelete
Replies

Add comment