Tuesday, August 28, 2007

A true story of archival woe

There's been an interesting discussion on the ImageLib list about "archival" DVDs (although the thread is titled "LZW compression for archival masters"). The question of best media for storage of archival master images comes up frequently. In general, we have four options:
  • CD-R. Up to 700 Mb of data. "Archival" CD-Rs use a gold reflective layer and some kind of light sensitive dye, such as cyanine (blue) or phthalocyanine (transparent) to etch the disks, creating optical storage. Manufacturers claim up to 300 years of expected life, barring damage from heat, scratches, and other physical damage.
  • DVD-R. Up to 4.7Gb of data. Like CD-Rs, these are available with gold reflective layers and phthalocyanine dyes, although the dyes are weaker in order than those used in CD-Rs because of the different lasers used for high-density discs. Still, manufacturers claim lifespans of 50-100 years.
  • HDD (hard disk drives). Multi-terabyte drives are now available. The newest high-density drives often use so-called "vertical storage," which refers not to the orientation of the drive unit but to the alignment of the bits on the spinning platter. Hard drives are magnetic storage media and typically have a lifespan of 5-10 years when properly stored. Heat is a major killer of hard drives, so some sources recommend powering them down to minimize the heat caused by spinning, while others prefer to find alternate ways to dissipate heat.
  • Magnetic tage. Virtually unlimited storage. Magnetic tape is the most common type of storage for large datasets; the tape has a typical lifespan of 20-30 years depending on storage conditions (heat and humidity being the prime issues), but the equipment used to create and read the data is typically obsolete long before that, creating its own set of preservation challenges.
Tim Vitale has nicely summarized some of the available options in "Digital Imaging in Conservation: File Storage" in the January 2006 issue of AIC News. See also TASI's guide to "Using CD-R and DVD-R for Digital Preservation."

In real life, however, all of these concerns are moot. My first question when people ask me what type of storage media to use is "what do you plan to do with the images?" CD-Rs and even DVDs are relatively cheap storage, which offers the advantage that you might actually make multiple copies and store them in multiple physical locations, but it is tedious to retrieve information off them. Also note that they are most reliable when slow write speeds (2x-8x) are used! Magnetic tape is a specialized type of storage and generally requires a bit more technical savvy to store and retrieve information; the size of the datasets generally stored on tape also makes it less likely that multiple copies will be created. Hard disks are increasingly inexpensive and portable; options like RAID 5 decrease the likelihood of errors, while relatively fast read/write capabilities and high storage density make it more likely that people will actually use the images stored on disk.

Cost is not an issue. It is almost always cheaper to store the images properly than to re-image them in case of disaster. Here's a case study. The Texas State Library and Archives Commission (my employer) digitized the Republic Claims series of records, which consists of Comptroller records from the Republic of Texas era (1835-1846). A total of 182,000 images were created from microfilm masters.

The work was performed in two stages. The first set of images were created by a division of the State Library. Approximately 60,000 images were stored on a 20 CD-Rs. The second set of images were done by an outside vendor; approximately 120,000 images were stored on 19 CD-Rs. Without knowing any specific bid details, I would guess that there were differences in resolution between the two vendors; all the master images were supposedly uncompressed B&W TIFFs.

Both vendors burned the images to CD-R; State Library staff created PDFs from the master images for use in an online database. 5 years later, that database had to be re-developed, and staff decided to go back through the masters and create thumbnails and access images. At that time, they discovered that the entire batch of CD-Rs from the second vendor were unreadable.

I've looked at some of the bad CD-Rs. They seem to be ordinary consumer-grade CD-Rs, not archival gold, but they were stored properly and have no obvious physical damage. It is not clear whether the discs are unreadable due to "disc rot" (see the Wikipedia article on CD rot) or if the adhesive used by the vendor to affix labels to the discs caused physical damage. Either way, the discs are unusable. Because the State Library kept the original documents and had preservation-quality microfilm, preservation of the digital images was not a primary consideration, and no one lost sleep over the lost images, but...

Currently, digitization from microfilm costs approximately $0.10 per image. To recreate the masters for this particular collection will cost approximately $12,000, which is money that could have been used to digitize material that is in need of preservation. The vendor saved a few pennies per disc, and the state saved more by not making a copy of the master images. Congratulations.

I don't know what we'll eventually do with this collection. Because the digital images were never great (low-resolution black and white images made from microfilm), I think that we can batch-process the "use" PDFs to create perfectly adequate thumbnails and access images. However, this was a great lesson, and I'll be using it frequently in presentations: the true cost of digital preservation should be measured in terms of the cost to do the work over again, not the cost of material and supplies.

1 comment:

Ross Cunniff said...

Hey, Danielle,

I just stumbled across your blog. And, coincidentally, stumbled across this article on slashdot. Interesting, although I don't think the TGDaily article actually goes into enough detail to decide whether the UCSC's researchers are proposing will actually work.