Wednesday, August 29, 2007

Ancestry.com and the Internet Biographical Collection Debacle

I'm very interested in searching across genealogical databases and websites, since one of the focuses of my Texas Heritage Online search application is resources for genealogists. This makes sense, as so many of the resources from the Texas State Library and Archives Commission and other state agencies are developed for genealogists.

With libraries, archives, and museums, I pretty much know what's out there and I can get them to index it for me in a form that's searchable. However, there's a TON of material out there in the form of static web pages -- for example, the Texas State Library has developed a series of web exhibits, such as Texas Treasures, that I cannot search currently. Google can, but I can't. The Texas State Library has signed an agreement with the Archive-It service from the Internet Archive to archive copies of state agency web pages for posterity as part of our TRAIL program, and I will eventually be able to search those pages from Texas Heritage Online. This made me start thinking about whether I could do the same thing for content from non-state agencies. A big target here would be the materials posted to the Texas GenWeb.

Over the weekend, Ancestry.com rolled out a search application using similar technology to Archive-It. Like Archive-It, Ancestry.com's "Internet Biographical Collection" (free, but registration required as of this writing) presents you by default with an archived, or "cached," copy (the difference is that IA keeps multiple versions of a page, so you can see what it had on any particular date). Ancestry adds some value by indexing names and dates (how, exactly, I'm not sure) and by rolling it into their multi-database search engine.

The outcry from genealogists has been fierce. They feel like their pages -- and, presumably, their family histories, given the nature of the material -- have been "stolen" or "hijacked." For discussions, see Kimberley Powell's "The Legality of Caching" on About.com; Susan Kitchens has uncovered some of the technical details of Ancestry.com's bot and posted it on her Family Oral History blog.

I don't know how much of this is anti-Ancestry.com backlash (which reminds me -- the Ancestry Insider blog is worth reading), but it makes me wonder if I'm going to need to do any future web search integration into Texas Heritage Online as an opt-in program. I was going to be very selective, anyway, about sites to include, as another audience is K-12 educators and students, and I can't allow age-inappropriate material in my search results! Sounds like a focus group is needed -- which will significantly delay any implementation of this type of tool.

Update: Ancestry.com pulled the collection down this afternoon, according to 24/7 Family History Circle, which is, more or less, the official Ancestry.com blog.

Tuesday, August 28, 2007

Bringing the Open Library to SXSW

As an update to my post about "Collaborative Cataloging," I'm pleased to say that the Texas State Library and Archives Commission has proposed a panel discussion for the 2008 SXSW Interactive conference in Austin. Here's our proposal:

Why Do We Need Libraries Anyway?

On June 25, 2007, California recognized the Internet Archive as an official library. As digital resources become a formal part of our civic structure we ask: How are physical and virtual libraries used, what are the emotional connotations of being a library, and what do we do with librarians?

Our perceptions of libraries and librarians are often based in childhood nostalgia or media stereotypes (The Music Man's Marian the Librarian, Hogwart's Madam Pince, Noah Wyle's Flynn Carsen, "The Librarian"), but today's library is as much about bytes as about books. Come join us for a discussion about the future of libraries and information generally in a networked world!

We have lined up speakers Aaron Schwartz (Raw Thought blog), leader of the Internet Archive's Open Libraries project, and Lorcan Demsey (blog), Vice President for Research and Chief Strategist for OCLC, the Online Computer Library Center, home of WorldCat, to offer their thoughts on libraries, both physical and virtual, and on the services that librarians provide. The panel will be moderated by Danielle Cunniff Plumer (The Darchivist), coordinator of the Texas Heritage Digitization Initiative at the Texas State Library and Archives Commission and project manager of Texas Heritage Online.
If this sounds like a fun program (and it will be!), be sure to vote for it on SXSW Interactive's Panel Picker.

A true story of archival woe

There's been an interesting discussion on the ImageLib list about "archival" DVDs (although the thread is titled "LZW compression for archival masters"). The question of best media for storage of archival master images comes up frequently. In general, we have four options:
  • CD-R. Up to 700 Mb of data. "Archival" CD-Rs use a gold reflective layer and some kind of light sensitive dye, such as cyanine (blue) or phthalocyanine (transparent) to etch the disks, creating optical storage. Manufacturers claim up to 300 years of expected life, barring damage from heat, scratches, and other physical damage.
  • DVD-R. Up to 4.7Gb of data. Like CD-Rs, these are available with gold reflective layers and phthalocyanine dyes, although the dyes are weaker in order than those used in CD-Rs because of the different lasers used for high-density discs. Still, manufacturers claim lifespans of 50-100 years.
  • HDD (hard disk drives). Multi-terabyte drives are now available. The newest high-density drives often use so-called "vertical storage," which refers not to the orientation of the drive unit but to the alignment of the bits on the spinning platter. Hard drives are magnetic storage media and typically have a lifespan of 5-10 years when properly stored. Heat is a major killer of hard drives, so some sources recommend powering them down to minimize the heat caused by spinning, while others prefer to find alternate ways to dissipate heat.
  • Magnetic tage. Virtually unlimited storage. Magnetic tape is the most common type of storage for large datasets; the tape has a typical lifespan of 20-30 years depending on storage conditions (heat and humidity being the prime issues), but the equipment used to create and read the data is typically obsolete long before that, creating its own set of preservation challenges.
Tim Vitale has nicely summarized some of the available options in "Digital Imaging in Conservation: File Storage" in the January 2006 issue of AIC News. See also TASI's guide to "Using CD-R and DVD-R for Digital Preservation."

In real life, however, all of these concerns are moot. My first question when people ask me what type of storage media to use is "what do you plan to do with the images?" CD-Rs and even DVDs are relatively cheap storage, which offers the advantage that you might actually make multiple copies and store them in multiple physical locations, but it is tedious to retrieve information off them. Also note that they are most reliable when slow write speeds (2x-8x) are used! Magnetic tape is a specialized type of storage and generally requires a bit more technical savvy to store and retrieve information; the size of the datasets generally stored on tape also makes it less likely that multiple copies will be created. Hard disks are increasingly inexpensive and portable; options like RAID 5 decrease the likelihood of errors, while relatively fast read/write capabilities and high storage density make it more likely that people will actually use the images stored on disk.

Cost is not an issue. It is almost always cheaper to store the images properly than to re-image them in case of disaster. Here's a case study. The Texas State Library and Archives Commission (my employer) digitized the Republic Claims series of records, which consists of Comptroller records from the Republic of Texas era (1835-1846). A total of 182,000 images were created from microfilm masters.

The work was performed in two stages. The first set of images were created by a division of the State Library. Approximately 60,000 images were stored on a 20 CD-Rs. The second set of images were done by an outside vendor; approximately 120,000 images were stored on 19 CD-Rs. Without knowing any specific bid details, I would guess that there were differences in resolution between the two vendors; all the master images were supposedly uncompressed B&W TIFFs.

Both vendors burned the images to CD-R; State Library staff created PDFs from the master images for use in an online database. 5 years later, that database had to be re-developed, and staff decided to go back through the masters and create thumbnails and access images. At that time, they discovered that the entire batch of CD-Rs from the second vendor were unreadable.

I've looked at some of the bad CD-Rs. They seem to be ordinary consumer-grade CD-Rs, not archival gold, but they were stored properly and have no obvious physical damage. It is not clear whether the discs are unreadable due to "disc rot" (see the Wikipedia article on CD rot) or if the adhesive used by the vendor to affix labels to the discs caused physical damage. Either way, the discs are unusable. Because the State Library kept the original documents and had preservation-quality microfilm, preservation of the digital images was not a primary consideration, and no one lost sleep over the lost images, but...

Currently, digitization from microfilm costs approximately $0.10 per image. To recreate the masters for this particular collection will cost approximately $12,000, which is money that could have been used to digitize material that is in need of preservation. The vendor saved a few pennies per disc, and the state saved more by not making a copy of the master images. Congratulations.

I don't know what we'll eventually do with this collection. Because the digital images were never great (low-resolution black and white images made from microfilm), I think that we can batch-process the "use" PDFs to create perfectly adequate thumbnails and access images. However, this was a great lesson, and I'll be using it frequently in presentations: the true cost of digital preservation should be measured in terms of the cost to do the work over again, not the cost of material and supplies.

On the road....

I have a number of trips planned for the next few months. I'm always happy to make arrangements to meet with folks in the area if I have time, so send me an email if I'll be in your part of the state!
  • August 21-23: Arlington & Dallas
  • September 19-22: Abilene
  • September 27-30: El Paso
  • October 2-7: Denver, CO