the darchivist: 2007

Wednesday, August 29, 2007

Ancestry.com and the Internet Biographical Collection Debacle

I'm very interested in searching across genealogical databases and websites, since one of the focuses of my Texas Heritage Online search application is resources for genealogists. This makes sense, as so many of the resources from the Texas State Library and Archives Commission and other state agencies are developed for genealogists.

With libraries, archives, and museums, I pretty much know what's out there and I can get them to index it for me in a form that's searchable. However, there's a TON of material out there in the form of static web pages -- for example, the Texas State Library has developed a series of web exhibits, such as Texas Treasures, that I cannot search currently. Google can, but I can't. The Texas State Library has signed an agreement with the Archive-It service from the Internet Archive to archive copies of state agency web pages for posterity as part of our TRAIL program, and I will eventually be able to search those pages from Texas Heritage Online. This made me start thinking about whether I could do the same thing for content from non-state agencies. A big target here would be the materials posted to the Texas GenWeb.

Over the weekend, Ancestry.com rolled out a search application using similar technology to Archive-It. Like Archive-It, Ancestry.com's "Internet Biographical Collection" (free, but registration required as of this writing) presents you by default with an archived, or "cached," copy (the difference is that IA keeps multiple versions of a page, so you can see what it had on any particular date). Ancestry adds some value by indexing names and dates (how, exactly, I'm not sure) and by rolling it into their multi-database search engine.

The outcry from genealogists has been fierce. They feel like their pages -- and, presumably, their family histories, given the nature of the material -- have been "stolen" or "hijacked." For discussions, see Kimberley Powell's "The Legality of Caching" on About.com; Susan Kitchens has uncovered some of the technical details of Ancestry.com's bot and posted it on her Family Oral History blog.

I don't know how much of this is anti-Ancestry.com backlash (which reminds me -- the Ancestry Insider blog is worth reading), but it makes me wonder if I'm going to need to do any future web search integration into Texas Heritage Online as an opt-in program. I was going to be very selective, anyway, about sites to include, as another audience is K-12 educators and students, and I can't allow age-inappropriate material in my search results! Sounds like a focus group is needed -- which will significantly delay any implementation of this type of tool.

Update: Ancestry.com pulled the collection down this afternoon, according to 24/7 Family History Circle, which is, more or less, the official Ancestry.com blog.

Tuesday, August 28, 2007

Bringing the Open Library to SXSW

As an update to my post about "Collaborative Cataloging," I'm pleased to say that the Texas State Library and Archives Commission has proposed a panel discussion for the 2008 SXSW Interactive conference in Austin. Here's our proposal:

Why Do We Need Libraries Anyway?

On June 25, 2007, California recognized the Internet Archive as an official library. As digital resources become a formal part of our civic structure we ask: How are physical and virtual libraries used, what are the emotional connotations of being a library, and what do we do with librarians?

Our perceptions of libraries and librarians are often based in childhood nostalgia or media stereotypes (The Music Man's Marian the Librarian, Hogwart's Madam Pince, Noah Wyle's Flynn Carsen, "The Librarian"), but today's library is as much about bytes as about books. Come join us for a discussion about the future of libraries and information generally in a networked world!

We have lined up speakers Aaron Schwartz (Raw Thought blog), leader of the Internet Archive's Open Libraries project, and Lorcan Demsey (blog), Vice President for Research and Chief Strategist for OCLC, the Online Computer Library Center, home of WorldCat, to offer their thoughts on libraries, both physical and virtual, and on the services that librarians provide. The panel will be moderated by Danielle Cunniff Plumer (The Darchivist), coordinator of the Texas Heritage Digitization Initiative at the Texas State Library and Archives Commission and project manager of Texas Heritage Online.

If this sounds like a fun program (and it will be!), be sure to vote for it on SXSW Interactive's Panel Picker.

A true story of archival woe

There's been an interesting discussion on the ImageLib list about "archival" DVDs (although the thread is titled "LZW compression for archival masters"). The question of best media for storage of archival master images comes up frequently. In general, we have four options:

CD-R. Up to 700 Mb of data. "Archival" CD-Rs use a gold reflective layer and some kind of light sensitive dye, such as cyanine (blue) or phthalocyanine (transparent) to etch the disks, creating optical storage. Manufacturers claim up to 300 years of expected life, barring damage from heat, scratches, and other physical damage.
DVD-R. Up to 4.7Gb of data. Like CD-Rs, these are available with gold reflective layers and phthalocyanine dyes, although the dyes are weaker in order than those used in CD-Rs because of the different lasers used for high-density discs. Still, manufacturers claim lifespans of 50-100 years.
HDD (hard disk drives). Multi-terabyte drives are now available. The newest high-density drives often use so-called "vertical storage," which refers not to the orientation of the drive unit but to the alignment of the bits on the spinning platter. Hard drives are magnetic storage media and typically have a lifespan of 5-10 years when properly stored. Heat is a major killer of hard drives, so some sources recommend powering them down to minimize the heat caused by spinning, while others prefer to find alternate ways to dissipate heat.
Magnetic tage. Virtually unlimited storage. Magnetic tape is the most common type of storage for large datasets; the tape has a typical lifespan of 20-30 years depending on storage conditions (heat and humidity being the prime issues), but the equipment used to create and read the data is typically obsolete long before that, creating its own set of preservation challenges.

Tim Vitale has nicely summarized some of the available options in "Digital Imaging in Conservation: File Storage" in the January 2006 issue of AIC News. See also TASI's guide to "Using CD-R and DVD-R for Digital Preservation."

In real life, however, all of these concerns are moot. My first question when people ask me what type of storage media to use is "what do you plan to do with the images?" CD-Rs and even DVDs are relatively cheap storage, which offers the advantage that you might actually make multiple copies and store them in multiple physical locations, but it is tedious to retrieve information off them. Also note that they are most reliable when slow write speeds (2x-8x) are used! Magnetic tape is a specialized type of storage and generally requires a bit more technical savvy to store and retrieve information; the size of the datasets generally stored on tape also makes it less likely that multiple copies will be created. Hard disks are increasingly inexpensive and portable; options like RAID 5 decrease the likelihood of errors, while relatively fast read/write capabilities and high storage density make it more likely that people will actually use the images stored on disk.

Cost is not an issue. It is almost always cheaper to store the images properly than to re-image them in case of disaster. Here's a case study. The Texas State Library and Archives Commission (my employer) digitized the Republic Claims series of records, which consists of Comptroller records from the Republic of Texas era (1835-1846). A total of 182,000 images were created from microfilm masters.

The work was performed in two stages. The first set of images were created by a division of the State Library. Approximately 60,000 images were stored on a 20 CD-Rs. The second set of images were done by an outside vendor; approximately 120,000 images were stored on 19 CD-Rs. Without knowing any specific bid details, I would guess that there were differences in resolution between the two vendors; all the master images were supposedly uncompressed B&W TIFFs.

Both vendors burned the images to CD-R; State Library staff created PDFs from the master images for use in an online database. 5 years later, that database had to be re-developed, and staff decided to go back through the masters and create thumbnails and access images. At that time, they discovered that the entire batch of CD-Rs from the second vendor were unreadable.

I've looked at some of the bad CD-Rs. They seem to be ordinary consumer-grade CD-Rs, not archival gold, but they were stored properly and have no obvious physical damage. It is not clear whether the discs are unreadable due to "disc rot" (see the Wikipedia article on CD rot) or if the adhesive used by the vendor to affix labels to the discs caused physical damage. Either way, the discs are unusable. Because the State Library kept the original documents and had preservation-quality microfilm, preservation of the digital images was not a primary consideration, and no one lost sleep over the lost images, but...

Currently, digitization from microfilm costs approximately $0.10 per image. To recreate the masters for this particular collection will cost approximately $12,000, which is money that could have been used to digitize material that is in need of preservation. The vendor saved a few pennies per disc, and the state saved more by not making a copy of the master images. Congratulations.

I don't know what we'll eventually do with this collection. Because the digital images were never great (low-resolution black and white images made from microfilm), I think that we can batch-process the "use" PDFs to create perfectly adequate thumbnails and access images. However, this was a great lesson, and I'll be using it frequently in presentations: the true cost of digital preservation should be measured in terms of the cost to do the work over again, not the cost of material and supplies.

On the road....

I have a number of trips planned for the next few months. I'm always happy to make arrangements to meet with folks in the area if I have time, so send me an email if I'll be in your part of the state!

August 21-23: Arlington & Dallas
September 19-22: Abilene
September 27-30: El Paso
October 2-7: Denver, CO

Thursday, July 19, 2007

Collaborative Cataloging

This week, two new projects that attempt to provide a framework for collaborative cataloging came to my attention. The first, Freebase, describes itself as "a global knowledge base: a structured, searchable, writeable and editable database built by a community of contributors, and open to everyone," and my impression is that it's trying to build a grass-roots version of the Semantic Web. I just got an invite to try the alpha version today, so I'll post more about it when I've had a chance to experiment. For now, the O'Reilly Radar piece from March 2007 has a bit of information.

The second project is the Open Library project from the Internet Archive. Their vision: "Imagine a library that collected all the world's information about all the world's books and made it available for everyone to view and update." Pretty ambitious!

The big announcement came on Aaron Swartz's Raw Thought. For those folks who don't know who Aaron is, the summary at LISNews.org is hilarious:

What are you supposed to feel about Aaron Swartz? He co-authored RSS, served on the W3C's RDF Core Working Group, helped the wonderful John Gruber design the amazing Markdown, and developed and gave away software like rss2email that many of us use every day... and then he graduated high school.

Open Library got mentioned on a few blogs and lists -- Jessamyn posted to librarian.net about it, saying "[Open source cataloging is] a weird juxtaposition, the idea of authority and the idea of a collaborative project that anyone can work on and modify," and quite a few other blogs picked up on the discussion.

I thought the best discussions happened on non-librarian blogs, frankly, particularly on Slashdot, where Swartz popped up to explain the vision. Deborah Richman posted the following blurb to the Search Engine Watch blog:

So who is quietly trying to solve your search and discovery problem? Librarians. This week, a new searching mechanism was announced by the OpenLibrary project, with the audacious goal of providing information about every book on the planet. No ordinary catalog here, as OpenLibrary relies on the considered librarianship of everyone who uses or contributes to it.
As usual, librarians are experimenting with access, resources and usability. We’re happy to follow their lead. In this case, it’s digital librarian and archivist Brewster Kahle, who started the Wayback Machine and has been thinking about open access for years. Yet almost no one heard about this effort, and it’s pretty interesting!
http://blog.searchenginewatch.com/blog/070718-032552

You can't buy publicity like this!

So what is Open Library? Essentially, it's a wiki (built on infogami). Slashdotters compared it (repeatedly) to IMDb and Project Gutenberg, but I think it's more like a WorldCat.org where anyone can edit. Like WorldCat, it's based on authority records, primarily from the Library of Congress, to which have been added books digitized as part of the Open Content Alliance project.

There are some major hurdles to overcome. As any cataloger knows, the records at WorldCat are hardly perfect; what happens when authority control goes Wikipedia? How do you deal with editions? Can the records be FRBRized? How do you prevent vandalism?

Despite these questions, I think this is pretty exciting stuff. I'm always interested in how information forms evolve, and I tend to think wikis in general are the next evolution of the book (see, for example, the WikiBooks project); they're almost infinitely better for collaboration than other forms of editing. This could be the next evolution of cataloging, particularly when we can start plugging some web services onto it. Hmmm...

Monday, July 9, 2007

Turning the Page

One of my favorite memories of a 2002 trip to England with my husband was a visit to the British Library at St Pancras. The St Pancras building, which opened in 1997, looks much too modern for my tastes, but the inside is a medievalist's fantasy: the King's Library, four levels of stacks enclosed in glass, holding works collected during the reign of George III, including a work particularly meaningful to me: Caxton's first edition of Chaucer's Canterbury Tales, printed in 1476.

I was reminded of that trip because last week I caught an announcement on Resource Shelf about a public release of Turning the Pages 2.0, the software used by the British Library, developed by Armadillo Systems with support from Microsoft. In 2004, the digitized books at St Pancras were almost as mind-blowing to me as the King's Library itself. I'd followed Kevin Kiernan's Electronic Beowulf project, but seeing digitized versions of the Lindisfarne Gospels, the Magna Carta, and other great medieval works made me realize that This Was What I Wanted to Do With My Life.

Turning the Pages 2.0 is a nice system. It allows annotations and pan and zoom, as well as the 3D page-turning effect that is possibly its most memorable feature. Unlike the previous Shockwave version of Turning the Pages, it only runs on Internet Explorer on Windows Vista or Windows XP SP2 with the latest version of the .NET framework installed. For that reason (and the fact that it is a commercial, proprietary product), it may not be the right tool out there for all projects. Here are a few other products that should be considered:

3-D "page-turners":

LuraTech offers LuraWave, a proprietary system for viewing and manipulating JPEG2000 images. It offers pan and zoom in addition to page turning animations.
Flash is the current standard for page-turn effects. To learn how to create a Flash page-turn applet, see the tutorial by Sham Bhangal, author of Flash Hacks.
Microsoft's new Silverlight platform also offers page-turn effects. Microsoft is challenging Adobe's dominance of the rich-media market, but it's not clear whether they'll succeed.
Adobe's Digital Editions is an end-user solution, which moves the burden of technology from the digital library to the user (although digital libraries will want to test to make sure their books display correctly). You might want to look at the review on if:book, though, before choosing to use it (or forcing your users to).

Pan and Zoom:

Zoomify is a proprietary solution, but it offers a free option, Zoomify EZ. The Enterprise option adds annotation capabilities. I know that the folks at UNT's Portal to Texas History are using it.
PowerWeb Zoom from Dart Communications is a proprietary Ajax-based solution; the company also offers a free version.
LizardTech's GeoExpress supports both proprietary MrSID files as well as the free DjVu format, two different image compression solutions. MrSID works best with large images, such as maps, while DjVu works better with images containing text.
If you're not put off by the name, GSV (Giant-Ass Image Viewer) offers a JavaScript alternative, with an open-source license. A Python library assists with image tiling.
Of course, there's the PDF option, as well. It requires Adobe's Acrobat or a PDF-reader, but pretty much everyone has a PDF plug-in nowadays.

I'm sure there are many more, but it's late and I'm tired. Add a comment if you've tried other solutions or have a good example of any of these!

the darchivist

Danielle Cunniff Plumer

About Me

My Favorites

Blog Archive

Labels