Wednesday, September 09, 2009

Digital Preservation Matters - 09 September 2009

Harvard's Web Archive Collection Service (WAX). Website. September 2009.

This site began as a pilot project to address the management of web sites by collection managers for long-term archiving. It is designed to capture, manage, store and display web sites in an archive. “With blogs supplanting diaries, e-mail supplanting traditional correspondence, and HTML materials supplanting many forms of print collateral, collection managers have grown increasingly concerned about potential gaps in the documentation of our cultural heritage.” It is built using open source tools such as the Heritrix web crawler; the Wayback index and rendering tool; and the NutchWAX index and search tool. Documents concerning WAX are available at:

Glossary of Web Archiving Terms. Molly Bragg. Internet Archive website. August 06, 2009.

Part of the Archive-It help section. This page has a glossary of web archiving terms. The rest of the wiki has some good information about archiving web sites.

Missing links: the enduring web. Web Archiving Consortium Workshop. 21 July 2009.

Web pages are at risk. The early web pages are of similar historical importance with prehistoric writings, and both are at risk. “Key issues for long-term access and preservation remain unresolved.” This site includes the presentations from the workshop. Some of these include:

  • Web Archive and Citation Repository in One: DACHS
  • The future of researching the past of the Internet
  • Web Archiving Tools: An Overview
  • Context and content: Delivering Coordinated UK Web Archive to User Communities
  • Capture and Continuity: Broken links and the UK Central Government Web Presence
  • Diamonds in the Rough: Capturing and Preserving Online Content from Blogs
  • Beyond Harvest: Long Term Preservation of the UK Web Archive
  • From Web Page to Living Web Archive
  • Emulating access to the web
  • What we want with web-archives; will we win?

The following items are a few of the presentations at the Web Archiving Consortium Workshop:

Web Archive and Citation Repository in One: DACHS. Hanno Lecher.


  1. Capturing and archiving relevant resources as primary source for later research
  2. Providing citation repository for authors and publishers

When citing online resources:

  • Verify URL references
  • Evaluate reliability of online resources
  • Use PURLs
  • Tools include: Snagit, Zotero, WebCite, DACHS Citation Repository

“the best current solution to improve access to Internet references is for publishers to require capture and submission of all Internet information at the time of manuscript consideration“

The future of researching the past of the Internet. Eric T. Meyer.

May not want to capture an entire web site, so you may consider the sub links. ‘Seed’ is a site from which other sites can be discovered through the links. Look at annotating the web sites; moving from snapshots of a site to more continuous data capture; how to share the results in a meaningful way.

Web Archiving Tools: An Overview. Helen Hockx-Yu.

  1. Selection: have a policy, decide what to capture.
  2. Collect data files (snapshot), examine for other sites to be collected, add to collection list
  3. Store the archived files on disk, virus check, integrity check
  4. Make accessible, index, add metadata, render the files, ensure long term access

Heretrix is the most commonly used tool, also Web Curator Tool.

WebARCive (WARC) format is coming into use. Other tools needed:

  • Rendering, such as Open Source WaybackMachine;
  • Full-text search, such as Nutch/Nutchwaxby or Hanzo tools
  • Provide other search/retrieval options (subject, collection, site name, change over time)

No consensus on strategy, practices and specific tools. Crawlers work with HTML, but not advanced designs or tools. Need to handle problem sites. Decide what to duplicate.

Context and content: Delivering Coordinated UK Web Archive to User Communities. Cathy Smith.

The presentation starts with two questions:

  1. What audiences should web archives anticipate and what does this mean for selection, ingest and preservation?
  2. What will the web be like as an historical source, and what use will be made of archived web sites by future generations?


  1. Institutions continue to provide access to their individual collections, where appropriate, to support researchers; assure integrity of collections; allow integration with the institution’s other, non-web holdings;
  2. Coordinate with other institutions by sharing collection development policies; defining the metadata standard, and developing technical interfaces.

Diamonds in the Rough: Capturing and Preserving Online Content from Blogs. Richard Davis.

“New genres of publications are becoming increasingly important to participants. For example, blogs are cited as a good window into what expert practitioners are doing. This material is not duplicated in traditional sources, yet it is important to consult”. Perceived barriers of web archiving are the cost of implementation and the complexity of available tools. Institutional blog archives are part of the institutional record. They should go through a selection process; support authenticity and fixity; be persistent and citable. Blogs seem to be an area where the content is of primary importance and design is secondary. Create an institutional or thematic archive by using a WordPress database to gather and store the posts and comments and provide access.

Chinese HD DVD Successor Outsells Blu-Ray Discs in China.

Chinese HD DVD Successor Outsells Blu-Ray Discs in China. Anton Shilov. X-bit labs. July 27, 2009.
A Chinese HD DVD standard (CBHD) is being used in China more than Blu-ray. Optical disc manufacturers, who produce Blu-ray, are not planning to support it. They see little support for this standard outside China.

Wednesday, September 02, 2009

Digital Preservation Matters - 02 September 2009

Archival Masters - An RUcore Case Study. Ron Jantz, Isaiah Beard. Duraspace Case Studies. September 2009.

This case study is a summary of practices that Rutgers University Libraries has used with their Fedora system in the treatment of archival masters which have been developed over a period of years. They are recognized as compromises between preservation theory and practice. This will be valuable for others dealing with similar problems. The case study looks at topics such as policies, critical technologies, persistent IDs, normalizing archival masters, using checksums, documenting architectures, generating presentation files, content models, file formats, and others. Video files have been their greatest challenge.

A Data Deluge Swamps Science Historians. Robert Lee Hotz. The Wall Street Journal. August 28, 2009.

The first curator of e-Manuscripts in the British Library struggles with archiving the flood of computer materials. “Never have so many people generated so much digital data or been able to lose so much of it so quickly.” More technical data has been collected in the past year than all previous years combined. “The problem is forcing historians to become scientists, and scientists to become archivists and curators.” People are overwhelmed with all the data. “What you keep and how you pay for it are difficult issues.”

Time to clean up your digital closet. Chris O'Brien. Mercury News. August 3, 2009.

What will happen to data you have stored on devices that become outdated? People don’t really think about it. There isn’t an easy solution, and may never be one due to the dynamic nature of computers. There are some strategies you can put in place. “You will need to start thinking like a librarian and become an active curator of your files. That means relentlessly organizing, labeling and tagging, backing up and deleting.” Keep only the essential data. Develop a system for organizing files online and offline and remember where they are. Label every file and tag them with as much information as you can. Make multiple copies. Investigate ways to keep track of all this and update it regularly.

Think Tank: Google must let us forget. James Harkin. The Sunday Times. August 9, 2009.

With all the data that is now being stored online, there needs to be a way to purge unwanted information. Some companies are gathering information about people from public sites and storing it in a single database. Some data about individuals may be posted by other people. Some say we are creating a “digital memory that vastly exceeds the capacity of our collective human mind”, that there needs to be a way of forgetting the unimportant elements. One way suggested is to put an expiry date on data, then to remove the information on that date.

This article will self-destruct: A tool to make online personal data vanish. Hannah Hickey. University of Washington website. July 21, 2009.

Computers have made it difficult for data to be left behind, but the University of Washington has developed a way to make data expire with a system called Vanish. Vanish, a free, open-source tool that works with Firefox, can place a time limit on text uploaded to any Web service through a Web browser. “After a set time period, electronic communications such as e-mail, Facebook posts and chat messages would automatically self-destruct, becoming irretrievable from all Web sites, inboxes, outboxes, backup sites and home computers. Not even the sender could retrieve them.” It is intended to make information as private as a “phone conversation”.

The Norwegian National Digital Library. Marianne Takle. Ariadne. July 2009.

The National Library of Norway is establishing itself as a digital national library. It plans to digitize its entire collection and has added other practices and strategies. Resources have been redistributed to give priority to digitization, documents are being deposited in digital format, and agreements are in place for digital deposits. It is making collections available to users over the Internet. The three Guiding Principles of Selection for the library are:

  1. A strategy and priority for different collections: books: (oldest information); newspapers (those in demand); photos (donations); music (endangered sound formats).
  2. The thematic selection of material across all media types
  3. Follow up enquiries from other users and institutions and co-operate with them

The greatest obstacle to making information available is copyright.