Wednesday, September 09, 2009

Digital Preservation Matters - 09 September 2009

Harvard's Web Archive Collection Service (WAX). Website. September 2009.

This site began as a pilot project to address the management of web sites by collection managers for long-term archiving. It is designed to capture, manage, store and display web sites in an archive. “With blogs supplanting diaries, e-mail supplanting traditional correspondence, and HTML materials supplanting many forms of print collateral, collection managers have grown increasingly concerned about potential gaps in the documentation of our cultural heritage.” It is built using open source tools such as the Heritrix web crawler; the Wayback index and rendering tool; and the NutchWAX index and search tool. Documents concerning WAX are available at:

Glossary of Web Archiving Terms. Molly Bragg. Internet Archive website. August 06, 2009.

Part of the Archive-It help section. This page has a glossary of web archiving terms. The rest of the wiki has some good information about archiving web sites.

Missing links: the enduring web. Web Archiving Consortium Workshop. 21 July 2009.

Web pages are at risk. The early web pages are of similar historical importance with prehistoric writings, and both are at risk. “Key issues for long-term access and preservation remain unresolved.” This site includes the presentations from the workshop. Some of these include:

  • Web Archive and Citation Repository in One: DACHS
  • The future of researching the past of the Internet
  • Web Archiving Tools: An Overview
  • Context and content: Delivering Coordinated UK Web Archive to User Communities
  • Capture and Continuity: Broken links and the UK Central Government Web Presence
  • Diamonds in the Rough: Capturing and Preserving Online Content from Blogs
  • Beyond Harvest: Long Term Preservation of the UK Web Archive
  • From Web Page to Living Web Archive
  • Emulating access to the web
  • What we want with web-archives; will we win?

The following items are a few of the presentations at the Web Archiving Consortium Workshop:

Web Archive and Citation Repository in One: DACHS. Hanno Lecher.


  1. Capturing and archiving relevant resources as primary source for later research
  2. Providing citation repository for authors and publishers

When citing online resources:

  • Verify URL references
  • Evaluate reliability of online resources
  • Use PURLs
  • Tools include: Snagit, Zotero, WebCite, DACHS Citation Repository

“the best current solution to improve access to Internet references is for publishers to require capture and submission of all Internet information at the time of manuscript consideration“

The future of researching the past of the Internet. Eric T. Meyer.

May not want to capture an entire web site, so you may consider the sub links. ‘Seed’ is a site from which other sites can be discovered through the links. Look at annotating the web sites; moving from snapshots of a site to more continuous data capture; how to share the results in a meaningful way.

Web Archiving Tools: An Overview. Helen Hockx-Yu.

  1. Selection: have a policy, decide what to capture.
  2. Collect data files (snapshot), examine for other sites to be collected, add to collection list
  3. Store the archived files on disk, virus check, integrity check
  4. Make accessible, index, add metadata, render the files, ensure long term access

Heretrix is the most commonly used tool, also Web Curator Tool.

WebARCive (WARC) format is coming into use. Other tools needed:

  • Rendering, such as Open Source WaybackMachine;
  • Full-text search, such as Nutch/Nutchwaxby or Hanzo tools
  • Provide other search/retrieval options (subject, collection, site name, change over time)

No consensus on strategy, practices and specific tools. Crawlers work with HTML, but not advanced designs or tools. Need to handle problem sites. Decide what to duplicate.

Context and content: Delivering Coordinated UK Web Archive to User Communities. Cathy Smith.

The presentation starts with two questions:

  1. What audiences should web archives anticipate and what does this mean for selection, ingest and preservation?
  2. What will the web be like as an historical source, and what use will be made of archived web sites by future generations?


  1. Institutions continue to provide access to their individual collections, where appropriate, to support researchers; assure integrity of collections; allow integration with the institution’s other, non-web holdings;
  2. Coordinate with other institutions by sharing collection development policies; defining the metadata standard, and developing technical interfaces.

Diamonds in the Rough: Capturing and Preserving Online Content from Blogs. Richard Davis.

“New genres of publications are becoming increasingly important to participants. For example, blogs are cited as a good window into what expert practitioners are doing. This material is not duplicated in traditional sources, yet it is important to consult”. Perceived barriers of web archiving are the cost of implementation and the complexity of available tools. Institutional blog archives are part of the institutional record. They should go through a selection process; support authenticity and fixity; be persistent and citable. Blogs seem to be an area where the content is of primary importance and design is secondary. Create an institutional or thematic archive by using a WordPress database to gather and store the posts and comments and provide access.

No comments: