Web-Archiving. Maureen Pennock.
DPC Technology Watch Report 13-01. March 2013. Publicly released
24 May 2013.
This report is intended for those wanting to develop a better understanding of the issues and options for archiving web content, and for those intending to set up a web archive. Web archiving technology allows valuable web content to be preserved and managed for future generations.
Web content is lost at an alarming rate and our digital cultural memory and organizational accountability is at risk. Organizational needs and resources must be considered when choosing web archiving tools and services. Issues with web archiving include selection of content, authenticity and integrity, quality assurance, duplication of content, legal rights, viruses, and the long term preservation of resources. Web archiving is not a single action but often a suite of applications used in various ways at different stages of the archiving process. Archiving tools may include commercial services, Web Curator Tool, Netarchive Suite, the Heritrix web crawler, WGet, and the Wayback access interface. Archiving a simple website may be straightforward, but archiving large numbers of websites for the long term becomes much more complicated and requires a complex solution. The International Internet Preservation Consortium has played key roles in developing standards, such as the WARC standard, and archiving tools.
There are three main technical approaches:
1. Client-side archiving, using web crawlers such as Heritrix or HTTrack
2. Transactional archiving, which addresses the capture of client-side transactions
3. Server-side archiving, which requires active participation from publishing organizations
Another option being explored is the use of RSS feeds to identify and pull content into a web archive.
In spite of all of the efforts for capture and managing web content, web archives still face significant challenges, such as quality assurance issues, the need for more capable tools, and the need for better legislation. "The technical challenges of web archiving cannot, and should not, be addressed in isolation."