Wednesday, March 02, 2016

Leveraging Heritrix and the Wayback Machine on a Corporate Intranet: A Case Study on Improving Corporate Archives

Leveraging Heritrix and the Wayback Machine on a Corporate Intranet: A Case Study on Improving Corporate Archives. Justin F. Brunelle, et al. D-Lib Magazine. January/February 2016.
     This is a case study about using open-source, web-scale web archiving tools, Heritrix and the Wayback Machine. Internet archiving does not have the opportunity to archive Intranet-based resources, such as corporate content. Past research has shown that "web pages' reliance on JavaScript to construct representations leads to a reduction in archivability".  The Internet Archive uses Heritrix and the Wayback Machine to archive web resources and replay mementos on the public web.
The article recommends content authors use robots.txt and noarchive HTTP response headers to avoid sensitive information. Accidentally archiving sensitive information can result in loss of mementos within a WARC. Recommendations include:
  • Use smaller storage devices to limit the problems if sensitive information is crawled;
  • Develop a way to remove a sensitive memento from a WARC file 
  • Identify high-risk vs. low-risk archival targets within the Intranet.
Archiving intranet content needs to fit within a larger documentation plan and knowing what the key resources and elements are that need to be preserved in order to preserve corporate memory. There is value for a corporation to have a web crawling archiving strategy. It "may make more sense for a corporate archives to preserve information about its corporation's projects that is tracked in a database and served to an Intranet through an export directly from the database rather than crawling the Intranet for the project data".

The case study and the next steps proposed will help archive corporate memory, improve information longevity, and can help corporate archivists implement web archiving strategies.


No comments: