This is a case study about using open-source, web-scale web archiving tools, Heritrix and the Wayback Machine. Internet archiving does not have the opportunity to archive Intranet-based resources, such as corporate content. Past research has shown that "web pages' reliance on JavaScript to construct representations leads to a reduction in archivability". The Internet Archive uses Heritrix and the Wayback Machine to archive web resources and replay mementos on the public web.
The article recommends content authors use robots.txt and noarchive HTTP response headers to avoid sensitive information. Accidentally archiving sensitive information can result in loss of mementos within a WARC. Recommendations include:
- Use smaller storage devices to limit the problems if sensitive information is crawled;
- Develop a way to remove a sensitive memento from a WARC file
- Identify high-risk vs. low-risk archival targets within the Intranet.
The case study and the next steps proposed will help archive corporate memory, improve information longevity, and can help corporate archivists implement web archiving strategies.
No comments:
Post a Comment