Smithsonian Institution Archives is preserving the Institution’s history, including its large web presence. The Archives crawls each website using Heritrix, an open-source tool created by the Internet Archive, to capture content in an archival format. The purpose is to preserve the appearance, behavior, and content of digital objects. The Archives tailors crawl configurations to each specific website to capture as much of it as possible while adhering to the collections policy. Sometimes the structure of the site itself makes a perfect crawl difficult or impossible.
Five suggestions for web developers that can help ensure that their websites will be easier to crawl, to make accessible, and to preserve.
- Follow accessibility standards
- Avoid proprietary formats for important content or provide alternate versions
- Maintain stable URLs and redirect when necessary. Avoid linkrot, meaning links which point to resources that are no longer available. Carefully plan and implement a URL design scheme with a policy of persistence. They have found websites with as many as 40% broken links.
- Design navigation carefully and include a sitemap. The crawler is usually set to capture only six levels deep. To help others discover your entire website, provide a sitemap and “view all” link for documents.
- Allow browsing of collections, not just searching, such as by arranging images by genre.