The tool we chose to investigate was the California Digital Library's Web Archiving Service (WAS).
our existing model was becoming unsustainable and we needed to move to a new model if we were to continue capturing and archiving campaign websites. Our reluctance to move away from our existing labor-intensive manual process was rooted in the high quality capture results our method produced. Thus, finding an automated tool that could match, or come close to matching, the quality of our manual captures was the most important element we considered as we evaluated our options.
The Web Archiving Service (WAS), which is based on the Heritrix crawler, is essentially a "What You See Is What You Get" (WYSIWYG) tool. WAS includes various limited options which allow curators to adjust the settings used to capture a particular website, but they cannot edit or modify the final capture results. Ultimately the decision as to whether WAS was a viable alternative to our current method would rest on the quality of the captures (the WYG).
We analyzed the robots.txt files from a preliminary list of 181 websites and discovered the following results:
- 27 (15%) would have been entirely blocked or resulted in unusable captures. Robots.txt blocked access to whole sites or to key directories required for site navigation.
- 45 (25%) would experience at least minor capture problems such as loss of CSS files, images, or drop-down menus. Robots.txt blocked access to directories containing ancillary file types such as images, CSS, or JavaScript which provided much of the "look and feel" of the site.
- 9 (5%) would have unknown effects on the capture. This case was applied to sites with particularly complicated robots.txt files and/or uncommon directory names where it was not clear what files were located in the blocked directories.
- 100 sites (55%) would have no effect. The robots.txt file was not present, contained no actual blocks, or blocked only specific crawlers.