More researchers are taking an interest in web archives. The post author says their archive has "tried to our best to capture as much of our own crawl context as we can." In addition to the WARC request and response records, they store other information that can answer how and why a particular resource has been archived:
- links that the crawler found when it analysed each resource
- the full crawl log, which records DNS results and other situations
- the crawler configuration, including seed lists, scope rules, exclusions etc.
- the versions of the software we used
- rendered versions of original seeds and home pages and associated metadata.
No archive is perfect. They "can only come to be understood through use, and we must open up to and engage with researchers in order to discover what provenance we need and how our crawls and curation can be improved. " There are problems need to be documented, but researchers "can’t expect the archives to already know what they need to know, or to know exactly how these factors will influence your research questions."
No comments:
Post a Comment