Monday, November 23, 2015

The Provenance of Web Archives

The Provenance of Web Archives. Andy Jackson; Jason Webber. UK Web Archive blog. 20 November 2015.
     More researchers are taking an interest in web archives.  The post author says their archive has "tried to our best to capture as much of our own crawl context as we can." In addition to the WARC request and response records, they store other information that can answer how and why a particular resource has been archived:
  • links that the crawler found when it analysed each resource 
  • the full crawl log, which records DNS results and other situations
  • the crawler configuration, including seed lists, scope rules, exclusions etc.
  • the versions of the software we used  
  • rendered versions of original seeds and home pages  and associated metadata.
Th archive doesn't "document every aspect of our curatorial decisions, e.g. precisely why we choose to pursue permissions to crawl specific sites that are not in the UK domain. Capturing every mistake, decision or rationale simply isn’t possible, and realistically we’re only going to record information when the process of doing so can be largely or completely automated". In the future, there "will be practical ways of summarizing provenance information in order to describe the systematic biases within web archive collections, but it’s going to take a while to work out how to do this, particularly if we want this to be something we can compare across different web archives."

No archive is perfect. They "can only come to be understood through use, and we must open up to and engage with researchers in order to discover what provenance we need and how our crawls and curation can be improved. " There are problems need to be documented, but researchers "can’t expect the archives to already know what they need to know, or to know exactly how these factors will influence your research questions."

No comments: