Thursday, June 11, 2015

What We've Saved (2004-2014)

What We've Saved (2004-2014). Andy Jackson. UK Web Archive. June 11, 2015. [PPT slides]
After 10 years of the UK web archive, what has been saved? Three collections, over 8 billion resources, and 160 TB of compressed data. "Looking inward is not enough: To understand the value of our collection, we need to look beyond our walls and put it in context." A review shows how much has been lost from the web. Almost 100% of the crawled urls in the UK web archive, are gone or missing on the internet. And about 40% from 2013 is gone or missing. Link rot & content drift dominate:
  • 50% of resources unrecognisable or gone after 1 year
  • 60% after 2 years, 65% after 3 years (islands of stability)
  • Noticeably higher rot rate than results for legal/academic web

Simple similarity measures provides some insights, but there needs to be more work to look for old content in new locations.

No comments: