Digital Preservation Matters: How many of the EOT2008 PDF files were harvested in EOT2012

Monday, March 21, 2016

How many of the EOT2008 PDF files were harvested in EOT2012

How many of the EOT2008 PDF files were harvested in EOT2012.  Mark Phillips. mark e. phillips journal. February 23, 2016.
Post aabout the author looking at some of the data from the End of Term 2012 Web Archive snapshot at the UNT Libraries. From the EOT2008 Web archive 4,489,675 unique (by hash) PDF files were extracted and then compared recently to see how many of those nearly 4.5 million PDFs were still around in 2012 when they crawled the federal Web again as part of the EOT2012 project. The findings:

After the numbers finished running, it looks like the following.

                     PDFs        Percentage
Found             774,375       17%
Missing    3,715,300       83%
Total            4,489,675     100%

So 83% of the PDF files that were present in 2008 are not present in the EOT2012 Archive. It is possible that the items are still available at a different URL entirely in 2012 when it was harvested again. So the URL might not be available but the content could be available at another location.

Digital Preservation Matters.

Monday, March 21, 2016

How many of the EOT2008 PDF files were harvested in EOT2012

No comments: