Saturday, November 21, 2015

How Much Of The Internet Does The Wayback Machine Really Archive?

How Much Of The Internet Does The Wayback Machine Really Archive? Kalev Leetaru. Forbes.  November 16, 2015.
     "The Internet Archive turns 20 years old next year, having archived nearly two decades and 23 petabytes of the evolution of the World Wide Web. Yet, surprisingly little is known about what exactly is in the Archive’s vaunted Wayback Machine." The article looks at how the Internet Archive archives sites and suggests "that far greater understanding of the Internet Archive’s Wayback Machine is required before it can be used for robust reliable scholarly research on the evolution of the web." It requires a more "systematic assessment of the collection’s holdings." Archive the open web uses enormous technical resources.

Maybe the important lesson to learn is that we have little understanding of what is actually in the data we use and few researchers really explore the questions about the data.  The archival landscape of the Wayback Machine was far more complex than original realized, and it is unclear how the Wayback Machine has been constructed. This insight is critical. "When archiving an infinite web with finite resources, countless decisions must be made as to which narrow slices of the web to preserve." The selection can be either random or prioritized by some element.  Each approach has distinct benefits and risks.

Libraries have formalized over time how they make collection decisions. Web archives must adopt similar processes.  The web is "disappearing before our very eyes" which can be seen in the fact that  up to 14% of all online news monitored by the GDELT Project is no longer accessible after two months".  We must "do a better job of archiving the online world and do it before this material is lost forever."

No comments: