Tuesday, July 30, 2013

Choosing a Sustainable Web Archiving Method: A Comparison of Capture Quality

Choosing a Sustainable Web Archiving Method: A Comparison of Capture Quality. Gabriella Gray and Scott Martin. D-Lib Magazine. May/June 2013.

The tool we chose to investigate was the California Digital Library's Web Archiving Service (WAS).
our existing model was becoming unsustainable and we needed to move to a new model if we were to continue capturing and archiving campaign websites. Our reluctance to move away from our existing labor-intensive manual process was rooted in the high quality capture results our method produced. Thus, finding an automated tool that could match, or come close to matching, the quality of our manual captures was the most important element we considered as we evaluated our options.

The Web Archiving Service (WAS), which is based on the Heritrix crawler, is essentially a "What You See Is What You Get" (WYSIWYG) tool. WAS includes various limited options which allow curators to adjust the settings used to capture a particular website, but they cannot edit or modify the final capture results. Ultimately the decision as to whether WAS was a viable alternative to our current method would rest on the quality of the captures (the WYG).

We analyzed the robots.txt files from a preliminary list of 181 websites and discovered the following results:
  • 27 (15%) would have been entirely blocked or resulted in unusable captures. Robots.txt blocked access to whole sites or to key directories required for site navigation.
  • 45 (25%) would experience at least minor capture problems such as loss of CSS files, images, or drop-down menus. Robots.txt blocked access to directories containing ancillary file types such as images, CSS, or JavaScript which provided much of the "look and feel" of the site.
  • 9 (5%) would have unknown effects on the capture. This case was applied to sites with particularly complicated robots.txt files and/or uncommon directory names where it was not clear what files were located in the blocked directories.
  • 100 sites (55%) would have no effect. The robots.txt file was not present, contained no actual blocks, or blocked only specific crawlers.
 The results of our comparison, that the core content gathered by WAS and our manual capture and editing method was overall equivalent, provided the impetus we needed to officially make the decision to transition to WAS for our web archiving needs. As capture tools evolve more attention is being paid to enhancing their quality assurance tools.

Sunday, July 14, 2013

Archive.is Supports Memento

Archive.is Supports Memento. Web Science and Digital Libraries Research Group. July 9, 2013.
Archive.is a new page-at-a-time personal web archiving utility. It archives a single page on request. Features include a simple search/upload interface, a bookmarklet to push pages into the archive while reading, thumbnails and full-sized images of captured pages, and it now  supports Memento.


The age of data: Strategies for response

The age of data: Strategies for response. John W. Thompson. Computerworld. June 14, 2013.
The scale of data growth today is so massive it can be numbing. A recent study shows that "in the last minute there were 204 million emails sent, 61,000 hours of music listened to on Pandora, 20 million photo views and 3 million uploads to Flickr, 100,000 tweets, 6 million views and 277,000 Facebook logins, and 2 million plus Google searches." Data is continuing to grow at a phenomenal pace. The total of all digital data created and replicated will reach 4 zettabytes in 2013, almost 50 percent more than 2012. The growth of data also provides an opportunity for organizations to analyze the information being gathered and use it to its advantage. One of the things that has helped is the technology to reduce the amount of data by managing it and eliminate dozens and dozens of redundant copies. 

Edward R. Murrow's audio essays with the famous -- and not-so-famous -- have been digitized and put online

Edward R. Murrow's audio essays with the famous -- and not-so-famous -- have been digitized and put online. Computerworld. Lucas Mearian. July 12, 2013.
Over 800 oral essays from Edward R. Murrow's 1950s radio series, This I Believe, have been placed online for public use by Tufts University. The audio collection comes from almost 800 reel-to-reel tape recordings "that were nearly lost forever due to natural wear and tear from more than 50 years in less than ideal storage." The engineers captured the analogue recordings using a 96K, 24-bit high resolution WAV format.

Friday, July 12, 2013

NDSA Storage Report: Reflections on National Digital Stewardship Alliance Member Approaches to Preservation Storage Technologies

NDSA Storage Report: Reflections on National Digital Stewardship Alliance Member Approaches to Preservation Storage TechnologiesMicah Altman, et al. D-Lib Magazine. June 2013.

The structure and design of digital storage systems is a cornerstone of digital preservation.  To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This article reports on the findings of the survey. 

Key Findings

The key findings from the survey were:
  • 90% of respondents are distributing copies of at least part of their content geographically.
  • 88% of respondents are responsible for their content for an indefinite period of time.
  • 80% of respondents use some form of fixity checking for their content.
  • 75% of respondents report a strong preference to host and control their own technical infrastructure for preservation storage.
  • 69% of respondents are considering, or currently participating in, a distributed storage cooperative or system (ex. LOCKSS alliance, MetaArchive, Data-PASS).
  • 64% of respondents are planning to make significant technological changes in their preservation storage architecture in the next three years.
  • 51% of respondents are considering or already using a cloud storage provider to keep one or more copies of their content.
  • 48% of respondents are considering, or currently contracting out, storage services to be managed by another organization or company.

WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy.

WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy.  July 10, 2013.
A goal of the Web Science and Digital Libraries Research Group is to assist in making web preservation accessible to regular users instead of just power users.  A few digital preservation software packages that were created by WS-DLers include:
  • Warrick - a utility for reconstructing/ recovering a website using various archives and caches.
  • Synchronicity - a Firefox extension that supports rediscovering missing web pages
  • mcurl - a command-line memento client
  • WARCreate - a Google Chrome extension that can create WARC files from any webpage 
  • Web Archiving Integration Layer (WAIL) - a re-packaged Wayback and Heritrix that aims to be "One-Click User Instigated Preservation"