Monday, October 14, 2013

The WARC File Format (ISO 28500) - Information, Maintenance, Drafts

The WARC File Format (ISO 28500) - Information, Maintenance, Drafts.

I confirm - as convenor of the WARC format ISO working group - that there are no substantial modifications between the version on and the ISO standard, except some little editorial changes. So it may be used as a trustworthy reference.

As far as I know, ISO organization resources comes largely from the selling of their standards, so it is not possible to make them freely available, except in some cases. The case of ISO/IEC standards is one of these exceptions; it is due to the fact that the standards are developed by two organizations with different publication rules (ISO and IEC).
Even as convenor, I had no free copies of the standard.
I will check again with ISO secretariat but I doubt it will be legal to make freely available the official version.

This is a reason why there is a common practice to publish draft standards
- such as we did on BnF website.

Best regards,

Saturday, August 10, 2013

Game Walkthroughs As A Metaphor for Web Preservation

Game Walkthroughs As A Metaphor for Web Preservation. Michael Nelson. Web Science and Digital Libraries Research Group. May 25, 2013.
Somethings can't really be preserved digitally, such as computer games, even though it would be possible to create emulators. So for some, the best way to experience the game is though walk throughs on YouTube.
"I think game walkthroughs can provide us with an interesting metaphor for web archiving, not simply walkthroughs of web instead of game sessions (though that is possible), but in the sense of capturing a series of snapshots of dynamic services and archiving them.  Given "enough" snapshots, we might be able to reconstruct the output of a black box"

Google Maps is another site that has preservation issues. 

"There are a number of issues to be researched to make this easy enough for people to do (many of which our group is investigating), but the popularity of game walkthroughs and their preservation side-effects suggests to me that the web archiving community should be informed by them."

Tuesday, July 30, 2013

Choosing a Sustainable Web Archiving Method: A Comparison of Capture Quality

Choosing a Sustainable Web Archiving Method: A Comparison of Capture Quality. Gabriella Gray and Scott Martin. D-Lib Magazine. May/June 2013.

The tool we chose to investigate was the California Digital Library's Web Archiving Service (WAS).
our existing model was becoming unsustainable and we needed to move to a new model if we were to continue capturing and archiving campaign websites. Our reluctance to move away from our existing labor-intensive manual process was rooted in the high quality capture results our method produced. Thus, finding an automated tool that could match, or come close to matching, the quality of our manual captures was the most important element we considered as we evaluated our options.

The Web Archiving Service (WAS), which is based on the Heritrix crawler, is essentially a "What You See Is What You Get" (WYSIWYG) tool. WAS includes various limited options which allow curators to adjust the settings used to capture a particular website, but they cannot edit or modify the final capture results. Ultimately the decision as to whether WAS was a viable alternative to our current method would rest on the quality of the captures (the WYG).

We analyzed the robots.txt files from a preliminary list of 181 websites and discovered the following results:
  • 27 (15%) would have been entirely blocked or resulted in unusable captures. Robots.txt blocked access to whole sites or to key directories required for site navigation.
  • 45 (25%) would experience at least minor capture problems such as loss of CSS files, images, or drop-down menus. Robots.txt blocked access to directories containing ancillary file types such as images, CSS, or JavaScript which provided much of the "look and feel" of the site.
  • 9 (5%) would have unknown effects on the capture. This case was applied to sites with particularly complicated robots.txt files and/or uncommon directory names where it was not clear what files were located in the blocked directories.
  • 100 sites (55%) would have no effect. The robots.txt file was not present, contained no actual blocks, or blocked only specific crawlers.
 The results of our comparison, that the core content gathered by WAS and our manual capture and editing method was overall equivalent, provided the impetus we needed to officially make the decision to transition to WAS for our web archiving needs. As capture tools evolve more attention is being paid to enhancing their quality assurance tools.

Sunday, July 14, 2013 Supports Memento Supports Memento. Web Science and Digital Libraries Research Group. July 9, 2013. a new page-at-a-time personal web archiving utility. It archives a single page on request. Features include a simple search/upload interface, a bookmarklet to push pages into the archive while reading, thumbnails and full-sized images of captured pages, and it now  supports Memento.


The age of data: Strategies for response

The age of data: Strategies for response. John W. Thompson. Computerworld. June 14, 2013.
The scale of data growth today is so massive it can be numbing. A recent study shows that "in the last minute there were 204 million emails sent, 61,000 hours of music listened to on Pandora, 20 million photo views and 3 million uploads to Flickr, 100,000 tweets, 6 million views and 277,000 Facebook logins, and 2 million plus Google searches." Data is continuing to grow at a phenomenal pace. The total of all digital data created and replicated will reach 4 zettabytes in 2013, almost 50 percent more than 2012. The growth of data also provides an opportunity for organizations to analyze the information being gathered and use it to its advantage. One of the things that has helped is the technology to reduce the amount of data by managing it and eliminate dozens and dozens of redundant copies. 

Edward R. Murrow's audio essays with the famous -- and not-so-famous -- have been digitized and put online

Edward R. Murrow's audio essays with the famous -- and not-so-famous -- have been digitized and put online. Computerworld. Lucas Mearian. July 12, 2013.
Over 800 oral essays from Edward R. Murrow's 1950s radio series, This I Believe, have been placed online for public use by Tufts University. The audio collection comes from almost 800 reel-to-reel tape recordings "that were nearly lost forever due to natural wear and tear from more than 50 years in less than ideal storage." The engineers captured the analogue recordings using a 96K, 24-bit high resolution WAV format.

Friday, July 12, 2013

NDSA Storage Report: Reflections on National Digital Stewardship Alliance Member Approaches to Preservation Storage Technologies

NDSA Storage Report: Reflections on National Digital Stewardship Alliance Member Approaches to Preservation Storage TechnologiesMicah Altman, et al. D-Lib Magazine. June 2013.

The structure and design of digital storage systems is a cornerstone of digital preservation.  To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This article reports on the findings of the survey. 

Key Findings

The key findings from the survey were:
  • 90% of respondents are distributing copies of at least part of their content geographically.
  • 88% of respondents are responsible for their content for an indefinite period of time.
  • 80% of respondents use some form of fixity checking for their content.
  • 75% of respondents report a strong preference to host and control their own technical infrastructure for preservation storage.
  • 69% of respondents are considering, or currently participating in, a distributed storage cooperative or system (ex. LOCKSS alliance, MetaArchive, Data-PASS).
  • 64% of respondents are planning to make significant technological changes in their preservation storage architecture in the next three years.
  • 51% of respondents are considering or already using a cloud storage provider to keep one or more copies of their content.
  • 48% of respondents are considering, or currently contracting out, storage services to be managed by another organization or company.

WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy.

WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy.  July 10, 2013.
A goal of the Web Science and Digital Libraries Research Group is to assist in making web preservation accessible to regular users instead of just power users.  A few digital preservation software packages that were created by WS-DLers include:
  • Warrick - a utility for reconstructing/ recovering a website using various archives and caches.
  • Synchronicity - a Firefox extension that supports rediscovering missing web pages
  • mcurl - a command-line memento client
  • WARCreate - a Google Chrome extension that can create WARC files from any webpage 
  • Web Archiving Integration Layer (WAIL) - a re-packaged Wayback and Heritrix that aims to be "One-Click User Instigated Preservation"

Friday, June 21, 2013

JHOVE 1.10b3

JHOVE 1.10b3. Gary McGath. File Formats Blog.

Saturday, June 15, 2013

EPUB for archival preservation: an update

EPUB for archival preservation: an update. Johan van der Knijff's blog on Open Planets.
In 2012  the KB released a report on the suitability of the EPUB format for archival preservation. A substantial number of EPUB-related developments have happened since then, and as a result some of the report's findings and conclusions have become outdated, particularly the observations on EPUB 3, and the support of EPUB by characterisation tools. This blog post provides an update to those findings :
  • Use of EPUB in scholarly publishing
  • Adoption and use of EPUB 3
  • EPUB 3 reader support
  • Support of EPUB by characterisation tools
The use of EPUB is increasing and a number of publishers are all using EPUB 2. Also, a number of organisations representing the publishing industry support EPUB 3, though the actual use of EPUB 3 is still limited.The 2012 report concluded that EPUB was not optimally supported by characterisation tools. This situation has improved quite a lot since that time. EPUB is now included in PRONOM, and DROID.  Overall, EPUB's credentials as a preservation format appear to have improved quite a bit over the last year.

Friday, June 14, 2013

EPUB for archival preservation

EPUB for archival   preservation. Johan van der Kniff. KB/National Library of the Netherlands. 20 July 2012. 
The EPUB format has become increasingly popular in the consumer market. A number of publishers have indicated their wish to use EPUB for supplying their electronic publications to the KB. This document looks at the characteristics and functionality of the format, and whether or not it is suitable for preservation.  Conceptually, an EPUB file is just an ordinary ZIP archive which includes one or more XHTML files, in one or more directories.  Cascading Style Sheets are used to define layout and formatting. A number of XML files provide metadata.

EPUB has a number of strengths that make it attractive for preservation. It is an open format that is well documented, and there are no known patents or licensing restrictions. The format's specifications are freely available. It is largely based on well‐established and widely‐used standards so it scores high marks for transparency and re‐usability. For situations where authenticity is crucial (e.g. legal documents) all or parts of a document can be digitally signed. Also, EPUB 2 is a popular format with excellent viewer support, including several open source implementations. There is concern that its role is limited because the current e‐book market is dominated by proprietary formats. And EPUB3 is currently less stable. There is a chart of recommendations for using EPUB.

Strategy for archiving digital records at the Danish National Archives

Strategy for archiving digital records at the Danish National Archives. Statens Arkiver. January 2013.
Their aim is to ensure the preservation of records that are of historical value, or that serve as documentation of significant administrative matters or legal importance for citizens and
authorities. The vision is to ensure that digital records are preserved so as to maintain their authenticity, and so that they can be found and reused. Preserving digital information for the long term, in a form that makes it reusable, requires some deliberate choices to be made in terms of methods, technologies and documentation. Digital preservation must also take economic considerations into account.

The basic strategy choice faced by preservation institutions is whether to pursue an emulation strategy or a migration strategy. This will determine how digital preservation in the institution is organised. The Danish National Archives have chosen  a migration strategy which requires that the Archives to migrate digital records to a few, well-defined standard formats, and from time to time, be migrated to new formats and structures.

The Danish National Archives’ strategy must not be dependent on continuous access to the system
in which the data was originally created. It must be possible to interpret and re-use data in other systems. The term “original” cannot be applied in the same way to digital records. Whether data
is extracted from tables in a database or digital documents, a representation of the content is preserved in the preservation format. A digital archive primarily preserves data or information. The key aspect is the preservation of authentic information. The implementation of the strategy requires
  1. Early identification and approval of systems for submission purposes
  2. Frequent submission in non-system dependent format
  3. Ongoing planning of preservation and periodical migration to a new preservation format
The Archives uses distributed digital preservation by keeping several identical copies on several different types of media, both optical and magnetic, at several different geographical locations. The Archives also conducts ongoing preservation planning and continuously adjusts the
implementation of its strategy so that the vision remains attainable and within its reach.

Wednesday, June 12, 2013


Web-Archiving. Maureen Pennock. DPC Technology Watch Report 13-01. March 2013. Publicly released
This report is intended for those wanting to develop a better understanding of the issues and options for archiving web content, and for those intending to set up a web archive. Web archiving technology allows valuable web content to be preserved and managed for future generations.

Web content is lost at an alarming rate and our digital cultural memory and organizational accountability is at risk. Organizational needs and resources must be considered when choosing web archiving tools and services. Issues with web archiving include selection of content, authenticity and integrity, quality assurance, duplication of content, legal rights, viruses, and the long term preservation of resources. Web archiving is not a single action but often a suite of applications used in various ways at different stages of the archiving process. Archiving tools may include commercial services, Web Curator Tool, Netarchive Suite, the Heritrix web crawler, WGet, and the Wayback access interface. Archiving a simple website may be straightforward, but archiving large numbers of websites for the long term becomes much more complicated and requires a complex solution. The International Internet Preservation Consortium has played key roles in developing standards, such as the WARC standard, and archiving tools.

There are three main technical approaches:
1. Client-side archiving, using web crawlers such as Heritrix or HTTrack
2. Transactional archiving, which addresses the capture of client-side transactions
3. Server-side archiving, which requires active participation from publishing organizations
Another option being explored is the use of RSS feeds to identify and pull content into a web archive.

In spite of all of the efforts for capture and managing web content, web archives still face significant challenges, such as quality assurance issues, the need for more capable tools, and the need for better legislation.  "The technical challenges of web archiving cannot, and should not, be addressed in isolation."

Tuesday, June 04, 2013

Cerf sees a problem: Today's digital data could be gone tomorrow.

Cerf sees a problem: Today's digital data could be gone tomorrow. Patrick Thibodeau. Computerworld. June 4, 2013.
Vinton Cerf is concerned that much of the data that has been created in the past few decades and for years still to come, will be lost to time. Digital materials from today, such as spreadsheets, documents, presentations as well as mountains of scientific data, won't be readable in the years and centuries ahead. Software backward compatibility is very hard to preserve over very long periods of time, and the data objects are only meaningful if the software programs are available to interpret them. "The scientific community collects large amounts of data from simulations and instrument readings. But unless the metadata survives, which will tell under what conditions the data was collected, how the instruments were calibrated, and the correct interpretation of units, the information may be lost. If you don't preserve all the extra metadata, you won't know what the data means. So years from now, when you have a new theory, you won't be able to go back and look at the older data."

What is needed is a "digital vellum," a digital medium that is as durable and long-lasting as the material that has successfully preserved written content for more than 1,000 years. If a company goes out of business and there is no provision for its software to become accessible to others, all the products running that software may become inaccessible. The cloud computing environment may help; it may be able to emulate older hardware on which we can run operating systems and applications. We need to preserve the bits, but also the a way of interpreting them.

The CODATA Mission: Preserving Scientific Data for the Future

The CODATA Mission: Preserving Scientific Data for the Future.Jeanne Kramer-Smyth. Spellbound Blog. February, 2013.
This is a post (and a link to the slides) about a session that was part of The Memory of the World in the Digital Age: Digitization and Preservation conference. The aim was to describe the initiatives of the Data at Risk Task Group (DARTG), which is part of the International Council for Science Committee on Data for Science and Technology (CODATA).

The goal is to preserve scientific data that is in danger of loss because they are not in modern electronic formats, or have particularly short shelf-life. The task group is seeking out sources of such data worldwide since many are irreplaceable for research into the long-term trends that occur in the natural world. One speaker talked about two forms of knowledge that we are concerned with here: the memory of the world and the forgettery of the world. Only the digital, or recently digitized, data can be recalled readily and made immediately accessible for research in the digital formats that research needs. The “forgettery of the world” is the analog records, ones that have been set aside for whatever reason, or put away for a long time and have become almost forgotten.  It the analog data which are considered to be “at risk” and which are the task group’s immediate concern.  Some of the early digital data are insufficiently described, or the format is out of date and unreadable, or the records cannot be located at all easily.

How can such “data at risk” be recovered and made useable?  An inventory website has been set up where one can report data-at-risk. The overarching goal is to build a research knowledge base that offers a complimentary combination of past, present and future records. Some data mentioned: Oceanographic; climate; satellite; and other scientific data sets; born digital maps. With digital preservation initiatives there is a lot of rhetoric, but not so much action. There have been many consultations, studies, reports and initiatives but not very much has translated into action. 

Monday, May 27, 2013

National Library of Australia’s Digital Preservation Policy

Digital Preservation Policy 4th Edition (2013). National Library of Australia.  May 26, 2013.
This site outlines the National Library of Australia’s policy on preserving its digital collections, and collaborating with others to preserve digital information resources. The primary objective of their digital preservation activities is maintaining the ability to meaningfully access digital collection content over time. The primary concern is preserving the ability to access the Preservation Master File from which derivatives files may be created or re-created over time. To this end, preservation of digital library material includes:
  •     Bit-level preservation of all digital objects, ie. keeping the original files intact;
  •     Ensuring that authenticity and provenance is maintained;
  •     Ensuring that appropriate preservation information is maintained;
  •     Understanding and reporting on risks which affect ongoing access;
  •     Performing appropriate actions to ensure that objects remain accessible;
  •     Periodic review of preferred formats and digital metadata standards
Preservation of the Library's digital collections involves four main goals:
  1.     Maintaining access to reliable data at bit-stream level;
  2.     Maintaining access to content encoded in the bit streams;
  3.     Maintaining access to the intended content; and
  4.     Maintaining the stated preservation intent for all digital material over time.
While specific preservation activities may focus on one or more of these goals, the Library’s preservation responsibility is only fulfilled when all four goals have been adequately addressed.

The Library uses the concepts in the Open Archival Information Systems (OAIS) Reference Model and other international standards and best practices, such as PREMIS and Open Planets Foundation.

Sunday, May 19, 2013

Digital Preservation Tool Grid

Digital Preservation Tool Grid. Preserving Objects With Restricted Resources. May 15, 2013.
     This is a grid, created by POWRR, that looks at 24 different features, such as ingest, processing, access, storage, maintenance, and cost, for about 50 digital preservation tools. The tools range from simple tools to full digital preservation systems, from ACE to Xena. This tool is very informative.