Sunday, July 14, 2013 Supports Memento Supports Memento. Web Science and Digital Libraries Research Group. July 9, 2013. a new page-at-a-time personal web archiving utility. It archives a single page on request. Features include a simple search/upload interface, a bookmarklet to push pages into the archive while reading, thumbnails and full-sized images of captured pages, and it now  supports Memento.


The age of data: Strategies for response

The age of data: Strategies for response. John W. Thompson. Computerworld. June 14, 2013.
The scale of data growth today is so massive it can be numbing. A recent study shows that "in the last minute there were 204 million emails sent, 61,000 hours of music listened to on Pandora, 20 million photo views and 3 million uploads to Flickr, 100,000 tweets, 6 million views and 277,000 Facebook logins, and 2 million plus Google searches." Data is continuing to grow at a phenomenal pace. The total of all digital data created and replicated will reach 4 zettabytes in 2013, almost 50 percent more than 2012. The growth of data also provides an opportunity for organizations to analyze the information being gathered and use it to its advantage. One of the things that has helped is the technology to reduce the amount of data by managing it and eliminate dozens and dozens of redundant copies. 

Edward R. Murrow's audio essays with the famous -- and not-so-famous -- have been digitized and put online

Edward R. Murrow's audio essays with the famous -- and not-so-famous -- have been digitized and put online. Computerworld. Lucas Mearian. July 12, 2013.
Over 800 oral essays from Edward R. Murrow's 1950s radio series, This I Believe, have been placed online for public use by Tufts University. The audio collection comes from almost 800 reel-to-reel tape recordings "that were nearly lost forever due to natural wear and tear from more than 50 years in less than ideal storage." The engineers captured the analogue recordings using a 96K, 24-bit high resolution WAV format.

Friday, July 12, 2013

WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy.

WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy.  July 10, 2013.
A goal of the Web Science and Digital Libraries Research Group is to assist in making web preservation accessible to regular users instead of just power users.  A few digital preservation software packages that were created by WS-DLers include:
  • Warrick - a utility for reconstructing/ recovering a website using various archives and caches.
  • Synchronicity - a Firefox extension that supports rediscovering missing web pages
  • mcurl - a command-line memento client
  • WARCreate - a Google Chrome extension that can create WARC files from any webpage 
  • Web Archiving Integration Layer (WAIL) - a re-packaged Wayback and Heritrix that aims to be "One-Click User Instigated Preservation"

Friday, June 21, 2013

JHOVE 1.10b3

JHOVE 1.10b3. Gary McGath. File Formats Blog.

Saturday, June 15, 2013

EPUB for archival preservation: an update

EPUB for archival preservation: an update. Johan van der Knijff's blog on Open Planets.
In 2012  the KB released a report on the suitability of the EPUB format for archival preservation. A substantial number of EPUB-related developments have happened since then, and as a result some of the report's findings and conclusions have become outdated, particularly the observations on EPUB 3, and the support of EPUB by characterisation tools. This blog post provides an update to those findings :
  • Use of EPUB in scholarly publishing
  • Adoption and use of EPUB 3
  • EPUB 3 reader support
  • Support of EPUB by characterisation tools
The use of EPUB is increasing and a number of publishers are all using EPUB 2. Also, a number of organisations representing the publishing industry support EPUB 3, though the actual use of EPUB 3 is still limited.The 2012 report concluded that EPUB was not optimally supported by characterisation tools. This situation has improved quite a lot since that time. EPUB is now included in PRONOM, and DROID.  Overall, EPUB's credentials as a preservation format appear to have improved quite a bit over the last year.

Friday, June 14, 2013

EPUB for archival preservation

EPUB for archival   preservation. Johan van der Kniff. KB/National Library of the Netherlands. 20 July 2012. 
The EPUB format has become increasingly popular in the consumer market. A number of publishers have indicated their wish to use EPUB for supplying their electronic publications to the KB. This document looks at the characteristics and functionality of the format, and whether or not it is suitable for preservation.  Conceptually, an EPUB file is just an ordinary ZIP archive which includes one or more XHTML files, in one or more directories.  Cascading Style Sheets are used to define layout and formatting. A number of XML files provide metadata.

EPUB has a number of strengths that make it attractive for preservation. It is an open format that is well documented, and there are no known patents or licensing restrictions. The format's specifications are freely available. It is largely based on well‐established and widely‐used standards so it scores high marks for transparency and re‐usability. For situations where authenticity is crucial (e.g. legal documents) all or parts of a document can be digitally signed. Also, EPUB 2 is a popular format with excellent viewer support, including several open source implementations. There is concern that its role is limited because the current e‐book market is dominated by proprietary formats. And EPUB3 is currently less stable. There is a chart of recommendations for using EPUB.

Strategy for archiving digital records at the Danish National Archives

Strategy for archiving digital records at the Danish National Archives. Statens Arkiver. January 2013.
Their aim is to ensure the preservation of records that are of historical value, or that serve as documentation of significant administrative matters or legal importance for citizens and
authorities. The vision is to ensure that digital records are preserved so as to maintain their authenticity, and so that they can be found and reused. Preserving digital information for the long term, in a form that makes it reusable, requires some deliberate choices to be made in terms of methods, technologies and documentation. Digital preservation must also take economic considerations into account.

The basic strategy choice faced by preservation institutions is whether to pursue an emulation strategy or a migration strategy. This will determine how digital preservation in the institution is organised. The Danish National Archives have chosen  a migration strategy which requires that the Archives to migrate digital records to a few, well-defined standard formats, and from time to time, be migrated to new formats and structures.

The Danish National Archives’ strategy must not be dependent on continuous access to the system
in which the data was originally created. It must be possible to interpret and re-use data in other systems. The term “original” cannot be applied in the same way to digital records. Whether data
is extracted from tables in a database or digital documents, a representation of the content is preserved in the preservation format. A digital archive primarily preserves data or information. The key aspect is the preservation of authentic information. The implementation of the strategy requires
  1. Early identification and approval of systems for submission purposes
  2. Frequent submission in non-system dependent format
  3. Ongoing planning of preservation and periodical migration to a new preservation format
The Archives uses distributed digital preservation by keeping several identical copies on several different types of media, both optical and magnetic, at several different geographical locations. The Archives also conducts ongoing preservation planning and continuously adjusts the
implementation of its strategy so that the vision remains attainable and within its reach.

Wednesday, June 12, 2013


Web-Archiving. Maureen Pennock. DPC Technology Watch Report 13-01. March 2013. Publicly released
This report is intended for those wanting to develop a better understanding of the issues and options for archiving web content, and for those intending to set up a web archive. Web archiving technology allows valuable web content to be preserved and managed for future generations.

Web content is lost at an alarming rate and our digital cultural memory and organizational accountability is at risk. Organizational needs and resources must be considered when choosing web archiving tools and services. Issues with web archiving include selection of content, authenticity and integrity, quality assurance, duplication of content, legal rights, viruses, and the long term preservation of resources. Web archiving is not a single action but often a suite of applications used in various ways at different stages of the archiving process. Archiving tools may include commercial services, Web Curator Tool, Netarchive Suite, the Heritrix web crawler, WGet, and the Wayback access interface. Archiving a simple website may be straightforward, but archiving large numbers of websites for the long term becomes much more complicated and requires a complex solution. The International Internet Preservation Consortium has played key roles in developing standards, such as the WARC standard, and archiving tools.

There are three main technical approaches:
1. Client-side archiving, using web crawlers such as Heritrix or HTTrack
2. Transactional archiving, which addresses the capture of client-side transactions
3. Server-side archiving, which requires active participation from publishing organizations
Another option being explored is the use of RSS feeds to identify and pull content into a web archive.

In spite of all of the efforts for capture and managing web content, web archives still face significant challenges, such as quality assurance issues, the need for more capable tools, and the need for better legislation.  "The technical challenges of web archiving cannot, and should not, be addressed in isolation."

Tuesday, June 04, 2013

Cerf sees a problem: Today's digital data could be gone tomorrow.

Cerf sees a problem: Today's digital data could be gone tomorrow. Patrick Thibodeau. Computerworld. June 4, 2013.
Vinton Cerf is concerned that much of the data that has been created in the past few decades and for years still to come, will be lost to time. Digital materials from today, such as spreadsheets, documents, presentations as well as mountains of scientific data, won't be readable in the years and centuries ahead. Software backward compatibility is very hard to preserve over very long periods of time, and the data objects are only meaningful if the software programs are available to interpret them. "The scientific community collects large amounts of data from simulations and instrument readings. But unless the metadata survives, which will tell under what conditions the data was collected, how the instruments were calibrated, and the correct interpretation of units, the information may be lost. If you don't preserve all the extra metadata, you won't know what the data means. So years from now, when you have a new theory, you won't be able to go back and look at the older data."

What is needed is a "digital vellum," a digital medium that is as durable and long-lasting as the material that has successfully preserved written content for more than 1,000 years. If a company goes out of business and there is no provision for its software to become accessible to others, all the products running that software may become inaccessible. The cloud computing environment may help; it may be able to emulate older hardware on which we can run operating systems and applications. We need to preserve the bits, but also the a way of interpreting them.

The CODATA Mission: Preserving Scientific Data for the Future

The CODATA Mission: Preserving Scientific Data for the Future.Jeanne Kramer-Smyth. Spellbound Blog. February, 2013.
This is a post (and a link to the slides) about a session that was part of The Memory of the World in the Digital Age: Digitization and Preservation conference. The aim was to describe the initiatives of the Data at Risk Task Group (DARTG), which is part of the International Council for Science Committee on Data for Science and Technology (CODATA).

The goal is to preserve scientific data that is in danger of loss because they are not in modern electronic formats, or have particularly short shelf-life. The task group is seeking out sources of such data worldwide since many are irreplaceable for research into the long-term trends that occur in the natural world. One speaker talked about two forms of knowledge that we are concerned with here: the memory of the world and the forgettery of the world. Only the digital, or recently digitized, data can be recalled readily and made immediately accessible for research in the digital formats that research needs. The “forgettery of the world” is the analog records, ones that have been set aside for whatever reason, or put away for a long time and have become almost forgotten.  It the analog data which are considered to be “at risk” and which are the task group’s immediate concern.  Some of the early digital data are insufficiently described, or the format is out of date and unreadable, or the records cannot be located at all easily.

How can such “data at risk” be recovered and made useable?  An inventory website has been set up where one can report data-at-risk. The overarching goal is to build a research knowledge base that offers a complimentary combination of past, present and future records. Some data mentioned: Oceanographic; climate; satellite; and other scientific data sets; born digital maps. With digital preservation initiatives there is a lot of rhetoric, but not so much action. There have been many consultations, studies, reports and initiatives but not very much has translated into action. 

Monday, May 27, 2013

National Library of Australia’s Digital Preservation Policy

Digital Preservation Policy 4th Edition (2013). National Library of Australia.  May 26, 2013.
This site outlines the National Library of Australia’s policy on preserving its digital collections, and collaborating with others to preserve digital information resources. The primary objective of their digital preservation activities is maintaining the ability to meaningfully access digital collection content over time. The primary concern is preserving the ability to access the Preservation Master File from which derivatives files may be created or re-created over time. To this end, preservation of digital library material includes:
  •     Bit-level preservation of all digital objects, ie. keeping the original files intact;
  •     Ensuring that authenticity and provenance is maintained;
  •     Ensuring that appropriate preservation information is maintained;
  •     Understanding and reporting on risks which affect ongoing access;
  •     Performing appropriate actions to ensure that objects remain accessible;
  •     Periodic review of preferred formats and digital metadata standards
Preservation of the Library's digital collections involves four main goals:
  1.     Maintaining access to reliable data at bit-stream level;
  2.     Maintaining access to content encoded in the bit streams;
  3.     Maintaining access to the intended content; and
  4.     Maintaining the stated preservation intent for all digital material over time.
While specific preservation activities may focus on one or more of these goals, the Library’s preservation responsibility is only fulfilled when all four goals have been adequately addressed.

The Library uses the concepts in the Open Archival Information Systems (OAIS) Reference Model and other international standards and best practices, such as PREMIS and Open Planets Foundation.

Sunday, May 19, 2013

Digital Preservation Tool Grid.

Digital Preservation Tool Grid. Preserving Objects With Restricted Resources. May 15, 2013.
This is a grid, created by POWRR, that looks at 24 different features, such as ingest, processing, access, storage, maintenance, and cost, for about 50 digital preservation tools. The tools range from simple tools to full digital preservation systems, from ACE to Xena. This tool is very informative.

Sunday, May 12, 2013

ZENODO. Research. Shared.

ZENODO. Research. Shared. Website. May 12, 2013.
ZENODO is a new open digital repository repository service that enables researchers, scientists, projects and institutions to share and showcase multidisciplinary research results (data and publications) that are not part of existing institutional or subject-based repositories. The repository is created by OpenAIRE and CERN, and supported by the European Commission.  It promotes peer-reviewed openly accessible research;  all items have a DOI, so they are citable. All formats are allowed. There is a 1GB per file size constraint.  Data files are versioned, but records are not. Files may be deposited under closed, open, embargoed or restricted access.
It is named after  Zenodotus, the first librarian of the Ancient Library of Alexandria and father of the first recorded use of metadata, a landmark in library history. ZENODO is provided free of charge for educational and informational use.

Saturday, May 11, 2013

British Library Digital Preservation Strategy.

British Library Digital Preservation Strategy. The British Library. March 2013.
The British has published their digital preservation strategy for 2013 - 2016. Their vision is that by 2020 they will put in place end-to-end workflows that deliver and preserve their digital collections in a trusted long term digital repository so that they may be accessed by future users.
This is not a strategy for the digital preservation team alone, but a strategy for the whole library. They are working to confidently, reliably, and cost-effectively manage and preserve all types of digital content destined for long term preservation and embed best practice in digital collection content management in all areas of the Library. With acquisition comes responsibility: we must preserve and make this content accessible for our future users. They recognize a benefit from collaboration with other national and international institutions on digital preservation initiatives

This strategy outlines four strategic priorities to be met by 2016:
  1. Ensure our digital repository can store and preserve our collections for the long term:
  2. Manage the risks and challenges associated with digital preservation throughout the digital collection content lifecycle
  3. Embed digital sustainability as an organisational principle for digital library planning and development not just technical solutions but also clear organisational
  4. commitment and resources.

The NDSA Levels of Digital Preservation: An Explanation and Uses.

The NDSA Levels of Digital Preservation: An Explanation and Uses. Megan Phillips, et al. National Digital Stewardship Alliance. February 28, 2013.

The National Digital Stewardship Alliance (NDSA) is refining a a set of recommendations and guidelines for those involved with preserving digital assets long term.  The guidelines are organized into five functional areas that are at the heart of digital
preservation systems: storage and geographic location, file fixity and data integrity, information security, metadata, and file formats.

The tiered, matrix approach of the Levels of Digital Preservation features multiple levels and content areas that can be adapted over time. The flexible approach allows users to achieve different levels in different content areas according to their unique needs and resources.

The guidelines were initially developed as a reference for prioritizing enhancements to digital preservation systems. They are also useful for developing guidelines for content creators, validate local preservation guidance, as minimum requirements for developing preservation services,  and to help assess compliance with best practices.

Tor Books says cutting DRM out of its e-books hasn’t hurt business.

Tor Books says cutting DRM out of its e-books hasn’t hurt business. Megan Geuss.   Ars Technica. May 4, 2013.

Tor Books announced last April that it would only retail e-books in DRM-free formats because its customers are “a technically sophisticated bunch, and DRM is a constant annoyance to them. It prevents them from using legitimately-purchased e-books in perfectly legal ways, like moving them from one kind of e-reader to another."

This week, Julie Crisp, editorial director at Tor UK, wrote that the publisher has seen “no discernible increase in piracy on any of our titles, despite them being DRM-free for nearly a year.”

Tor's 2012 decision was largely applauded by its customers and authors. The authors agreed to a scheme which would allow their readers greater freedom with their novels.

PDF/A, PDF for Long-term Preservation.

PDF/A, PDF for Long-term Preservation. Library of Congress.  March 21, 2013.
This section on PDF/A is part of the Library of Congress website on sustainable formats. The page includes description of PDF/A, sustainability factors, quality and functionality factors, format specifications, and useful references.

PDF/A is a family of ISO standards that attempt provide sustainable formats, through device independence, self-containment, and self-documentation. The PDF/A standards are developed and maintained by a working group with representatives from government, industry, and academia and active support from Adobe Systems Incorporated.

PDF/A-1, the first PDF/A standard, was based on PDF version 1.4 and published in 2005.
PDF/A-2 extends the capabilities of PDF/A-1 and is based on PDF version 1.7.
PDF/A-3 allows including in a PDF/A file, other types of files in any other format, not just other PDF/A files.

Restrictions on PDF/A files include:
  • Audio and video content are forbidden
  • Javascript and executable file launches are prohibited
  • All fonts must be legally embeddable for unlimited, universal rendering
  • Colorspaces specified in a device-independent manner
  • Encryption is disallowed
  • Use of standards-based metadata is mandated
The PDF/A standards define levels of conformance: conformance level A satisfies all requirements in the specification; level B and level U are lower levels of conformance, still satisfying the requirements of ISO 19005 regarding the visual appearance of electronic documents, but less demanding as to representation of structural or semantic properties.

Monday, May 06, 2013

The APTrust Architecture Presentation.

The APTrust Architecture Presentation. Scott Turnbull. Academic Preservation Trust. May 6, 2013.
The Academic Preservation Trust (APTrust) consortium is developing a preservation environment.  The website includes slides presenting the APTrust Phase I Architecture.  It gives a general look at the components being developed.  The APTrust repository will serve as a replicating node for the Digital Preservation Network (DPN). At the local level, APTrust will provide a preservation environment for participating members, including disaster recovery services.

Light, Dark and Dim Archives: What are they?

Light, Dark and Dim Archives: What are they?  May 2013.
 The following is a compilation of a few definitions or examples of Light, Dark and Dim archives to better understand what they are.

 The notion of "dark archives", supporting little or no access to archived materials, has met with scant enthusiasm in the library community. This suggests that digital repositories will function not just as guarantors of the long-term viability of materials in their custody, but also as access gateways. Lavoie

A secure digital repository sometimes referred to as a "dark archive" Kirsch

Dark Archive: An archive that does not grant public access and only preserves the information it contains. This can refer to a digital archive or repository as well as brick & mortar archive. Michigan

Dark archive: The purpose of a dark archive is to function as a repository for information that can be used as a failsafe during disaster recovery. UCPress

The Dark Archive is a secret place for storing archival material with restricted user access. Tufts

We chose to create a “dark” archive to focus our efforts on securing and preserving large volumes of content important to libraries and their users; however, it is not exclusively dark. Participating libraries experience the archive as a “light” or accessible archive in two ways: auditing the archive to ensure we are prepared to support eventual use and accessing of content that has been made available as the result of a “trigger event” or post-cancellation access claim. Portico

Dark archives are certainly misunderstood both inside and outside the industry.  So, what is a dark archive?  It is, simply put, an archive of information that is not used for public access.  Most often it serves as a failsafe copy of a light archive, i.e. a publicly available version of the information, for use in disaster recovery operations.  Dark archives need not be a fully operational copy of an information system, rather just the content behind the information system.  This is an important distinction because maintaining an exact operational copy of an information system is a much more complex and expensive undertaking than maintaining only the content the information system operates on.  Metaphorically, at its base definition, a dark archive will require more than a flip of the switch to make a light archive. Osti Bog

Dark Archive: An archive that is inaccessible to the public. It is typically used for the preservation of content that is accessible elsewhere. See also dim archive, light archive.
Dim Archive: An archive that is inaccessible to the public, but that can easily be made accessible if required. It's typically used for the preservation of content that is accessible elsewhere. See also dark archive, light archive.
Light Archive: An archive that is accessible to the public. See also dim archive and dark archive.  CDL

A DDP network may be an open archive, or it may reside somewhere on the spectrum from dim to dark archive. That is, it may be open to only the contributors’ servers for ingesting (dark archive); it may be open to specified users, such as the contributing institutions’ communities (dim archive); or it may provide unrestricted access (open archive). This status will determine whether contributors will focus solely on long-term preservation issues, or some combination of preservation and public access issues. MetaArchive

Dark Archive: Digital archive for which access to content is limited to organizational custodians.
Dim Archive: Digital archive that incorporates elements of both the Dark and Open Archive models. Access for some materials is restricted to organizational custodians, while access for others may be open to a broad user community.
Open Archive:  A digital archive that is publicly accessible. MetaArchive

 Below are two figures from the OAIS Model (2012) showing the Access functions.

Figure 4-1: OAIS Functional Entities

Figure 4-7: Functions of the Access Functional Entity