Friday, March 26, 2010

Digital Preservation Matters - March 26, 2010

Archiving Britain's web: The legal nightmare explored. Katie Scott. Wired. 05 March 2010.

Websites are increasing recognized as being culturally valuable. But there are concerns about the ability to preserve them because of current copyright requirements. The British Library over the past 6 years has archived over 6,000 culturally significant websites. Currently they must contact every copyright holders of these sites, and only have a 24% response rate. Some feel there is a "'digital black hole' in the nation's memory" because of the difficulty in archiving the web sites. There is a proposal to change the law to allow the copy deposit act to include websites. Some look at an opt out option. The BBC has a "no take-down" rule.


Canterbury Tales manuscript to be digitized. Medieval news. March 22, 2010.

The University of Manchester Library is planning to digitize the Canterbury Tales manuscript. This is part of a JISC funded project. The Centre of Digital Excellence supports universities, colleges, libraries and museums which lack the resources to digitize important works. In addition to the digitizing work, “they will also be exploring business models for the long term viability of digitisation.”


ISO Releases Archival Standards. eContent. Mar 23, 2010.

Two documents from the International Organization for Standardization (ISO) aim to provide guidelines for archiving patient information. "Health informatics-Security requirements for archiving of electronic health records-Principles" and "Health informatics-Security requirements for archiving of electronic health records-Guidelines" look at topics of records maintenance, retention, disclosure, and eventual destruction. Electronic medical data must be stored for the life of the patient; there are legal, ethical, and privacy concerns.


Elsevier and PANGAEA Data Archive Linking Agreement. Neil Beagrie. Blog. 03 Mar 2010.

Elsevier and the data library PANGAEA (Publishing Network for Geoscientific & Environmental Data) have agreed to reciprocal linking of their content in earth system research. Research data sets deposited at PANGAEA are now automatically linked to the corresponding articles in Elsevier journals on ScienceDirect. Science is better supported through the cooperation and the flow of data into trusted archives. “This is the beginning of a new way of managing, preserving and sharing data from earth system research.”


Duplicating Federal Videos for an Online Archive. Brian Stelter. The New York Times. March 14, 2010.

The International Amateur Scanning League plans to upload the National Archives’ collection of 3,000 DVDs in an “experiment in crowd-sourced digitization” using a DVD duplicator and a YouTube account. This is a small demonstration that volunteers can sometimes achieve what bureaucracies can’t or won’t. the DVDs are all technically available to the public, they are hard to see unless a person visits the archive or pays for a copy. The volunteers duplicate the DVDs then upload them to YouTube, the Internet Archive Web site and an independent server.


Uncompressed Audio File Formats. JISC Digital Media. 10 February 2010.

This looks at the main features of uncompressed audio file types, including WAV, AIFF and Broadcast WAV (BWF). “Uncompressed audio files are the most accurate digital representation of a soundwave” but they also take the most resources. Digital audio recording measures the level of a sound wave at regular intervals and records that value as a number. “This bitstream is the ‘raw’ audio data, expressing the sound wave in its closest digital analogue. “ These uncompressed audio file types are ‘wrapper’ formats that take the original data and combine it with additional data to make it compatible with other systems.

The most common is the Waveform Audio File Format (WAV), which is limited to a 4 Gb file size. The European Broadcasting Union created the Broadcast Wave Format (BWF) which is functionally identical to the WAV file except it has an extra header file for metadata. This is a recommended archive format and also has a 4 Gb file size. The European Broadcasting Union has recently added the Multichannel Broadcast Wave Format (MBWF)which combines the RF64 audio format (surround sound, MP3, AAC, etc) with a 64 bit address header and has a file size limit of 18 billion Gb. It is backwardly compatible with WAV and BWF. The Audio Interchange File Format (AIFF) is the native format for audio on Mac OSX.

“The International Association of Sound and Audiovisual Archives (IASA) recommend Broadcast WAV as a suitable archival format, for reasons of its wide compatibility and support, and its embedded metadata capability. For surround-sound or multichannel audio the MBWF format should be used. For archive PCM audio, bit depth should be a minimum of 24-bit, and sample rate a minimum of 48kHz to comply with IASA standards.” If compression is needed, lossless compression, which requires an additional encoding/decoding stage – codec) is the least destructive alternative. Some open-source lossless compression codecs are available, such as FLACC.


Court Orders Producing Party to "Unlock" PDF Since Not in a "Reasonably Usable" Form. Michael Arkfeld . Electronic Discovery and Evidence - blog. February 15, 2010.

In this contractual action, the defendants disclosed 11,757-page summary in a PDF "locked" format precluding the plaintiff from being able to edit and or manage the summary without retyping it. The Court found that the defendants' locked format made it "completely impractical for use" and ordered that the defendants "unlock" the files.


Tuesday, March 16, 2010

Digital Preservation Matters - March 16 2010

Fending Off Digital Decay, Bit by Bit. Patricia Cohen. The New York Times. March 15, 2010.

This looks at the archival material, including digital, from an author that is on display at Emory University. It highlights what research libraries and archives are discovering, that “born-digital” materials are much more complicated and costly to preserve than anticipated. The “archivists are finding themselves trying to fend off digital extinction at the same time that they are puzzling through questions about what to save, how to save it and how to make that material accessible.” Computers have now been used for over two decades, but their digital materials are just now find their way into archives. The curator said “We don’t really have any methodology as of yet to process born-digital material. We just store the disks in our climate-controlled stacks, and we’re hoping for some kind of universal Harvard guidelines.” The challenges including cataloging the material, acquiring the equipment and expertise to access the data stored on obsolete media. Do they try to save the look and feel of the material or just save the content? The computer editing meant that there are no manuscripts with pages with “lots of crossings-out and scribbling”. The display is providing the “emulation to a born-digital archive” similar to reproducing the author’s work environment. Emory is providing $500,00 to produce a computer forensics lab to do this kind of work. Others are impressed with the emulation, but their focus is storage and preservation of digital content. One center is trying to raise money to hire a to hire a digital collections coordinator. Until then, the digital materials are unavailable to researchers.


More on using DROID for Appraisal. Chris Prom. Practical E-Records. March 10, 2010.

The information that DROID supplies is useful but the output not optimally organized for reuse. But by regularizing the DROID CSV output the information became sortable and more useful. DROID was also useful in identifying files that did not use the standard file extension for an application, also to find files that needed attention or need to be converted. And it was very useful in the appraisal process. With it, the major migration problems could be identified and it helped to weed out inappropriate, duplicate, or private content.


Data, data everywhere. Economist. February 25, 2010.

The world contains an unimaginably vast amount of digital information which is increasing rapidly. This makes it possible to do many things that previously could not be done but it is also creating a host of new problems. The proliferation of data is making them increasingly inaccessible. The way that information is managed touches all areas of life. The data-centered economy is still new and the implications are not yet understood.


Archon™: The Simple Archival Information System. Website. 15 February 2010.

Version 3 of this software has been released. The software is for archivists and manuscript curators. It publishes archival descriptive information and digital archival objects to a user-friendly website. Functionality includes:

· Create standards-compliant collection descriptions and full finding aids using web forms.

· Describe the series, subseries, files, items, etc. within each collection.

· Upload digital objects/electronic records or link archival descriptions to external URLs.

· Batch import data

· Export MARC and EAD records


Deluge of scientific data needs to be curated for long-term use. Carole L. Palmer. February 24, 2010.

Data curation is the active and ongoing management of data through their lifecycle. It is an important part of research. Data is a valuable asset to institutions and to the scientific enterprise. Saving the publications that report the results of research isn't enough; researchers also need access to data. Data curation begins long before the data are generated, it needs to start at the proposal stage. Without the data there is the issue of replicating and validating a research project's conclusions. "Digital content, including digital data, is much more vulnerable than the print or analog formats we had before." selecting, appraising and organizing data to make them accessible and interpretable takes a lot of work and expense. "The bottom line is that many very talented scientists are spending a lot of time and effort managing data. Our aim is to get scientists back to doing science, where their expertise can make a real difference to society."


Is copyright getting in the way of us preserving our history? Victor Keegan. The Guardian. 25 February 2010.

In theory, future historians will have a lot of information about our age. In reality, much of it may be lost. Much of the information is on web pages, and they have a short life expectancy. The British Library has launched the UK Web Archive, which will guarantee longevity to thousands of hand-picked UK websites. But this is only a small part. “The issue of copyright is a global nightmare for anyone interested in digital preservation.”


"Zubulake Revisited: Six Years Later": Judge Shira Scheindlin Issues her Latest e-Discovery Opinion. Electronic Discovery Law. January 27, 2010.

This review of a case that addresses the issues of parties’ preservation obligations. Check here for the full opinion. The case revisits an earlier decision concerning e-discovery, or finding electronic documents, emails, etc, in court cases; obligations; and negligence for failure to keep records correctly. Some statements from the court opinion:

  • By now, it should be abundantly clear that the duty to preserve means what it says and that a failure to preserve records, paper or electronic, and to search in the right places for those records, will inevitably result in the spoliation of evidence.
  • While litigants are not required to execute document productions with absolute precision, at a minimum they must act diligently and search thoroughly at the time they reasonably anticipate litigation.
  • The following failures support a finding of gross negligence, when the duty to preserve has attached: to issue a written litigation hold; to identify all of the key players and to ensure that their electronic and paper records are preserved; to cease the deletion of email or to preserve the records of former employees that are in a party's possession, custody, or control; and to preserve backup tapes when they are the sole source of relevant information or when they relate to key players, if the relevant information maintained by those players is not obtainable from readily accessible sources.
  • The case law makes crystal clear that the breach of the duty to preserve, and the resulting spoliation of evidence, may result in the imposition of sanctions by a court because the court has the obligation to ensure that the judicial process is not abused.

Friday, March 12, 2010

A New Approach to Web Archiving

At the Marriott Library, we’ve recently begun looking into what it would take to archive websites that are important to the University. During some research into this area, I came across the proceedings of the 2009 International Web Archiving Workshop (IWAW).

An interesting project is taking place in France that may change the way web archiving is approached. At University P. and M. Curie in Paris, researchers are developing a web crawler that will not only detect changes to a website but one that will be able to detect which changes are unimportant (changing ads on a page, etc.) versus which are important to the page’s content. If successful, this might greatly improve the effectiveness of the web archiving system because digital archives would no longer be gumming up bandwidth and storage space with needless data.

This project is taking place in conjunction with the French National Audio-Visual Institute (INA). The institute would like to archive French television and radio station websites. The visual component of the institute’s pages is very important to the project, not just the content.

According to the workshop proceedings, the project idea is to “use a visual page analysis to assign importance to web pages parts, according to their relative location. In other words, page versions are restructured according to their visual representation. Detecting changes on such restructured page versions gives relevant information for understanding the dynamics of the web sites. A web page can be partitioned into multiple segments or blocks and, often, the blocks in a page have a different importance. In fact, different regions inside a web page have different importance weights according to their location, area size, content, etc. Typically, the most important information is on the center of a page, advertisement is on the header or on the left side and copyright is on the footer. Once the page is segmented, then a relative importance must be assigned to each block…Comparing two pages based on their visual representation is semantically more informative than with their HTML representation.”

The main concept and hopeful contribution to the world of web archiving is summed up by the presenters as follows:

• A novel web archiving approach that combines three concepts: visual page analysis (or segmentation), visual change detection and importance of web page’s blocks.

• An extension of an existing visual segmentation model to describe the whole visual aspect of the web page.

• An adequate change detection algorithm that computes changes between visual layout structures of web pages with a reasonable complexity in time.

• A method to evaluate the importance of changes occurred between consecutive versions of documents.

• An implementation of our approach and some experiments to demonstrate its feasibility.

It will be interesting to follow up with this project as it reaches its conclusion and see how its results will affect current web archiving players like as well as fellow research endeavors like the Memento Project.

You can read about this project in much more technical detail at the IWAW website (unless it’s been taken down and hasn’t been properly archived).

Thursday, March 11, 2010

Digital Preservation Matters - March 9, 2010

Accelerated Life Cycle Comparison of Millenniata Archival DVD [corrected link].. Ivan Svrcek. Naval Air Warfare Center. March 2010. [75 p. PDF]
The Life Cycle and Environmental Engineering branch at China Lake installation performed an accelerated aging test of Millenniata discs with current archival grade DVDs (Delkin, MAM-A, Mitsubishi, Taiyo Yuden, and Verbatim). The test evaluated the disc stability when exposed to combined light, heat and humidity. Besides using the standard tests for predicting the lifetime of a disc, the test included looking at the initial write quality and exposure to full spectrum of light. The test also looked at the drives used to burn the discs, and which drives worked best with which discs. One conclusion with the drives was that “the device used to record an optical media can have a great impact upon the write quality and should be considered in all data storage situations.” According to the ECMA standards, “All dye-based discs failed.” That is in contrast to the Millenniata discs: “none of the Millenniata media suffered any data degradation at all. Every other brand tested showed large increases in data errors after the stress period. Many of the discs were so damaged that they could not be recognized as DVDs by the disc analyzer.”

“Ensuring that valuable digital assets will be available for future use is not simply a matter of finding sufficient funds. It is about mobilizing resources—human, technical, and financial—across a spectrum of stakeholders.” Major questions are what should we preserve, who is responsible, and who will pay for it. This looks at scholarly publications, research data, commercially owned culture content, and collectively produced web content. Three important components in developing preservation strategies :
  1. When talking about preservation, make the case for use of the materials. A decision to preserve something now does not mean a permanent commitment of resources. The value and use may be clearer later.
  2. Incentives to preserve must be clearly shown as being in the public interest.
  3. There must be agreement on the roles and responsibilities of all concerned: the information creators, owners, preservers, and users.
It is important to reduce the cost of preservation as digital information increases. The areas for priority action include:
Organizational: Develop partnerships; ensure access to skilled personnel; sustain stewardship chain.
Technical : build capacity to support stewardship in all areas; lower the cost of preservation overall.
Policy: Create incentives; clarify rights of web materials; empower organizations.
Educational: promote education and training; raise awareness of the urgency of timely preservation actions.
  • Sustainable preservation strategies are not built all at once, nor are they static. Sustainable preservation is a series of timely actions taken to anticipate the dynamic nature of digital information.
  • Commitments made today are not commitments for all time. But actions must be taken today to ensure flexibility in the future.
  • Sustainable digital preservation requires a compelling value proposition, incentives to act, and well-defined roles and responsibilities.
  • Decisions about longevity are made throughout the digital lifecycle.
  • A sustainable preservation strategy must be flexible enough to span generations of data formats, access platforms, owners, and users.
  • Preservation decisions can often be seen as an incremental cost, and are often the same as decisions made to meet current demand.
Five conditions required for economic sustainability are:
  1. recognition of the benefits of preservation by decision makers;
  2. a process for selecting digital materials with long-term value;
  3. incentives for decision makers to preserve in the public interest;
  4. appropriate organization and governance of digital preservation activities; and
  5. mechanisms to secure an ongoing, efficient allocation of resources to digital preservation activities.
AAC Audio and the MP4 Media Format. JISC Digital Media. 12 February 2010.
From the JISC advice site: This is a guide to creating and using the AAC compressed audio resources. AAC is the successor to the MP3 format; this site explains the advantages of AAC over MP3. AAC offers significant reduction of audio file size while still retaining good sound quality. The AAC audio standard is a subsection of the MPEG-4 standard and the MP4 file type is often used to deliver content. Apple added the .m4a and .m4p extensions to designate audio content. AAC requires a compatible codec for the final user to be able to listen to it. AAC uses a lossy compression; so for standards-compliant sound archiving, Broadcast WAV format should be used according to the guidelines of the International Association of Sound and Audiovisual Archives (IASA). More on BWAV at the BBC site.
If you don’t need standards compliance or absolute fidelity for your archive, or if you don’t have the storage space for the much larger uncompressed BWAV files, “then you may want to consider AAC as the overall best currently available lossy compression method.” This is an excellent site for information and contains much more on the container, encoding, versions, filetypes, bitrate, metadata, the iTunes schema, and a simplified visual representation of an MP4.

Tuesday, March 02, 2010

Digital Preservation Matters - March 2, 2010

A Guide to Distributed Digital Preservation. Katherine Skinner, Matt Schultz. Educopia Institute. February 2010. [156 p. PDF]

Excellent guide created by MetaArchive, who developed the first private LOCKSS network in 2004. This work examines distributed digital preservation, successful strategies and new models . It will help others to join or establish a private LOCKSS network. It discusses the network architecture, technical and organization considerations, content selection and ingest, administration and copyright practices in the network. A distributed digital preservation system must preserve, not just back-up. The preservation process of contributing, preserving, and retrieving content depends upon the institution’s diligence. Ingested content is preserved not just through replication, but by the caches through a set of polling, voting, and repairing processes. Distributed digital preservation, by definition, requires communication and collaboration across multiple locations and between numerous staff.

The software provides bit-level preservation for digital objects of any file type or format, but it can also provide a set of services to make the preserved files usable in the future, such as normalizing and migrating. The MetaArchive network is a dark archive with no public interface; communication between caches is secure. Organizations collaborating on preserving digital content must examine the roles and responsibilities of members, address essential management, policy, and staffing questions, develop standards, and define the network’s sphere of activity. Ingest, monitoring, and recovery of content are critical steps for preserving the content.

Some interesting quotes from the guide:

  • Paradoxically, there is simultaneously far greater potential risk and far greater potential security for digital collections
  • many cultural memory organizations are today seeking third parties to take on the responsibility for acquiring and managing their digital collections. The same institutions would never consider outsourcing management and custodianship of their print and artifact collections;
  • A great deal of content is in fact routinely lost by cultural memory organizations as they struggle with the enormous spectrum of issues required to preserve digital collections,
  • A true digital preservation program will require multi-institutional collaboration and at least some ongoing investment to realistically address the issues involved in preserving information over time.
  • One of the greatest risks we run in not preserving our own digital assets for ourselves is that we simultaneously cease to preserve our own viability as institutions.


Encouraging Open Access. Steve Kolowich. Inside Higher Ed. March 2, 2010.

Conversations about open access to journal articles currently revolve around policy, not technology; about if the content should be made available, not how. “Without content, an IR is just a set of empty shelves.” A new model of repository focuses on giving researchers an online “workspace” within the repository where they can upload and preserve different versions of an article they are working on. The idea is to make publishing articles to the open repository a natural extension of the creative process. This is based on a survey where professors wanted:

  • to be able to work with co-authors easily,
  • to keep track of different versions of the same document, and
  • to make their work more visible
  • all while doing as little extra work as possible.


In the digital age, librarians are pioneers. Judy Bolton-Fasman. The Boston Globe. February 10, 2010.

Book review of This Book Is Overdue: How Librarians and Cybrarians Can Save Us All By Marilyn Johnson.

  • Among information professionals, Johnson notes there are librarians and archivists: “Librarians were finders [of information]. Archivists were keepers.’’ But the information revolution is affecting both.
  • The digital age is making possible the creation of searchable databases of archives, but it’s also making information, especially on the Internet, more ephemeral and harder to collect.
  • Information archivists “capturing history before it disappears because of a broken link or outdated software.”
  • in a world where technology moves life at a breathtaking pace, “where information itself is a free-for-all, with traditional news sources going bankrupt and publishers in trouble, we need librarians more than ever’’ to help point the way to the best, most reliable sources.


Installing OAIS Software: Archivematica. Chris Prom. Practical E-Records. February 1, 2010.

One of several reports on open source tools the blog author is evaluating to help with ingest, storage, and access processes in archives. This post looks at Archivematica, and he likes the supportable model for facilitating archival work with electronic records. It is a Ubuntu-based virtual appliance which can exist alongside preservation tools on other systems. It can be installed locally and in a variety of ways. Worth looking in to.


IBM announces massive NAS array for the cloud. Lucas Mearian. Computerworld. February 11, 2010.

IBM has announced SONAS, an enterprise-class network-attached storage array capable of scaling from 27TB to 14 petabytes under a single name space. It is designed to provide access to data anywhere any time. The policy-driven automation storage software allows an institution to predefine where data is placed, when it is created, where and when it moves to in the storage hierarchy, where it's copied for disaster recovery, and when it will be eventually deleted.