Friday, February 24, 2006

Weekly readings - 24 February 2006

The AIHT at Stanford University: Automated Preservation Assessment of Heterogeneous Digital Collections. Richard Anderson, et al. D-Lib Magazine. December 2005.

The Stanford Digital Repository is a set of services to provide to eventually preserve any digital content deemed institutionally significant or valuable. The early process focus on preserving a subset of the potential content. The group focused on tiered levels of service for digital preservation which they hoped would provide the flexibility to adapt to changing technologies to manage and preserve digital materials and their descriptions. The technical metadata seemed easier to create automatically but the contextual would be difficult. JHOVE was used for technical analysis and metadata creation. Sustainability and quality & functionality are the primary forces making any format "preferred”. Their preferred formats by type are:

Plain text: ASCII, UTF-8;
Marked-up text: XML 1.0;
Image: TIFF 5.0 and above (uncompressed);
Page-Viewer: PDF (any version);
Audio: WAVE (linear pulse code modulation); and
the Video format was yet to be determined.

“An automated assessment process is clearly the only efficient means to collect technical information about large numbers of files.” The knowledge will help to measure risk, to negotiate with depositors, and to make decisions about the long term preservation. In general, their experience showed that practical tests are valuable for understanding the needs and making informed decisions.


UKWAC: Building the UK's First Public Web Archive. Steve Bailey, Dave Thompson. D-Lib Magazine. December 2005.

We depend on the internet in many ways, but we pay little attention to the long-term preservation of websites. Invaluable scholarly, cultural and scientific resources are being lost to future generations. Six leading UK institutions are working on a project to test selective archiving of UK websites. The group has chosen a modified version of the PANDAS software, developed by the National Library of Australia. Their goals include:

· To work collaboratively in the achievement of a common searchable archive of selected web sites investigating solutions to issues such as, selection, rights management and digital preservation

· To evaluate the development of the collaborative infrastructure for web archiving with regards to assessing the permanence and long-term feasibility of such a collaborative enterprise

Archiving web sites follows the basic archival principles of Selection, Acquisition, Description and Access. Individual Partners in the group select web sites to be archived, but there are some additional steps to this process. The partners will check that someone else has not already selected a particular web site for archiving. When a new site is selected, basic metadata is entered into the central database; a group member then becomes responsible for that site's life cycle management. They seek explicit written permission to archive sites from site owners before archiving the site. There are difficulties and challenges in archiving web sites. “Web archiving is not an exact science.” In spite of the difficulties, this has been an important project for digital preservation. It has shown that selective web archiving can be done through using a consortium, it has highlighted the fragility of web-based materials while offering a workable solution.


Mind the gap: assessing digital preservation needs in the UK. Martin Waller, Robert Sharpe. Digital Preservation Coalition, 2006.

This 'state of the nation' report today shows that less than 20% of UK organizations surveyed have a strategy in place to deal with the risk of loss or degradation to their digital resources – even though there is a very high level of awareness of the risks and potential economic penalties. The survey shows that digital data loss is commonplace and seen by some as inevitable.

· Over 70% of respondents said data had been lost in their organization

· 87% recognized that key material could be lost and

· 60% said that their organization would lose financially

· 52% of the organizations said there was management commitment to digital preservation

· Only 18% had a digital preservation strategy in place

Some high profile instances include: The decision that Morgan Stanley must pay over $1 billion for failure to preserve and hand over some documents required by the courts; the 1975 Viking Lander mission data tapes have deteriorated despite careful storage, and scientists are unable to decode the formats used. The principal risks to digital material are:

· the deterioration of the storage medium;

· obsolescence of hardware, software or storage format;

· failure to save crucial document format information, such as preserving tables of numbers without preserving an explanation of their meaning.

The report identifies 18 core needs and recommendations to address those needs. The needs include:

· Increase awareness of digital preservation issues especially among data creators;

· Take stock of digital materials (55% of those surveyed did not know what digital material they had;

· Fund digital preservation aspects of projects from the beginning ;

· Increase funding for digital archives

“Gone are the days when archives were dusty places that could be forgotten until they were needed. The digital revolution means all of us – organisations and individuals – must regularly review and update resources to ensure they remain accessible. Updating need not be expensive, but the report is a wake-up call to each one of us to ensure proper and continuing attention to our digital records.”

It is important to create long-term pro-active preservation plans, and allocate adequate budget and resource to implementing practical solutions. “Organisations that create large volumes of digital information need to recognise the benefits of retaining long-term information in digital form so that these can be balanced against the costs of active preservation.”

Friday, February 17, 2006

Weekly readings - 17 February 2006

The Archive Ingest and Handling Test: The Johns Hopkins University Report. Tim DiLauro, et al. D-Lib Magazine. December 2005.

Johns Hopkins University (JHU) performed the Archive Ingest and Handling Test with two repository applications, DSpace and Fedora. Their model consisted of two classes of objects (data-group and data-item), which consisted of identifier, name, metadata, and group and item ids. They used METS as a wrapper for the metadata for a digital object. They generated the SIP packet and then ingested it into the repositories. The bulk ingestion was extremely slow; DSpace and Fedora had constraints on the process. Errors, memory limits, and server issues caused crashes. The size of the collection was also a factor. In the second phase, JHU exported the data to Stanford and imported the data from Harvard. Each of the participants chose a different approach for their dissemination packet, but there would have been an advantage to having common elements in the process. In the format transformation phase, their goal was to create a flexible mechanism to allow systematic migration of specified content. They chose to migrate JPEGs to TIFFs; they added metadata about this conversion to the TIFF itself, in addition to the item metadata. The problems with DSpace and Fedora were reported.

Lessons learned included:
· The log files produced during ingest, export, and migration are important and should be structured so they can be used later.
· After a bulk operation there should be an easy way to rollback the changes.
· Memory consumption was a big issue.
· When processing the objects, it was necessary to access objects one at a time and write any intermediate results to disk.
· The export was complicated by storing the METS in each DataObject; instead it should have been assembled from the content and then reassembled during export.
· The configuration information for the repositories should have been stored in one config file and referred to symbolically.
· Be prepared for tools not working perfectly.
· The metadata provided was inconsistent.
· Format registries will form the basis for future automated processes for ingestion and management / migration of content already in repositories.


Archive Ingest and Handling Test: The Old Dominion University Approach
. Michael L. Nelson, et al. D-Lib Magazine. December 2005.

Old Dominion University (ODU) was the only non-library to participate in the test. The focus was on
· self-archiving objects
· archive models & granularity
· archive export & import

They used the MPEG-21 Digital Item Declaration Language (DIDL) complex object format They do not have an institutional repository so the process is more of a pre-ingest phase. The metadata descriptors follow the general Dublin Core structure. The file processing workflow is represented in the article. The imported items are given a new identifier based on the MD5 of the file name. There is an additional MD5 checksum generated on the contents of the object for verification needs. Each object was processed with JHOVE to provide technical metadata, and also checked in the Format Registry Demonstrator. Format conversion is handled as a batch process for archived items and the work flow is outlined. Through this process, they regarded the repository as “a preservation threat the data must survive.”


Plextor Ships Industry's First 18x DVD±R Burner. Press Release. Computer Technology Review. February 13, 2006.

Plextor has announced a DVD±R/RW CD-R/RW drive aimed at users who require reliability, high performance, and premium recording features. This drive has a recording speed of 18X DVD±R on certified 16X DVD±R media. It also supports lower speeds.


Plumbing and storing e-archives: an industry blooms. Brian Bergstein. January 31, 2006.

With the increase of digital communications, memos, presentations and other bits of data are increasingly finding their way into "electronic discovery" centers. "The big risk for companies is too much data that there's really no business need for, being kept in ways that if they had to go looking for it, would be uneconomic." "In litigation today, if e-discovery is done wrong, it can have huge implications.” Other times evidence comes not from what's in a file, but from its "metadata" -- the automatically applied labels that explain such things as when a file was made, reviewed, changed or transferred.

Friday, February 10, 2006

Weekly readings - 10 February 2006

AIHT: Conceptual Issues from Practical Tests. Clay Shirky. D-Lib Magazine. December 2005.

The Archive Ingest and Handling Test (AIHT) is a project of the National Digital Information Infrastructure and Preservation Program (NDIIPP), sponsored by the Library of Congress. The idea is that by giving a complete digital archive to a variety of participants, we can better understand which aspects of digital preservation are general and which are institution-specific. It was also to:

· test and assess the feasibility of transferring digital archives from one institution to another,
· document useful practices,
· discover which parts of the handling of digital material can be automated, and
· identify areas that require further research or development.

The George Mason University's collection of 9/11 materials was selected. Some of the lessons learned are:

· In ingest, the identifiers must be independently verified, such as determining whether files labeled .jpg are really in JPEG format.
· The desirability of a digital object will be more closely related to the value of its contents than to the quality of its metadata.
· Donors are likely to force preserving institutions to choose between accepting imperfectly described data or receiving no data at all.
· Even a small percentage of exceptions in operations on a large archive can create a large problem.
· Maintaining digital materials for a short term is relatively simple, but more difficult for the long term
· It is possible for a preserved bit stream become unusable because the complex the system changes over time: hardware, software, OS, etc.
· Multiple preservation strategies provide the best hedge against unforeseen systemic failure.
· Using a variety of strategies to preserve data may be a better strategy than putting all efforts toward one single preservation system.
· There is a need for continual comparative testing of preservation tools and technologies.


Harvard's Perspective on the Archive Ingest and Handling Test. Stephen Abrams, et al. D-Lib Magazine. December 2005.

Harvard has operated a preservation repository for over 5 years which contains over 3 million objects, and takes over 12 TB of space. The repository is intended for “highly ‘curated’ digital assets; that is, those that are owned and submitted by known users, created according to well-known workflows and meeting well-known technical specifications, and in a small set of approved formats.” They intend to eliminate the restrictions and make it an institutional repository. Some of the issues they will face in the future include:

· Automated extraction of technical metadata from digital objects
· Automated generation of Submission Information Packages (SIPs)
· Systematic preservation migrations
· Post-migration quality assurance testing
· Metadata models for capturing provenance information

The input method used the JHOVE program to validate file types and to generate some of the technical metadata. They also chose to migrate image files to JPEG 2000. The data model did not provide a way to capture provenance metadata, but they decided they would look at PREMIS for this purpose. The format migration process was fully automated once the appropriate specifications were developed. In general, the test showed that in spite of scaling problems, “digital content can be transferred without loss between institutions utilizing radically different preservation architectures and technologies.” The limiting factor in transferring large amounts of data appears to be the number of objects, rather than their individual or total size.


Research books its place in the library of the future. IST Results. 1 Feb 2006.

Digital preservation is one of the three major research areas of the European Digital Library. Audiovisual material is particularly vulnerable to being lost due mostly to technological obsolescence. In the PrestoSpace project, they found 60 different video formats, which increases the problems. This project, with Media Matters, has created a method of transferring the obsolete media into digital data. Other parts of the project include a database listing all the known characteristics of types and years of video tapes, and an algorithm for the restoration of video and audio materials.


Technology victim: Western Union sends its last telegram. Todd R. Weiss. Computerworld. February 03, 2006.

Western Union has delivered its last telegram messages. This means of communication began over 155 years ago but has been replaced by other means of communication. Over 200 million telegrams were delivered in 1929, and only 20,000 were delivered last year.


Interview as learning tool
… Michael Yunkin. digitize everything. February 3rd, 2006.

Some comments from an interview about digitization and preservation that are worth reading. Excerpts include: Digitization is not preservation, except possibly the source material is fragile, and using a digital surrogate can help avoid overuse. "The increased access that comes with digitization IS added value."

Friday, February 03, 2006

Weekly readings - 3 February 2006

Libraries fear digital lockdown. Ian Youngs. BBC News. 3 February 2006.

The British Library has warned that digital rights management (DRM) controls may block some legitimate uses of digital books and journals and that they may not be accessible in the future when technology evolves. The estimate that by 2020, 90% of newly published work will be available digitally. If these works are locked, it could cause problems in the future. "It is probable that no key would still exist to unlock the DRMs. For libraries this is serious.” Libraries, as “custodians of human memory”, would need to “keep digital works in perpetuity and may need to be able to transfer them to other formats in order to preserve them and make the content fully accessible and usable once out of copyright.”


Film preservation. Stephen Galloway. Hollywood Reporter. Dec. 20, 2005.

The Academy of Motion Picture Arts and Sciences' Pickford Center for Motion Picture Study is housed in a 118,000-square-foot former television studio that has become a state-of-the-art hub for film preservation and restoration. The activity has increased dramatically recently because of government and private funding that has provided the means to keep film from becoming extinct. "There is more money and more public awareness about preservation in general." The center, which houses over 100,000 films, works on about 200-300 films per year and have improved or re-mastered 2,000 titles, 1,000 of them have had full restorations. There is also a commercial market for maintaining these films, so some of the studios are beginning to invest in preservation instead of turning the material over to archives or the public. Creating an inventory and finding the materials is the most challenging part of the process. Restoring the sound can be more challenging than restoring the picture on some movies. A major issue in preserving movies would be the creation of a standard to store the movies digitally instead of on celluloid. The Digital Motion Picture Archival Project, comprised of studio and archival experts, met in June.


Lossless JP2 Meta-Standard for Video Archiving. Websites. January 18, 2006.

These two websites contain information on Jpeg2000 and video archiving. The first is a new forum for discussing the documents and issues. The second site contains some documents that explain the MPEG-A Process, and overview of the approach for creating a new Meta-Standard for Long-Term Digital Video Preservation, and the first draft of the standard.
Follow-on activity - Towards an MPEG-A meta-standard for lossless JPEG-2000-based video compression.


Trends in archiving. Raju Buddharaju. Computerworld (Singapore). December 2005.

There are a number of issues to be considered when archiving web sites. The technological issues include obsolescent formats and access mechanisms, legal and organizational issues such as copyright issues and managing intellectual property. Economic issues include long term viability and the cost of web archiving. There are some archiving models that look at these factors; the models look at closed and open archives, and also look at collection schemes such as selective archives, thematic, or deposit and domain archives.

There are four major archiving tasks, such as, a search and selection process, an ingest process, a storage facility, and access mechanisms. The tools used for each of these tasks vary for each model. For example, for closed web archives, the archived web sites are pre-selected, and with a closed group, getting copyright issues are simpler. Open archives need policy guidelines on which web sites to archive, how much to gather, what the copyright processes are, and how often to get the web site information. Ingest mechanisms could include converting original formats into archival formats to make storage and access easier.

There are a number of web archiving initiatives in place, such as Pandora in Australia, the Austrian Registration of On-Line Archive Project, Preservation and Access to Czech Resources on the Internet, Net Archive Project in Denmark, EVA project in Finland, BnF’s project for web archiving in France, Warp project in Japan, Paradigma project in Norway, Kulturarw3 project in Sweden, the Domain UK project, and Minerva in the US. Some focus on archiving the country’s specific domains. But in general at least 40 per cent of the links in these web sites refer to web sites outside of the country’s specific domains. As an alternative, international or consortium-based web archiving approaches are used. These projects archive websites be either making an archival copy, or by taking a snapshot of the web site periodically.