Digital Preservation Matters: September 2008

Friday, September 26, 2008

Digital Preservation Matters - 26 September 2008

iPres 2008 Web archiving. Digital Curation Blog. 30 September 2008.

Thorstheinn Hailgrimsson: Some web tools include Heretrix crawler, Web Curator from BL/NZNL, and Netarchive Curator Tool Suite from Denmark, plus access tools including NutchWAX for indexing, and an open source version of the Wayback machine. Three main approaches to web archiving: bulk, selective based on criteria, and event-based such as around an election, disaster, etc.

Helen Hockx Yu: Biggest problem is that legal deposit legislation not yet fully implemented, and without a legislative mandate, permission-based archiving is slow and typical success rate is 25%.

Birgit Henriksen: Access to web archive is only for research and statistical purposes, which is hard to implement. They do bulk harvests quarterly, selective sites more frequently (sometimes daily), and event-based archiving.

Gildas Illian: Challenge of change: digital curators not a new job, need to change librarianship.

Colin Webb: Challenges are interconnected: what we want to collect, what we’re allowed to collect, what we’re able to collect, and what we can afford to collect.

Preservation Of Web Resources: The JISC PoWR Project. Brian Kelly, et al. UKOLN at iPres conference. 30 September 2008. Slides.

Challenges of web archiving: How do you select material? It is the information or the ‘experience’ of the web page that is important? How can you move web documents between curatorial environments? “Even those who care about information persistence don’t necessarily do a good job of it on their Web sites.” Not everything on the web needs to be kept. The JISC PoWR (Preservation of Web Resources) project has created a blog and workshops to help develop best practices for web archiving. There are legal challenges and that brings some risks.

Universities have business continuity interests that need to be protected, and an interest in protecting, managing and preserving certain types of web content: "websites may be a unique repository for evidence of institutional activity which is unrecorded elsewhere, and this is often unacknowledged” . If unique records are being created, stored and published on the web, then we must establish their authenticity as records and determine if they are trustworthy versions of pages.

There is also a responsibility to staff and students for things put on the web by the university. “Research interests are reflected in the increasing number of Web resources that have potential longevity and re-use value, a category that may include scientific research outputs and e-learning objects.” Web managers and record managers should cooperate on preserving the web content.

Concerning how to preserve the web environment: look at data import/export; what is the cost of migration; is this sustainable; what are the risks of loss or of not using the service. We need to raise awareness of these important issues. The project will deliver a handbook of web archiving.

A change in New Zealand’s copyright law may affect who owns software. An amendment to the Copyright Act was introduced that would repeal the commissioning rule for software developers.

The general rule is that the creator of an artistic work or software holds the copyright to it. The commissioning rule is an exception which means that the commissioner of a work is the default copyright holder. Under the current rule, software developers have no rights to code developed for clients unless there is a contract in place saying otherwise. If enacted, the amendment could make significant changes to the industry.

Friday, September 19, 2008

Digital Preservation Matters - 19 September 2008

When to shred: Purging data saves money, cuts legal risk. Mary Brandel. Computerworld. September 18, 2008.

Many organizations never throw away data unless they run out of data, and they increase the amount of data by 20% - 50% each year. Not everything can or should be saved, it is important to decide what should be kept and for how long. Many organizations should be saving less data. The increase of data is growing faster than the decline of the cost of storage. The cost of storing and backing up data, including multiple copies of data, is increasing, as is the cost of e-discovery for lawsuits, which can range from $1 million to $3 million per terabyte of data. Electronic records management can help value the data and determine the retention period.

Edinburgh Repository Fringe. Website. August 2008.

This is a website of a ‘repository festival’ in Edinburgh which looks at repository issues, ideas, new perspectives, new projects, and interaction about repositories. It includes some documents, slides, and video streams of the discussions. A few items from the sessions:

Faculty repositories: variety of sources, aware they need to make data available, most stored on department servers or desktops, sharing is often by email, large datasets are a problem. They want them published on the web and find linking very useful.
They want a secure and user-friendly way to store and share research data, as well as the infrastructure to publish and preserve data.
We need to gather requirements, look at current and planned services, meet needs.
Promote favorable information: 87% said items found at the top of search results are seen as more authoritative

Poor E-Mail Archive Habits Plague Businesses. Leo King. Computerworld. August 31, 2008.

Research shows that employees do not properly archive e-mails because they are either too busy or are unsure how. Most employees do not receive guidance on how they should be archiving their email; many organizations do not have a policy.

30% said they had lost important documents
50% say email archiving is too time consuming
30% say it is too complicated
41 % leave files attached to e-mails forever
50% have an enforced limit on their email storage
Over 25% save the files to the company system
28 % save them to their hard drive

A lax approach or failure to communicate will take up extra space and the organization risks losing important information.

Thursday, September 18, 2008

IMLS funds TIPR Demonstration Project

From: Priscilla Caplan
Thursday, September 18, 2008 9:54 AM

The Cornell University Library, New York University Libraries and the Florida Center for Library Automation are happy to announce the receipt of an IMLS National Leadership Grant for the demonstration project:
Towards Interoperable Preservation Repositories (TIPR).

The task of preserving our digital heritage for future generations far exceeds the capacity of any government or institution. Responsibility must be distributed across a number of stewardship organizations running heterogeneous and geographically dispersed digital preservation repositories. For reasons of redundancy, succession planning and software migration, these repositories must be able to exchange copies of archived information packages with each other. Practical repository-to-repository transfer will require a common, standards-based transfer format capable of transporting rich preservation metadata as well as digital objects, and repository systems must be capable of exporting and importing information packages utilizing this format.

The three TIPR partners run three technically heterogeneous, geographically distributed digital preservation repositories. Cornell University Library runs CUL-OAIS based on aDORe, New York University Libraries' Preservation Repository is based on DSpace, and the FCLA's Florida Digital Archive uses DAITSS. The TIPR partners will:
* design a shared transfer format based on METS and PREMIS schemas;
* enhance each of their preservation repository systems to support import and export of this information;
* test the actual transfer of processed and enriched archival information packages between the three repository systems.

The goals of the project are to:
* demonstrate the feasibility of repository-to-repository transfer of rich archival information packages;
* advance the state of the art by identifying and resolving issues that impede such transfers;
* develop a usable, standards-based transfer format, building on prior work;
* disseminate these results to the international preservation community and the relevant standards activities.

This two-year project will begin October 1, 2008.

Friday, September 12, 2008

Digital Preservation Matters - 12 September 2008

It’s Happening Now: This is the Tera Era of Data Storage. Larry Swezey. Computer Technology Review. 16 September 2008.

New visual and audio drive storage capacities upward but the new digital data explosion is very different. More data is being produced and it becoming a more important part of in all aspects of our lives. But the large files we see now are just beginning. The size of files and the amount of data is increasing dramatically. The sizes are moving into the terabyte range [already there in many cases]. More data is being retained. AV items demand more storage. People expect large amount of information to be available almost immediately.

Using METS, PREMIS and MODS for Archiving eJournals. Angela Dappert, Markus Enders. D-Lib Magazine. September/October 2008.

Many decisions need to be made on metadata, including the structural and preservation metadata. The British Library is developing a system for ingest, storage, and preservation of digital with eJournals as the first content stream and developing a common format for the eJournal OAIS Archival Information Package (AIP). EJournals are complex and outside the outside the control of the digital repository so it does not have the structure for submission packets, format standards and such. This article shows one approach to defining an eJournal Archival Information Package. It has a database that provides an interface for resource discovery and delivery. An archival store is a long-term storage component that supports preservation activities. All archival metadata is linked to the content and placed into the archival store. The archival metadata is represented as a hierarchy of METS files with PREMIS and MODS components that reference all content. Each manifestation of an article is stored in a separate METS file. There is no existing metadata schema that has all the descriptive, preservation and structural metadata, but this is how they use a combination of METS, PREMIS and MODS to create an eJournal Archival Information Package.

Introducing djatoka: A Reuse Friendly, Open Source JPEG 2000 Image Server. Ryan Chute, Herbert Van de Sompel. D-Lib Magazine. September/October 2008.

Support for the JPEG 2000 format is emerging in major consumer applications, many consider it suitable for digital preservation. This introduces djatoka, an open source JPEG 2000 image server with basic features and they urge others to help develop it. Often the tiff format is used for the high resolution and a derivative image is available on the web. JPEG2000 has multiple resolutions, region extraction, lossless and lossy compression, and display can start without waiting for the entire file to be loaded. djatoka improves the performance, supports many formats, manipulation of the image (such as watermarking), and works with Open URL.

Friday, September 05, 2008

Digital Preservation Matters - 05 September 2008

Preserving Government Web Sites at the End-of-Term. Library of Congress Newsletter. September 3, 2008.

When political offices change, the websites often change dramatically in the transition. "Digital government information is considered at-risk.” The Internet Archive will undertake a comprehensive crawl of the .gov domain. The Library of Congress has been preserving congressional Web sites each month since December 2003 and will focus on developing of this collection for the project. Others will focus on in-depth crawls of specific government agencies or will help selecting or prioritizing web sites to be included in the collection, as well as identifying the frequency and depth of the act of collecting.

Poor E-Mail Archive Habits Plague Businesses. Leo King. PCWorld. August 31, 2008.

Employees are failing to properly archive e-mails, according to research, because they are often too busy or too unsure of their IT skills. Most employees received no guidance on the requirements and methods for archiving e-mail; one third said their company has no e-mail policy. Also, a third of employees said they had lost important electronic documents and never recovered them. More than half said e-mail archiving is too time-consuming, and thirty percent find it "complicated" or "unreliable." This suggests that the organizations either do not archive emails or that they do not communicate the methods to their employees.

"Digital Preservation" term considered harmful? Chris Rusbridge. Digital Curation Blog. 29 July 2008.

The term ‘digital preservation’ may not be a useful term with decision makers. “The digital preservation community has become very good at talking to itself and convincing ‘paid-up’ members of the value of preserving digital information, but the language used and the way that the discourse is constructed is unlikely to make much impact on either decision-makers or the creators of the digital information (academics, administrators, etc.).” Part of the problem is that digital preservation describes a process, and not an outcome. We value the outcomes not necessarily the processes we use to get the outcomes, and the terminology we use should reflect that, which is more persuasive. Digital preservation has been over-sold as difficult, complex and expensive over the long term, while the term itself contains no notion of its own value. Phrases like "long term accessibility" or "usability over time" are better than the process-oriented phrase "digital preservation".

European Archive. Website. September 2008.

The European Archive is a digital library of cultural artifacts in digital form. They provide free access to researchers, historians, scholars, and the general public. The site contains web archives, videos, and plans to add audio recordings. The Living Web Archives project will carry Web archiving beyond the current approach, characterized by static snapshots, to one that fully accounts for the dynamics and interrelations of Web content.