This blog contains information related to digital preservation, long term access, digital archiving, digital curation, institutional repositories, and digital or electronic records management. These are my notes on what I have read or been working on. I enjoyed learning about Digital Preservation but have since retired and I am no longer updating the blog.
Wednesday, November 16, 2016
A Doomsday Scenario: Exporting CONTENTdm Records to XTF
Because of budgetary concerns, the Illinois State Library asked Andrew Bullen to explore how their CONTENTdm collections could be migrated to another platform. (The Illinois Digital Archives repository is based on CONTENTdm). He chose methods that would allow him to quickly migrate the collections using existing tools, particularly PHP, Perl, and XTF which they use as the platform for a digital collection of electronic Illinois state documents. The article shows the perl code written, metadata, record examples, and walks through the process. He started A Cookbook of Methods for Using CONTENTdm APIs. Each collection presented different challenges and required custom programming. He recommends reviewing the metadata elements of each collection and normalizing like elements as much as possible, and plan what elements can be indexed and how faceted browsing could be implemented. The test was to see if the data could be reasonably converted so not all parts were implemented. In a real migration, CONTENTdm's APIs could be used as a data transfer medium.
Tuesday, June 28, 2016
Protecting the Long-Term Viability of Digital Composite Objects through Format Migration
The poster discusses work done at Emory University’s Manuscript, Archives, and Rare Book Library to "review policy on disk image file formats used to capture and store digital content in our Fedora repository". The goal was to to migrate existing disk images to formats more suitable for long-term digital preservation. Trusted Repositories Audit & Certification (TRAC) requires that digital repositories monitor changes in technology in order to respond to changes. Advanced Forensic Format offered a good solution for capturing forensic disk images along with disk image metadata, but Libewf by Joachim Metz, which is a library of tools to access the Expert Witness Compression Format (EWF) has replaced it. They have decided to acquire raw disk images, or when not possible, to use tar files, because the disk images may be less vulnerable to obsolescence.
In attempting to migrate formats, they had to develop methods for migrating the files setup the repository to accept the new files. They also rely on PREMIS metadata. The migration of disk images from a proprietary or unsupported format to a raw file format has made it easier for us to manage and preserve these objects and mitigates the threat of obsolescence for the near term. There have been some consequences. Some metadata is no longer available. Also, the process will be more complicated and require other workflows, and files will no longer contain embedded metadata. "The migration to a raw file format has made the digital file itself easier to preserve."
Wednesday, April 06, 2016
Validating migration via emulation
"Automated migration of content between files of different formats can often lead to content being lost or altered." Verifying the migration of content is mostly a manual process, and when done for a large number of objects it is not-cost effective. A possible way to do this is to automatically migrate to preferred formats as much as possible and give users the option of working with the object in the “original” software as well as an emulation service. The users could look at both the migrated and emulated versions and verify that the migrated object is valid. By involving multiple users, the migrated object becomes a trusted object.
If this were done together with migration or emulation on demand, then validated digital objects could be separately ingested into a digital preservation system and preserved along with the original version. This could reduce the storage of migrated versions by "only preserving 'validated' migrated versions" and also ensure that trusted content was "available and properly preserved".
Saturday, February 27, 2016
Back in a Flash
Flashback is a proof of concept project run by the British Library’s Digital Preservation Team that examines emulation and migration solutions as methods for preserving the content on CD, DVD , 3.5” and 5.25” disks. The team acquired original hardware for their legacy lab to analyze and deal with content from those formats. They have found that the old hardware can have problems. The first step is a capture process which extracts data from the storage media and characterizes its physical components and lists the files on the media. The content can be placed in a controlled environment that ensures that the bits are retained regardless of deteriorating storage media. The technical information about the content is important for preservation planning.
For less complex content such as text the solution is to migrate the file from for old or obsolete formats to more contemporary and reliable formats. The large majority of the content though is so "tightly bound up with its original environment that it cannot be migrated", which is the case for software. For these, the option is to emulate the item’s original hardware and software environment which were supplied by the University of Freiburg via BwFLA – Emulation As A Service. Flashback is gathering data about the performance and viability of emulating groups and comparing characteristics of the software on original hardware and emulators.
Friday, November 13, 2015
Alternatives for Long-Term Storage Of Digital Information
This is the poster and abstract that Dr. Lunt and I created and was presented at iPres 2015. The most fundamental component of digital preservation is storing the digital objects in archival repositories. Preservation Repositories must archive digital objects and associated metadata on an affordable and reliable type of digital storage. There are many storage options available; each institution should evaluate the available storage options in order to determine which options are best for their particular needs. This poster examines three criteria in order to help preservationists determine the best storage option for their institution:
- Cost
- Longevity
- Migration Time frame
Thursday, October 22, 2015
Preparing for format migration
The presentation begins with terms and definitions of digital preservation, obsolescence, fixity, migration, refreshing, and formats. Formats include hardware, software, media, and systems. The purpose of migration is:
- Avoid media failure
- Avoid obsolescence
- Benefit from new technologies
- “Data migration success rates are never 100%”
- Successive storage/migration cycles accumulate failures, data corruption and loss.
- Even if data migration is flawless, repeated migrations will take its toll on the data “the nearly universal experience has been that migration is labor-intensive, time-consuming, expensive, error-prone, and fraught with the danger of losing or corrupting information.”
Tuesday, September 22, 2015
Taking Control: Identifying Motivations for Migrating Library Digital Asset Management Systems
"Digital asset management systems (DAMS) have become important tools for collecting, preserving, and disseminating digitized and born digital content to library patrons." This article looks at why institutions are migrating to other systems and in what direction. Often migrations happen as libraries refine their needs. The literature on the migration process and the implications is limited; this article provide several case studies of repository migration.A presentation by Lisa Gregory "demonstrated the important role digital preservation plays in deciding to migrate from one DAMS to another and reiterated the need for preservation issues and standards to be incorporated into the tools and best practices used by librarians when implementing a DAMS migration". Repository migration gives institutions the opportunity to move from one type of repository, such as home grown or proprietary, to another type. Some of the reasons that institutions migrated to other repositories (by those ranked number 1) are:
- Implementation & Day-to-Day Costs
- Preservation
- Extensibility
- Content Management
- Metadata Standards
Response | Num. | % |
28 | 98 | |
JPEG | 26 | 90 |
MP3 | 22 | 76 |
JPEG2000 | 21 | 72 |
TIFF | 21 | 72 |
MP4 | 19 | 66 |
MOV | 17 | 59 |
CSV | 16 | 55 |
DOC | 13 | 45 |
DOCX | 12 | 41 |
For metadata, they wanted the new system to support multiple metadata schema; administrative, preservation, structural, and/or technical metadata standards; local and user created metadata, and linked data. In addition, METS and PREMIS were highly desirable.
The new system should support, among others:
- RDF/XML
- Ability to create modules/plugins/widgets/APIs, etc.
- Support DOIs and ORCIDs
- generate checksum values for ingested digital assets.
- perform fixity verification for ingested digital assets.
- assign unique identifiers for each AIP
- support PREMIS or local preservation metadata schema.
- produce AIPs.
- integrate with other digital preservation tools.
- synchronize content with other storage systems (including off site locations).
- support multiple copies of the repository — including dark and light (open and closed) instances.
Tuesday, August 11, 2015
Digital Heritage: Semantic Challenges of Long-term Preservation
This is an excellent article on long term preservation. It is argued that a period of 100 years constitutes an appropriate temporal frame of reference for addressing the problem of semantic aging. Ongoing format migration constitutes currently the best option for temporal scaling at the semantic level.
Digital preservation focuses on finding solutions that scale well along the temporal dimension. In the pre-digital world, the preservation of written records over long periods of time depended on several factors:
- The record needs to be preserved physically
- The ability to read the record and language need to exist
- There must be a community that still shows interest in the record
- Media aging: Any medium that carries a digital encoding will physically deteriorate until it is no longer possible to recover the original bit stream.
- Semantic aging: The evolution of data formats and the fact that knowledge about data semantics quickly disappears if not specified explicitly.
- Cultural aging: The community loses interest in some content; the documents are no longer retrieved, and is not maintained and transmitted any more, its loss is almost unavoidable
- find strategies to access digital contents from the past 50 years in spite of aging factors
- plan the preservation of currently accessible digital content for future use the next 50 years
"Taking cultural ageing seriously means to abandon the idea that digital preservation operates like a time capsule. The picture of content that is enclosed in a digital capsule to be opened at some moment in the future is misleading because it is not the past that sends messages to the future. Rather, it is the present that makes choices, selecting content from the past and linking to it. This ongoing process of linking from the present into the past makes up digital heritage."
Related posts:
- The Twelve Principles of Digital Preservation (and a cartridge in a repository…)
- Investing in Curation. A Shared Path to Sustainability. Final RoadMap
- Preserving Our Digital Heritage
- How Long Is Long-Term Data Storage?
- Storage Trends Around Computex 2015
- Fighting entropy and ISIL, one image at a time
- Dataliths vs. the digital dark age
- Millenniata Announces Results of ISO/IEC 10995 Standard Tests
- Start-up to release 'stone-like' optical disc that lasts forever
Monday, August 10, 2015
One downside to digital innovation: as formats die, we lose our past
Flash, a proprietary animation software made by Adobe, used to be a leading platform for multimedia but it has fallen out of favor due to security and compatibility issues. The death of Flash may destroy Flash content by making it not only obsolete but irretrievable. When formats become obsolete often something’s lost in the changeover. "That’s not an inevitable factor of age – books, an unusually obsolescence-resistant format, have remained accessible for hundreds of years. But for many other technologies, continued survival means shedding the past." This has happened to audio and video before, such as with VHS tapes. "Access to obsolete video formats will always be constrained by the fact that they require an older, tricky-to-source piece of hardware."
"But as early users move into middle age and beyond, we can’t expect our youth – digital or otherwise – to be accessible forever. We’re aging, but the internet’s aging too." We may need to migrate the digital content to other locations to be more shareable and compatible. "But changes in look and outlook don’t erase the past; it remains as a monument to an obsolete age. Changes in technology sometimes do."
Related posts:
- Why Media Preservation Can’t Wait: the Gathering Storm
- Enjoy your digital films and videos while you can... before they disappear
- Video Games and the Curse of Retro
- Meeting the Challenge of Media Preservation: Strategies and Solutions.
Friday, July 31, 2015
Rosetta Customer Testimonial - Jennifer L. Thoegersen, University of Nebraska–Lincoln
Jennifer Thoegersen, Data Curation Librarian at the University of Nebraska–Lincoln, talks about her experience with using Rosetta for managing and preserving different types of digital content, and its impact at UNL. The challenges that they were facing included having digital materials throughout the library and the campus that they were backing up but they wanted to do more to actively preserve and manage the materials far into the future. Libraries have been tasked to be the gatekeepers for the information. They have lots of different types of content, such as research data, audiovisual content, born digital content, websites, digitized images. They have moved content from ContentDM into Rosetta.
One of the things she really likes about being a Rosetta user is that the Rosetta User Community is very helpful. The group provides insights to working with different types of situations and challenges and they share code as well. The major benefit for UNL is the ability to validate their content, monitor our digital assets over an extended period of time, and being able to tailor the system to meet their needs. Rosetta is an open, extendable, and customizable digital preservation system. The implementation team worked well, and they have also been able too work with the system developers to suggest improvements and have those changes added to the system.
Related posts:
Friday, May 15, 2015
What Do We Mean by ‘Preserving Digital Information’? Towards Sound Conceptual Foundations for Digital Stewardship
Preserving digital information is a fundamental concept in digital and data stewardship. This dissertation explains what successfully ‘preserving information’ really is, and provides a framework for understanding when and why failures might happen and how to avoid them. The lack of a formal analysis of digital preservation is problematic. Some notes and quotes from the dissertation:
- At a high level of generality, bit preservation means enabling the possibility for the same (set of ) bit sequence(s) to be discriminated at different points in time, and, potentially, across changes in the underlying storage technology."
- Bit level preservation is a mean, not the goal, in digital stewardship.
- As suggested by the OAIS definition of digital preservation, successful digital preservation is about “maintaining” or “preserving” information.
- Preserving information appears to be a metaphorical expression where a complex set of requirements needs to be satisfied in order for an agent to be presented with intended information
- The best contemporary theories of digital preservation do not focus on the preservation of any sort of object, but rather on preserving access.
- it is impossible to preserve a digital document as a physical object. One can only
preserve the ability to reproduce the document. - "You cannot prove that you have preserved the object until you have re–created it in some form that is appropriate for human use or for computer system applications.”
- “digital records are not stable artefacts”; they last only when certain circumstances are met
- Bit preservation is only the first required step for successful digital stewardship. Interpreting the bits such that an intended digital material obtains through appropriate performances is essential as well.
- Successful digital preservation of information can be conceived as sustained and reliable communication mediated by digital technology and agents involved in the communication process.
Saturday, February 28, 2015
OxGarage Conversion
An interesting web tool from the University of Oxford for converting documents to different formats. OxGarage is a web, and RESTful, service to transform documents between a variety of formats, which uses the Text Encoding Initiative format as a pivot format.The initial option is to select:
- Documents
- Presentations
- Spreadsheets
Monday, December 08, 2014
Agreement Elements for Outsourcing Transfer of Born Digital Content.
The article Swatting the Long Tail of Digital Media: A Call for Collaboration (2012) held that few institutions would be able to have the hardware, software, and expertise to be able to read all digital media types. A group of archival practitioners started a pilot project to test outsourcing of the transfer of content from physical media they couldn’t read in-house. They realized the need for agreements between repositories and service providers to spell out the terms of such collaboration. The group began compiling a list of elements that should be considered when creating these agreements.
This article suggests elements to consider when creating an agreement for outsourcing the transfer of born-digital content from a physical medium, while encouraging adherence to both archival principles and technical requirements. The main areas are:
- General Provisions: desired outcome, description of work, responsibilities and liabilities
- Information Supplied by Service Provider: handling instructions
- Information Supplied by Client: content, inventory,
- Statement of Work: processing, exceptions, documentation, delivery, acceptance
- Cost and Liability: schedule of costs and charges, responsibilities of each party
Friday, October 17, 2014
Safeguard the Future of Your Data: Digital Preservation Technology for the U.S. Federal Market.
Hitachi’s Digital Preservation Platform (HDPP) is a non-magnetic storage solution that has the ability to preserve unlimited amounts of data for decades on end with minimal migration. The projected capacity of the storage solution is 1 PB per rack by the end of 2014. Offline media is also supported.
Cost-efficiency is another factor when considering long-term preservation. Traditional archives use a migration strategy that requires regular media refreshing which has proven to be costly over time. Migration is an ongoing process that takes a significant amount of resources.
Blu-ray optical media and M-DISC media ensure longevity and compatibility across generations of technology so the data can still be accessible as formats continue to evolve. Blu-ray discs are projected to 1 TB per disc. Mdisc capability is currently at 25 GB per disc, with plans for 300 GB per disc. Brochure also includes quick specs and diagrams.
Sunday, July 14, 2013
Edward R. Murrow's audio essays with the famous -- and not-so-famous -- have been digitized and put online
Over 800 oral essays from Edward R. Murrow's 1950s radio series, This I Believe, have been placed online for public use by Tufts University. The audio collection comes from almost 800 reel-to-reel tape recordings "that were nearly lost forever due to natural wear and tear from more than 50 years in less than ideal storage." The engineers captured the analogue recordings using a 96K, 24-bit high resolution WAV format.
Thursday, September 20, 2012
Swatting the Long Tail of Digital Media:A Call for Collaboration.
Archiving born digital content stored on a wide range of physical media types requires specialized
knowledge, expertise, and equipment to read and preserve the content on physical media, ranging from punched cards to flash drives. In general, transferring content from a particular physical medium requires a compatible computer that can read the data in the format that is stored on the medium, but also other hardware and software components, such as cables and drivers. A community-based approach could establish software and workstations for antiquated technology (SWAT ) sites where a few institutions acquire and maintain the technology and expertise to read data and transfer content from particular types of obsolete media.
Tuesday, May 01, 2012
Preserving Moving Pictures and Sound.
This excellent report is for anyone with responsibility for collections of sound or moving image content and an interest in preservation of that content. For audiovisual materials, digitization is critical to the survival of the content because of the obsolescence of playback equipment and decay and damage of physical items, whether analogue or digital.
The basic technology issue for audio/visual content is to digitize all items on the shelves, either for preservation or access. The risk of loss is high. Another issue is moving content from the current media to digital files. A third issue is preserving the digital files. This report describes the techniques for preservation planning, digitization and digital preservation of audiovisual content, and describes the technologies. Preservation of these materials is difficult because they are physically, culturally, and economically different.
Explanation of signals and carriers. "Digital technology produces recordings that are independent of carriers. Carrier independence is liberation". Digital preservation of the digitized signal means to preserve the numbers, but also the technology to decode the numbers. ‘Maximum integrity’ means keeping the full quality of the audio and video. As far as possible, the new preservation copy should be an exact replica of the original: the content should not be modified in any way’. This may be difficult to achieve.
The two basic kinds of preservation action are: 1) changing the audiovisual content within a collection, or normalization; 2) changing the system that holds the collection.
There are four main factors in an analogue or digital conservation program:
- packaging (wrappers), handling and storing;
- environmental conditions;
- protecting the masters; and
- condition monitoring, maintaining quality.
The four PrestoPRIME requirements for effective access to time-based media are:
- granularity: division of the content into meaningful units;
- navigation: the ability to select and use just one unit,
- citation: the ability to cite a point on the time dimension of an audio or video file, with a permanent link
- annotation: the ability of a user of content to make time-based contributions
Thursday, December 08, 2011
A literature review: What exactly should we preserve? How scholars address this question and where is the gap
There are generally two approaches to long-term preservation of digital materials
- preserving the object in its original form as much as possible along with the accompanying systems,
- migration or transformation: transforming the object to make it compatible with more current systems but retaining the original “look and feel.
The characteristics of digital objects that must be preserved over time inAn important goal of digital preservation is more than just retrieving the objects, it is to ensure the authenticity of the information. A digital object can change as long as the final output is what it is expected to be. The properties to preserve come from the purpose of the object, and at least one purpose for the object needs to be defined. Archivists have created standards that look at records in the context of their creation, intended use and preservation. It is important to ask what features of the object is important when delivering to the user. There may be many uses to many communities that were not intended by the object creator, so we should not let the ideal limit the reasonable.
order to ensure the continued accessibility, usability, and meaning of the
objects, and their capacity to be accepted as evidence of what they purport
to record.
Friday, April 30, 2010
Digital Preservation Matters - April 30, 2010
Digital Preservation: An Unsolved Problem. Jonathan Shaw. Harvard Magazine. April 27, 2010.
With the advantages of digital, why do libraries not embrace the digital future now? One of the main obstacles is the issue of preservation. For books: "the greatest risks to printed material are the environment, wear and tear, security, and custodial neglect." For digital: using data is one of the best ways to preserve it because you know it is usable; digital data must be read and checked constantly to ensure integrity. Another concern about digital is that current formats may not be readable in the future (reference to June 2009 New Yorker cover). Born digital materials are not as easy to save since they have many different formats. This is difficult for librarians keeping records of the university's intellectual life, because of both the legal and digital challenges. "We are in a period of unprecedented lack of documentation of academic output."
---
Gutenberg 2.0. Harvard's libraries deal with disruptive change. Jonathan Shaw. Harvard Magazine. April 27, 2010.
In the scientific disciplines, information, from online journals to databases, must be recent to be relevant. Books in libraries to some seem more like a museum. Some think that massive digital projects will make research libraries irrelevant. The future of libraries is clearly digital. "Yet if the format of the future is digital, the content remains data. And at its simplest, scholarship in any discipline is about gaining access to information and knowledge." Access to the information will mean different things and be done in different ways. In the meantime, "Who has the most scientific knowledge of large-scale organization, collection, and access to information? Librarians."
How do we deal with large scale collections and the access to the information? "We ought to be leveraging that expertise to deal with this new digital environment. That's a vision of librarians as specialists in organizing and accessing and preserving information in multiple media forms, rather than as curators of collections of books, maps, or posters." The role of libraries isn't going away, but it is changing.
The idea that libraries will be stewards of vast data collections raises very serious concerns about the long-term preservation of digital materials. The worry is that the longevity of the resources has not been tested. There are 3 copies of the 109 TB Harvard repository. It is in a constant process of checking and refreshing to make sure everything is readable.
---
The Floppy is Dead: Time to Move Memories to the Cloud. Lance Ulanoff. PC Magazine. Apr 26, 2010.
The decision by Sony to stop producing 3.5-inch disks marks an end to that format. The end of any popular format can have a ripple effect on the technology world. If the data is not migrated to later formats it could "trapped on its obsolete format". All media will become obsolete sometime, it is the natural progression of technology. Since change is inevitable the article suggests everyone consider cloud-based backup storage options. It suggests that this is better than storing data on eventually-to-be-obsolete media.
---
Google is not the last word in information. Lia Timson. Sydney Morning Herald. April 29, 2010.
Interesting article concerning primary and secondary sources, what is on the internet and how it gets there, special collections, etc.
- "Better still is the lesson and the realisation that information and history don't just appear on Google. Someone has to publish it onto the web, put it there in the first place."
- "As educators we must ask that assignment bibliographies include more than just "three websites". We must insist on a variety of media as sources, including interviews with real people, be they witnesses, historians or surviving relatives, and even insist on trips to the local library."
- … researching is much wider and deeper than searching online.
---
A Gentle Reminder to Special-Collections Curators. Todd Gilman. The Chronicle of Higher Education. April 29, 2010.
Article and a librarian's experience trying to use special collections. The "job is not to keep readers from your books but just the opposite: to facilitate readers' use of the collections."
---
Tuesday, February 09, 2010
Digital Preservation Matters - February 8, 2010
Online Recordkeeping: It's All in a Name. Mimi Dionne. Internet Evolution. February 2, 2010.
The born-digital record lifecycle has five stages, in chronological order: creation; distribution and use; storage and maintenance; retention; and disposition or archival preservation. All five stages are important. One of the best practices for born-digital records is uniform file naming protocols, including location, to encourage strong content management. These should align with the records retention policies. Organizations are better off if they select the information they need to retain and destroy what they don’t need. “The benefits of implementing a records program that includes regular records destruction have far-reaching influence not only on compliance issues and maintenance of a company’s IT environment but also the health of its budget.”
---
SPIE to Preserve E-Books in Portico. Press Release. Portico. 2 February 2010.
Portico has agreed with SPIE (the international society for optics and photonics) to preserve its collection of e-books, currently 93 items. It already participates with Portico to preserve its e-journals. Portico now holds over 34,000 e-books and over 10,000 e-journals. The SPIE has also announced the launch of their digital library, which includes 120 SPIE Press titles from the Field Guides, Monographs, and Tutorial Texts series.
---
Long-Term Preservation Of Web Archives – Experimenting With Emulation And Migration Methodologies. Andrew Stawowczyk Long. IIPC. December 2009. [54 p. PDF]
The decision to emulate or migration are largely based on personal beliefs, rather than on any particular evidence. We do not know which of these is more useful in the long term. All objects change over time, so ensuring long-term, useful access to collections requires we first define the most important aspects of an object that needs to be preserved. The “Preservation Intent” may be useful for this, which is what the institution intends to preserve for any given digital object and for how long. Also needed is the creator’s intent, the contextual information and the technical information.
Two possible approaches for institutions may be:
- preserve digital objects over the next twenty years;
- find means of preserving objects for longer.
Or an approach may include both: preserve items for 20 years while the search for longer preservation mechanisms continues. “Significant properties” means the properties of a digital object that are essential to the representation of the intended meaning of that object.
The author does not recommend either emulation or migration as a perfect solution to the problem at this current time. Also, their findings and recommendations include:
- There are no tools suitable for long-term preservation of very large web archives
- All preservation actions need to be based on a clearly defined “Preservation Intent”
- Migration and emulation offer some time extensions to for short term access to digital objects.
- Emulation seems to present higher risks as a long-term preservation methodology.
It is not possible to preserve it all. Priorities need to be established for practical, long-term preservation solutions. The best hope for adequate long-term preservation, lies in continuous and systematic work, researching various preservation methodologies, and improving our understanding of the future use of web archives.
---
Is NAND flash about to hit a dead end? Lucas Mearian. Computerworld. February 4, 2010.
IM Flash Technologies has said that shrinking the technology much further may not be possible because of problems with bit errors and reliability. The number of electrons that can be stored in the memory cell decreases with each generation of flash memory, making it more difficult for the cells to reliably retain data.
---
CNRI Digital Object Repository™. Corporation for National Research Initiatives. 19 January 2010.
(CNRI) has developed a new version of its Digital Object Repository Software. It is open source, flexible, scalable, secure, and has a suite that provides a common interface for accessing all types of digital objects. Redundancy is supported by a mirroring system with software to ensure that replicated objects are kept in sync.