Showing posts with label formats. Show all posts
Showing posts with label formats. Show all posts

Thursday, July 25, 2019

The Library of Congress 2019-2020 Recommended Formats Statement


     The Library of Congress has released the 2019-2020 Recommended Formats Statement. This version provides some valuable updates to the sections on Moving Image Works and Audio Works in particular. The goal of the Recommended Formats has always been to provide useful information furthering the shared goal of ensuring the preservation of and long-term access to creative works.  By providing up-to-date information about the file types, physical and technical characteristics and associated metadata which support these worthy goals, the Statement hopes to provide the building blocks upon which libraries can build their collections, now and for the future.

The Library remains committed to acquiring and preserving digital works and to providing whatever support it can to other similarly committed stakeholders.  "We shall continue to build our collections with their preservation and long-term access firmly in mind; and we shall continue to engage with others in the community in efforts such as the Recommended Formats Statement".  "And we shall continue to engage in an annual review process to ensure that it meets the needs of all stakeholders in the preservation and long-term access of creative works."


Friday, December 14, 2018

In-House Digitization with the Lossless FFV1 Codec At the University of Notre Dame Archives. AMIA Poster

In-House Digitization with the Lossless FFV1 Codec At the University of Notre Dame Archives. Erik Dix and Angela Fritz, University of Notre Dame Archives. AMIA 2018. Poster. [pptx].
     An interesting poster at AMIA which shows their digitization workflow and processing steps from accessioning to preservation system.
WHY FFV1 as a codec for Digital Preservation Masters?
1. Lossless compression (no quality loss)
2. A Standard Definition FFV1 file is ca. 46 % of the size of the uncompressed file.
    A High Definition FFV1 file is ca. 57 % of the size of the uncompressed file.
3. FFV1 is part of the FFmpeg project and open source
4. It is safe for long term preservation.
5. Encoding into FFV1 can be done with low cost Windows PCs.
6. The video is captured in FFV1 in real time.
7. Standard definition FFV1 files can be played with the VLC media player

Digitization Workflow

Accessioning as Processing:
  • Archives conducts a preliminary inventory, assigns collection code,  creates CMS record 
  • AV materials transferred to AV Archivist for a preservation and digitization assessment
  • Descriptive and technical metadata gathered
  • Analog materials reorganized and stabilized for long-term storage.
Basic Metadata Creation:
  • AV Archivist creates item–level metadata
  • Descriptive and technical metadata promotes access and discoverability
  • Descriptive metadata added to finding aid and uploaded to the Archives, the IR, and ILS
Inspection & Prep of A-V materials:
  • Only requested AV items or at-risk items will be digitized 
  • Videotapes often require baking or splicing
VCRS without SDI output:
  • The digitization capture card uses SDI [Serial Digital Interface] connections. 
  • VHS, Betamax, and older professional formats, e.g. 1” type C, U-matic don’t have SDI outputs. 
  • A DPS-575 frame synchronizer is used to create a SDI signal from the S-video or output of these items
  • Basic color correction is done at this step if necessary. 
  • The SDI signal from the frame synchronizer is then split in two to feed a Windows PC for the creation of the FFV1 preservation file and to feed a Mac computer to create an Apple ProRes 422 mezzanine file.
VCRs with SDI output:
  • They have VCRs for the DV tape family from Mini DV up to DVCPro HD, DVCam, and HDV, as well as the Betacam tape family from Betcam to HDcam that can output an SDI or HD-SDI signal. 
  • The signal is also split in two to simultaneously create a FFV1 file on a Windows Pc and an Apple ProRes 422 file on a MAC.
Digital Preservation System:
  • Use an LTO tape library for the storage of our digitized files. 
  • Currently, the Archives is evaluating digital preservation systems for implementation. 
  • Archives capabilities will be expanded to provide digital preservation micro-services to ensure continued access to its digital collections.


Wednesday, December 12, 2018

Preservation of AV Materials in Manuscript Collections. Training for AV format identification and risk assessment. Actions to take


Preservation of AV Materials in Manuscript Collections; Internal Training.  Ben Harry. Brigham Young University. November 2018.
     The presentation is not yet available on the internet. Some notes from the training:
“There is now consensus among audiovisual archives internationally that we will not be able to support large-scale digitisation of magnetic media in the very near future. Tape that is not digitised by 2025 will in most cases be lost.”  -NFSA.gov.au, Oct. 2018

The problem with AV is Fragility:
  • Playback equipment is disappearing
  • Knowledgeable experts are disappearing
  • Materials breaking down
  • Untrained handling easily destroys materials
The solution to the fragility is to address materials in a timely manner:
  • Priority and Speed and Efficiency
  • Train transfer operators
  • Untrained handling easily destroys materials
A Challenge of AV is Neglect:
  • Unable to describe AV Content adequately in finding aids or catalogs. 
  • Requires certain level of specific knowledge of formats and physical carriers.
  • Requires machine to read information that may not be available
  • Time-consuming process for little reward
  • Expensive, unstructured, uncoordinated
To overcome the challenge:
  • Digitize material for description in basic processing
  • Time-consuming process for little reward
  • MUST be a lean process to minimize the effect upon processing
Audio-video preservation requires a certain level of specific knowledge. Staff must be trained to recognize and report AV Formats. Also, it is important to have risk assessment guidelines to help make informed decisions. Coordinate efforts and resources to reduce confusion, prioritize and set goals, unify our proposals for equipment and man power.


Actions to take:
  • Prioritize Formats for Migration / Reformatting
  • Maintain Transparent Records on Preservation and Access
  • Link Preservation and Access (one does not happen without the other)
  • Provide Curators with AV Assessment tools
  • Organize a Queue System to keep things equitable (what about 12 items per month, per curator? Adjust as Necessary)
  • Create Digital File Naming guidelines
  • Establish Access and Preservation format standards for AV materials:
 For Access and Preservation, the following standards will be used:

Audio Preservation
  • Preservation Format:  PCM / wav 96 kHZ sampling   24-bit depth. 1 GB/Hour
  • Access Copy: mp3.  Music: 256 kbps. Voice: 192 Kbps.

Video Preservation: Standard Def
  • Preservation Format: ffv1 / mkv 720 x 486. 33 GB/Hour  
  • Access Copy: H.264 / mp4

Video Preservation: Hi Def
  • Preservation Format: ffv1 / mkv Native: 1080i / 1080p. 100 GB/Hour?  
  • Access Copy: H.264 / mp4

Film Preservation
  • Preservation Format: RGB ffv1 / mkv 1080i scan (MPS capability ceiling). 100 GB/Hour?  
  • Access Copy: H.264 / mp4

Archive and delivery methods:
  • Preservation: Rosetta
  • Access: various options are available. 


Monday, November 26, 2018

Preservation of AV Materials in Manuscript Collections. Training for AV format identification and risk assessment

Preservation of AV Materials in Manuscript Collections; Internal Training.  Ben Harry. Brigham Young University. November 2018.
     Ben Harry, Curator of Audiovisual Materials and Media Arts History at Brigham Young University, provided some internal training concerning AV format identification and risk assessment. Here are some assessment tools for AV materials.














































Saturday, October 14, 2017

Personal Digital Archiving Guide Part 2: Media Types and File Formats

Personal Digital Archiving Guide Part 2: Media Types and File Formats.  Scott David Witmer. Bits and Pieces. August 15, 2017.
     This helpful follow-up post focuses on “born-digital” files created on the computer, and the "characteristics of digital file formats that you should consider when deciding how to preserve your digital materials".  For information on digitizing, it links to guides and handouts, including scanning, recommendations, audio conversion, video conversion, storage, and others.

"The best time to think about preservation is before you create your files." Making decisions early, including organization and metadata, will make it easier to preserve digital files over time.  The post reviews:
  • The trade off between Quality vs. Size of digital files
  • Lossless versus lossy compression
  • File Formats by Media Type
  • Formats for Text, Image, email, audio, video
 Metadata is also important. Be consistent and descriptive when naming or grouping files.

Friday, September 15, 2017

Preservation with PDF/A

Preservation with PDF/A (2nd Edition). Betsy A Fanning. DPC Technology Watch Report 17-01. July 2017. [PDF 34pp.]  [Link updated]
     This report is an updated edition of the original Technology Watch Report 08-02, Preserving the Data Explosion: Using PDF (Fanning,2008). It looks at PDF/Archive as digital document file format for long-term preservation. The PDF/A versions of the PDF format have been developed as a family of open ISO Standards to address preservation of PDF files by removing features that pose preservation risks. It is important for preservation purposes to know how closely a file conforms to the  requirements defined in the standard. There are preservation risks that may exist in the standard PDF file format:
  • any file type can be embedded;
  • the primary document can be conformant as a static document, but the embedded files may not be static;
  • embedded files may be infected by computer viruses;
  • embedded files may have extended metadata requirements, may introduce unexpected dependencies or be subject to format obsolescence;
  • embedded files may complicate matters relating to information security, data protection or the management of intellectual property rights.
By restricting some risk features and thus reducing preservation risks, the PDF/A format seeks to maximize:
  • device independence;
  • self-containment;
  • self-documentation.
Some reasons why an organization might use PDF/A to preserve their digital documents, include:
  • its standardized format for storing digital documents for long periods of time;
  • it allows for digitally signed documents using the very latest digital signature software;
  • it reliably displays special characters for mathematics and languages since all are embedded within the file;
  • it displays correctly on any device as the author intended, including the reading order;
  • platform independence;
  • provision of fully searchable documents through Optical Character Recognition.
History and Features of PDF and PDF/A. The Standard was drafted in multiple in order to make it easier to implement the Standard. "Unfortunately, the committee’s philosophy of multiple parts resulted in confusion in the market place, making it more difficult for users to select the optimum file format." Users  may need to do a file format assessment based on their requirements that can help them decide which PDF/A Standard to implement.

Metadata helps effectively manage a file throughout its life cycle, as well assist in document discovery searches. "Establishing a long-term digital document preservation system requires careful consideration of the metadata that will be needed to locate and render documents years from now." Collecting metadata for the PDF/A documents in optional in the standard, except for the identifier, which is generated when the PDF/A file is created. Preservation metadata should:
  • be appropriate to the materials;
  • support interoperability;
  • use standardized controlled vocabulary;
  • include clear statements on the conditions and terms of use;
  • be authoritative and verifiable;
  • support the long-term management of the document.
Just because a file purports to be a PDF/A does not necessarily mean that it is. Format validation of a file can increase confidence a viewer will be able to render the file correctly.  A number of PDF/A validators are available.The development work on the PDF Standards is a continuing effort. There are additional preservation challenges in the format that are in the process of being addressed.

The report lists some recommendations, which are directed at groups that use the standard. They include:
  • For those evaluating PDF/A as a digital preservation solution:
    • Before adopting PDF/A as a preservation solution it is "essential to understand the organizational requirements and how PDF/A will support" the organization needs.
    • PDF/A is not a preservation solution on its own a part of the wider preservation strategy that must be consistent with other components of the preservation infrastructure, such as backups, integrity checks and documentation.
    • Different versions of PDF/A have different purposes, with different capabilities as well as different preservation risks. These should be understood and decisions should be documented and explained.
    • Different vendors offer different tools to manage PDF/A that should be compared against your requirements..
  • For organizations collecting and preserving digital data:
  • While it may not be possible to control or restrict how documents are produced, it may be useful to give document creators guidance on what is desired.
  • Embed PDF/A validation tools into preservation workflows and record the results to help manage the digital preservation risks associated with PDF/A files received.

Thursday, July 13, 2017

Integrating Research Data management and digital preservation systems at the University of Sheffield

Integrating Research Data management and digital preservation systems at the University of Sheffield. Chris Loftus. Digital Preservation Coalition. 31 May 2017.
     The University Library is leading the active management and curation of research data within the institution. This includes implementing a research data catalogue and repository powered by Figshare. They safeguard library collections and University assets of the University using Rosetta, a digital preservation platform from Ex Libris. "We are now working with figshare and Ex Libris to integrate both services to provide seamless preservation of published research data across the research lifecycle." Which will

  • provide a complete lifecycle data management service for the university’s research community; 
  • identify, understand and act on risks associated with preserving data sets; 
  • better inform advice and guidance around use of data formats for sharing and preservation purposes; and 
  • encourage researchers to share their data more openly with others by guaranteeing the long term sustainability of that data.
Initial integration work uses the OAI-PMH protocol and METS packages to transfer content efficiently. Rosetta will be the dark archive, with figshare the interface for researchers and external users.

File formats issues: Research data is often in niche and proprietary formats. Of the material currently deposited in the archive, only a small percentage was recognised by a Droid survey. They will need to invest some time to identify and plan for these formats, and hopefully the work will be of use to the wider digital preservation community.

Metadata: They plan to improve the quality and volume of metadata accompanying research data. Material from researchers often lacks needed metadata, which can cause future data access issues. They are investigating solutions.

Monday, April 10, 2017

Encoding and Wrapper Decisions and Implementation for Video Preservation Master Files

Encoding and Wrapper Decisions and Implementation for Video Preservation  Master Files. Mike Casey. Indiana University. March 27, 2017.
     "There is no consensus in the media preservation community on best practice for encoding and wrapping video preservation master files." Institutions preserving video files long term generally choose from three options:
  • 10-bit, uncompressed, v210 codec, usually with a QuickTime wrapper
  • JPEG 2000, mathematically lossless profile, usually with an MXF wrapper
  • FFV1, a mathematically lossless format, with an AVI or Matroska wrapper
The few institutions digitizing and preserving video for the  long-term are roughly evenly divided between the three options above. This report examines in detail a set of choices and an implementation that has worked well for their institution. Originally they chose the first option, but with recent advances of FFV1, they reopened this decision and initiated a research and review process:
  • Exit strategy research and testing
  • Capture research (use FFmpeg within their system to generate FFV1 files).
  • Comparison of issues
  • Consultation with an outside expert
Results:  Research into exit strategies, they were able to move FFV1 files to a lossless codec with no loss of data. They decided to capture using FFmpeg, which requires developing a simple capture tool, and developed specifications for a minimal capture interface with FFmpeg for encoding and wrapping the video data.

Technical:  identified a number of key advantages to FFV1, including:
  • roughly 65% less data than a comparable file using the v210 codec
  • open source, non-proprietary, and hardware independent
  • largely designed for the requirements of digital preservation
  • employs CRCs for each frame allowing any corruption to be associated with a
  • much smaller digital area than the entire file
FFV1 appears to be "trending upwards among developers and cultural heritage organizations engaged in preservation work". They also chose the Matroska wrapper, which is an audiovisual container or wrapper format in use since 2002, and which is a more flexible wrapper option.

As more and more archives undertake video digitization" they will not accept older and limited formats" (AVI or MOV), but they will be looking for standards-based, open source options developed specifically for archival preservation. "Both FFV1 and Matroska are open source and are more aligned with preservation needs than some of the other choices and we believe they will see rapidly increasing adoption and further development."

Implementation: They developed a quality control program to validate that the output meets their specification for long-term preservation and checks the FFV1/Matroska preservation master files. These files are viewed using the VLC media player, a free open source cross-platform multimedia player that supports FFV1 and Matroska

Currently, they have created over 38,000 video files using FFV1 and Matroska. "We have chosen two file formats that are open source, developed in part with reservation in mind, and on the road to standardization with tools in active development. We have aligned ourselves with the large and active FFmpeg community rather than a private company. While the future is ultimately unknowable, we believe that this positions us well for long-term preservation of video-based content."


Saturday, April 08, 2017

New Home and Features for Sustainability of Digital Formats Site

New Home and Features for Sustainability of Digital Formats Site.  Kate Murray, Jaime Mears. The Signal. April 6, 2017.
     The Library of Congress web site, Sustainability of Digital Formats, contains "the technical aspects of digital formats with a focus towards strategic planning regarding formats for digital content, especially collection policies." The formats are divided into the type of object, which includes:
  • still image, sound, textual, moving image, web archive, datasets, geospatial and generic formats
The website shows the relationships between formats, including the sustainability factors and the quality and functionality for each content category.
  • Disclosure
  • Adoption
  • Transparency
  • Self-documentation
  • External dependencies
  • Impact of patents
  • Technical protection mechanisms
The new website is at loc.gov/preservation/digital/formats and it now includes
  • The PRONOM ID and the Wikidata Title ID, both which help to document the formats, and 
  • The Library of Congress Recommended Formats Statement
The digital formats site continues to evolve to meet the Library’s and the digital preservation community’s changing needs.

Wednesday, December 14, 2016

PDF/A as a preferred, sustainable format for spreadsheets?

PDF/A as a preferred, sustainable format for spreadsheets?  Johan van der Knijff. johan's Blog. 9 Dec 2016.
     National Archives of the Netherlands published a report on preferred file formats, with an overview of their ‘preferred’ and ‘acceptable’ formats for 9 categories. The blog post concerns the ‘spreadsheet’ category for which it lists the following ‘preferred’ and ‘acceptable’ formats:
  • Preferred:  ODS, CSV, PDF/A     
  • Acceptable: XLS, XLSX
And the justification / explanation for using PDF:
PDF/A – PDF/A is a widely used open standard and a NEN/ISO standard (ISO:19005). PDF/A-1 and PDF/A-2 are part of the ‘act or explain’ list. Note: some (interactive) functionality will not be available after conversion to PDF/A. If this functionality is deemed essential, this will be a reason for not choosing PDF/A
There are some problems of the choice of PDF/A and its justification.
  • Displayed precision not equal to stored precision
  • Loss of precision after exporting to PDF/A
    • Also loss of precision after exporting to CSV
    • Use of cell formatting to display more precise data is possible but less than ideal,
  • Interactive content
  • Reading PDF/A spreadsheets: This may be difficult without knowing the intended users, the target software, the context, or how the user intends to use the spreadsheet. 
The justification states that some interactive functionality "will not be available after conversion to PDF/A. If this functionality is deemed essential, this will be a reason for not choosing PDF/A." However, deciding what functionality is ‘essential’ depends on the context and intended user base. In addition, interactive aspect may imply that "any spreadsheets that do not take any interaction with a user can be safely converted to PDF/A. But it may also be better to make a distinction between ‘static’ and ‘dynamic’ spreadsheets.

There may be situations where PDF/A is a good or maybe the best, but choosing a preferred format should "take into account the purpose for which a spreadsheet was created, its content, its intended use and the intended (future) user(s)."


Wednesday, November 30, 2016

To Act or Not to Act - Handling File Format Identification Issues in Practice

To Act or Not to Act - Handling File Format Identification Issues in Practice. Matthias Töwe, Franziska Geisser, Roland E. Suri. Poster, iPres 2016.  (Proceedings p. 288-89 / PDF p. 145).
     Format identification output needs to be assessed within an institutional context and also consider provenance information to determine actions. This poster presents ways to address file format identification and validation issues that are mostly independent of the specific tools and systems employed; they use Rosetta, DROID, PRONOM, and JHOVE. Archives rely on technical file format information and therefore want to derive as much information about the digital objects as possible before ingest. But there are issues that occur in everyday practice, such as:
  • how to proceed without compromising preservation options
  • how to make efforts scalable 
  • issues with different types of data
  • issues related to the tool's internal logic
  • metadata extraction which is also format related
 The use cases vary depending on the customers, types of material, and formats. A broad range of use cases apply to safeguarding research data for a limited period of time (ten years at minimum) to publishing and preserving data in the long term. Understanding the use cases’ characteristics helps provides "a better understanding of what actually matters most in each case."

Ideally, format identification should yield reliable and unambiguous information on the format of a given file, however a number of problems make the process more complicated. Handling files on an individual basis does not scale well. This may mean that unsatisfactory decisions need to be taken to keep the volume of data manageable. Some criteria to consider:
  • Usability: can the file be used as expected with standard software?
  • Tool errors: is an error known to be tool-related?
  • Understanding: is the error actually understood?
  • Seriousness: does the error concern the format's significant properties?
  • Correctability: is there a documented solution to the error?
  • Risk of correcting: what risks are associated with correcting the error?
  • Effort: what effort is required to correct the error?
  • Authenticity: is the file’s authenticity more relevant than format identification?
  • Provenance: can the data producer help resolve this and future errors?
  • Intended preservation: what solution is acceptable for lower preservation periods?
There are no simple rules to resolve these, so other considerations are needed to determine what actions to take:
  • Should format identification be handled at ingest or as a pre-ingest activity?
  • How to document measures taken to resolve identified problems?
  • Can unknown formats be admitted to the archive? 
  • Should the format identification be re-checked later? 
  • Do we rely on PRONOM or do we need local registries? 
  • How to preserve formats where no applications exist.
"Format validation can fail when file properties are not in accord with its format’s specification. However, it is not immediately clear if such deviations prevent current usability of a file orcompromise the prospects for a file’s long term preservability." If the file is usable today, does that mean it is valid? Digital archives need to "balance the efforts for making files valid vs. making files pass validation in spite of known issues."

The failure to extract significant properties has no immediate consequences, and institutions need to decide if they will correct the issues, especially if the metadata is embedded and the file must be altered, which creates the risk of unknowingly introducing other changes. The authors of this poster act even more cautiously when fixing metadata extraction issues that require working with embedded metadata or file properties.

Tuesday, November 22, 2016

Every little bit helps: File format identification at Lancaster University

Every little bit helps: File format identification at Lancaster University.  Rachel MacGregor. Digital Archiving at the University of York. 21 November 2016
   The post is about Rachel's work on identifying research data and follows on from the work of Filling the Digital Preservation Gap and provides a interesting comparison with the statistics reported previously. The aim was to understand the nature of research data and to inform their approaches to preservation. The summary of the statistics:
Of 24,705 files: 

  • 11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
  • 99.3% were given one file identification and 76 files had multiple identifications. 
    • 59 files had two possible identifications
    • 13 had 3 identifications
    • 4 had 4 possible identifications. 
    • 50 of these were either 8-bit or 7-bit ASCII text files.  
    • The remaining 26 were identified by container as various types of Microsoft files.

Of the 11008 identified files:

  • 89.34% were identified by signature
  • 9.2% were identified by extension
  • 1.46% identified by container
When adjusted for the 7,000 gzip files, the percentages identified were:
  • 68% (2505) by signature
  • 27.5% (1013) by extension
  • 4.5% (161) by container
These results were different from York's results but not so dramatically.

Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more.  Of these most were Microsoft files with multiple id's and also a set of lsm files identified as TIFFs. 

In all, 59 different file formats were identified, GZIP  was the most frequently occurring followed by xml format.

Files that weren't identified
  • There were 13697 files not identified by DROID of which 4947 (36%) had file extensions.  
  • 64% had no file extension (64%). 
  • Top counts of unidentified file extensions: dat, data, cell, param,
Gathering this information helps contribute towards our overall understanding of file format types. "Every little bit helps."

Wednesday, October 26, 2016

Research data is different

Research data is different. Simon Wilson. Digital Archiving blog. 5 August 2016.
     A blog post about some born digital archives at Hull.  It is not academic research data but instead comes from a variety of sources. By using DROID to look at 270,867 accessioned files they discovered the following:
  • 97.96% of files were identified by DROID 
  • There were 228 different format types were identified 
  • The most common format is fmt/40 (MS Word 97-2003) with 120,595 files (44.5%).  
  •   The top formats they found were:
    Microsoft Word Document (97-2003)                 44.52%
    Microsoft Word for Windows (2007 and later)     5.63%
    Microsoft Excel 97 Workbook                              5.08%
    Graphics Interchange Format                              4.15%
    Acrobat PDF 1.4 - Portable Document Format     3.12%
    JPEG File Interchange Format (1.01)                    2.72%
    Microsoft Word Document (6.0 / 95)                    2.46%
    Acrobat PDF 1.3 - Portable Document Format     2.39%
    JPEG File Interchange Format (1.02)                    1.83%
    Hypertext Markup Language (v4)                         1.67%
 The number of and type of formats they found in their collections was different from other institutions that had research data.  An important step is to then look at the identified file formats and determine a strategy to migrate that format. Knowing the number and frequency of the formats in the collections will allow efforts to be prioritized.


Monday, September 05, 2016

Preservation Challenges in the Digital Age

Preservation Challenges in the Digital Age. Bernadette Houghton. D-Lib Magazine. July/August 2016.
     The rapidly evolving digital preservation field has many preservation challenges:
  • Digital materials are more at risk than analogue
  • Preserving digital materials is also providing access to the material
  • Ensuring the infrastructure that renders the file is preserved or replicated
  • Focal areas changing and best practices still under debate.
"The optimal preservation strategy for individual organisations will differ according to their requirements, resources and data type. Each strategy comes with its own set of challenges, many of which are dependent on, or impacted in some way by, other challenges. This article will cover what the author sees as the major challenges for digital preservation at this point in time, covering a range of technical, administrative, logistical and legal aspects."

Other challenges:
  • Data volumes. Digital storage is becoming cheaper, but not every file and every version of it can and should be stored or preserved. Selecting what to preserve and when to take preservative action becomes more complex with a larger volume of data and a wider range of storage media. This  increases the risk of failing to preserve materials of historical value. There is also a higher risk of data not finding data because of poor metadata.
  • Archivability. One of the most fundamental challenges in archiving is determining what should be preserved and the extent of preservation.
  • Multiplicities. Materials born digital today are likely to have multiple copies in multiple versions stored in multiple locations, possibly under multiple filenames and in multiple file formats.
  • Hardware and storage. Obsolescence, deterioration of media and hardware mechanical failure increase the risk of loss. The cloud is increasingly used for storage, but there are also significant issues with using it.
  • File formats. File formats were considered a big risk in digital preservation but they have not proven to be the overwhelming danger that it was initially perceived to be. Proprietary file formats continue to pose a challenge.
  • Metadata. Metadata is probably the most important aspect of digital preservation. Materials with poor metadata may be undiscoverable, and their authenticity, verifiability and their context unclear.
  • Legalities. Digital preservation presents some complex legal issues
  • Privacy. Material chosen for preservation may contain private and confidential information, and its unauthorised release may lead to legal action.
  • Resourcing. Preservation costs involve not just the actual digitisation, but also storage, infrastructure, staff resourcing and training, ongoing maintenance and auditing of the digitised materials. There are also costs associated with providing access
The challenge is to use the scarce resources to preserve the most important materials, using the most cost-effective and efficient methods. Even choosing not to preserve materials also involves costs. Those who will benefit most from current preservation programs are future generations, which makes it difficult to justify expenditure on digital preservation, since there is little current benefit. The "best that the preservation community can do with digital material is to make educated guesses based on a few decades of mostly anecdotal experience".

"The challenges in digital preservation involve dealing with not just the technologies of the past, but also those to come". The digital preservation field is developing rapidly and the people working with digital materials need to keep up with the changes.


Thursday, September 01, 2016

Digital Preservation: Keep calm and get on with it!

Digital Preservation: Keep calm and get on with it! Matthew Addis. Archives and Records Association 2016. 30 August 2016.
     This is a presentation about simple and practical steps towards digital preservation using open source tools best practices. The benefits of a digital preservation strategy is increasingly clear, but implementing the strategy can be overwhelming. The presentation lists resources and tools, such as the Digital Preservation Coalition handbook, the COPTR tool website, DROID, and the Data Assessment Framework. Sometimes complex resources can also be overwhelming and make decisions more difficult. "If you think that you’re not able to ‘do enough’ or ‘do it properly’, then this can result in doing nothing because this feels like the next best thing." But doing nothing has serious consequences in the digital world. "It’s almost always better to get on and do something than it is to do nothing." The presentation also refers to ‘parsimonious preservation’ or starting with minimal actions. Understand what you have and try to keep it safe through safe copies. It is important to understand formats and to use the tools to keep the content safe. "File format identification gives the information needed to make decisions." Another important part is to start simple and add functionality as you go. The maturity model from the National Digital Stewardship Alliance is a good guide.

Tuesday, June 28, 2016

Protecting the Long-Term Viability of Digital Composite Objects through Format Migration

Protecting the Long-Term Viability of Digital Composite Objects through Format Migration. Elizabeth Roke, Dorothy Waugh. iPres 2015 Poster. November, 2015.
     The poster discusses work done at Emory University’s Manuscript, Archives, and Rare Book Library to "review policy on disk image file formats used to capture and store digital content in our Fedora repository". The goal was to to migrate existing disk images to formats more suitable for long-term digital preservation. Trusted Repositories Audit & Certification (TRAC) requires that digital repositories monitor changes in technology in order to respond to changes. Advanced Forensic Format offered a good solution for capturing forensic disk images along with disk image metadata, but Libewf by Joachim Metz, which is a library of tools to access the Expert Witness Compression Format (EWF) has replaced it. They have decided to acquire raw disk images, or when not possible, to use tar files, because the disk images may be less vulnerable to obsolescence.

In attempting to migrate formats, they had to develop methods for migrating the files setup the repository to accept the new files. They also rely on PREMIS metadata.  The migration of disk images from a proprietary or unsupported format to a raw file format has made it easier for us to manage and preserve these objects and mitigates the threat of obsolescence for the near term. There have been some consequences. Some metadata is no longer available. Also, the process will be more complicated and require other workflows, and files will no longer contain embedded metadata. "The migration to a raw file format has made the digital file itself easier to preserve."



Friday, June 24, 2016

File-format analysis tools for archivists

File-format analysis tools for archivists. Gary McGath. LWN. May 26, 2016.
     Preserving files for the long term is more difficult than just copying them to a drive. There are other issues are involved. "Will the software of the future be able to read the files of today without losing information? If it can, will people be able to tell what those files contain and where they came from?"

Digital data is more problematic than analog materials, since file formats change. Detailed tools can check the quality of digital documents, analyze the files and report problems. Some concerns:

  • Exact format identification: Knowing the MIME type isn't enough.
  • Format durability: Software can fade into obsolescence if there isn't enough interest to keep it updated.
  • Strict validation: Archiving accepts files in order to give them to an audience that doesn't even exist yet. This means it should be conservative in what it accepts.
  • Metadata extraction: A file with a lot of identifying metadata, such as XMP or Exif, is a better candidate for an archive than one with very little. An archive adds a lot of value if it makes rich, searchable metadata available.
Some open-source applications address these concerns, such as:
  • JHOVE (JSTOR-Harvard Object Validation Environment)
  • DROID and PRONOM
  • ExifTool
  • FITS File Information Tool Set
"Identifying formats and characterizing files is a tricky business. Specifications are sometimes ambiguous."  There are different views on how much error, if any, is acceptable. "Being too fussy can ban perfectly usable files from archives."

"Specialists are passionate about the answers, and there often isn't one clearly correct answer. It's not surprising that different tools with different philosophies compete, and that the best approach can be to combine and compare their outputs"


Wednesday, June 22, 2016

Five Star File Format Signature Development

Five Star File Format Signature Development. Ross Spencer. Open Preservation Foundation blog. 14 Jun 2016 .
     Discussion about formats and the importance of developing identification techniques for text formats. DROID is a useful tool but it has its limitations. For those wanting to be involved in defining formats, there are five principles of file format signature development:
  1. Tell the community about your identification gaps
  2. Share sample files
  3. Develop basic signatures
  4. Where feasible, engage openly with the community
  5. Seek supporting evidence
Developing file format signatures is really reverse engineering.

Saturday, March 26, 2016

Caring for file formats

Caring for file formats. Ange Albertini. Presentation at Troopers 2016. March 17, 2016. [PDF]
     The risk to preserving digital objects is very high. The "attack surface with file formats is too big". The specifications of formats are a nice guide, but they don't represent reality; they are useless for managing the formats. "We can’t deprecate formats because we can’t preserve and we can’t define how they really work."
The formats need good documentation to show the landscape and "to express the reality of file format".  Once they are better understood, then "we can preserve and deprecate older format, which reduces attack surface". Then people can focus on making the present formats more secure.

What is a file format? A computer dialect to communicate between communities; file formats are community connectors. People don't care about the format itself, they care about the characteristics and how easy it is to use. We don't need new formats, since reality will diverge from the specs anyway. The need is for up to date, traceable specs. Formats are constantly being updated with new features added. That doesn't solve the problem.  Specs should reflect reality and be "updated, enforced, realistic, freely available". Deprecation is a natural cycle, but are afraid to deprecate because "no file format is fully preserved". Formats should be open and the specs kept up to date. But it won’t happen until "we experience a great disaster".


Wednesday, March 16, 2016

File identification ...let's talk about the workflows

File identification ...let's talk about the workflows. Jenny Mitcham. Digital Archiving at the University of York. 27 November 2015.
     When adding files to a digital archive, an important questions is "What file formats have we got here?" Knowing this can:
  • determine the right software to open the file and view the contents 
  • start the conversation with the data provider about what formats are best to use for archiving
  • discuss the risks on the format and define a migration pathway for preservation and/or access
There are many tools for working with formats; each tool has strengths and weaknesses. Defining a workflow can help determine how best to use these tools, how to interact with them, or if manual steps should be taken instead. File identification tools are often incorporated into digital preservation systems that may determine the workflow in using the tools. Additional workflow questions around format tools include:
  • what should happen if ingested data can't be identified?  
  • should the curator/digital archivist be able to over-ride file identifications?
  • what should happen if there is more than one possible identification for a file?
  • is there a sustainable manual identification process if tools cannot identify a file? 
  • how to contribute to file format registries such as PRONOM
  • is the digital preservation system configurable enough to resolve these questions? 
Their Archivematica development work is focusing in the first instance on allowing the digital curator to see a report of the files that are not identified in order to understand the problem.

[Our Rosetta system has a format library that handles these questions, as well as a user driven Format Working Group that helps resolve questions and interacts with PRONOM if there are questions, changes or new additions. - Chris]