Showing posts with label validation. Show all posts
Showing posts with label validation. Show all posts

Wednesday, November 30, 2016

To Act or Not to Act - Handling File Format Identification Issues in Practice

To Act or Not to Act - Handling File Format Identification Issues in Practice. Matthias Töwe, Franziska Geisser, Roland E. Suri. Poster, iPres 2016.  (Proceedings p. 288-89 / PDF p. 145).
     Format identification output needs to be assessed within an institutional context and also consider provenance information to determine actions. This poster presents ways to address file format identification and validation issues that are mostly independent of the specific tools and systems employed; they use Rosetta, DROID, PRONOM, and JHOVE. Archives rely on technical file format information and therefore want to derive as much information about the digital objects as possible before ingest. But there are issues that occur in everyday practice, such as:
  • how to proceed without compromising preservation options
  • how to make efforts scalable 
  • issues with different types of data
  • issues related to the tool's internal logic
  • metadata extraction which is also format related
 The use cases vary depending on the customers, types of material, and formats. A broad range of use cases apply to safeguarding research data for a limited period of time (ten years at minimum) to publishing and preserving data in the long term. Understanding the use cases’ characteristics helps provides "a better understanding of what actually matters most in each case."

Ideally, format identification should yield reliable and unambiguous information on the format of a given file, however a number of problems make the process more complicated. Handling files on an individual basis does not scale well. This may mean that unsatisfactory decisions need to be taken to keep the volume of data manageable. Some criteria to consider:
  • Usability: can the file be used as expected with standard software?
  • Tool errors: is an error known to be tool-related?
  • Understanding: is the error actually understood?
  • Seriousness: does the error concern the format's significant properties?
  • Correctability: is there a documented solution to the error?
  • Risk of correcting: what risks are associated with correcting the error?
  • Effort: what effort is required to correct the error?
  • Authenticity: is the file’s authenticity more relevant than format identification?
  • Provenance: can the data producer help resolve this and future errors?
  • Intended preservation: what solution is acceptable for lower preservation periods?
There are no simple rules to resolve these, so other considerations are needed to determine what actions to take:
  • Should format identification be handled at ingest or as a pre-ingest activity?
  • How to document measures taken to resolve identified problems?
  • Can unknown formats be admitted to the archive? 
  • Should the format identification be re-checked later? 
  • Do we rely on PRONOM or do we need local registries? 
  • How to preserve formats where no applications exist.
"Format validation can fail when file properties are not in accord with its format’s specification. However, it is not immediately clear if such deviations prevent current usability of a file orcompromise the prospects for a file’s long term preservability." If the file is usable today, does that mean it is valid? Digital archives need to "balance the efforts for making files valid vs. making files pass validation in spite of known issues."

The failure to extract significant properties has no immediate consequences, and institutions need to decide if they will correct the issues, especially if the metadata is embedded and the file must be altered, which creates the risk of unknowingly introducing other changes. The authors of this poster act even more cautiously when fixing metadata extraction issues that require working with embedded metadata or file properties.

Tuesday, August 11, 2015

Digital Preservation Tools on Github.

Digital Preservation Tools on Github. Chris Erickson. Blog. August 2015.
     While looking for a particular tool I came across several others that look interesting. I have not yet tried them, but this is a reminder that I need to check into them. 
  • epubcheck: a tool to validate EPUB files. It can detect many types of errors in EPUB. OCF container structure, OPF and OPS mark-up, and internal reference consistency are checked. EpubCheck can be run as a standalone command-line tool or used as a Java library.
  • preservation-tools: Bundles a number of preservation tools for all file types and tools in a modular way. Includes:
    • PdfHeaderChecker (able to detect the software used to create a PDF),
    • PdfAValidator (Checks via PDFBox if a PDF/A is valid. Runs through a folder and picks out only PDF/A-files),
    • iTextRepairPdf (take a PDF-file and copies the content page-per-page to a new, PDFA1-conform PDF-file)
    • PdfToImageConverter (Converts PDF Files in a certain folder to JPEGs page-per-page)
    • PdfTwinTest (compares the two PDF line-by-line and puts out differences. This is handy for after-Migration Quality-Checking)
  • wail: Web Archiving Integration Layer (WAIL). A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.
  • db-preservation-toolkit. The Database Preservation Toolkit allows conversion between Database formats, including connection to live systems, for purposes of digitally preserving databases. The toolkit allows conversion of live or backed-up databases into preservation formats such as DBML or SIARD, XML-based formats created for the purpose of database preservation. The toolkit also allows conversion of the preservation formats back into live systems to allow the full functionality of databases. For example, it supports a specialized export into MySQL, optimized for PhpMyAdmin, so the database can be fully experimented using a web interface.
  • DPFManager. DPF Manager is an open source modular TIFF conformance checker that is extremely easy to use, to integrate with existing and new projects, and to deploy in a multitude of different scenarios. It is designed to help archivists and digital content producers ensure that TIFF files are fit for long term preservation, and is able to automatically suggest improvements and correct preservation issues. The team developing it has decades of experience working with image formats and digital preservation, and has leveraged the support of 60+ memory institutions to draft a new ISO standard proposal (TIFF/A) specifically designed for long term preservation of still-images. An open source community will be created and grown through the project lifetime to ensure its continuous development and success. Additional commercial services will be offered to make DPF Manager self-sustainable and increase its adoption.
  • PreservationSimulation. This project is to provide baseline data for librarians and researchers about long-term survival rates of document collections. We have developed computer simulations to estimate document failure rates over a wide variety of conditions. The data from these simulations should be useful to stewards of such collections in planning and budgeting for storage and bandwidth needs to protect their collections.
  • flint.  Facilitate a configurable file/format validation. Its underlying architecture is based on the idea that file/format validation almost always has a specific use-case with concrete requirements that may differ from a validation against the official industry standard of a given format. The following are the principle ideas we've implemented in order to match such requirements.
  • excel. Regarding the second issue: how to best retain formulas and other essential components of spreadsheets, like Excel, one of our data curators, John McGrory (U of Minnesota), just published a tool in GitHub that can help. In our data repository, we use the tool each time a dataset is submitted and zip these resulting files as the "Archival Version of the Data." Download the software at http://z.umn.edu/exceltool. See also a description of what the tool does: http://hdl.handle.net/11299/171966

Friday, March 27, 2015

Siegfried v 1.0 released (a file format identification tool)

Siegfried v 1.0 released (a file format identification tool). Richard Lehane. Open Preservation Foundation. 25th Mar 2015. Siegfried is a file format identification tool that is now available. The key features are:
  • complete implementation of PRONOM (byte and container signatures)   
  • reliable results
  • fast matching without limiting the number of bytes scanned
  • detailed information about the basis for format matches
  • simple command line interface with a choice of outputs (YAML, JSON, CSV)
  • a built-in server for integrating with workflows 
  • options for debug mode, signature modification, and multiple identifiers.

Friday, March 20, 2015

When checksums don't match...

When checksums don't match... Digital Archiving at the University of York. 2 February 2015.
Post about an example of files that had MD5 errors. Used various utilities to generate the check-sums for both MD5 and SHA1. One program showed a change, while another did not. However, when SHA1 was used, it showed that the files had different check-sums. Possibly an example of bit rot.

Saturday, February 07, 2015

Digital Tools and Apps

Digital Tools and Apps. Chris Erickson. Presentation for ULA. 2014. [PDF]
This is a presentation I created for ULA to briefly outline a few tools that I find helpful. There are many useful tools, and more are being created all the time. Here are a few that I use.
  • Copy & Transfer Tools: WinSCP; Teracopy;
  • Rename Tools: Bulk Rename Utility
  • Integrity & Fixity Tools: MD5Summer; MD5sums 1.2; Quick Hash; Hash Tool
  • File Editing Tools: Babelpad; Notepad++; XML Notepad; 
    • ExifTool; BWF MetaEdit; BWAV Reader;
  • File Format Tools: DROID; 
  • File Conversion:  Calibre; Adobe Portfolio;
  • Others: A whole list of other tools that I use or suggest you look at.
    •  PDF/A tools
    • Email tools
 Please let me know what tools you find helpful.

Monday, January 19, 2015

Ensuring long-term access: PDF validation with JHOVE?

Ensuring long-term access: PDF validation with JHOVE? Yvonne Friese. ZBW - Leibniz Information Centre for Economics.  PDF Association. December 17, 2014.
JHOVE is an open source tool for identifying, characterizing and validating twelve common formats such as pdf, tiff, jpeg, aiff and wave.  Pages within a PDF file are usually stored as a page tree, allowing the user to reach a given page as quickly as possible. Common advice for long-term archiving is to preferentially use the PDF/A format. However, this no longer matches to the day-to-day reality of many workflows which use JHOVE for validation tests. The differences between PDF and PDF/A means that there there can be validation errors. JHOVE’s PDF module is certainly capable of validating PDF/A files but the feature does not work well.  The process does not analyze the content of the data streams, meaning that it cannot validate PDF/A compliance in line with ISO standards. JHOVE is not suited to PDF/A validation but there currently are no alternatives to JHOVE for validating standard PDFs.

JHOVE can still be useful, provided users understand its error reports and are aware of ways to resolve them. Even with the problems JHOVE remains an excellent option for providing initial guidance.

[In our own institution, we have found JHOVE to be useful in identifying PDF files that have potential problems. Each problem for each source needs to be examined to decide if there is a preservation risk.]