Showing posts with label TIFF. Show all posts
Showing posts with label TIFF. Show all posts

Monday, August 31, 2015

Beyond TIFF and JPEG2000: PDF/A as an OAIS Submission Information Package Container

Beyond TIFF and JPEG2000: PDF/A as an OAIS Submission Information Package Container. Yan Han. Library Hi Tech. 2015.
     PDF/A can be used as a file format, but it can also be used as OAIS SIP containers. The PDF/A open standards can "simplify digitization process, reduce digitization cost, improve production substantially and build more confidence for preservation and access." PDF/A can be used as an Archival Information Package container.

The three main goals of PDF/A are to:
  • provide a way to present the appearance of documents independent of the tools and systems used
  • provide a framework for recording the context and history of electronic documents in the metadata
  • define a framework for representing the logical structure of electronic documents within conforming files

A typical SIP may consist of a directory containing the following information"
  • Content: 
    • Preservation master files (such as TIFF images files). 
    • Access files (such as a PDF or JPG / JPG2000 files).
    • Other content (such as OCR data).
  • Preservation description: 
    • Preservation metadata in the TIFF header
    • Other structural and technical metadata
    • Checksum files.
  • Packaging information: 
    • Directory and File naming, structural metadata.
  • Descriptive information: 
    • Descriptive metadata saved in digital management system, catalog, or textual/XML files.
"The key requirement of PDF/A is that it is self-described and self-contained so that it can bereproduced exactly the same way with different software in various platforms." It will include all information needed to display the content in the PDF/A file (text, images, fonts, and color profiles).

Master file formats should be non-proprietary, open and documented international standards that are  commonly used. The files should be unencrypted, and should be uncompressed or else use lossless compression. The author of the article recommends using PDF/A as the preferred file format for text and image files, and possibly using it as an OAIS SIP container. The author shows how PDF/A is a better file format than the currently preferred TIFF or JPEG2000 formats.

There are several issues with PDF/A naming and implementation. The most critical need is reliable open source software for producing and validating PDF/A files.

Tuesday, August 11, 2015

Digital Preservation Tools on Github.

Digital Preservation Tools on Github. Chris Erickson. Blog. August 2015.
     While looking for a particular tool I came across several others that look interesting. I have not yet tried them, but this is a reminder that I need to check into them. 
  • epubcheck: a tool to validate EPUB files. It can detect many types of errors in EPUB. OCF container structure, OPF and OPS mark-up, and internal reference consistency are checked. EpubCheck can be run as a standalone command-line tool or used as a Java library.
  • preservation-tools: Bundles a number of preservation tools for all file types and tools in a modular way. Includes:
    • PdfHeaderChecker (able to detect the software used to create a PDF),
    • PdfAValidator (Checks via PDFBox if a PDF/A is valid. Runs through a folder and picks out only PDF/A-files),
    • iTextRepairPdf (take a PDF-file and copies the content page-per-page to a new, PDFA1-conform PDF-file)
    • PdfToImageConverter (Converts PDF Files in a certain folder to JPEGs page-per-page)
    • PdfTwinTest (compares the two PDF line-by-line and puts out differences. This is handy for after-Migration Quality-Checking)
  • wail: Web Archiving Integration Layer (WAIL). A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.
  • db-preservation-toolkit. The Database Preservation Toolkit allows conversion between Database formats, including connection to live systems, for purposes of digitally preserving databases. The toolkit allows conversion of live or backed-up databases into preservation formats such as DBML or SIARD, XML-based formats created for the purpose of database preservation. The toolkit also allows conversion of the preservation formats back into live systems to allow the full functionality of databases. For example, it supports a specialized export into MySQL, optimized for PhpMyAdmin, so the database can be fully experimented using a web interface.
  • DPFManager. DPF Manager is an open source modular TIFF conformance checker that is extremely easy to use, to integrate with existing and new projects, and to deploy in a multitude of different scenarios. It is designed to help archivists and digital content producers ensure that TIFF files are fit for long term preservation, and is able to automatically suggest improvements and correct preservation issues. The team developing it has decades of experience working with image formats and digital preservation, and has leveraged the support of 60+ memory institutions to draft a new ISO standard proposal (TIFF/A) specifically designed for long term preservation of still-images. An open source community will be created and grown through the project lifetime to ensure its continuous development and success. Additional commercial services will be offered to make DPF Manager self-sustainable and increase its adoption.
  • PreservationSimulation. This project is to provide baseline data for librarians and researchers about long-term survival rates of document collections. We have developed computer simulations to estimate document failure rates over a wide variety of conditions. The data from these simulations should be useful to stewards of such collections in planning and budgeting for storage and bandwidth needs to protect their collections.
  • flint.  Facilitate a configurable file/format validation. Its underlying architecture is based on the idea that file/format validation almost always has a specific use-case with concrete requirements that may differ from a validation against the official industry standard of a given format. The following are the principle ideas we've implemented in order to match such requirements.
  • excel. Regarding the second issue: how to best retain formulas and other essential components of spreadsheets, like Excel, one of our data curators, John McGrory (U of Minnesota), just published a tool in GitHub that can help. In our data repository, we use the tool each time a dataset is submitted and zip these resulting files as the "Archival Version of the Data." Download the software at http://z.umn.edu/exceltool. See also a description of what the tool does: http://hdl.handle.net/11299/171966

Monday, July 06, 2015

TIFF/A

TIFF/A. Gary McGath. File Formats Blog.  July 3, 2015.
   The tiff format has been around for a long time. There have been many changes and additions, such that "TIFF today is the sum of a lot of unwritten rules".  A group of academic archivists have been working on a long term readable version, calling it TIFF/A. A white paper discusses the technical issues. Discussions starting in September will hope to create a version to submit for ISO consideration.

Friday, June 12, 2015

TIFF/A Standard Initiative

TIFF/A Standard Initiative. Website. June, 2015.
The TIFF/A standard initiative intends to create an ISO specification of a Archival TIFF Format. TIFF is a widely used format, but it is complex and has some features not suited for long term preservation. The TIFF/A-specification will be enhanced with mandatory and forbidden tags for archival purposes, similar to PDF/A. "This standard will be created in parallel with DPF Manager, an open source TIFF format validator that, in addition to the current TIFF ISO Standards, will be the first conformance checker for the TIFF/A new standard." The group looks to create a community of experts interested in discussing the initiative in order to prepare a proposal to submit to the ISO.