Showing posts with label PDF/A. Show all posts
Showing posts with label PDF/A. Show all posts

Friday, September 15, 2017

Preservation with PDF/A

Preservation with PDF/A (2nd Edition). Betsy A Fanning. DPC Technology Watch Report 17-01. July 2017. [PDF 34pp.]  [Link updated]
     This report is an updated edition of the original Technology Watch Report 08-02, Preserving the Data Explosion: Using PDF (Fanning,2008). It looks at PDF/Archive as digital document file format for long-term preservation. The PDF/A versions of the PDF format have been developed as a family of open ISO Standards to address preservation of PDF files by removing features that pose preservation risks. It is important for preservation purposes to know how closely a file conforms to the  requirements defined in the standard. There are preservation risks that may exist in the standard PDF file format:
  • any file type can be embedded;
  • the primary document can be conformant as a static document, but the embedded files may not be static;
  • embedded files may be infected by computer viruses;
  • embedded files may have extended metadata requirements, may introduce unexpected dependencies or be subject to format obsolescence;
  • embedded files may complicate matters relating to information security, data protection or the management of intellectual property rights.
By restricting some risk features and thus reducing preservation risks, the PDF/A format seeks to maximize:
  • device independence;
  • self-containment;
  • self-documentation.
Some reasons why an organization might use PDF/A to preserve their digital documents, include:
  • its standardized format for storing digital documents for long periods of time;
  • it allows for digitally signed documents using the very latest digital signature software;
  • it reliably displays special characters for mathematics and languages since all are embedded within the file;
  • it displays correctly on any device as the author intended, including the reading order;
  • platform independence;
  • provision of fully searchable documents through Optical Character Recognition.
History and Features of PDF and PDF/A. The Standard was drafted in multiple in order to make it easier to implement the Standard. "Unfortunately, the committee’s philosophy of multiple parts resulted in confusion in the market place, making it more difficult for users to select the optimum file format." Users  may need to do a file format assessment based on their requirements that can help them decide which PDF/A Standard to implement.

Metadata helps effectively manage a file throughout its life cycle, as well assist in document discovery searches. "Establishing a long-term digital document preservation system requires careful consideration of the metadata that will be needed to locate and render documents years from now." Collecting metadata for the PDF/A documents in optional in the standard, except for the identifier, which is generated when the PDF/A file is created. Preservation metadata should:
  • be appropriate to the materials;
  • support interoperability;
  • use standardized controlled vocabulary;
  • include clear statements on the conditions and terms of use;
  • be authoritative and verifiable;
  • support the long-term management of the document.
Just because a file purports to be a PDF/A does not necessarily mean that it is. Format validation of a file can increase confidence a viewer will be able to render the file correctly.  A number of PDF/A validators are available.The development work on the PDF Standards is a continuing effort. There are additional preservation challenges in the format that are in the process of being addressed.

The report lists some recommendations, which are directed at groups that use the standard. They include:
  • For those evaluating PDF/A as a digital preservation solution:
    • Before adopting PDF/A as a preservation solution it is "essential to understand the organizational requirements and how PDF/A will support" the organization needs.
    • PDF/A is not a preservation solution on its own a part of the wider preservation strategy that must be consistent with other components of the preservation infrastructure, such as backups, integrity checks and documentation.
    • Different versions of PDF/A have different purposes, with different capabilities as well as different preservation risks. These should be understood and decisions should be documented and explained.
    • Different vendors offer different tools to manage PDF/A that should be compared against your requirements..
  • For organizations collecting and preserving digital data:
  • While it may not be possible to control or restrict how documents are produced, it may be useful to give document creators guidance on what is desired.
  • Embed PDF/A validation tools into preservation workflows and record the results to help manage the digital preservation risks associated with PDF/A files received.

Friday, September 25, 2015

veraPDF releases prototype validation library for PDF/A-1b

veraPDF releases prototype validation library for PDF/A-1b. News release. veraPDF consortium. 16 September 2015.
     Version 0.4 of the veraPDF validation library is now available. This release delivers a working validation model and validator, an initial, PDF/A-1b validation profile; and a prototype of the PDF feature reporting. This early version allows users to test this implementation of PDF/A-1b validation on single files. The roadmap for 2015 - 2017 is available.

Friday, September 04, 2015

Preserving Documents Forever: When is a PDF not a PDF?

Preserving Documents Forever: When is a PDF not a PDF?  Digital Preservation Coalition. July 15, 2015.
     This was a briefing day on preserving PDF at Oxford University. Presentations include:
  • An introduction to PDF, Sarah Higgins, Aberystwyth University
    • Portable Document Format (PDF)
    • Developed to enable document sharing across platforms while retaining “look and feel”
    • Originally a proprietary format - Adobe Systems
    • Specification available free of charge from 1993
    • Became an open standard in 2008 ISO 32000-1:2008 (PDF 1.7) 
    • Many flavors, PDF/A, PDF/X, PDF/E, PDF/VT, PDF/UA
    • PDF/A is a sub-set for the Long Term Preservation of multi-media page documents that may contain a mixture of text, raster images and vector graphics. Self contained, robust, predictable, no encryption, no interactivity, limited color space
    • Flavours of PDF/A: PDF/A-1, PDF/A-2, PDF/A-3 and different levels of conformance
    • A Document is not the same as a Record, which is
      • Authentic
      • Reliable
      • Has integrity
      • Usable
  • Understanding PDF risks in preservation, Johan van der Knijff, National Library of the Netherlands 
  • PDF: Myths vs facts, Ange Albertini, Corkami
    • Graphical fact sheet about PDF. Shows the structure
    • Many myths about PDF
    • Many possible malformations handled specifically by each reader
    • It’s a complex patchwork!
    • PDF is very useful, but it has many issues of all kinds to deal with. It is far from perfect. 
    • What if Adobe stopped supporting PDF (like Flash) and we were just left with the specs?
  • Preserving PDF at the coalface, Tim Evans, Archaeology Data Service
    • PDF to PDF/A 1B conversions were problematic
    • A lot of the PDF problems can be fixed with manual intervention 
    • Used PDF/A Manager created by PDFTron for batch processing, automated fix-ups, with 80% success rate.
    • Still using a mixture of PDFTron and Preflight
    • Concern over incoming PDFs checked by DROID showing false positives
    • Third party tools: CutePDF, OmniPageCapture SDK, Nitro PDF, PDFCreator, Acrobat PDFMakerfor Word (v.8)
    • Currently Practice: Use of PDF/A 1 and PDF/A 2; adopt a best fit of the two. 
    • Use Callas PDF Toolbox
    • Now tied to a mixed economy of softwares and tools (some free, some commercial) to ensure consistent and accurate creation and validation.
  • Introducing veraPDF, Carl Wilson, Open Preservation Foundation
    • veraPDF is a project, a consortium, and a software product
    • Plan to produce a conformance checker 
    • Keep up with Developments on Github: https://github.com/verapdf 
 Related posts:

Monday, August 31, 2015

Beyond TIFF and JPEG2000: PDF/A as an OAIS Submission Information Package Container

Beyond TIFF and JPEG2000: PDF/A as an OAIS Submission Information Package Container. Yan Han. Library Hi Tech. 2015.
     PDF/A can be used as a file format, but it can also be used as OAIS SIP containers. The PDF/A open standards can "simplify digitization process, reduce digitization cost, improve production substantially and build more confidence for preservation and access." PDF/A can be used as an Archival Information Package container.

The three main goals of PDF/A are to:
  • provide a way to present the appearance of documents independent of the tools and systems used
  • provide a framework for recording the context and history of electronic documents in the metadata
  • define a framework for representing the logical structure of electronic documents within conforming files

A typical SIP may consist of a directory containing the following information"
  • Content: 
    • Preservation master files (such as TIFF images files). 
    • Access files (such as a PDF or JPG / JPG2000 files).
    • Other content (such as OCR data).
  • Preservation description: 
    • Preservation metadata in the TIFF header
    • Other structural and technical metadata
    • Checksum files.
  • Packaging information: 
    • Directory and File naming, structural metadata.
  • Descriptive information: 
    • Descriptive metadata saved in digital management system, catalog, or textual/XML files.
"The key requirement of PDF/A is that it is self-described and self-contained so that it can bereproduced exactly the same way with different software in various platforms." It will include all information needed to display the content in the PDF/A file (text, images, fonts, and color profiles).

Master file formats should be non-proprietary, open and documented international standards that are  commonly used. The files should be unencrypted, and should be uncompressed or else use lossless compression. The author of the article recommends using PDF/A as the preferred file format for text and image files, and possibly using it as an OAIS SIP container. The author shows how PDF/A is a better file format than the currently preferred TIFF or JPEG2000 formats.

There are several issues with PDF/A naming and implementation. The most critical need is reliable open source software for producing and validating PDF/A files.

Friday, August 14, 2015

PDF/A Flyer

PDF/APDF/A Competence Center. June 18, 2015. (PDF version of the Flyer).
     PDF/A is the ISO standard for archiving electronic documents using the PDF format. It required years of cooperation between software developers, industry associations and government agencies. There are three parts to the standard which were published between 2005 and 2012. Parts 2 and 3 add options for combining several PDF/A files into one PDF/A collection, embedding the PDF’s source files or other data, support for transparency, and more.

PDF/A-2 is defined by ISO 19005, which "provides a mechanism for displaying electronic documents in such a way that the visual image is maintained over time, irrespective of the tools and systems used for their production, storage and reproduction”. It does not define an archiving strategy; instead, PDF/A specifies technical requirements for PDF electronic documents to ensure reliability after the file’s creation.

Since 2005 PDF/A has become the preferred format for archiving electronic documents. The PDF format, standardized in 2008 as ISO 32000-1, is used by many people. PDF will continue to ensure reliable access to PDF documents, making PDF ideal for long-term archiving.


Tuesday, August 11, 2015

Digital Preservation Tools on Github.

Digital Preservation Tools on Github. Chris Erickson. Blog. August 2015.
     While looking for a particular tool I came across several others that look interesting. I have not yet tried them, but this is a reminder that I need to check into them. 
  • epubcheck: a tool to validate EPUB files. It can detect many types of errors in EPUB. OCF container structure, OPF and OPS mark-up, and internal reference consistency are checked. EpubCheck can be run as a standalone command-line tool or used as a Java library.
  • preservation-tools: Bundles a number of preservation tools for all file types and tools in a modular way. Includes:
    • PdfHeaderChecker (able to detect the software used to create a PDF),
    • PdfAValidator (Checks via PDFBox if a PDF/A is valid. Runs through a folder and picks out only PDF/A-files),
    • iTextRepairPdf (take a PDF-file and copies the content page-per-page to a new, PDFA1-conform PDF-file)
    • PdfToImageConverter (Converts PDF Files in a certain folder to JPEGs page-per-page)
    • PdfTwinTest (compares the two PDF line-by-line and puts out differences. This is handy for after-Migration Quality-Checking)
  • wail: Web Archiving Integration Layer (WAIL). A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.
  • db-preservation-toolkit. The Database Preservation Toolkit allows conversion between Database formats, including connection to live systems, for purposes of digitally preserving databases. The toolkit allows conversion of live or backed-up databases into preservation formats such as DBML or SIARD, XML-based formats created for the purpose of database preservation. The toolkit also allows conversion of the preservation formats back into live systems to allow the full functionality of databases. For example, it supports a specialized export into MySQL, optimized for PhpMyAdmin, so the database can be fully experimented using a web interface.
  • DPFManager. DPF Manager is an open source modular TIFF conformance checker that is extremely easy to use, to integrate with existing and new projects, and to deploy in a multitude of different scenarios. It is designed to help archivists and digital content producers ensure that TIFF files are fit for long term preservation, and is able to automatically suggest improvements and correct preservation issues. The team developing it has decades of experience working with image formats and digital preservation, and has leveraged the support of 60+ memory institutions to draft a new ISO standard proposal (TIFF/A) specifically designed for long term preservation of still-images. An open source community will be created and grown through the project lifetime to ensure its continuous development and success. Additional commercial services will be offered to make DPF Manager self-sustainable and increase its adoption.
  • PreservationSimulation. This project is to provide baseline data for librarians and researchers about long-term survival rates of document collections. We have developed computer simulations to estimate document failure rates over a wide variety of conditions. The data from these simulations should be useful to stewards of such collections in planning and budgeting for storage and bandwidth needs to protect their collections.
  • flint.  Facilitate a configurable file/format validation. Its underlying architecture is based on the idea that file/format validation almost always has a specific use-case with concrete requirements that may differ from a validation against the official industry standard of a given format. The following are the principle ideas we've implemented in order to match such requirements.
  • excel. Regarding the second issue: how to best retain formulas and other essential components of spreadsheets, like Excel, one of our data curators, John McGrory (U of Minnesota), just published a tool in GitHub that can help. In our data repository, we use the tool each time a dataset is submitted and zip these resulting files as the "Archival Version of the Data." Download the software at http://z.umn.edu/exceltool. See also a description of what the tool does: http://hdl.handle.net/11299/171966

Monday, August 10, 2015

Why PDF/A validation matters, even if you don’t have PDF/A

Why PDF/A validation matters, even if you don’t have PDF/A. Johan van der Knijff.  KB Research, National Library of the Netherlands. July 7, 2015.
     The PDF format has a number of features that don’t fit with the aims of long-term preservation and accessibility, such as encryption, password protection, external fonts and reliance on external software. Some examples are PDFs that use Quicktime content. Acrobat cannot render this format natively, and relies on an external player. Also files that use Linux fonts, or files with 3D content.
Institutions may want to check their PDF files to similar examples. Reasons for doing this include:
  • Check compliance with institutional policy (e.g. do not accept PDFs with passwords)
  • Check collections for preservation risks (e.g. embedded multimedia content)
There are some useful software tools are available, such as:
  • qpdf gives detailed information about encryption and password protection
  • pdffonts tool that is part of xpdf is useful for checking whether fonts in a PDF are embedded
  • The professional version of Adobe Acrobat has a PDF/A validator built into its Preflight tool
  • PDF/A validator that is part of the open-source Apache PDFBox library
  • VeraPDF has the potential to develop into a full-fledged PDF validator
"The PDF/A standards are nothing more than a set of profiles that impose some restrictions on a PDF, ruling out features that are not well-suited to long-term accessibility."  These features are encryption, non-embedded fonts, multimedia content, and so on. Several tools exist that compare a PDF against PDF/A and report any deviations. These PDF/A validators are typically used to verify PDF/A files but can also be used to detect user-specified risky features in regular PDFs. It is possible to automatically evaluate PDFs against a user-defined set of features. But it is important to check the file because a PDF may satisfy all requirements of PDF/A, and still be broken.

Related posts:

Friday, April 24, 2015

PREFORMA Starts Prototyping Phase

PREFORMA Starts Prototyping Phase. OPF Blog. 22 April 2015.
The PERFORMA prototyping phase has started with three groups that will work on:
  1. the compliance checker for the PDF/A standard for documents; 
  2. the TIFF standard for digital still images; and
  3. a set of open source standards for moving images
This phase will last until December 2016. It is important that libraries and archives understand what is in the digital objects they are preserving.  These tools will increase the knowledge about these formats.


Monday, January 19, 2015

Ensuring long-term access: PDF validation with JHOVE?

Ensuring long-term access: PDF validation with JHOVE? Yvonne Friese. ZBW - Leibniz Information Centre for Economics.  PDF Association. December 17, 2014.
JHOVE is an open source tool for identifying, characterizing and validating twelve common formats such as pdf, tiff, jpeg, aiff and wave.  Pages within a PDF file are usually stored as a page tree, allowing the user to reach a given page as quickly as possible. Common advice for long-term archiving is to preferentially use the PDF/A format. However, this no longer matches to the day-to-day reality of many workflows which use JHOVE for validation tests. The differences between PDF and PDF/A means that there there can be validation errors. JHOVE’s PDF module is certainly capable of validating PDF/A files but the feature does not work well.  The process does not analyze the content of the data streams, meaning that it cannot validate PDF/A compliance in line with ISO standards. JHOVE is not suited to PDF/A validation but there currently are no alternatives to JHOVE for validating standard PDFs.

JHOVE can still be useful, provided users understand its error reports and are aware of ways to resolve them. Even with the problems JHOVE remains an excellent option for providing initial guidance.

[In our own institution, we have found JHOVE to be useful in identifying PDF files that have potential problems. Each problem for each source needs to be examined to decide if there is a preservation risk.]

Sunday, November 02, 2014

ARMA 2014: The Convergence of Records Management and Digital Preservation

ARMA 2014: The Convergence of Records Management and Digital Preservation. Howard Loos, Chris Erickson. October 2014. [PDF]
Presentation on records management and digital preservation given at the ARMA 2014 conference.
Notes:
  • Records Management mission: To assist departments in fulfilling their responsibility to identify and manage records and information in accordance with legal, regulatory, and operational requirements
  • RIM Life Cycle to DP Life Cycle
  • Challenges and successful approaches
  • Storing records permanently with M-Discs
  • Introduction to Digital Preservation, challenges, format sustainability, media obsolescence, metadata, organizational challenges,
  • Life of digital media
  • Best practices and processes
  • OAIS model
  • Rosetta Digital Preservation System
  • Library of Congress Digital Preservation Outreach & Education (DPOE) Network

Saturday, May 11, 2013

PDF/A, PDF for Long-term Preservation.

PDF/A, PDF for Long-term Preservation. Library of Congress.  March 21, 2013.
This section on PDF/A is part of the Library of Congress website on sustainable formats. The page includes description of PDF/A, sustainability factors, quality and functionality factors, format specifications, and useful references.

PDF/A is a family of ISO standards that attempt provide sustainable formats, through device independence, self-containment, and self-documentation. The PDF/A standards are developed and maintained by a working group with representatives from government, industry, and academia and active support from Adobe Systems Incorporated.

PDF/A-1, the first PDF/A standard, was based on PDF version 1.4 and published in 2005.
PDF/A-2 extends the capabilities of PDF/A-1 and is based on PDF version 1.7.
PDF/A-3 allows including in a PDF/A file, other types of files in any other format, not just other PDF/A files.

Restrictions on PDF/A files include:
  • Audio and video content are forbidden
  • Javascript and executable file launches are prohibited
  • All fonts must be legally embeddable for unlimited, universal rendering
  • Colorspaces specified in a device-independent manner
  • Encryption is disallowed
  • Use of standards-based metadata is mandated
The PDF/A standards define levels of conformance: conformance level A satisfies all requirements in the specification; level B and level U are lower levels of conformance, still satisfying the requirements of ISO 19005 regarding the visual appearance of electronic documents, but less demanding as to representation of structural or semantic properties.