Preserving the Data Explosion: Using PDF. Betsy Fanning. Digital Preservation Coalition & AIIM. February 2008. [PDF]
This report looks at PDF standards activities and the relevance to digital preservation. The PDF Reference is an open specification made freely available by Adobe. The various version are listed; in 2000 subsets were created, including PDF/A for archiving, which is being developed by AIIM and an ISO group. They looked at a variety of formats for long term preservation and "PDF was chosen as the file format best suited for long-term preservation due to its wide adoption in numerous applications and ease of creating PDF files from digitally born documents." Long term is defined as "the period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing user community, on the information being held in a repository, which may extend into the indefinite future."
PDF is an open file format but is considered proprietary because Adobe Systems owns patents on the format. However it allows developers to use the specification royalty free. The objectives are to find a format that:
- is device-independent
- self-contained for rendering and description
- does not have restrictive elements to render the document long term
- wide spread use
PDF does not fit all these and have issues that need to be resolved. PDF/A limits some functions, and there are two levels:
PDF/A-1a: may include any features before PDF version 1.4 except those forbidden by the specifiations,
PDF/A-1b: must meet all specifications
Adobe products will conform to the ISO PDF standard when approved. But the PDF format is not enough to ensure accurate preservation. Organizations must have appropriate policies, procedures and records management in place. It is important to know that files conform to PDF/A, so tools are needed. "It is safe to say that correctly implementing the PDF/A file format should result in reliable, predictable, and unambiguous access to the full information content of electronic documents long-term." Education and training on PDF/A is needed. "Due to the specific nature of long-term preservation of electronic documents, the field of available file formats that can be used for preservation purposes is very small." Other formats often considered are TIFF, XML, ODF, OOXML, and XPS.
Significant Properties of Digital Objects. Andrew Wilson. JISC Workshop. 7 April 2008.
The fundamental challenge is to preserve the accessibility and authenticity of digital objects over time and across changing technical environments. We must accept the separation of logical information of an object from its physical environment. There are different models of digital preservation that focus on the technology, the data, the processes, or restoring objects later (digital archaeology). Authenticity comes from integrity and accuracy (no unauthorized changes), being able to trust that the item is what it is supposed to be, and the ability to use and view it later. That does not mean that it has not been changed, but that the message it was meant to communicate is unaltered. The model needs to ensure that the essence or significant properties are preserved.
Investigating the significant properties of electronic content over time. Stephen Grace. JISC Workshop. 7 April 2008.
The project is to look at the properties of the digital content. The framework is to catalog the significant properties of a digital object, determine the relative value of the property for the re-creation of the object, designate the level of significance, determine the user community and restrictions. Some properties are more important to others and a judgment has to be made on the value. A numbered scale measures the significance, from essential to not important.
The Significant Properties of Vector Images. David A. Duce. JISC Workshop. 7 April 2008.
They use the data-centric approach which focuses on maintaining digital objects in the current formats rather than the process-centric approach that keeps objects in their original form and attempts to emulate the original environment. The strategy is to transform the original object with related information to create a transformed source that retains the essence of the original. It is a challenge to identify the significant properties and keep them through the transformation process. We need to document why something is being preserved and why the particular methods were used. Some possible formats for these types of graphics are WebCGM (mostly engineering), SVG (an XML application with font and animation capability) and PDF/A. More research is needed.