Digital Preservation Matters: PDF

Showing posts with label PDF. Show all posts

Monday, March 09, 2020

The Future of Past Email is PDF

The Future of Past Email is PDF. Chris Prom. Information and Data Manager (IDM). March 6, 2020.
The article reports on a group of people who look at the question: How should governments, universities, business, and archives ensure the future generations can access and render email? A group looks at ways to capture, preserve, and render. It builds on an earlier report:

The Future of Email Archives: A Report from the Task Force on Technical Approaches for Email Archives. CLIR Publications. August 2018. [PDF

Email is an increasingly important part of the historical record, yet it is particularly difficult to preserve, putting future access to this vast resource at risk. It looks at what makes email archiving so complex and describes emerging strategies to meet the challenge.

Addressing the challenges will require commitment from stakeholders, as well as for tool support, testing, and development.

Some institutions preserve emails with MBOX, EML or PST; maintain or emulate old email environments; or transform them to XML. All these ways require a high level of technical support. Others simply store email archives.

The group suggests the PDF format could be used for email, though there are gaps and risks.

PDF includes data structures that could fully accommodate the diversity of email content and metadata. It is completely self-contained, PDF and designed to capture text and graphical content for archival purposes.
Email-to-PDF provides a migration pathway for email messages independent of email applications and could preserve essential attributes of the message.
A standardized application of PDF technology could provide source data, universally usable archival-quality renderings including attachments, and provenance metadata.
It could use existing standards and a diverse vendor community for preserving, searching and reusing email.
Using PDF could integrate with existing preservation tools for ingesting, storing, preserving and disseminating content from established repository systems already in use in government, academic, public, and corporate archives and libraries.
Since the PDF format is so widely implemented, there would already be a common understanding of best-practices for archiving email with PDF.

"In short, the "email archiving in PDF" concept seeks to build on widely implemented standards and technologies. It would allow individuals and institutions a pathway to migrate email into the most widely used format for the distribution of text documents."

Currently there is a drawback for using PDF for email preservation: "attachments, metadata, context, and sometimes, even searchable text are missing. Simply "printing to PDF" fails to meet the specific needs of institutions archiving volumes of complex email messages, at least as currently implemented." So how can "institutions ensure authenticity, completeness, privacy, security and other needs, especially when working with thousands or millions of messages, when most header metadata and attachments are lost in the conversion?"

The group identified and documented the essential characteristics and technical requirements for converting email into PDF, which will soon be published as a set of fundamental requirements for archiving email.

Digital Preservation Matters.

Friday, September 15, 2017

Preservation with PDF/A

Preservation with PDF/A (2nd Edition). Betsy A Fanning. DPC Technology Watch Report 17-01. July 2017. [PDF 34pp.] [Link updated]
This report is an updated edition of the original Technology Watch Report 08-02, Preserving the Data Explosion: Using PDF (Fanning,2008). It looks at PDF/Archive as digital document file format for long-term preservation. The PDF/A versions of the PDF format have been developed as a family of open ISO Standards to address preservation of PDF files by removing features that pose preservation risks. It is important for preservation purposes to know how closely a file conforms to the requirements defined in the standard. There are preservation risks that may exist in the standard PDF file format:

any file type can be embedded;
the primary document can be conformant as a static document, but the embedded files may not be static;
embedded files may be infected by computer viruses;
embedded files may have extended metadata requirements, may introduce unexpected dependencies or be subject to format obsolescence;
embedded files may complicate matters relating to information security, data protection or the management of intellectual property rights.

By restricting some risk features and thus reducing preservation risks, the PDF/A format seeks to maximize:

device independence;
self-containment;
self-documentation.

Some reasons why an organization might use PDF/A to preserve their digital documents, include:

its standardized format for storing digital documents for long periods of time;
it allows for digitally signed documents using the very latest digital signature software;
it reliably displays special characters for mathematics and languages since all are embedded within the file;
it displays correctly on any device as the author intended, including the reading order;
platform independence;
provision of fully searchable documents through Optical Character Recognition.

History and Features of PDF and PDF/A. The Standard was drafted in multiple in order to make it easier to implement the Standard. "Unfortunately, the committee’s philosophy of multiple parts resulted in confusion in the market place, making it more difficult for users to select the optimum file format." Users may need to do a file format assessment based on their requirements that can help them decide which PDF/A Standard to implement.

Metadata helps effectively manage a file throughout its life cycle, as well assist in document discovery searches. "Establishing a long-term digital document preservation system requires careful consideration of the metadata that will be needed to locate and render documents years from now." Collecting metadata for the PDF/A documents in optional in the standard, except for the identifier, which is generated when the PDF/A file is created. Preservation metadata should:

be appropriate to the materials;
support interoperability;
use standardized controlled vocabulary;
include clear statements on the conditions and terms of use;
be authoritative and verifiable;
support the long-term management of the document.

Just because a file purports to be a PDF/A does not necessarily mean that it is. Format validation of a file can increase confidence a viewer will be able to render the file correctly. A number of PDF/A validators are available.The development work on the PDF Standards is a continuing effort. There are additional preservation challenges in the format that are in the process of being addressed.

The report lists some recommendations, which are directed at groups that use the standard. They include:

For those evaluating PDF/A as a digital preservation solution:

Before adopting PDF/A as a preservation solution it is "essential to understand the organizational requirements and how PDF/A will support" the organization needs.
PDF/A is not a preservation solution on its own a part of the wider preservation strategy that must be consistent with other components of the preservation infrastructure, such as backups, integrity checks and documentation.
Different versions of PDF/A have different purposes, with different capabilities as well as different preservation risks. These should be understood and decisions should be documented and explained.
Different vendors offer different tools to manage PDF/A that should be compared against your requirements..

For organizations collecting and preserving digital data:
While it may not be possible to control or restrict how documents are produced, it may be useful to give document creators guidance on what is desired.
Embed PDF/A validation tools into preservation workflows and record the results to help manage the digital preservation risks associated with PDF/A files received.

Digital Preservation Matters.

Friday, September 04, 2015

Preserving Documents Forever: When is a PDF not a PDF?

Preserving Documents Forever: When is a PDF not a PDF? Digital Preservation Coalition. July 15, 2015.
This was a briefing day on preserving PDF at Oxford University. Presentations include:

An introduction to PDF, Sarah Higgins, Aberystwyth University

Portable Document Format (PDF)
Developed to enable document sharing across platforms while retaining “look and feel”
Originally a proprietary format - Adobe Systems
Specification available free of charge from 1993
Became an open standard in 2008 ISO 32000-1:2008 (PDF 1.7)
Many flavors, PDF/A, PDF/X, PDF/E, PDF/VT, PDF/UA
PDF/A is a sub-set for the Long Term Preservation of multi-media page documents that may contain a mixture of text, raster images and vector graphics. Self contained, robust, predictable, no encryption, no interactivity, limited color space
Flavours of PDF/A: PDF/A-1, PDF/A-2, PDF/A-3 and different levels of conformance
A Document is not the same as a Record, which is
- Authentic
- Reliable
- Has integrity
- Usable

Understanding PDF risks in preservation, Johan van der Knijff, National Library of the Netherlands

Why PDF/A validation matters, even if you don’t have PDF/A
Identify preservation risks of any PDF by assessing against PDF/A standard with a validator

PDF: Myths vs facts, Ange Albertini, Corkami

Graphical fact sheet about PDF. Shows the structure
Many myths about PDF
Many possible malformations handled specifically by each reader
It’s a complex patchwork!
PDF is very useful, but it has many issues of all kinds to deal with. It is far from perfect.
What if Adobe stopped supporting PDF (like Flash) and we were just left with the specs?

Preserving PDF at the coalface, Tim Evans, Archaeology Data Service

PDF to PDF/A 1B conversions were problematic
A lot of the PDF problems can be fixed with manual intervention
Used PDF/A Manager created by PDFTron for batch processing, automated fix-ups, with 80% success rate.
Still using a mixture of PDFTron and Preflight
Concern over incoming PDFs checked by DROID showing false positives
Third party tools: CutePDF, OmniPageCapture SDK, Nitro PDF, PDFCreator, Acrobat PDFMakerfor Word (v.8)
Currently Practice: Use of PDF/A 1 and PDF/A 2; adopt a best fit of the two.
Use Callas PDF Toolbox
Now tied to a mixed economy of softwares and tools (some free, some commercial) to ensure consistent and accurate creation and validation.

Introducing veraPDF, Carl Wilson, Open Preservation Foundation

veraPDF is a project, a consortium, and a software product
Plan to produce a conformance checker
Keep up with Developments on Github: https://github.com/verapdf

Digital Preservation Matters.

Friday, August 14, 2015

PDF/A Flyer

PDF/A. PDF/A Competence Center. June 18, 2015. (PDF version of the Flyer).
PDF/A is the ISO standard for archiving electronic documents using the PDF format. It required years of cooperation between software developers, industry associations and government agencies. There are three parts to the standard which were published between 2005 and 2012. Parts 2 and 3 add options for combining several PDF/A files into one PDF/A collection, embedding the PDF’s source files or other data, support for transparency, and more.

PDF/A-2 is defined by ISO 19005, which "provides a mechanism for displaying electronic documents in such a way that the visual image is maintained over time, irrespective of the tools and systems used for their production, storage and reproduction”. It does not define an archiving strategy; instead, PDF/A specifies technical requirements for PDF electronic documents to ensure reliability after the file’s creation.

Since 2005 PDF/A has become the preferred format for archiving electronic documents. The PDF format, standardized in 2008 as ISO 32000-1, is used by many people. PDF will continue to ensure reliable access to PDF documents, making PDF ideal for long-term archiving.

Digital Preservation Matters.

Tuesday, August 11, 2015

Digital Preservation Tools on Github.

Digital Preservation Tools on Github. Chris Erickson. Blog. August 2015.
While looking for a particular tool I came across several others that look interesting. I have not yet tried them, but this is a reminder that I need to check into them.

epubcheck: a tool to validate EPUB files. It can detect many types of errors in EPUB. OCF container structure, OPF and OPS mark-up, and internal reference consistency are checked. EpubCheck can be run as a standalone command-line tool or used as a Java library.

preservation-tools: Bundles a number of preservation tools for all file types and tools in a modular way. Includes:

PdfHeaderChecker (able to detect the software used to create a PDF),
PdfAValidator (Checks via PDFBox if a PDF/A is valid. Runs through a folder and picks out only PDF/A-files),
iTextRepairPdf (take a PDF-file and copies the content page-per-page to a new, PDFA1-conform PDF-file)
PdfToImageConverter (Converts PDF Files in a certain folder to JPEGs page-per-page)
PdfTwinTest (compares the two PDF line-by-line and puts out differences. This is handy for after-Migration Quality-Checking)

wail: Web Archiving Integration Layer (WAIL). A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.

db-preservation-toolkit. The Database Preservation Toolkit allows conversion between Database formats, including connection to live systems, for purposes of digitally preserving databases. The toolkit allows conversion of live or backed-up databases into preservation formats such as DBML or SIARD, XML-based formats created for the purpose of database preservation. The toolkit also allows conversion of the preservation formats back into live systems to allow the full functionality of databases. For example, it supports a specialized export into MySQL, optimized for PhpMyAdmin, so the database can be fully experimented using a web interface.

DPFManager. DPF Manager is an open source modular TIFF conformance checker that is extremely easy to use, to integrate with existing and new projects, and to deploy in a multitude of different scenarios. It is designed to help archivists and digital content producers ensure that TIFF files are fit for long term preservation, and is able to automatically suggest improvements and correct preservation issues. The team developing it has decades of experience working with image formats and digital preservation, and has leveraged the support of 60+ memory institutions to draft a new ISO standard proposal (TIFF/A) specifically designed for long term preservation of still-images. An open source community will be created and grown through the project lifetime to ensure its continuous development and success. Additional commercial services will be offered to make DPF Manager self-sustainable and increase its adoption.

PreservationSimulation. This project is to provide baseline data for librarians and researchers about long-term survival rates of document collections. We have developed computer simulations to estimate document failure rates over a wide variety of conditions. The data from these simulations should be useful to stewards of such collections in planning and budgeting for storage and bandwidth needs to protect their collections.

flint. Facilitate a configurable file/format validation. Its underlying architecture is based on the idea that file/format validation almost always has a specific use-case with concrete requirements that may differ from a validation against the official industry standard of a given format. The following are the principle ideas we've implemented in order to match such requirements.

excel. Regarding the second issue: how to best retain formulas and other essential components of spreadsheets, like Excel, one of our data curators, John McGrory (U of Minnesota), just published a tool in GitHub that can help. In our data repository, we use the tool each time a dataset is submitted and zip these resulting files as the "Archival Version of the Data." Download the software at http://z.umn.edu/exceltool. See also a description of what the tool does: http://hdl.handle.net/11299/171966

Digital Preservation Matters.

Monday, August 10, 2015

Why PDF/A validation matters, even if you don’t have PDF/A

Why PDF/A validation matters, even if you don’t have PDF/A. Johan van der Knijff. KB Research, National Library of the Netherlands. July 7, 2015.
The PDF format has a number of features that don’t fit with the aims of long-term preservation and accessibility, such as encryption, password protection, external fonts and reliance on external software. Some examples are PDFs that use Quicktime content. Acrobat cannot render this format natively, and relies on an external player. Also files that use Linux fonts, or files with 3D content.
Institutions may want to check their PDF files to similar examples. Reasons for doing this include:

Check compliance with institutional policy (e.g. do not accept PDFs with passwords)
Check collections for preservation risks (e.g. embedded multimedia content)

There are some useful software tools are available, such as:

qpdf gives detailed information about encryption and password protection
pdffonts tool that is part of xpdf is useful for checking whether fonts in a PDF are embedded
The professional version of Adobe Acrobat has a PDF/A validator built into its Preflight tool
PDF/A validator that is part of the open-source Apache PDFBox library
VeraPDF has the potential to develop into a full-fledged PDF validator

"The PDF/A standards are nothing more than a set of profiles that impose some restrictions on a PDF, ruling out features that are not well-suited to long-term accessibility." These features are encryption, non-embedded fonts, multimedia content, and so on. Several tools exist that compare a PDF against PDF/A and report any deviations. These PDF/A validators are typically used to verify PDF/A files but can also be used to detect user-specified risky features in regular PDFs. It is possible to automatically evaluate PDFs against a user-defined set of features. But it is important to check the file because a PDF may satisfy all requirements of PDF/A, and still be broken.

Related posts:

Digital Preservation Matters.

Monday, March 30, 2015

Tabula

Tabula. Website. march 27, 2015.
Tabula is a tool for working with text based data tables inside PDF files. There's no easy way to copy-and-paste rows of data out of PDF files. This tool allows you to extract that data into a Excel spreadsheets, csv, or JSON using a simple interface. Tabula works on Mac, Windows and Linux.

Digital Preservation Matters.

Monday, January 19, 2015

Ensuring long-term access: PDF validation with JHOVE?

Ensuring long-term access: PDF validation with JHOVE? Yvonne Friese. ZBW - Leibniz Information Centre for Economics. PDF Association. December 17, 2014.
JHOVE is an open source tool for identifying, characterizing and validating twelve common formats such as pdf, tiff, jpeg, aiff and wave. Pages within a PDF file are usually stored as a page tree, allowing the user to reach a given page as quickly as possible. Common advice for long-term archiving is to preferentially use the PDF/A format. However, this no longer matches to the day-to-day reality of many workflows which use JHOVE for validation tests. The differences between PDF and PDF/A means that there there can be validation errors. JHOVE’s PDF module is certainly capable of validating PDF/A files but the feature does not work well. The process does not analyze the content of the data streams, meaning that it cannot validate PDF/A compliance in line with ISO standards. JHOVE is not suited to PDF/A validation but there currently are no alternatives to JHOVE for validating standard PDFs.

JHOVE can still be useful, provided users understand its error reports and are aware of ways to resolve them. Even with the problems JHOVE remains an excellent option for providing initial guidance.

[In our own institution, we have found JHOVE to be useful in identifying PDF files that have potential problems. Each problem for each source needs to be examined to decide if there is a preservation risk.]

Digital Preservation Matters.