Friday, June 24, 2016

File-format analysis tools for archivists

File-format analysis tools for archivists. Gary McGath. LWN. May 26, 2016.
     Preserving files for the long term is more difficult than just copying them to a drive. There are other issues are involved. "Will the software of the future be able to read the files of today without losing information? If it can, will people be able to tell what those files contain and where they came from?"

Digital data is more problematic than analog materials, since file formats change. Detailed tools can check the quality of digital documents, analyze the files and report problems. Some concerns:

  • Exact format identification: Knowing the MIME type isn't enough.
  • Format durability: Software can fade into obsolescence if there isn't enough interest to keep it updated.
  • Strict validation: Archiving accepts files in order to give them to an audience that doesn't even exist yet. This means it should be conservative in what it accepts.
  • Metadata extraction: A file with a lot of identifying metadata, such as XMP or Exif, is a better candidate for an archive than one with very little. An archive adds a lot of value if it makes rich, searchable metadata available.
Some open-source applications address these concerns, such as:
  • JHOVE (JSTOR-Harvard Object Validation Environment)
  • ExifTool
  • FITS File Information Tool Set
"Identifying formats and characterizing files is a tricky business. Specifications are sometimes ambiguous."  There are different views on how much error, if any, is acceptable. "Being too fussy can ban perfectly usable files from archives."

"Specialists are passionate about the answers, and there often isn't one clearly correct answer. It's not surprising that different tools with different philosophies compete, and that the best approach can be to combine and compare their outputs"

No comments: