Tuesday, November 22, 2016

Every little bit helps: File format identification at Lancaster University

Every little bit helps: File format identification at Lancaster University.  Rachel MacGregor. Digital Archiving at the University of York. 21 November 2016
   The post is about Rachel's work on identifying research data and follows on from the work of Filling the Digital Preservation Gap and provides a interesting comparison with the statistics reported previously. The aim was to understand the nature of research data and to inform their approaches to preservation. The summary of the statistics:
Of 24,705 files: 

  • 11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
  • 99.3% were given one file identification and 76 files had multiple identifications. 
    • 59 files had two possible identifications
    • 13 had 3 identifications
    • 4 had 4 possible identifications. 
    • 50 of these were either 8-bit or 7-bit ASCII text files.  
    • The remaining 26 were identified by container as various types of Microsoft files.

Of the 11008 identified files:

  • 89.34% were identified by signature
  • 9.2% were identified by extension
  • 1.46% identified by container
When adjusted for the 7,000 gzip files, the percentages identified were:
  • 68% (2505) by signature
  • 27.5% (1013) by extension
  • 4.5% (161) by container
These results were different from York's results but not so dramatically.

Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more.  Of these most were Microsoft files with multiple id's and also a set of lsm files identified as TIFFs. 

In all, 59 different file formats were identified, GZIP  was the most frequently occurring followed by xml format.

Files that weren't identified
  • There were 13697 files not identified by DROID of which 4947 (36%) had file extensions.  
  • 64% had no file extension (64%). 
  • Top counts of unidentified file extensions: dat, data, cell, param,
Gathering this information helps contribute towards our overall understanding of file format types. "Every little bit helps."

No comments: