The post is about Rachel's work on identifying research data and follows on from the work of Filling the Digital Preservation Gap and provides a interesting comparison with the statistics reported previously. The aim was to understand the nature of research data and to inform their approaches to preservation. The summary of the statistics:
Of 24,705 files:
- 11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
- 99.3% were given one file identification and 76 files had multiple identifications.
- 59 files had two possible identifications
- 13 had 3 identifications
- 4 had 4 possible identifications.
- 50 of these were either 8-bit or 7-bit ASCII text files.
- The remaining 26 were identified by container as various types of Microsoft files.
Of the 11008 identified files:
- 89.34% were identified by signature
- 9.2% were identified by extension
- 1.46% identified by container
- 68% (2505) by signature
- 27.5% (1013) by extension
- 4.5% (161) by container
Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more. Of these most were Microsoft files with multiple id's and also a set of lsm files identified as TIFFs.
In all, 59 different file formats were identified, GZIP was the most frequently occurring followed by xml format.
Files that weren't identified
- There were 13697 files not identified by DROID of which 4947 (36%) had file extensions.
- 64% had no file extension (64%).
- Top counts of unidentified file extensions: dat, data, cell, param,