Format identification output needs to be assessed within an institutional context and also consider provenance information to determine actions. This poster presents ways to address file format identification and validation issues that are mostly independent of the specific tools and systems employed; they use Rosetta, DROID, PRONOM, and JHOVE. Archives rely on technical file format information and therefore want to derive as much information about the digital objects as possible before ingest. But there are issues that occur in everyday practice, such as:
- how to proceed without compromising preservation options
- how to make efforts scalable
- issues with different types of data
- issues related to the tool's internal logic
- metadata extraction which is also format related
Ideally, format identification should yield reliable and unambiguous information on the format of a given file, however a number of problems make the process more complicated. Handling files on an individual basis does not scale well. This may mean that unsatisfactory decisions need to be taken to keep the volume of data manageable. Some criteria to consider:
- Usability: can the file be used as expected with standard software?
- Tool errors: is an error known to be tool-related?
- Understanding: is the error actually understood?
- Seriousness: does the error concern the format's significant properties?
- Correctability: is there a documented solution to the error?
- Risk of correcting: what risks are associated with correcting the error?
- Effort: what effort is required to correct the error?
- Authenticity: is the file’s authenticity more relevant than format identification?
- Provenance: can the data producer help resolve this and future errors?
- Intended preservation: what solution is acceptable for lower preservation periods?
- Should format identification be handled at ingest or as a pre-ingest activity?
- How to document measures taken to resolve identified problems?
- Can unknown formats be admitted to the archive?
- Should the format identification be re-checked later?
- Do we rely on PRONOM or do we need local registries?
- How to preserve formats where no applications exist.
The failure to extract significant properties has no immediate consequences, and institutions need to decide if they will correct the issues, especially if the metadata is embedded and the file must be altered, which creates the risk of unknowingly introducing other changes. The authors of this poster act even more cautiously when fixing metadata extraction issues that require working with embedded metadata or file properties.