This is a report of phase 3 of the Filling the Digital Preservation Gap project. It is important to
consider how we incorporate digital preservation functionality into our Research Data Management workflows.
- Phase 1: addressed the need for digital preservation as part of the research data management infrastructure
- Phase 2: practical steps to enhance their preservation system for research data
- Phase 3 has the following aims:
- To establish proof of concept implementations of Archivematica at the Universities of Hull and York, integrated with other research data systems at each institution
- To investigate the problem of unidentified research data file formats and consider practical steps for increasing the representation of research data formats in PRONOM3
- To continue to disseminate the outcomes of the project both nationally and internationally and to a variety of different audiences
"Preserving digital data isn’t solely reliant on the implementation of a digital preservation system, it is also necessary to think about related challenges that will be encountered and how they may be addressed." In working with formats it was clear that DROID does not look inside the zip files, and not all files were assigned a file format identification. Of the 3752 files analysed at York, only 1382 (37%) were assigned a file format identification by DROID. At the University of Hull a similar exercise had quite different results, with 89% of files assigned an identification by DROID. At Lancaster University the identification rate was 46%. Of the files, 70% of the files were TIFF images. Of the files that were not automatically identified, files with no extension made up 26% of the total.
"One possible solution to the file format problem as described would be to limit the types of files that would be accepted within the digital repository. This is a tried and tested approach for certain disciplines and data archives" and follows the NDSA level one recommendations, to “... encourage use of a limited set of known open formats ...”. This may be a problem with preserving research data, since researchers use a wide range of specialist hardware and software and it will be "hard for the repository and research support staff to provide appropriate advice on suitable formats. For much of the data there will be no obvious preservation format for that data."
The University of York encourages researchers (in training sessions and webpages) to consider file formats throughout their project, and the longevity and accessibility of the formats they select, but researcher decides what formats to deposit their data in. The university accepts these formats and will preserve them on a best efforts basis. "Understanding the file format moves us one step closer to preservation and reuse over the longer term." In order to help the research data community their recommendations include:
- For data curators:
- Greater engagement with researchers on the value and necessity of recognising and recording the file formats they will use/generate to inform effective data curation.
- For researchers:
- Supply adequate metadata about submitted datasets. Clear and accurate metadata about file formats and hardware/software dependencies will aid file format identification and future preservation work.
- Be open to sharing sample files for testing and to aid signature development where appropriate.
Appendix 2 contains A Draft PCDM-based Data Model for Datasets