Showing posts with label DROID. Show all posts
Showing posts with label DROID. Show all posts

Tuesday, November 22, 2016

Every little bit helps: File format identification at Lancaster University

Every little bit helps: File format identification at Lancaster University.  Rachel MacGregor. Digital Archiving at the University of York. 21 November 2016
   The post is about Rachel's work on identifying research data and follows on from the work of Filling the Digital Preservation Gap and provides a interesting comparison with the statistics reported previously. The aim was to understand the nature of research data and to inform their approaches to preservation. The summary of the statistics:
Of 24,705 files: 

  • 11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
  • 99.3% were given one file identification and 76 files had multiple identifications. 
    • 59 files had two possible identifications
    • 13 had 3 identifications
    • 4 had 4 possible identifications. 
    • 50 of these were either 8-bit or 7-bit ASCII text files.  
    • The remaining 26 were identified by container as various types of Microsoft files.

Of the 11008 identified files:

  • 89.34% were identified by signature
  • 9.2% were identified by extension
  • 1.46% identified by container
When adjusted for the 7,000 gzip files, the percentages identified were:
  • 68% (2505) by signature
  • 27.5% (1013) by extension
  • 4.5% (161) by container
These results were different from York's results but not so dramatically.

Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more.  Of these most were Microsoft files with multiple id's and also a set of lsm files identified as TIFFs. 

In all, 59 different file formats were identified, GZIP  was the most frequently occurring followed by xml format.

Files that weren't identified
  • There were 13697 files not identified by DROID of which 4947 (36%) had file extensions.  
  • 64% had no file extension (64%). 
  • Top counts of unidentified file extensions: dat, data, cell, param,
Gathering this information helps contribute towards our overall understanding of file format types. "Every little bit helps."

Wednesday, October 26, 2016

Research data is different

Research data is different. Simon Wilson. Digital Archiving blog. 5 August 2016.
     A blog post about some born digital archives at Hull.  It is not academic research data but instead comes from a variety of sources. By using DROID to look at 270,867 accessioned files they discovered the following:
  • 97.96% of files were identified by DROID 
  • There were 228 different format types were identified 
  • The most common format is fmt/40 (MS Word 97-2003) with 120,595 files (44.5%).  
  •   The top formats they found were:
    Microsoft Word Document (97-2003)                 44.52%
    Microsoft Word for Windows (2007 and later)     5.63%
    Microsoft Excel 97 Workbook                              5.08%
    Graphics Interchange Format                              4.15%
    Acrobat PDF 1.4 - Portable Document Format     3.12%
    JPEG File Interchange Format (1.01)                    2.72%
    Microsoft Word Document (6.0 / 95)                    2.46%
    Acrobat PDF 1.3 - Portable Document Format     2.39%
    JPEG File Interchange Format (1.02)                    1.83%
    Hypertext Markup Language (v4)                         1.67%
 The number of and type of formats they found in their collections was different from other institutions that had research data.  An important step is to then look at the identified file formats and determine a strategy to migrate that format. Knowing the number and frequency of the formats in the collections will allow efforts to be prioritized.


Wednesday, October 19, 2016

Filling the Digital Preservation Gap. Phase Three report - October 2016.

Filling the Digital Preservation Gap. A Jisc Research Data Spring project. Phase Three report - October 2016. 19 October 2016. Jenny Mitcham, et. al. [PDF]
     This is a report of phase 3 of the Filling the Digital Preservation Gap project.  It is important to
consider how we incorporate digital preservation functionality into our Research Data Management workflows.
  • Phase 1: addressed the need for digital preservation as part of the research data management infrastructure
  • Phase 2: practical steps to enhance their preservation system for research data 
  • Phase 3 has the following aims:
    • To establish proof of concept implementations of Archivematica at the Universities of Hull and York, integrated with other research data systems at each institution
    • To investigate the problem of unidentified research data file formats and consider practical steps for increasing the representation of research data formats in PRONOM3
    • To continue to disseminate the outcomes of the project both nationally and internationally and to a variety of different audiences

"Preserving digital data isn’t solely reliant on the implementation of a digital preservation system, it is also necessary to think about related challenges that will be encountered and how they may be addressed."  In working with formats it was clear that DROID does not look inside the zip files, and not all files were assigned a file format identification. Of the 3752 files analysed at York, only 1382 (37%) were assigned a file format identification by DROID. At the University of Hull a similar exercise had quite different results, with 89% of files assigned an identification by DROID. At Lancaster University the identification rate was 46%. Of the files, 70% of the files were TIFF images. Of the files that were not automatically identified, files with no extension made up 26% of the total.

"One possible solution to the file format problem as described would be to limit the types of files that would be accepted within the digital repository. This is a tried and tested approach for certain disciplines and data archives" and follows the NDSA level one recommendations, to “... encourage use of a limited set of known open formats ...”. This may be a problem with preserving research data, since researchers use a wide range of specialist hardware and software and it will be "hard for the repository and research support staff to provide appropriate advice on suitable formats. For much of the data there will be no obvious preservation format for that data."

The University of York encourages researchers (in training sessions and webpages) to consider file formats throughout their project, and the longevity and accessibility of the formats they select, but  researcher decides what formats to deposit their data in. The university accepts these formats and will preserve them on a best efforts basis. "Understanding the file format moves us one step closer to preservation and reuse over the longer term." In order to help the research data community their recommendations include:
  • For data curators: 
    • Greater engagement with researchers on the value and necessity of recognising and recording the file formats they will use/generate to inform effective data curation.
  • For researchers:
    • Supply adequate metadata about submitted datasets. Clear and accurate metadata about file formats and hardware/software dependencies will aid file format identification and future preservation work. 
    • Be open to sharing sample files for testing and to aid signature development where appropriate.

Appendix 2 contains A Draft PCDM-based Data Model for Datasets


Tuesday, August 25, 2015

Hero or Villain? A Tool to Create a Digital Preservation Rogues Gallery

Hero or Villain? A Tool to Create a Digital Preservation Rogues Gallery. Ross Spencer. Open Preservation Foundation blog. 25 Aug 2015.
     The tool, droid-sqlite-analysis, will create a 'rogues gallery' out of any digital collection for which you have a DROID report. This identifies files that pose a digital preservation risk. It can also be used to:
  • enable users to work on copies of content that requires immediate attention
  • clone the directory structures (context) containing rogue content  
  • provide ingest and delivery of a 'clean' collection independent of a rogues collection to promote immediate access while file format issues are worked on in an isolated treatment environment
  • create working copies of only those files of immediate interest
  • reduce collection complexities and issues to show patterns in collection


Tuesday, July 28, 2015

detect-empty-folders

detect-empty-folders. Ross Spencer. Github. 22 July 2015.
A tool to detect empty folders in a DROID CSV. A blacklist allows you to simulate the deletion of non-record objects, which may render a folder empty.  The heuristics used here can be implemented in any language; this tool is in Python.

Related posts:

Monday, July 20, 2015

File identification tools, part 7: Apache Tika

File identification tools, part 7: Apache Tika. Gary McGath. Mad File Format Science Blog.  July 1, 2015.
     Apache Tika is a Java-based open source toolkit that can identify a wide range of formats and extract metadata from others. It doesn’t distinguish variants as much as DROID. Plugins can be added for formats that it does not regularly support.

Related posts:

Friday, June 12, 2015

File identification tools, part 3: DROID and PRONOM

File identification tools, part 3: DROID and PRONOM. Gary McGath. File Formats Blog.  June 1, 2015.
DROID (Digital Record Object IDentification) is an open sourced Java-based tool from the UK National Archives that is designed to identify and verify files for digital repositories.  It relies on file format information from the National Archive’s registry, which uses a tool called PRONOM. "DROID depends on files that describe distinctive data values for each format". It can verify single files or large batches of files, or it can be integrated into other applications. DROID generates reports about the file and the identify and verification, or report if it can't match the type of file. Sometimes it may report that a file has more than one matching signature, such as if there is more than one version of a format.


Saturday, February 07, 2015

Digital Tools and Apps

Digital Tools and Apps. Chris Erickson. Presentation for ULA. 2014. [PDF]
This is a presentation I created for ULA to briefly outline a few tools that I find helpful. There are many useful tools, and more are being created all the time. Here are a few that I use.
  • Copy & Transfer Tools: WinSCP; Teracopy;
  • Rename Tools: Bulk Rename Utility
  • Integrity & Fixity Tools: MD5Summer; MD5sums 1.2; Quick Hash; Hash Tool
  • File Editing Tools: Babelpad; Notepad++; XML Notepad; 
    • ExifTool; BWF MetaEdit; BWAV Reader;
  • File Format Tools: DROID; 
  • File Conversion:  Calibre; Adobe Portfolio;
  • Others: A whole list of other tools that I use or suggest you look at.
    •  PDF/A tools
    • Email tools
 Please let me know what tools you find helpful.