Digital Preservation Matters: Tools

Showing posts with label Tools. Show all posts

Thursday, November 29, 2018

The File Discovery Tool - A simple tool to gather file and filepath information, and ingest into our Rosetta Digital Archive

The File Discovery Tool. Chris Erickson. Brigham Young University. November 29, 2018.
We have created a File Discovery Tool that analyzes directories of objects and prepares a spreadsheet of all the files it discovers for preservation/ingest. This file allows the curators to discover and work with the materials, select those that need to be preserved, and then add collection and other metadata information. The tool fits our workflow, but the source code may be useful for others trying to accomplish a similar task.

A sample command to run the tool:

>> java -jar FileDiscovery.jar [path name of files to check] [output path name for saving the report]
>> java -jar C:\FileDiscovery\FileDiscovery.jar "R:\test\objects" C:\output\files

The commands and syntax are outlined in a brief document: File Discovery Outline

The spreadsheet that is created has the following column headings:

FILENAME, ITEM ID, FILEPATH, BYTESIZE, SIZE, COLLECTION, IE_LEVEL, DATE_CREATED, DATE_MODIFIED, TITLE, CREATOR, DESCRIPTION, RIGHTS_POLICY

Metadata can be added as needed before ingesting the content into Rosetta.

The files and the metadata can then be submitted to Rosetta using the csv option in the Rosetta File Harvester tool by adding in a second row of Dublin Core names in order to map the column. A standard template has been created to help in preparing the file for ingest and is found on the resources page: RosettaFile Ingest template for Excel, or (PDF)

The source is available at https://bitbucket.org/byuhbll/filediscovery

Digital Preservation Matters.

The File Harvester tool - Our tool for ingesting content to our Rosetta Digital Archive

The File Harvester tool. Chris Erickson. Brigham Young University. November 29, 2018.

We have created a harvester tool for harvesting, processing, and submitting content to Rosetta. Our Library IT department has made this open source. The tool fits our workflow, but the source code may be useful for others trying to accomplish a similar task.

The File Harvester tool gathers content from several different sources:

Our hosted CONTENTdm (cdm)
Open Journal System (ojs)
Internet Archive (ia)
Unstructured files in a folder with metadata in a spreadsheet (csv)

The tool creates SIPs by adding objects and metadata from the specified source, by creating a Rosetta mets xml file and a Dublin core xml file; and by putting it in the structure for our Rosetta system. The objects can either be on the hosted system or in a source folder. The harvest tool can also submit the content to Rosetta for ingest.

The structure is:

Folder: collection-itemid and it contains the dc.xml and subfolder content
Sub-Folder: content and it contains the mets.xml and the folder streams
Sub-Folder: streams which contains the file objects

The commands and syntax are outlined in a brief document on the Resources page:
RosettaFile Harvester outline

The source is available at: https://bitbucket.org/byuhbll/rosetta-tools

Digital Preservation Matters.

Tuesday, November 20, 2018

Audiovisual Metadata Platform Planning Project: Progress Report and Next Steps

Audiovisual Metadata Platform (AMP) Planning Project: Progress Report and Next Steps. Jon W. Dunn, et al. Indiana University. March 28, 2018.
This is a report of a workshop which was part of a planning project for design and development of an audiovisual metadata platform. "The platform will perform mass description of audiovisual content utilizing automated mechanisms linked together with human labor in a recursive and reflexive workflow to generate and manage metadata at scale for libraries and archives."

Libraries and archives hold massive collections of audiovisual recordings from a diverse range of timeframes, cultures, and contexts that are of great interest across many disciplines and communities. Galleries, Libraries, Archives, and Museums (GLAM) face difficulty in creating access to their audiovisual collections, due to high costs, difficulty in managing the objects, and the lack of sufficiently granular metadata for audio/video content to support discovery, identification, and use. Text materials can use full-text indexing to provide some degree of discovery, but "without metadata detailing the content of the dynamic files, audiovisual materials cannot be located, used, and ultimately, understood". Metadata generation for audiovisual recordings rely almost entirely on manual description performed by experts in a variety of ways. The AMP will need to process audio and video files to extract metadata, and also accept / incorporate metadata from supplementary documents. One major challenge is processing and moving large files around, both in terms of time and bandwidth costs.

The report goes into depth on the AMP business requirements, some of which are:

Automate analysis of audiovisual content and human-generated metadata in a variety of formats to efficiently generate a rich set of searchable, textual attributes
Offer streamlined metadata creation by leveraging multiple, integrated, best-of-breed software tools in a single workflow
Produce and format metadata with minimal errors
Build a community of developers in the cultural heritage community who can develop and support AMP on an ongoing basis
Scale to efficiently process multi-terabyte batches of content
Support collaborative efforts with similar initiatives

The following formats are possible sources for AMP processing:

Audio (.mp3, .wav)
Image (.eps, .jpg, .pdf, .png, .tif)
Data (.xlsx, .csv, .ttl, .json)
Presentation (.key, .pptx)
Video (.mov, .mp4, .mkv, .mts, .mxf)
Structured text (.xml, with or without defined schemas, such as TEI, MODS, EAD, MARCXML)
Unstructured text (.txt, .docx)

The report continues by looking at the Proposed System Architecture, functional requirements, and workflows.
Outcome: "The AMP workshop successfully gathered together a group of experts to talk about what would be needed to perform mass description of audiovisual content utilizing automated mechanisms linked together with human labor in a recursive and reflexive workflow to generate and manage metadata at scale for libraries and archives. The workshop generated technical details regarding the software and computational components needed and ideas for tools to use and workflows to implement to make this platform a reality."

Digital Preservation Matters.

Friday, July 21, 2017

ePADD 4.0 Released

ePADD 4.0 Final. July 21, 2017.
This is the latest release of ePADD, a software tool "developed by Stanford University's Special Collections & University Archives that supports archival processes around the appraisal, ingest, processing, discovery, and delivery of email archives."

The software is comprised of four modules:

Appraisal: Allows users to gather and review email archives
Processing: Tools to arrange and describe email archives.
Discovery: Tools to share a view of email archives with users through web discovery
Delivery: Enables repositories to provide access within a reading room environment.

System Requirements:

OS: Windows 7 SP1 / 10, Mac OS X 10.10 / 10.11
Memory: 8 GB RAM (4 GB RAM allocated to the application by default)
Browser: Chrome 50/51, Firefox 47/48
Windows installations: Java Runtime Environment 64-bit, 8u101 or later required

ePADD Installation and User Guide
ePADD Github website

Digital Preservation Matters.

Saturday, May 13, 2017

Design Requirements for Better Open Source Tools

OSS4Pres 2.0: Design Requirements for Better Open Source Tools. Heidi Elaine Kelly. bloggERS! April 25, 2017.
Free and Open Source Software need to "integrate easily with digital preservation institutional systems and processes.” The FOSS Development Requirements Group created a design guide for to ensure easier adoption of open-source tools and their integration with other software and tools.

Minimum Necessary Requirements for FOSS Digital Preservation Tool Development. The premise is that "digital preservation is an operating system-agnostic field."

Necessities

Provide publicly accessible documentation and an issue tracker
Have a documented process so people can contribute to development, report bugs, and suggest new documentation
Every tool should do the smallest possible task really well; if you are developing an end-to-end system, develop it in a modular way in keeping with this principle
Follow established standards and practices for development and use of the tool
Keep documentation up-to-date and versioned
Follow test-driven development philosophy
Don’t develop a tool without use cases, and stakeholders willing to validate those use cases
Use an open and permissive software license to allow for integrations and broader use

Recommendations

Have a mailing list or other means for community interaction
Establish community guidelines
Provide a well-documented mechanism for integration with other tools/systems
Provide functionality of tool as a library, separate UI from the actual functions
Package tool in an easy-to-use way, that supports any dependencies
Provide examples of functionality for potential users
Consider the long-term sustainability of the tool
Consider a way for internationalization of the tool

Digital Preservation Matters.

Tuesday, May 09, 2017

Using Open-Source Tools to Fulfill Digital Preservation Requirements

OSS4EVA: Using Open-Source Tools to Fulfill Digital Preservation Requirements. Marty Gengenbach, et al. Code4Lib. 2016-10-25.
Open-source software has played an increasingly prominent role in digital preservation, such as LOCKSS, DSpace, and DROID. The number and variety of such tools has increased, there was a growing need among preservationists to assess how and when to adopt particular tools so that they could better support their institutions’ specific requirements and workflows. Open-source projects allows the user community to contribute by developing and documenting tools.

There are some challenges with open source programming.

Perceptions of instability: One challenge is the perception that these tools are "inherently unstable and therefore present a risk".
Resources and funding: Administrators often are reluctant to commit resources to an open source project. Funding problems can threaten the long-term sustainability of open source tools.
System updates: Open source tools require regular patches, updates, and upkeep. Without this, the tool would be outdated, and open to security holes. "The choice to maintain an unsupported version of a particular open-source tool simply because it meets (or has been customized to meet) an organization’s needs is problematic. For what an institution may stand to gain from this tool in terms of functionality and local integration, it may stand to lose in terms of the stability of a mainstream code release, the risk to information security, and the likelihood that the tool in question will become increasingly less functional and reliable as it ages".
Integration. Integrating open-source tools into institutional workflows can be a challenge, taking into account software dependencies, system requirements, and local configuration to put the tools into a production environment. This can require a considerable time and resources.

One of the possible benefits is that institutions can customize open source tools for use within a specific context, but that comes with its own hurdles, such as reducing the ability to draw on the user community. The digital preservation open source landscape has evolved from a scattered set of standalone tools designed to complex software environments. "Nevertheless, these tools still are not watertight." There are real concerns about open-source tools that can pose serious risks to collections.

Digital Preservation Matters.

Monday, April 10, 2017

Encoding and Wrapper Decisions and Implementation for Video Preservation Master Files

Encoding and Wrapper Decisions and Implementation for Video Preservation Master Files. Mike Casey. Indiana University. March 27, 2017.
"There is no consensus in the media preservation community on best practice for encoding and wrapping video preservation master files." Institutions preserving video files long term generally choose from three options:

10-bit, uncompressed, v210 codec, usually with a QuickTime wrapper
JPEG 2000, mathematically lossless profile, usually with an MXF wrapper
FFV1, a mathematically lossless format, with an AVI or Matroska wrapper

The few institutions digitizing and preserving video for the long-term are roughly evenly divided between the three options above. This report examines in detail a set of choices and an implementation that has worked well for their institution. Originally they chose the first option, but with recent advances of FFV1, they reopened this decision and initiated a research and review process:

Exit strategy research and testing
Capture research (use FFmpeg within their system to generate FFV1 files).
Comparison of issues
Consultation with an outside expert

Results: Research into exit strategies, they were able to move FFV1 files to a lossless codec with no loss of data. They decided to capture using FFmpeg, which requires developing a simple capture tool, and developed specifications for a minimal capture interface with FFmpeg for encoding and wrapping the video data.

Technical: identified a number of key advantages to FFV1, including:

roughly 65% less data than a comparable file using the v210 codec
open source, non-proprietary, and hardware independent
largely designed for the requirements of digital preservation
employs CRCs for each frame allowing any corruption to be associated with a
much smaller digital area than the entire file

FFV1 appears to be "trending upwards among developers and cultural heritage organizations engaged in preservation work". They also chose the Matroska wrapper, which is an audiovisual container or wrapper format in use since 2002, and which is a more flexible wrapper option.

As more and more archives undertake video digitization" they will not accept older and limited formats" (AVI or MOV), but they will be looking for standards-based, open source options developed specifically for archival preservation. "Both FFV1 and Matroska are open source and are more aligned with preservation needs than some of the other choices and we believe they will see rapidly increasing adoption and further development."

Implementation: They developed a quality control program to validate that the output meets their specification for long-term preservation and checks the FFV1/Matroska preservation master files. These files are viewed using the VLC media player, a free open source cross-platform multimedia player that supports FFV1 and Matroska

Currently, they have created over 38,000 video files using FFV1 and Matroska. "We have chosen two file formats that are open source, developed in part with reservation in mind, and on the road to standardization with tools in active development. We have aligned ourselves with the large and active FFmpeg community rather than a private company. While the future is ultimately unknowable, we believe that this positions us well for long-term preservation of video-based content."

Digital Preservation Matters.

Friday, April 07, 2017

How a Browser Extension Could Shake Up Academic Publishing

How a Browser Extension Could Shake Up Academic Publishing. Lindsay McKenzie. The Chronicle of Higher Education. April 06, 2017
There are several open-access initiatives. One initiative, called Unpaywall, is a just a browser extension. Unpaywall is an open-source, nonprofit organization "dedicated to improving access to scholarly research". It has created a browser extension to hopefully do one thing really well: instantly deliver legal, open-access, full text as you browse. "When an Unpaywall user lands on the page of a research article, the software scours thousands of institutional repositories, preprint servers, and websites like PubMed Central to see if an open-access copy of the article is available. If it is, users can click a small green tab on the side of the screen to view a PDF." A legally uploaded open-access copy is delivered to users more than half the time.

"It’s the scientists who wrote the articles, it’s the scientists who uploaded them — we’re just doing that very small amount of work to connect what the scientists have done to the readers who need to read the science." Open-access papers have the information but don’t always look like the carefully formatted articles in academic journals. Some users might not feel comfortable citing preprints or open-access versions obtained through Unpaywall, "without the trappings and formatting of traditional paywalled publishing," even if the copy is credible.

Digital Preservation Matters.

Monday, March 06, 2017

Electric WAILs and Ham

Electric WAILs and Ham. John Berlin. Web Science and Digital Libraries Research Group. February 13, 2017.
Web Archiving Integration Layer (WAIL) is a one-click configuration and utilization tool that fits between institutional and individual archiving tools from a user's personal computer. Changing the tool from a Python application into an Electron application has brought with it many improvements especially the ability to update and package it for Linux, MacOS, and Windows.

WAIL is now collection-centric and provides users with the ability to curate personalized web archive collections, similar to Archive-It, but on their local machines. It also adds the ability to monitor and archive Twitter content automatically. WAIL is now available from the project's release page on Github. More information about WAIL is available on their wiki.

Digital Preservation Matters.

Saturday, December 10, 2016

Error detection of JPEG files with JHOVE and Bad Peggy – so who’s the real Sherlock Holmes here?

Error detection of JPEG files with JHOVE and Bad Peggy – so who’s the real Sherlock Holmes here? Yvonne Tunnat. Yvonne Tunnat's Blog. 29 Nov 2016.
Post that describes an examination of the findings of two validation tools, JHOVE (Version 1.14.6) and Bad Peggy (version 2.0), which scans image files for damages, using the Java Image IO library. The goal of the test is to compare the findings from these validation tools and know what to expect for digital curation work. There were 3070 images for the test, which included images from Google's publicly available Imagetestsuite. Of the images, 1,007 files had problems.

The JHOVE JPEG module can determine 13 different error conditions; Bad Peggy can distinguish at least 30 errors. The results of each are in tables in the post. The problem images could not be opened and displayed or had missing parts, mixed up parts and colour problems. The conclusion is that the tool Bad Peggy was able to detect all of the visually corrupt images. The JHOVE JPEG module missed 7 corrupt images out of 18.

Digital Preservation Matters.

Wednesday, November 09, 2016

Autonomous Preservation Tools in Minimal Effort Ingest

Autonomous Preservation Tools in Minimal Effort Ingest. Asger Askov Blekinge, Bolette Ammitzbøll Jurik, Thorbjørn Ravn. Andersen. Poster, iPres 2016. (Proceedings p. 259-60 / PDF p. 131).
This poster presents the concept of Autonomous Preservation Tools developed by the State and University Library, Denmark. It is an expansion of their idea of Minimal Effort Ingest. In Minimal Effort Ingest most of the preservation actions are handled within the repository when resources are available. The incoming data is to be secured quickly, even when resources are sparse. Preservation actions should happen when resources are available, rather than by a static ingest workflow.

From these concepts they created the idea of Autonomous Preservation Tools which are more like software agents rather than a static workflow system. The process is more flexible and allows for easy updates or changes to the workflow steps. A fixed workflow is replaced with a decentralised implicit workflow which defines the set of events that an AIP must go through. Rather than a static workflow that must process AIPs in a fixed way, the Autonomous Preservation Tools "can discover AIPs to process on their own". Because AIPs maintain an record of past events tools can determine whether or not an AIP has been processed or if other Tool actions must be performed first. So the workflow is the tools finding and processing items correctly until every item has been processed. This becomes an alternative method of processing.

Digital Preservation Matters.

Tuesday, October 25, 2016

Checksum 101: A bit of information about Checksums

Checksum 101: A bit of information about Checksums. Ross Spencer. Archives NZ Workshop. 2 October 2016.
A slide presentation providing very good information on checksums. Why do we use checksums:

Policy: Provides Integrity
Moving files: Validation after the move
Working with files: Uniquely identifying what we’re working with
Security: a by-product of file integrity

An algorithm does the computing bit, and there are a variety of types, MD5, CRC32, SHA, etc. A checksum algorithm is a one way function that can't be reversed. DROID can handle MD5, SHA1, and SHA256. Why use multiple checksums? This helps to avoid potential collisions, though the probabilities are low. The presentation shows the different type of checksums and how they are generated.

Checksums will ensure uniqueness. We can automate processes better with file checksums. Some people may have a preference of which checksums to use. Using the checksums will help future proof the systems and provide greater security

Digital Preservation Matters.

Thursday, September 01, 2016

Digital Preservation: Keep calm and get on with it!

Digital Preservation: Keep calm and get on with it! Matthew Addis. Archives and Records Association 2016. 30 August 2016.
This is a presentation about simple and practical steps towards digital preservation using open source tools best practices. The benefits of a digital preservation strategy is increasingly clear, but implementing the strategy can be overwhelming. The presentation lists resources and tools, such as the Digital Preservation Coalition handbook, the COPTR tool website, DROID, and the Data Assessment Framework. Sometimes complex resources can also be overwhelming and make decisions more difficult. "If you think that you’re not able to ‘do enough’ or ‘do it properly’, then this can result in doing nothing because this feels like the next best thing." But doing nothing has serious consequences in the digital world. "It’s almost always better to get on and do something than it is to do nothing." The presentation also refers to ‘parsimonious preservation’ or starting with minimal actions. Understand what you have and try to keep it safe through safe copies. It is important to understand formats and to use the tools to keep the content safe. "File format identification gives the information needed to make decisions." Another important part is to start simple and add functionality as you go. The maturity model from the National Digital Stewardship Alliance is a good guide.

Digital Preservation Matters.

Friday, June 24, 2016

File-format analysis tools for archivists

File-format analysis tools for archivists. Gary McGath. LWN. May 26, 2016.
Preserving files for the long term is more difficult than just copying them to a drive. There are other issues are involved. "Will the software of the future be able to read the files of today without losing information? If it can, will people be able to tell what those files contain and where they came from?"

Digital data is more problematic than analog materials, since file formats change. Detailed tools can check the quality of digital documents, analyze the files and report problems. Some concerns:

Exact format identification: Knowing the MIME type isn't enough.
Format durability: Software can fade into obsolescence if there isn't enough interest to keep it updated.
Strict validation: Archiving accepts files in order to give them to an audience that doesn't even exist yet. This means it should be conservative in what it accepts.
Metadata extraction: A file with a lot of identifying metadata, such as XMP or Exif, is a better candidate for an archive than one with very little. An archive adds a lot of value if it makes rich, searchable metadata available.

Some open-source applications address these concerns, such as:

JHOVE (JSTOR-Harvard Object Validation Environment)
DROID and PRONOM
ExifTool
FITS File Information Tool Set

"Identifying formats and characterizing files is a tricky business. Specifications are sometimes ambiguous." There are different views on how much error, if any, is acceptable. "Being too fussy can ban perfectly usable files from archives."

"Specialists are passionate about the answers, and there often isn't one clearly correct answer. It's not surprising that different tools with different philosophies compete, and that the best approach can be to combine and compare their outputs"

Digital Preservation Matters.