Thursday, November 29, 2018

The File Discovery Tool - A simple tool to gather file and filepath information, and ingest into our Rosetta Digital Archive

The File Discovery Tool. Chris Erickson. Brigham Young University. November 29, 2018.
     We have created a File Discovery Tool that analyzes directories of objects and prepares a spreadsheet of all the files it discovers for preservation/ingest. This file allows the curators to discover and work with the materials, select those that need to be preserved, and then add collection and other metadata information. The tool fits our workflow, but the source code may be useful for others trying to accomplish a similar task.

A sample command to run the tool:
>> java -jar FileDiscovery.jar [path name of files to check] [output path name for saving the report]
>> java -jar C:\FileDiscovery\FileDiscovery.jar "R:\test\objects"  C:\output\files
 The commands and syntax are outlined in a brief document: File Discovery Outline
The spreadsheet that is created has the following column headings:

Metadata can be added as needed before ingesting the content into Rosetta.

The files and the metadata can then be submitted to Rosetta using the csv option in the Rosetta File Harvester tool by adding in a second row of Dublin Core names in order to map the column. A standard template has been created to help in preparing the file for ingest and is found on the resources page: RosettaFile Ingest template for Excel, or (PDF)
The source is available at

The File Harvester tool - Our tool for ingesting content to our Rosetta Digital Archive

The File Harvester tool. Chris Erickson. Brigham Young University. November 29, 2018. 
     We have created a harvester tool for harvesting, processing, and submitting content to Rosetta. Our Library IT department has made this open source. The tool fits our workflow, but the source code may be useful for others trying to accomplish a similar task.

The File Harvester tool gathers content from several different sources:
  • Our hosted CONTENTdm (cdm)
  • Open Journal System (ojs)
  • Internet Archive (ia)
  •  Unstructured files in a folder with metadata in a spreadsheet (csv)
The tool creates SIPs by adding objects and metadata from the specified source, by creating a Rosetta mets xml file and a Dublin core xml file; and by putting it in the structure for our Rosetta system. The objects can either be on the hosted system or in a source folder. The harvest tool can also submit the content to Rosetta for ingest.

The structure is:
  1. Folder: collection-itemid and it contains the dc.xml and subfolder content 
  2. Sub-Folder: content and it contains the mets.xml and the folder streams 
  3. Sub-Folder: streams which contains the file objects
The commands and syntax are outlined in a brief document on the Resources page:
RosettaFile Harvester outline

The source is available at:

Wednesday, November 28, 2018

Cambridge University Libraries inaugural Digital Preservation Policy

Cambridge University Libraries inaugural Digital Preservation Policy. Somaya Langley. Digital Preservation at Oxford and Cambridge. 26 November 2018.
     The inaugural Cambridge University Libraries Digital Preservation Policy has been published. This can be compared to the Oxford Bodleian Libraries Digital Preservation Policy which was published earlier in the year. “Long-term preservation of digital content is essential to the University’s mission of contributing to society through the pursuit of education, learning, and research.” The is a "dearth of much-needed policies".

A gap analysis found that a few key policies existed, but there were gaps or duplication. The policy process is never ending. The policies should be a ‘live and breathing’ process, with the policy document itself purely being there to keep a record of the agreed upon decisions and principles."

The digital preservation policy process may need to also review other relevant policies (such as the Collection Care and Conservation Policy) and add digital preservation statements. "In the longer term, while it might be ideal to combine a preservation policy into one (encompassing the conservation and preservation of physical and digital collection items), CUL’s digital preservation maturity and skill capabilities are too low at present. Focus needed to be really drawn to how to manage digital content, hence the need for a separate Cambridge University Libraries Digital Preservation Policy." There is a need include statements to the policy to "support better care for digital (and audiovisual) content still remaining on carriers (that are yet to be transferred)."

Monday, November 26, 2018

Preservation of AV Materials in Manuscript Collections. Training for AV format identification and risk assessment

Preservation of AV Materials in Manuscript Collections; Internal Training.  Ben Harry. Brigham Young University. November 2018.
     Ben Harry, Curator of Audiovisual Materials and Media Arts History at Brigham Young University, provided some internal training concerning AV format identification and risk assessment. Here are some assessment tools for AV materials.

Tuesday, November 20, 2018

Audiovisual Metadata Platform Planning Project: Progress Report and Next Steps

Audiovisual Metadata Platform (AMP) Planning Project: Progress Report and Next Steps. Jon W. Dunn, et al. Indiana University. March 28, 2018.
     This is a report of a workshop which was part of a planning project for design and development of an audiovisual metadata platform. "The platform will perform mass description of audiovisual content utilizing automated mechanisms linked together with human labor in a recursive and reflexive workflow to generate and manage metadata at scale for libraries and archives." 

Libraries and archives hold massive collections of audiovisual recordings from a diverse range of timeframes, cultures, and contexts that are of great interest across many disciplines and communities. Galleries, Libraries, Archives, and Museums (GLAM) face difficulty in creating access to their audiovisual collections, due to high costs, difficulty in managing the objects, and the lack of sufficiently granular metadata for audio/video content to support discovery, identification, and use. Text materials can use full-text indexing to provide some degree of discovery, but "without metadata detailing the content of the dynamic files, audiovisual materials cannot be located, used, and ultimately, understood".  Metadata generation for audiovisual recordings rely almost entirely on manual description performed by experts in a variety of ways. The AMP will need to process audio and video files to extract metadata, and also accept / incorporate metadata from supplementary documents.  One major challenge is processing and moving large files around, both in terms of time and bandwidth costs.

The report goes into depth on the AMP business requirements, some of which are:
  • Automate analysis of audiovisual content and human-generated metadata in a variety of formats to efficiently generate a rich set of searchable, textual attributes
  • Offer streamlined metadata creation by leveraging multiple, integrated, best-of-breed software tools in a single workflow
  • Produce and format metadata with minimal errors 
  • Build a community of developers in the cultural heritage community who can develop and support AMP on an ongoing basis 
  • Scale to efficiently process multi-terabyte batches of content 
  • Support collaborative efforts with similar initiatives
The following formats are possible sources for AMP processing:
  • Audio (.mp3, .wav) 
  • Image (.eps, .jpg, .pdf, .png, .tif) 
  • Data (.xlsx, .csv, .ttl, .json) 
  • Presentation (.key, .pptx) 
  • Video (.mov, .mp4, .mkv, .mts, .mxf) 
  • Structured text (.xml, with or without defined schemas, such as TEI, MODS, EAD, MARCXML) 
  • Unstructured text (.txt, .docx)
The report continues by looking at the Proposed System Architecture, functional requirements, and workflows.
Outcome: "The AMP workshop successfully gathered together a group of experts to talk about what would be needed to perform mass description of audiovisual content utilizing automated mechanisms linked together with human labor in a recursive and reflexive workflow to generate and manage metadata at scale for libraries and archives. The workshop generated technical details regarding the software and computational components needed and ideas for tools to use and workflows to implement to make this platform a reality."

Saturday, November 17, 2018

The PKP Preservation Network: A Free, Sustainable Preservation Service for OJS Journals

The PKP Preservation Network: A Free, Sustainable Preservation Service for OJS Journals. Bronwen Sprout and Mark Jordan. Poster, iPres 2018
     The PKP Preservation Network offers free preservation to any journal running Open Journal Systems (OJS) that has an ISSN. As of September 2018, 856 journals have deposited 22,549 issues into the network. The network is administered by the Public Knowledge Project and supported by partners who are running preservation nodes, along with an Advisory Panel. Future development will allow it to preserve supplemental and linked content

Journal deposit into the dark archive is fully automated through an OJS plugin. The content is harvested and processed by a staging server and then stored in a LOCKSS network.  In case of a trigger event, a journal's content will be republished for public access.

Friday, November 16, 2018

How State CIOs Should Preserve Digital Records -- Electronic records are at risk and vulnerable

How State CIOs Should Preserve Digital Records. Phil Goldstein. November 05, 2018.
     States are not well prepared for long-term preservation of digital records, which means the electronic records are at risk and vulnerable. State governments are living in the world of digital records, which has many challenges with preserving content. The records are essential for state governments and must be preserved 

  • Electronic records require attention to ensure they are preserved and accessible. They are more complex to preserve than paper records
  • “Sustained attention and resources are needed to ensure the long-term management and accessibility of our nation’s electronic records.”
  • Collaboration is key, since digital records management involve multiple organizations.  “Collaborative effort is key to developing and adopting best practices and sustainable models for the long-term preservation of electronic records,”
  • “adequate employee awareness and training activities are keys to ensuring that employees correctly carry out new or existing policies and procedures and understand how to use any new technologies associated with improved electronic records management.”
  •  “Establishing fixity, or the property of a digital file or object being fixed or unchanged, is a critical part of confirming evidentiary status of electronic records,”
  • some systems for document management don’t preserve the content, structure, context and integrity of the record over time.  “States must select technologies that properly manage and store electronic records, while ensuring that the inevitable obsolescence of the technology does not compromise the records’ integrity or accessibility,” 
  • “The need for digital preservation of state electronic records will outlast commercial service providers and current technological infrastructures. The state needs to clearly understand its rights regarding its data and how the preservation provider is helping it perform its obligations to its citizens.”
  • State CIOs can help state archives and records management personnel perform a cost-benefit analysis about outsourcing preservation services in relation to data security, the report says. 
  • Contracts with third-party digital preservation service providers should “establish responsibility for functions that are critical to ensuring the integrity of state data including fixity checking and audits or compliance with state government legal responsibilities,” according to the report. 
  • State CIOs and archivists should also establish audit trails when working with a third-party preservation service provider. “A verifiable audit trail of the activities involved in the processing of digital records ensures that the reliability and authenticity of the data is secure,” the report says. 

Thursday, November 15, 2018

Announcing the Digital Processing Framework

Announcing the Digital Processing Framework. Erin Faulder, et al. bloggERS! November 13, 2018.   [PDF]
     The Digital Processing Framework suggests a minimum processing method for digital archival content. The framework brings together archival processing practice and digital preservation activities. The intention is to  promote consistent practices and to establish common terminologies.  A few of the 23 framework activities are: 
• Survey the collection
• Capture digital content off physical media
• Create checksums for transfer, preservation, and access copies
• Determine level of description
• Identify restricted material based on copyright/donor agreement
• Gather metadata for description
• Organize electronic files according to intellectual arrangement
• Perform file format analysis
• Identify deleted/temporary/system files
• Manage personally identifiable information (PII) risk
• Normalize files
There is a reusable Excel version of the framework as well. The framework is for people who "process born digital content in an archival setting and are looking for guidance in creating processing guidelines and making level-of-effort decisions for collections."  It was designed to be practical, usable, and adaptable to local institutional settings.

Wednesday, November 14, 2018

Perspectives on the Changing Ecology of Digital Preservation

Perspectives on the Changing Ecology of Digital Preservation. Oya Y. Rieger. Ithaka S+R New Issue Brief. October 29, 2018.
     Our cultural, historic, and scientific heritage is increasingly being produced and shared in digital forms, which raises questions about the role of research libraries and archives in digital preservation. The purpose of this report is to share some common themes and provide an opportunity for broader community involvement. Interviews were conducted to identify opportunities and needs and focuses on gaps and challenges, in order to explore how we can strengthen collaborations. "Digital preservation involves the management and maintenance of digital objects to ensure the authenticity, accuracy, and functionality of content over time in the face of technological and administrative changes."  A critical issue expressed: “The main risk is that one assumes that ‘somebody else’ will take care of the digital information.”

The following questions framed the discussions:
  • What seems to be working well now (in which areas have we seen significant progress)?
  • What are your thoughts on how the preservation community is preparing for new content types and formats?
  • Do you have any observations about new research workflows and practices and their potential impact on the future of the scholarly record?
  • What do you see as gaps or areas that need further attention? 
  • If you were writing a new preservation research or implementation grant, what would you focus on?
Some notes and quotes from the article that are important to consider:
  • The digital preservation community is getting larger, representing deeper expertise around a wide range of digital content types. 
  • Through several digital preservation and repository conferences and organizations, there is a robust exchange of best practices, standards, and preservation technique.
  • There is now significant experience in implementing preservation strategies such as normalization, refreshing, migration, and emulation as the community of practitioners successfully moved these techniques from theory to practice.
  • The development and adoption of shared standards, (such as OAIS, PREMIS, PRONOM) have helped the access, discovery, management, and preservation of digital resources.
  • There are now a range of digital repository architectures and open source collaborations to provide open and scalable technical infrastructures for libraries and archives.

Challenges in Need of Further Research and Action:

Organizations and Leadership

  • The role of research libraries is unclear as academic libraries are no longer perceived as critical drivers and leaders of digital preservation.
  • how to provide sufficient levels of digital preservation to meet the community’s needs.
  • the role of research libraries in digital preservation needs to be redefined  
  • It is difficult to preserve content that is not “owned” or “controlled” by libraries.
  •  library leaders have “shifted their attention from seeing preservation as a moral imperative to catering to the university’s immediate needs.” 
  • "Several wondered what arguments could convince provosts and other senior university leaders to invest in digital preservation."
  •  with the increasing influence of commercial and industrial actors, “the digital preservation community is becoming more diverse and the distinctive requirements of research libraries are not as dominant as they perhaps once were in the community.”
  • “Expertise is increasingly fragmented as web archiving, digital curation, research data, repositories, and special collections are often placed in different library units without a common preservation mandate.” 
  • there seems to be some disconnect between how the top leadership level (University Librarians, Associate University Librarian) perceives preservation priorities and needs versus curators, digital collection specialists, archivists, and other specialists. 
  • it is important for specialists, such as curators and archivists, to have a grounded understanding of how their specific roles and priorities fit into the overall strategies of libraries and cultural heritage institutions.
  • concern about how digital preservation activities are being slowed down or impeded due to politics and conflicts both within and outside of organizations.

Preservation Services and Program Areas

  • There is confusion about the purpose and business models of preservation services, and in how such services fit together in a comprehensive preservation service framework
  • understanding about what’s being preserved and the associated technical, organizational, and policy issues is important for effective planning and implementation of a digital preservation program.
  • storage does not equate to preservation
  • there is a need to better understand the current storage options and costs, especially cloud storage
  • "we need to be careful about relying on the university IT unit for building storage" since they will focus more on providing platforms and less on providing commodity storage 
  • there are problems with legacy content that have not yet been resolved, including ejournals, ebooks, and Electronic Theses and Dissertations (ETDs)
  • there is concern about the long-term sustainability and preservation of open access content, which is diverse and problematic
  • there is a focus on initial identification, ingest, and description stages without a sufficient emphasis on how the archived content will be discovered, accessed, and used by scholars at the point of need in a usable and meaningful way
  •  initiatives tend to focus on initial identification, ingest, and description stages without a sufficient emphasis on how the archived content will be discovered, accessed, and used 
  • "It is difficult to justify collecting and preserving things if they aren’t providing value to your stakeholders.”

Assessment, Evaluation, and Risk Management
  • There are questions about certification and self-audit so that we have a systematic and recurrent way of assessing progress and gaps
  • there do not seem to be sufficient collaborative approaches to explore what constitutes success and how we identify and measure outcomes associated with digital preservation.
  • “More candid discussions around loss and failure will promote openness and transparency in our community and help us with risk management."
The key to digital preservation is sustaining interactivity and variability to support future uses in addition to considering the core archival principles such as authenticity, fixity, and integrity.

Three overarching issues that may be fruitful to explore are:
  1. A roadmap to guide the international community in understanding what digital preservation comprises, defining the key problems, identifying barriers limitations, and developing an action agenda accordingly.
  2. Understanding the ownership, control, discovery and access of materials
  3. What are the measurable benefits of digital preservation that can be presented as a communal responsibility that deserves funding?

Wednesday, November 07, 2018

Metadata for audio and videos

Metadata for audio and videos. Karen Smith-Yoshimura. OCLC: Hanging Together blog.
October 29, 2018.
     This post discusses a topic that is under discussion by a number of groups.
Our libraries are repositories of large amounts of audiovisual materials, which often represent unique, local collections. These issues need to be addressed. Chela Scott Weber: “For decades, A/V materials in our collections were largely either separated from related manuscript material (often shunted away to be dealt with at a later date) or treated at the item level. Both have served to create sizeable backlogs of un-quantified and un-described A/V materials.”
The result is that today, much of this audiovisual material is in dire need of preservation, digitization, clarification of conditions of use, and description.

AV materials, skill-sets and stakeholders are part of a complex environment. Managing AV resources requires knowledge of the use context and the technical metadata issues, in order to think through programs of description and access. It may help for libraries to identify the issues by the category of the AV materials:
  •     Commercial AV: Licensing issues, old formats, and the quality of vendor records
  •     Unique archival collections: Often deteriorating formats, large backlogs, lack of resources, and rare and expensive equipment that may be required to access (and assess) the files
  •     Locally generated content: Desire for content-creators to describe own resources
How does a library decide the amount of effort to invest in describing these AV materials. Finding aids can provide useful contextual information for individual items within a specific collection, but they often lack important details needed for discovery of the items, specifically for legacy data.  Some hope that better discovery information will reduce the need to repeat the same information in different databases, but this would require using consistent access points across systems.

Institutions commonly prioritize which of their AV materials are to be described and preserved, assessing their importance through surveys and assigning priorities from inventories. These are often multi-divisional efforts.  Rights management issues can be very complex, but they are easier for new AV files acquired since rights management has become part of normal workflows. However, older materials may lack rights information.

Metadata for AV materials often include important technical information. Some have systems that have implemented PREMIS to support the preservation of digital objects, which helps with their AV materials.

This is an opportunity for institutions who have developed their own assessments and templates to share them with others and identify common practices and criteria.