Showing posts with label metadata. Show all posts
Showing posts with label metadata. Show all posts

Tuesday, November 20, 2018

Audiovisual Metadata Platform Planning Project: Progress Report and Next Steps

Audiovisual Metadata Platform (AMP) Planning Project: Progress Report and Next Steps. Jon W. Dunn, et al. Indiana University. March 28, 2018.
     This is a report of a workshop which was part of a planning project for design and development of an audiovisual metadata platform. "The platform will perform mass description of audiovisual content utilizing automated mechanisms linked together with human labor in a recursive and reflexive workflow to generate and manage metadata at scale for libraries and archives." 

Libraries and archives hold massive collections of audiovisual recordings from a diverse range of timeframes, cultures, and contexts that are of great interest across many disciplines and communities. Galleries, Libraries, Archives, and Museums (GLAM) face difficulty in creating access to their audiovisual collections, due to high costs, difficulty in managing the objects, and the lack of sufficiently granular metadata for audio/video content to support discovery, identification, and use. Text materials can use full-text indexing to provide some degree of discovery, but "without metadata detailing the content of the dynamic files, audiovisual materials cannot be located, used, and ultimately, understood".  Metadata generation for audiovisual recordings rely almost entirely on manual description performed by experts in a variety of ways. The AMP will need to process audio and video files to extract metadata, and also accept / incorporate metadata from supplementary documents.  One major challenge is processing and moving large files around, both in terms of time and bandwidth costs.

The report goes into depth on the AMP business requirements, some of which are:
  • Automate analysis of audiovisual content and human-generated metadata in a variety of formats to efficiently generate a rich set of searchable, textual attributes
  • Offer streamlined metadata creation by leveraging multiple, integrated, best-of-breed software tools in a single workflow
  • Produce and format metadata with minimal errors 
  • Build a community of developers in the cultural heritage community who can develop and support AMP on an ongoing basis 
  • Scale to efficiently process multi-terabyte batches of content 
  • Support collaborative efforts with similar initiatives
The following formats are possible sources for AMP processing:
  • Audio (.mp3, .wav) 
  • Image (.eps, .jpg, .pdf, .png, .tif) 
  • Data (.xlsx, .csv, .ttl, .json) 
  • Presentation (.key, .pptx) 
  • Video (.mov, .mp4, .mkv, .mts, .mxf) 
  • Structured text (.xml, with or without defined schemas, such as TEI, MODS, EAD, MARCXML) 
  • Unstructured text (.txt, .docx)
The report continues by looking at the Proposed System Architecture, functional requirements, and workflows.
Outcome: "The AMP workshop successfully gathered together a group of experts to talk about what would be needed to perform mass description of audiovisual content utilizing automated mechanisms linked together with human labor in a recursive and reflexive workflow to generate and manage metadata at scale for libraries and archives. The workshop generated technical details regarding the software and computational components needed and ideas for tools to use and workflows to implement to make this platform a reality."

Wednesday, November 07, 2018

Metadata for audio and videos

Metadata for audio and videos. Karen Smith-Yoshimura. OCLC: Hanging Together blog.
October 29, 2018.
     This post discusses a topic that is under discussion by a number of groups.
Our libraries are repositories of large amounts of audiovisual materials, which often represent unique, local collections. These issues need to be addressed. Chela Scott Weber: “For decades, A/V materials in our collections were largely either separated from related manuscript material (often shunted away to be dealt with at a later date) or treated at the item level. Both have served to create sizeable backlogs of un-quantified and un-described A/V materials.”
The result is that today, much of this audiovisual material is in dire need of preservation, digitization, clarification of conditions of use, and description.

AV materials, skill-sets and stakeholders are part of a complex environment. Managing AV resources requires knowledge of the use context and the technical metadata issues, in order to think through programs of description and access. It may help for libraries to identify the issues by the category of the AV materials:
  •     Commercial AV: Licensing issues, old formats, and the quality of vendor records
  •     Unique archival collections: Often deteriorating formats, large backlogs, lack of resources, and rare and expensive equipment that may be required to access (and assess) the files
  •     Locally generated content: Desire for content-creators to describe own resources
How does a library decide the amount of effort to invest in describing these AV materials. Finding aids can provide useful contextual information for individual items within a specific collection, but they often lack important details needed for discovery of the items, specifically for legacy data.  Some hope that better discovery information will reduce the need to repeat the same information in different databases, but this would require using consistent access points across systems.

Institutions commonly prioritize which of their AV materials are to be described and preserved, assessing their importance through surveys and assigning priorities from inventories. These are often multi-divisional efforts.  Rights management issues can be very complex, but they are easier for new AV files acquired since rights management has become part of normal workflows. However, older materials may lack rights information.

Metadata for AV materials often include important technical information. Some have systems that have implemented PREMIS to support the preservation of digital objects, which helps with their AV materials.

This is an opportunity for institutions who have developed their own assessments and templates to share them with others and identify common practices and criteria.


Thursday, October 05, 2017

Exploring Metadata Interoperability in the United States and United Kingdom

Exploring Metadata Interoperability in the United States and United Kingdom. Charlotte Kostelic. bloggERS! March 28, 2017.    
     A post about international perspectives on digital preservation. It looks at a comparative analysis of descriptive metadata for collections and specifically understanding how metadata can aid in providing access to digitized collections and inter-operable access for the collections. One goal of this analysis was to find a common data model for the various collections.
  • The standards used by the partner institutions include: 
    • Encoded Archival Description (EAD) with DACS for archival collections in the United States; 
    • ISAD(G) for archival collections in the United Kingdom; 
    • MARC for bibliographic, map, serial, and print collections; and 
    • Dublin Core employed for certain digital collections records. 
    • There are additional library and museum standards that need to be analyzed further.
  • Key access points include: subject headings; dates; languages; and place, personal, and corporate names.
  • The level of description between collections varies based on whether or not the materials are from archival collections or library collections.
There is a need for inter-operability between collections that use different data models, especially in an institution that intends to make all collections accessible in a single viewer.

Friday, August 18, 2017

Evaluating Your DPN Metadata Approach

Evaluating Your DPN Metadata Approach.  DPN Preservation Metadata Standards Working Group. July 27, 2017. [PDF, 6 pp.]
     This brief guide can help determine a clear metadata approach to recovering data "in the far future among unpredictable circumstances".  The document can help users create a sound approach to preserving your institution’s data and make decisions that fit with their own institutional needs.

The first section is:
What information is needed to understand and contextualize an object? It examines both descriptive and structural metadata.

Descriptive Metadata: for the purpose of identification and discovery of an object. Dublin
Core, MODS and VRAcore are common standards used for descriptive metadata.  

Structural Metadata: describes relationships between objects, such as pages in a book. The METS Structural Map can express  hierarchical relationships or parent/child relationships. The PREMIS "relationship" element can express version relationships.

The document also looks at how to:
  • understand and contextualize a collection; 
  • connect/relate objects to a collection; 
  • connect/relate versions to each other; 
  • connect metadata records to associated objects and collections;
  • ensuring the authenticity of an object;
  • ensuring the essential characteristics of the original are maintained in a data migration

Thursday, July 13, 2017

Integrating Research Data management and digital preservation systems at the University of Sheffield

Integrating Research Data management and digital preservation systems at the University of Sheffield. Chris Loftus. Digital Preservation Coalition. 31 May 2017.
     The University Library is leading the active management and curation of research data within the institution. This includes implementing a research data catalogue and repository powered by Figshare. They safeguard library collections and University assets of the University using Rosetta, a digital preservation platform from Ex Libris. "We are now working with figshare and Ex Libris to integrate both services to provide seamless preservation of published research data across the research lifecycle." Which will

  • provide a complete lifecycle data management service for the university’s research community; 
  • identify, understand and act on risks associated with preserving data sets; 
  • better inform advice and guidance around use of data formats for sharing and preservation purposes; and 
  • encourage researchers to share their data more openly with others by guaranteeing the long term sustainability of that data.
Initial integration work uses the OAI-PMH protocol and METS packages to transfer content efficiently. Rosetta will be the dark archive, with figshare the interface for researchers and external users.

File formats issues: Research data is often in niche and proprietary formats. Of the material currently deposited in the archive, only a small percentage was recognised by a Droid survey. They will need to invest some time to identify and plan for these formats, and hopefully the work will be of use to the wider digital preservation community.

Metadata: They plan to improve the quality and volume of metadata accompanying research data. Material from researchers often lacks needed metadata, which can cause future data access issues. They are investigating solutions.

Tuesday, April 18, 2017

Understanding PREMIS

Understanding PREMIS. Priscilla Caplan. Library of Congress Network Development and MARC Standards Office. 2017.
     PREMIS stands for "PREservation Metadata: Implementation Strategies". This document is a relatively brief overview of the PREMIS preservation metadata standard. It can also serve as an "gentle introduction" to the much larger document PREMIS Data Dictionary for Preservation Metadata. PREMIS defines preservation metadata as "the information a repository uses to support the digital preservation process."  Preservation metadata also supports activities "intended to ensure the long-term usability of a digital resource."

The Data Dictionary defines a core set of metadata elements needed in order to perform preservation functions, so that digital objects can be read from the digital media, and can be displayed or played. It includes a definition of the element; a reason why it is part of the metadata; also examples and notes about how the value might be obtained and used.  The elements address information needed to manage files properly, and to document any changes made. PREMIS only defines the metadata elements commonly needed to perform preservation functions on the materials to be preserved. The focus is on the repository and its management, not on the content authors or the associated staff, so it can be a guide or checklist for those developing or managing a repository or software applications. Some information needed is:
  • Provenance: The record of the chain of custody and change history of a digital object. 
  • Significant Properties: Characteristics of an object that should be maintained through preservation actions. 
  • Rights: knowing what you can do with an object while trying to preserve it.
The Data Model defines several kinds of Entities:
  • Objects (including Intellectual Entities)
  • Agents
  • Events
  • Rights
PREMIS provides an XML schema that "corresponds directly to the Data Dictionary to provide a straightforward description of Objects, Events, Agents and Rights."

Monday, March 13, 2017

What Makes A Digital Steward: A Competency Profile Based On The National Digital Stewardship Residencies

What Makes A Digital Steward: A Competency Profile Based On The National Digital Stewardship Residencies. Karl-Rainer Blumenthal, et al. Long paper, iPres 2016. (Proceedings p. 112-120 / PDF p. 57-61).
       Digital stewardship is the active and long-term management of digital objects with the intent to preserve them for long term access. Because the field is relatively young, there is not yet a "sufficient scholarship performed to identify a competency profile for digital stewards". A profile details the specific skills, responsibilities, and knowledge areas required and this study attempts to describe a competency profile for digital stewards by using a three-pronged approach:
  1. reviewing literature on the topics of digital stewardship roles, responsibilities, expected practices, and training needs
  2. qualitatively analyzing current and completed project descriptions
  3. quantitatively analyzing the results from a survey conducted that identified competencies need to successfully complete projects
"This study had two main outputs: the results of the document analysis (qualitative), and the results of the survey (quantitative)."  Seven coded categories of competence emerged from the analysis:
  1. Technical skills;
  2. Knowledge of standards and best practices;
  3. Research responsibilities;
  4. Communication skills;
  5. Project management abilities;
  6. Professional output responsibilities; and
  7. Personality requirements.
Based on the responses for Very important and Essential, a competency statement representing this profile would suggest that "effective digital stewards leverage their technical skills, knowledge of standards and best practices, research opportunities, communication skills, and project management abilities to ensure the longterm viability of the digital record." They do this by:
  • developing and enhancing new and existing digital media workflows
  • managing digital assets
  • creating and manipulating asset metadata
  • commit to the successful implementation of these new workflows
  • manage both project resources and people
  • solicit regular input from stakeholders
  • document standards and practices
  • create policies, professional recommendations, and reports,
  • maintain current and expert knowledge of standards and best practices for metadata and data management
  • manage new forms of media
The study suggests that, in practice, technical skills are not always as essential in digital stewardship as job postings suggest. Hardware/software implementation and Qualitative data analysis skills were important to only half of the respondents. Workflow management is a universally important skill deemed ”Essential" by almost all respondents. Other categories appeared as Somewhat Important, or as areas that need further research.

The study suggests that "although specific technical skills are viewed as highly important in different settings, a much larger majority of projects required skills less bound to a particular technology or media, like documentation creation and workflow analysis."  Digital stewards should possess, not only a deep understanding of their field, but the ability to "effectively disseminate their work to others."

Thursday, March 02, 2017

A lifetime in the digital world

A lifetime in the digital world. Helen Hockx. Blog: Things I cannot say in 140 characters.
February 15, 2017.
     A very interesting post about papers donated to the University of Notre Dame in 1996, and how the library has been dealing with the collection. The collection includes a survey that is possibly “the largest, single, data gathering event ever performed with regard to women religious”. The data was stored on “seven reels of 800 dpi tapes, ]rec]120, blocksize 12,000, approximately 810,000 records in all”, extracted from the original EBCDIC tapes and converted to newer formats in 1996, transferred to CDs then to computer hard disk in 1999. The 1967 survey data has fortunately survived the format migrations. Some other data in the collection had been lost: at least 3 tape reels could not be read during the 1996 migration exercise and at least one file could not be copied in 1999. "The survey data has not been used for 18 years since 1996 – nicely and appropriately described by the colleague as “a lifetime in the digital world”.

The dataset has now been reformatted and stored in .dta and .csv formats. We also recreated the “codebook” of all the questions and pre-defined responses and put in one document. The dataset is in the best possible format for re-use. The post gives examples of  digital collection items that require intervention or preservation actions. A few takeaways:
  • Active use seems to be the best way for monitoring and detecting digital obsolescence.
  • Metadata really is essential. Without the notes, finding aid and scanned codebook, we would not be able to make sense of the dataset.
  • Do not wait a lifetime to think about digital preservation. 
  • The longer you wait, the more difficult it gets.

Tuesday, January 24, 2017

The UNESCO/PERSIST Guidelines for the selection of digital heritage for long-term preservation

The UNESCO/PERSIST Guidelines for the selection of digital heritage for long-term preservation. Sarah CC Choy, et al. UNESCO/PERSIST Content Task Force. March 2016.
     The survival of digital heritage is much less assured than its traditional counterparts. “Identification of significant digital heritage and early intervention are essential to ensuring its long-term preservation.” This project was created to help preserve our cultural heritage, and to provide a starting point for institutions creating their policies. Preserving and ensuring access to its digital information is also a challenge for the private sector. Acquiring and collecting digital heritage requires significant effort and resources. It is vital that organizations accept digital stewardship roles and responsibilities.Some thoughts and quotes from the document.
  • There is a strong risk that the restrictive legal environment will negatively impact the long-term survival of important digital heritage.
  • The challenge of long-term preservation in the digital age requires a rethinking of how heritage institutions identify significance and assess value.
  • new forms of digital expression blur boundaries and lines of responsibility and challenge past approaches to collecting.
  • libraries, archives, and museums have common interests to each preserve heritage
  • heritage institutions must be proactive to identify digital heritage and information for long-term preservation before it is lost.
  • Selection is as essential, as it is economically and technically impossible, and often legally prohibited, to collect all current digital heritage. Selecting for long-term preservation will thus be a critical function of heritage institutions in the digital age.
  • Selecting digital heritage for long-term preservation may focus primarily on evaluating publications already in their collection, originally acquired for short-term use, rather than assessing new publications for acquisition. 
  • Rapid obsolescence in digital formats, storage media, and systems is collapsing the window of opportunity of selection, and increase the risk that records are lost that might not have yet “proved” their significance over time.
Address strategies for collecting digital heritage and develop selection criteria for an institution. Four possible steps to use:
  1. Identify the material to be acquired or evaluated
  2. Determine the legal obligation to preserve the material
  3. Assess the material using three selection criteria: significance, sustainability, and availability
  4. Compile the above information and make a decision based on the results
Management of long-term digital preservation and metadata is important. There are five basic functional requirements for digital metadata:
  1. Identification of each digital object
  2. Location of each digital object so that it can be located and retrieved.
  3. Description of digital object is needed for recall and interpretation, both content and context
  4. Readability and encoding, in order to remain legible over time.
  5. Rights management, including conditions of use and restrictions of each digital item
“The long-term preservation of digital heritage is perhaps the most daunting challenge facing heritage institutions today.”

Wednesday, December 21, 2016

We Are Surrounded by Metadata--But It’s Still Not Enough

We Are Surrounded by Metadata--But It’s Still Not Enough. Teresa Soleau. In  Metadata Specialists Share Their Challenges, Defeats, and Triumphs. Marissa Clifford. The Iris. October 17, 2016.
     Many of their digital collections end up in their Rosetta digital preservation repository. Descriptive and structural information about the resources comes from many sources, including the physical materials themselves as they are being reformatted. "Metadata abounds. Even file names are metadata, full of clues about the content of the files: for reformatted material they may contain the inventory or accession number and the physical location, like box and folder; while for born-digital material, the original file names and the names of folders and subfolders may be the only information we have at the file level."

A major challenge is that the collection descriptions must be at the aggregate level because of the volume of materials, "while the digital files must exist at the item level, or even more granularly if we have multiple files representing a single item, such as the front and back of a photograph". The questions is how to provide useful access to all the digital material with so little metadata. This can be overwhelming and inefficient if the context and content is difficult to recognize and understand.  And "anything that makes the material easier to use now will contribute to the long-term preservation of the digital files as well; after all, what’s the point of preserving something if you’ve lost the information about what the thing is?"

Technical information about the files themselves are fingerprints that help verify the file hasn’t changed over time, in addition to tracking what has happened to the files after entering the archive. Software preservation, such as with the Software Preservation Network, is now being recognized as an important effort. Digital preservationists are working out who should be responsible for preserving which software. There are many preservation challenges yet to be solved in the years ahead.


Tuesday, December 20, 2016

File Extensions and Digital Preservation

File Extensions and Digital Preservation. Laura Schroffel. In  Metadata Specialists Share Their Challenges, Defeats, and Triumphs. Marissa Clifford. The Iris. October 17, 2016
     The post looks at metadata challenges with digital preservation. Most of the born-digital material they work with exists on outdated or quickly obsolescing media, such as floppy disks, compact discs, hard drives, and flash drives that are transferred into their Rosetta digital preservation repository, and accessible through Primo.

"File extensions are a key piece of metadata in born-digital materials that can either elucidate or complicate the digital preservation process". The extensions describe format type, provide clues to file content, and indicate a file that may need preservation work. The extension is an external label that is human readable, often referred to as external signatures. "This is in contrast to internal signatures, a byte sequence modelled by patterns in a byte stream, the values of the bytes themselves, and any positioning relative to a file."

Their born-digital files are processed on a Forensic Recovery of Evidence Device ( FRED) which can acquire data from many types of media, such as Blu-Ray, CD-ROM, DVD-ROM, Compact Flash, Micro Drives, Smart Media, Memory Stick, Memory Stick Pro, xD Cards, Secure Digital Media and Multimedia Cards. The workstation also has the Forensic Toolkit (FTK) software is capable of processing a file and can indicate the file format type and often the software version. There are challenges since file extensions are not standardized or unique, such as naming conflicts between types of software, or older Macintosh systems that did not require files extensions. Also, because FRED and FTK originated in  law enforcement, challenges arise when using it to work with cultural heritage objects.


Monday, December 19, 2016

Metadata Specialists Share Their Challenges, Defeats, and Triumphs

Metadata Specialists Share Their Challenges, Defeats, and Triumphs. Marissa Clifford. The Iris. October 17, 2016.
     "Metadata is a common thread that unites people with resources across the web—and colleagues across the cultural heritage field. When metadata is expertly matched to digital objects, it becomes almost invisible. But of course, metadata is created by people, with great care, time commitment, and sometimes pull-your-hair-out challenge."  At the Getty there are a number of people who work with metadata "to ensure access and sustainability in the (digital) world of cultural heritage—structuring, maintaining, correcting, and authoring it for many types of online resources." Some share their challenges, including:
Some notes from some of the challenges:
  • The metadata process had to be re-thought when they started publishing digitally because the metadata machinery was specifically for print books. That proved mostly useless for their online publications so that started from scratch to find the best ways of sharing book metadata to increase discoverability. 
  • "Despite all of the standards available, metadata remains MESSY. It is subject to changing standards, best practices, and implementations as well as local rules and requirements, catalogers’ judgement, and human error." 
  • Another challenge with access is creating relevancy in the digital image repository 
  • Changes are needed in skills and job roles to make metadata repositories truly useful. 
  • "One of the potential benefits of linked open data is that gradually, institutional databases will be able speak to each other. But the learning curve is quite large, especially when it comes to integrating these new concepts with traditional LIS concepts in the work environment."

Wednesday, November 30, 2016

To Act or Not to Act - Handling File Format Identification Issues in Practice

To Act or Not to Act - Handling File Format Identification Issues in Practice. Matthias Töwe, Franziska Geisser, Roland E. Suri. Poster, iPres 2016.  (Proceedings p. 288-89 / PDF p. 145).
     Format identification output needs to be assessed within an institutional context and also consider provenance information to determine actions. This poster presents ways to address file format identification and validation issues that are mostly independent of the specific tools and systems employed; they use Rosetta, DROID, PRONOM, and JHOVE. Archives rely on technical file format information and therefore want to derive as much information about the digital objects as possible before ingest. But there are issues that occur in everyday practice, such as:
  • how to proceed without compromising preservation options
  • how to make efforts scalable 
  • issues with different types of data
  • issues related to the tool's internal logic
  • metadata extraction which is also format related
 The use cases vary depending on the customers, types of material, and formats. A broad range of use cases apply to safeguarding research data for a limited period of time (ten years at minimum) to publishing and preserving data in the long term. Understanding the use cases’ characteristics helps provides "a better understanding of what actually matters most in each case."

Ideally, format identification should yield reliable and unambiguous information on the format of a given file, however a number of problems make the process more complicated. Handling files on an individual basis does not scale well. This may mean that unsatisfactory decisions need to be taken to keep the volume of data manageable. Some criteria to consider:
  • Usability: can the file be used as expected with standard software?
  • Tool errors: is an error known to be tool-related?
  • Understanding: is the error actually understood?
  • Seriousness: does the error concern the format's significant properties?
  • Correctability: is there a documented solution to the error?
  • Risk of correcting: what risks are associated with correcting the error?
  • Effort: what effort is required to correct the error?
  • Authenticity: is the file’s authenticity more relevant than format identification?
  • Provenance: can the data producer help resolve this and future errors?
  • Intended preservation: what solution is acceptable for lower preservation periods?
There are no simple rules to resolve these, so other considerations are needed to determine what actions to take:
  • Should format identification be handled at ingest or as a pre-ingest activity?
  • How to document measures taken to resolve identified problems?
  • Can unknown formats be admitted to the archive? 
  • Should the format identification be re-checked later? 
  • Do we rely on PRONOM or do we need local registries? 
  • How to preserve formats where no applications exist.
"Format validation can fail when file properties are not in accord with its format’s specification. However, it is not immediately clear if such deviations prevent current usability of a file orcompromise the prospects for a file’s long term preservability." If the file is usable today, does that mean it is valid? Digital archives need to "balance the efforts for making files valid vs. making files pass validation in spite of known issues."

The failure to extract significant properties has no immediate consequences, and institutions need to decide if they will correct the issues, especially if the metadata is embedded and the file must be altered, which creates the risk of unknowingly introducing other changes. The authors of this poster act even more cautiously when fixing metadata extraction issues that require working with embedded metadata or file properties.

Monday, September 05, 2016

Preservation Challenges in the Digital Age

Preservation Challenges in the Digital Age. Bernadette Houghton. D-Lib Magazine. July/August 2016.
     The rapidly evolving digital preservation field has many preservation challenges:
  • Digital materials are more at risk than analogue
  • Preserving digital materials is also providing access to the material
  • Ensuring the infrastructure that renders the file is preserved or replicated
  • Focal areas changing and best practices still under debate.
"The optimal preservation strategy for individual organisations will differ according to their requirements, resources and data type. Each strategy comes with its own set of challenges, many of which are dependent on, or impacted in some way by, other challenges. This article will cover what the author sees as the major challenges for digital preservation at this point in time, covering a range of technical, administrative, logistical and legal aspects."

Other challenges:
  • Data volumes. Digital storage is becoming cheaper, but not every file and every version of it can and should be stored or preserved. Selecting what to preserve and when to take preservative action becomes more complex with a larger volume of data and a wider range of storage media. This  increases the risk of failing to preserve materials of historical value. There is also a higher risk of data not finding data because of poor metadata.
  • Archivability. One of the most fundamental challenges in archiving is determining what should be preserved and the extent of preservation.
  • Multiplicities. Materials born digital today are likely to have multiple copies in multiple versions stored in multiple locations, possibly under multiple filenames and in multiple file formats.
  • Hardware and storage. Obsolescence, deterioration of media and hardware mechanical failure increase the risk of loss. The cloud is increasingly used for storage, but there are also significant issues with using it.
  • File formats. File formats were considered a big risk in digital preservation but they have not proven to be the overwhelming danger that it was initially perceived to be. Proprietary file formats continue to pose a challenge.
  • Metadata. Metadata is probably the most important aspect of digital preservation. Materials with poor metadata may be undiscoverable, and their authenticity, verifiability and their context unclear.
  • Legalities. Digital preservation presents some complex legal issues
  • Privacy. Material chosen for preservation may contain private and confidential information, and its unauthorised release may lead to legal action.
  • Resourcing. Preservation costs involve not just the actual digitisation, but also storage, infrastructure, staff resourcing and training, ongoing maintenance and auditing of the digitised materials. There are also costs associated with providing access
The challenge is to use the scarce resources to preserve the most important materials, using the most cost-effective and efficient methods. Even choosing not to preserve materials also involves costs. Those who will benefit most from current preservation programs are future generations, which makes it difficult to justify expenditure on digital preservation, since there is little current benefit. The "best that the preservation community can do with digital material is to make educated guesses based on a few decades of mostly anecdotal experience".

"The challenges in digital preservation involve dealing with not just the technologies of the past, but also those to come". The digital preservation field is developing rapidly and the people working with digital materials need to keep up with the changes.


Wednesday, April 13, 2016

Beyond the Binary: Pre-Ingest Preservation of Metadata

Beyond the Binary: Pre-Ingest Preservation of Metadata. Jessica Moran, Jay Gattuso. iPres 2015. Nov. 2015.
     This paper describes some of the challenges the National Library of New Zealand has faced to maintain the authenticity of born digital collections (objects and metadata) from the time they are first received until they are ingested into their Rosetta digital preservation system. Two specific challenges relate to contextual metadata of filenames and file dates.

"The digital preservation analyst is responsible for technical assessment of digital content going into the digital preservation system, and troubleshooting digital content that fails validation checks". The digital archivists serve as archival and content subject matter experts; the digital preservation analyst is the subject matter expert for technical concerns. The two perspectives allow for robust workflows that better preserve the content.  They are especially interested in "file system metadata such as filenames and dates that are not embedded with the objects themselves, but rather are stored externally in the file system". Filename and date metadata have "challenged us to think critically about what constitutes acceptable, reversible, and recordable change and where and how this metadata should be stored for preservation and later for delivery to users".

Proper handling rules means that for digital preservation we need to treat files slightly more sensitively. We might want to know what the original file extension was as it is an important part of a file’s provenance.

Most born digital objects they receive have three dates: created date, last modified date, and last accessed date. They can be used to confirm an object is what it says it is. They have a practice of "touching the original file as little as possible and only as much as needed to get the file into the preservation environment".

One solution is the creation of forensic disk images as a first step in the transfer process. Another solution would be to create "a tool to help us automate the original and any subsequent transfers of born digital content, ensure the capture of original filename and date metadata and any preconditioning actions we performed, and at the same time create a log of that activity that is auditable and both human and machine readable."  They have been developing a script to accomplish what they need.

Their ongoing questions concern the delivery of objects from the digital preservation system should include proof of the integrity and authenticity of the binary object through delivery of the associated metadata.

Thursday, March 17, 2016

Guidelines for the selection of digital heritage for long-term preservation

Guidelines for the selection of digital heritage for long-term preservationUNESCO/PERSIST Content Task Force. March 2016.
     Libraries, archives, and museums traditionally have the responsibility of preserving the intellectual and cultural resources produced by society but this is in jeopardy because of amount of information created every day in digital form. Digital content is doubling in size every two years.The digital content is also in danger because much of it is ephemeral; it lacks the longevity of physical objects. The challenge of keeping digital content "requires a rethinking of how heritage institutions identify significance and assess value". Institutions must be proactively identify and preserve digital heritage and information before it is lost. The role of libraries, archives, and museums are blurring in the digital age, but they still have major interests to preserve heritage.

Libraries face the challenge selecting digital content for long-term preservation. Many focus on short term use content already in their collection, rather than assessing new publications for acquisition. Archives have traditionally "relied on the passage of time between their creation and their acquisition by an archive to lend historical perspective in making selection decisions". However, the time frame for selecting content is shorter now since the rapid obsolescence of digital formats, storage media, system hardware and software systems, of opportunity of selection.  Some strategies for selecting digital content:

Acting locally 1: Strategies for collecting digital heritage.
  • Comprehensive collecting to acquire all of the material produced on a given subject area, time period, or geographic region.
  • Representative sampling to capture a representative picture makes selection and preservation more manageable and less resource-intensive.
  • Selecting material for addition to their collections based on specific criteria, such as
    • Subject/Topic.
    • Creator/Provenance.
    • Type/Format.
    • Institutions could defer selection by capturing all the digital heritage material now and apply selection criteria later.

Acting locally 2: Developing selection criteria for a single institution
How should institutions select, identify, and prioritize digital heritage before it is lost? Evaluating and assessing digital content should be based on the principles that underlie traditional selection, but include long term perspective for use and access as defined by its mandate and users.
Decision Tree for Selection in an individual Institution
  1. Identification. Identify the material to be acquired or evaluated. 
  2. Legal framework. Does the institution have a legal obligation to preserve the material?
  3. Application of three selection criteria to determine if content should be preserved: significance, sustainability, and availability
  4. Decision. make a decision based on the three items and then document the rationale and justification for the evaluation or decision.
"The long-term preservation of digital heritage is perhaps the most daunting challenge facing heritage institutions today. Developing and implementing selection criteria and collecting policies is the first step to ensuring that vital heritage material is preserved for the benefit of current and future generations."

Appendix 1: Management of long-term digital preservation and metadata. If the digital heritage is the “content”, then the metadata provides the “context”.

"Selection of digital heritage is closely connected with issues related to long-term preservation and access. Some losses of important digital heritage may be unavoidable, but the risk can be mitigated by following best practices in digital preservation, including redundancy, active management, and metadata management."

Three key types of metadata crucial to long-term preservation:
  • Structural (required for the technical capacity to read digital content)
  • Descriptive (containing bibliographic, archival, or museum contextual information, which can be system-generated or created by heritage professionals, content creators, and/or users)
  • Administrative (documenting the management of a digital object while in its collection).

Five basic functional requirements for digital metadata:
  1. Identification: The metadata must identify each digital object uniquely and unambiguously.
  2. Location: The metadata must allow each digital object to be located and retrieved.
  3. Description: A description of digital object as well as data about the content and the context.
  4. Readability: Metadata about the structure, format and encoding of digital objects
  5. Rights management: Rights and conditions of use and restrictions must be recorded.

Saturday, March 05, 2016

The M Word: The Good, the Bad, the Ugly

The M Word: The Good, the Bad, the Ugly. Robert H. McDonald and Juliet L. Hardesty. Educause Review. January 11, 2016.
     Metadata is a part of all aspects of the research process. "The expectations are that metadata will be clean and understandable, secure and accessible when appropriate, and easily shareable." In reality this doesn’t happen naturally or without "concerted effort and cooperation" in all areas of policy, design, and practice. Metadata created by hand can be problematic; the academic research librarians work with metadata to "ensure good storage, maintainability, shareability, and most importantly, accessibility."

Libraries are often the right agency to serve as a neutral mediator for collaborations among researchers. New directions in the research process are "creating new roles and opportunities for libraries to help in preserving, managing, publishing, and accessing data."

Friday, February 26, 2016

Having FITS Over Digital Preservation?

Having FITS Over Digital Preservation? Jeffrey Erickson. NDSR Boston. February 11, 2016.
     FITS (File Information Tool Set) is an open source digital preservation tool designed to identify and validate a wide assortment of file formats, determine technical characteristics, and extract embedded metadata. The technical metadata from FITS can be exported to XML schemas. Digital preservation repositories contain a growing number of file formats. "Proper identification of a file’s format and the extraction of embedded technical metadata are key aspects of preserving digital objects. Proper identification helps determine how digital objects will be managed and extracting embedded technical metadata provides information that future repository staff or users need to render, transform, access and use the digital objects." The current version of FITS bundles many other tools together, and makes them all easier to use; some are: Droid; ExifTool; ffident; Jhove; MediaInfo (video files); New Zealand Metadata Extractor Tool. Using multiple tools can help verify the file information.

FITS consolidates and normalizes the output, providing a homogenized data set that is easier to interpret. The output can be inserted into other files, such as METS files, that can provide digital preservation documentation about the file.  FITS can assist with quality control, improving metadata, format migration metadata. FITS Sites: fitstool.org  GitHub

Monday, January 18, 2016

Exploring the potential of Information Encapsulation techniques

Exploring the potential of Information Encapsulation techniques. Anna Eggers. Pericles  Blog. 30 November 2015.
     Information Encapsulation is the aggregation of information that belongs together and can be implemented at different states of the information life cycle. For Digital Preservation this usually means pairing a digital object with its metadata. The PeriCAT open-source tool provides encapsulation techniques and mechanisms that help ensure the information remains accessible even if the digital object leaves its creation environment. The tool supports the creation of self-describing objects and the long-term reusability of information.

The two main categories: Information Embedding and Packaging. Packaging refers to the aggregation information, like files or streams, as equal entities stored in an information container. As opposed to this, information embedding needs a carrier information entity in which the payload information will be embedded.

Packaging techniques: adding files with the information into simple archive packages such as bagit, zip and tar. Metadata files, such as METS and OAI-ORE, can be added to the archive packages. The ensures that the packaged objects can be restored so that the restored objects are identical to the original objects and that they can be verified by a checksum.

Embedding techniques: Making a distinction between the information that is the format of the item, and the message information which is embedded into the object itself. This includes: Digital Watermarking, Steganography (hiding messages), and attaching files or text to objects.

PeriCAT (PERICLES Content Aggregation Tool) is a framework that allows the encapsulation and decapsulation of information. (Decapsulation is the process to separate encapsulated entities from each other.) Each of the techniques have different features; the technique to be used should be chosen based on the specified requirements.

PERICLES

Monday, November 23, 2015

Introduction to Metadata Power Tools for the Curious Beginner

Introduction to Metadata Power Tools for the Curious Beginner. Maureen Callahan, Regine Heberlein, Dallas Pillen. SAA Archives 2015. August 20, 2015.   PowerPoint  Google Doc 
      "At some point in his or her career, EVERY archivist will have to clean up messy data, a task which can be difficult and tedious without the right set of tools." A few notes from the excellent slides and document:

Basic Principles of Working with Power Tools
  • Create a Sandbox Environment: have backups. It is ok to break things
  • Think Algorithmically: Break a big problem down into smaller steps
  • Choosing a Tool: The best tools, works for your problem and skill set
  • Document: Successes, failures, procedures
Dare to Make Mistakes
  • as long as you know how to recognize and undo them!
  • view mistakes as an opportunity
  • mistakes can teach you as much about your data as about your tool
  • share your mistakes so others may benefit
  • realize that everybody makes them
General Principles
  • Know the applicable standards
  • Know your data
  • Know what you want
  • Normalize your data before you start a big project
  • The problem is intellectual, not technical
  • Use the tools available to you
  • Don’t do what a machine can do for you
  • Think about one-off operations vs. tools you might re-use or re-purpose
  • Think about learning tools in terms of raising the level of staff skill
Tools
  • XPath
  • Regex
  • XQuery
  • XQuery Update
  • XSLT
  • batch
  • Linux command line
  • Python
  • AutoIt