Tuesday, March 31, 2015

PERICLES Environment Extraction Project

PERICLES Environment Extraction Project and tools. Website. 12/11/2014.
The PERICLES project is trying to keep digital content accessible when the digital environment continues to change. The website discusses environment information, what it is, why it is important, and how to collect it.

Digital objects are created and exist in environments and the information about it can be important to their current and long term use and re-use of the content. This information, which needs to be collected at creation and throughout the object's life cycle time, is very relevant for preserving the data long-term. Most metadata standards describe the object but ignore the environment. Some examples of environmental information include dependencies (what you need in order to use the object), environment reconstruction, resource status, validation, monitoring and extraction techniques.

The PERICLES Extraction Tool (PET), as discussed in an article in D-Lib Magazine by Fabio Corubolo, has been created to extract environmental information from where objects are created and modified. It analyses the use of the data within the environment that may not be available later.

Sheer curation (as in lightweight or transparent) depends on data capture being embedded within the data creators’ working practices so that it is automatic and invisible to them..

Monday, March 30, 2015

Digital Preservation Challenges with an ETD Collection: A Case Study at Texas Tech University

Digital Preservation Challenges with an ETD Collection — A Case Study at Texas Tech University. Joy M. Perrina, Heidi M. Winkler, Le Yanga. The Journal of Academic Librarianship. January, 2015.
The potential risk of loss seems distant and theoretical until it actually happens. The "potential impact of that loss increases exponentially" for a university when the loss is part of the research output. This excellent article looks at a case study of the challenges one university library encountered with its electronic theses and dissertations (ETDs).  Many institutions have been changing from publishing paper theses and dissertations to accepting electronic copies. One of the challenges that has not received as much attention is that of preserving these electronic documents for the long term.  The electronic documents require more hands-on curation.

Texas Tech University encountered difficulties with preserving their ETD collection. They hope the lessons learned from these data losses will help other organizations looking to preserve ETDs and other types of digital files and collections. Some of the losses were:
  1. Loss of metadata edits. Corrupted database and corrupted IT backups required a rebuild of the database, but the entered metadata was lost.
  2. Loss of administrative metadata-embargo periods. The ETD-db files imported into DSpace did not include the embargoed files. Plans were not documented and personnel changed before the problem was discovered. Some items were found accidentally on a personal drive years later.
  3. Loss of scanned files. The scanning server was also the location to store files after scanning. Human error beyond the backup window resulted in the deletion of over a thousand scanned ETDs, which were eventually recovered.
  4. Failure of policies: loss of embargo statuses changes. The embargo statement recorded in the ETD management system did not match what was published in DSpace.
The library started on real digital preservation for the ETD collection. Funds were set aside to increase the storage of the archive space and provide a second copy of the archived files. A digital resources unit was created to handle the digital files which finally brought the entire digital workflow, from scanning to preservation, under one supervisor. The library joined DPN in hopes that it would yield a level of preservation far beyond what the university would be able to accomplish alone. The clean-up of the problems has been difficult and will take years to accomplish. Lessons learned:
  1. Systems designed for managing or publishing documents are not preservation solutions
  2. System backups are not reliable enough to act as a preservation copy. Institutions must make digital preservation plans beyond backups
  3. Organizations with valuable digital assets should invest in their items to store them outside of a display system only. 
  4. Multiple copies of digital items must reside on different servers in order to guarantee that files will not be accidentally deleted or lost through technical difficulties. 
  5. All metadata, including administrative data, should be preserved outside of the display system. The metadata is a crucial part of the digital item.
  6. Digital items are collections of files and metadata.
  7. Maintaining written procedures and documentation for all aspects of digital collections is vital.
  8. The success of digital preservation will require collaboration between curators and the IT people who maintain the software and hardware, and consistent terminology (e.g. archived).
 "Even though this case study has primarily been a description of local issues, the grander lessons gleaned from these crises are not specific to this institution. Librarians are learning and re-learning every day that digital collections cannot be managed in the same fashion as their physical counterparts. These digital collections require more active care over the course of their lifecycles and may require assistance from those outside the traditional library sphere...."


Tabula. Website. march 27, 2015.
Tabula is a tool for working with text based data tables inside PDF files. There's no easy way to copy-and-paste rows of data out of PDF files. This tool allows you to extract that data into a Excel spreadsheets, csv, or JSON using a simple interface. Tabula works on Mac, Windows and Linux.

Friday, March 27, 2015

National Institutes of Health: Plan for Increasing Access to Scientific Publications and Digital Scientific Data from NIH Funded Scientific Research

National Institutes of Health: Plan for Increasing Access to Scientific Publications and Digital Scientific Data from NIH Funded Scientific Research. February 2015.

This document describes NIH’s plans to build upon and enhance its longstanding efforts to increase access to scholarly publications and digital data resulting from NIH-funded research. Sections relevant to digital preservation and long term management:

NIH intends to make public access to digital scientific data the standard for all NIH funded research. Following adoption of the final plan, NIH will:
  • Explore steps to require data sharing.
  • Ensure that all NIH-funded researchers prepare data management plans and that the plans are evaluated during peer review.
  • Develop additional data management policies to increase public access to designated types of biomedical research data.
  • Encourage the use of established public repositories and community-based standards.
  • Develop approaches to ensure the discoverability of data sets resulting from NIH-funded research to make them findable, accessible, and citable.
  • Promote interoperability and openness of digital scientific data generated or managed by NIH.
  • Explore the development of a data commons. NIH will explore the development of a commons, a shared space for basic and clinical research output including data, software, and narrative, that follows the FAIR principles of Find, Access, Interoperate and Reuse.

Preservation is one of the Public Access Policy’s primary objectives. It wants to ensure that publications and metadata are stored in an archival solution that:
  • provides for long-term preservation and access to the content without charge; 
  • uses standards, widely available and, to the extent possible, nonproprietary archival formats for text and associated content (e.g., images, video, supporting data); 
  • provides access for persons with disabilities
The content in the NIH database is actively curated  using XML records which is future proof, in that XML is technology independent and can be easily and reliably migrated as technology evolves. 

The first principle behind the plan for increasing access to digital scientific data is: The sharing and preservation of data advances science by broadening the value of research data across disciplines and to society at large, protecting the integrity of science by facilitating the validation of results, and increasing the return on investment of scientific research.

Data Management Plans
Data management planning should be an integral part of research planning.  NIH wants to ensure that all extramural researchers receiving Federal grants and contracts for scientific research and intramural researchers develop data management plans describing how they will provide for long-term preservation of, and access to, scientific data in digital formats resulting from federally funded research, or explaining why long-term preservation and access cannot be justified. In order to preserve the balance between the relative benefits of long-term preservation and access and the associated cost and administrative burden, NIH will continue to expect researchers to consider the benefits of long-term preservation of data against the costs of maintaining and sharing the data.

NIH will assess whether the appropriate balance has been achieved in data management plans between the relative benefits of long-term preservation and access and the associated cost and administrative burden. It will also develop guidance with the scientific community to decide which data should be prioritized for long-term preservation and access. NIH will also explore and fund innovative tools and services that improve search, archiving, and disseminating of data, while ensuring long-term stewardship and usability.

Assessing Long-Term Preservation Needs
NIH will provide for the preservation of scientific data and outline options for developing and sustaining repositories for scientific data in digital formats.  The policies expect long-term preservation of data.
Long-term preservation and sustainability will be included in data management plans and will collaborate with other agencies on how best to develop and sustain repositories for digital scientific data.

Siegfried v 1.0 released (a file format identification tool)

Siegfried v 1.0 released (a file format identification tool). Richard Lehane. Open Preservation Foundation. 25th Mar 2015. Siegfried is a file format identification tool that is now available. The key features are:
  • complete implementation of PRONOM (byte and container signatures)   
  • reliable results
  • fast matching without limiting the number of bytes scanned
  • detailed information about the basis for format matches
  • simple command line interface with a choice of outputs (YAML, JSON, CSV)
  • a built-in server for integrating with workflows 
  • options for debug mode, signature modification, and multiple identifiers.

Thursday, March 26, 2015

Letter to the editor concerning digital preservation of government information

DttP letter to the editor re digital preservation of government information. James R. Jacobs.  ALA Connect. January 26, 2015.
Digital preservation is an incredibly important topic for government information professionals. This letter, in response to previous article, includes several important points for all libraries.
  1. Preservation of born-digital information is a very real and important topic that the government documents community needs to understand and address. In a single year, more government information is born-digital than all the printed government information accumulated by all Federal Deposit libraries in over 200 years.
  2. Digitization of print information is not a preservation solution. Instead it creates new digital preservation challenges and is really just the first of many costly and technically challenging steps needed to ensure long-term access to content.
  3. Access is not preservation; it does not guarantee preservation or long-term access. 
    1. Access without preservation is temporary, at best. 
    2. Preservation without access is an illusion.
  4. Digital preservation is an essential activity of libraries. It cannot be dismissed as the responsibility of others. Digital preservation requires:
    1. resources, 
    2. a long-term commitment,
    3. an understanding of the long-term value of information (even information that is not popular or used by many people), 
    4. a commitment to the users of information.  
  5. Relying solely on the government or others to preserve its information is risky. “Who is responsible for this preservation?” Libraries should take this responsibility. Libraries can take actions now to promote the preservation of digital information
  6. Preserve Paper copies. The Federal Depository Library Program (FDLP) is successfully preserving paper and micro-form documents. "We often hear that “digitizing” paper documents will “preserve” them, but we do not need to convert these documents to digital in order to preserve them". While digitization can provide better access, usability, and re-usability of many physical documents, it does not guarantee the preservation of the content. Worse, there are repeated calls for digitizing paper collections so that the paper collections can be discarded and destroyed. Such actions will endanger preservation of the content if they do not include adequate steps to ensure digital preservation of those newly created digital objects. 
  7. Smart-Archive the Web. Although capturing web pages and preserving them is far from an adequate (or even accurate) form of digital preservation, it is a useful stop-gap until producers understand that depositing preservable digital objects with trusted repositories is the only way to guarantee preservation of their information. Libraries should use web archiving tools and services such as Archive-It.
  8. Promote Digital Preservation. Libraries should be actively preserving digital government information. The time of 'passive digital preservation' or looking to others to take care of digital preservation is long past. We can work with others, not leave the work to them.

Release of jpylyzer 1.14.1

Release of jpylyzer 1.14.1 versionNational Library of the Netherlands / Open Preservation Foundation. 25 March 2015.
Release of a new version of jpylyze. The tool validates that a JP2 image really conforms to the format’s specifications. It also is a feature (technical characteristics) extractor for JP2 images. Changes include: Improved XML output and Recursive scanning of directory trees.

Sowing the seed: Incentives and Motivations for Sharing Research Data, a researcher's perspective

Sowing the seed: Incentives and Motivations for Sharing Research Data, a researcher's perspective. Knowledge Exchange. November 2014. PDF.
This study has gathered evidence, examples and opinions on incentives for research data sharing from the researchers’ point of view. Using this study will help provide recommendations on developing policies and best practices for data access, preservation, and re-use. A emerging theme today is to make it possible for all researchers to share data and to change the collective attitude towards sharing.

A DCC project, investigating researchers’ attitudes and approaches towards data deposit,
sharing, reuse, curation and preservation found that the data sharing requirements should be defined at the finer-grained level, such as the research group.When researchers talk about ‘data sharing’ there are different modes of data sharing, such as:
  1. private management sharing, 
  2. collaborative sharing, 
  3. peer exchange, 
  4. sharing for transparent governance, 
  5. community sharing and 
  6. public sharing.
Important motivations for researchers to share research data are:
  1. When data sharing is an essential part of the research process; 
  2. Direct career benefits (greater visibility and recognition of one’s work, reciprocal data)
  3. As a normal part of their research circle or discipline;
  4. Existing funder and publisher expectations, policies, infrastructure and data services
Some points on preservation of research information for research institution and research funders:
  • Recognize and value data as part of research assessment and career advancement
  • Set preservation standards for data formats, file formats, and documentation
  • Develop clear policies on data sharing and preservation 
  • Provide training and support for researchers and students to manage and share data so it becomes part of standard research practice.
  • Make all data related to a published manuscript available
Actions of some organizations regarding data management and preservation:
  • The Royal Netherlands Academy of Arts and Sciences requests its researchers to digitally preserve research data, ideally via deposit in recognised repositories, to make them openly accessible as much as possible; and to include a data section in every research plan stating how the data produced or collected during the project will be dealt with.
  • The Alliance of German Science Organisations adopted principles for the handling of research data, supporting long-term preservation and open access to research data for the benefit of science.
  • Research organizations receiving EPSRC funding will from May 2015 be expected to have appropriate policies, processes and infrastructure in place to preserve research data, to publish metadata for their research data holdings, and to provide access to research data securely for 10 years beyond the last data request.
  • The European Commission has called  for coordinated actions to drive forward open access, long-term preservation and capacity building to promote open science for all EC and national research funding.
  • The UK Economic and Social Research Council has mandated the archiving of research data from all funded research projects. This policy goes hand in hand with the funding of supporting data infrastructure and services. The UK Data Service provides the data infrastructure to curate,
  • preserve and disseminate research data, and provides training and support to researchers.

Wednesday, March 25, 2015

I tried to use the Internet to do historical research. It was nearly impossible.

I tried to use the Internet to do historical research. It was nearly impossible. February 17, 2015. 
How do you organize so much information? So far, the Internet Archive has archived more than 430,000,000,000 web pages. It’s a rich and fantastic resource for historians of the near-past. Never before has humanity produced so much data about public and private lives – and never before have we been able to get at it in one place. In the past it was just a theoretical possibility, but now we have the computing power and a deep enough archive to try to use it.

But it’s a lot more difficult to understand than we thought. "The ways in which we attack this archive, then, are not the same as they would be for, say, the Library of Congress. There (and elsewhere), professional archivists have sorted and cataloged the material. We know roughly what the documents are talking about. We also know there are a finite number. And if the archive has chosen to keep them, they’re probably of interest to us. With the internet, we have everything. Nobody has – or can – read through it. And so what is “relevant” is completely in the eye of the beholder."

Historians must take new approaches to the data. No one can read everything, nor know what is even in the archive. Better sampling, specifically chosen for their historical importance, can give us a much better understanding. We need to ask better questions about how sites are constructed, what links exist between sites, and have more focused searches. And we need to know what questions to ask.

JHOVE Evaluation & Stabilisation Plan

JHOVE Evaluation & Stabilisation Plan. Open Preservation Foundation. March 2015.
JHOVE is an extensible software framework for performing format identification, validation, and characterization of digital objects. In February the JHOVE format validation tool was transferred to Open Preservation Foundation stewardship. Their initial review of JHOVE has been completed and the Evaluation & Stabilisation Plan is now available on the site. 

The main objective of our work to date has been to establish a firm foundation for future changes based on agile software development best practises. A further technical evaluation will be published in April that will also outline options for possible future development and maintenance tasks.

Tuesday, March 24, 2015

“An alarmingly casual indifference to accuracy and authenticity.” What we know about digital surrogates

“An alarmingly casual indifference to accuracy and authenticity.” What we know about digital surrogates. James A Jacobs. Free Government Information. March 1, 2015.
Post examines several articles concerning the reliability and accuracy of digital text extracted from printed books in five digital libraries: the Internet Archive, Project Gutenberg, the HathiTrust, Google Books, and the Digital Public Library of America.

In a study by Paul Conway of page images in the HathiTrust, he found 25%  of the 1000 volumes examined by Conway contained at least one page image whose content was “unreadable.” Only 64.9% of the volumes examined were considered accurate and complete enough to be considered “reliably intelligible surrogates.”  HathiTrust only attests to the integrity of the transferred file, and not to the completeness of the original digitization effort. 

The “uncorrected, often unreadable, raw OCR text” that most mass-digitization projects produce today, will be inadequate for future, more sophisticated uses. Libraries that are concerned about their future and their role in the information ecosystem should look to the future needs of users when evaluating digitization projects. Libraries have a special obligation to preserve the historic collections in their charge in an accurate form. 

Cites articles:

Monday, March 23, 2015

New policy recommendations on open access to research data

New policy recommendations on open access to research data. Angus Whyte. DCC News. 19 January, 2015.
Some of the recommendations of the RECODE case studies concerning open access to research data:
  • Develop policies for open access to research data
  • Ensure appropriate funding for open access to research data 
  • Develop policies and initiatives for researchers for open access to high quality data
  • Identify key stakeholders and relevant networks 
  • Foster a sustainable ecosystem for open access to research data
  • Plan for the long-term, sustainable curation and preservation of open access data
  • Develop comprehensive and collaborative technical and infrastructure solutions that afford open access to and long-term preservation of high-quality research data
  • Develop technical and scientific quality standards for research data
  • Address legal and ethical issues arising from open access to research data
  • Support the transition to open research data through curriculum-development and training 
Two things needed with open access to research data:
  1. coherent open data ecosystem
  2. attention to research practice, processes and data collections

Saturday, March 21, 2015

Reaching Out and Moving Forward: Revising the Library of Congress’ Recommended Format Specifications

Reaching Out and Moving Forward: Revising the Library of Congress’ Recommended Format Specifications. Ted Westervelt, Butch Lazorchak. The Signal. Library of Congress. March 16, 2015.
The Library has created the Recommended Format Specifications, which is the result of years of work by experts from across the institution because it is essential to the mission of the institution. The  Library is committed to making the collection available to its patrons now and for generations to come and must be able to determine the physical and technical characteristics needed to fulfill this goal. The Specifications have hierarchies of characteristics, physical and digital, in order to provide guidance and determine the level of effort involved in managing and maintaining content. In order to continue manage the materials, the Specifications must be constantly reviewed and updated and materials and formats change. An example is exploring the potential value of the SIARD format developed by the Swiss Federal Archives as a means of preserving relational databases.

A Geospatial Approach to Library Resources

A Geospatial Approach to Library Resources. Justin B. Sorensen. D-Lib Magazine. March/April 2015.
The fire insurance maps are a valuable resource. Digital versions of the original printed maps have been created and have been converted into georeferenced raster datasets, using ArcGIS software, aligning each map to its appropriate geospatial location to maintain consistent digital overlays for all of the historic maps. This allows the information to be displayed, expressed and presented in completely new ways. GIS can be one of the many tools libraries will have available to assist them in sharing their resources with others.

Friday, March 20, 2015

When checksums don't match...

When checksums don't match... Digital Archiving at the University of York. 2 February 2015.
Post about an example of files that had MD5 errors. Used various utilities to generate the check-sums for both MD5 and SHA1. One program showed a change, while another did not. However, when SHA1 was used, it showed that the files had different check-sums. Possibly an example of bit rot.

Forecasting the Future of Libraries 2015

Forecasting the Future of Libraries 2015. . American Libraries. February 26, 2015. 
While it’s nearly impossible to accurately predict the future, we can identify trends that can be key in understanding what the future might bring. It is important for libraries to spot trends and integrate them into their programs and services in order to remain useful and relevant. An article “Trending Now,” lists 5 trends that are worth looking at:
  1. Anonymity: it may help build community and is an increasingly important part of web interactions.
  2. Collective impact: organizations are adopting common agendas to address issues in the community. Librarians could become highly valued partners in collective-impact responses
  3. Fast casual: establishements incorporate customized services and products, and also inte­grate technology, with customer loyalty apps, online or mobile ordering, and mobile payments. Fast casual has advanced the growth of living-room-like flexible spaces (multiple and varied seating arrangements, easy-to-find power outlets) that accommodate social and business needs, and are tech­nologically savvy.
  4. Resilience: Resilience includes preparation for and rapid recovery from physical, social, and economic di­sasters, including natural disasters, terrorist at­tacks, or economic collapse.
  5. Robots: libraries have seen robots and robotics as a next wave for technology access and training, even lending robots to help users experience what might soon be a regular part of their futures. [They could also be places to learn more about technology.]
The trend library is designed to provide the library community with a centralized and regularly updated source for trends—including how they are developing; why they matter for libraries; and links to the reports, articles, and resources that can further explain their significance. As a collection, it will grow to include changes and trends across society, technology, education, the environment, politics, the economy, and demographics.  Makerspaces are playing an increasingly important role in libraries.

Another article “The Future, Today”addresses similar concepts:
  • Digital downloads, ebooks, personal content, and live programming together with books, periodicals, microfilm, audio, and video in today’s libraries. The library of the future will  support and en­hance navigation and exchange of these new forms of information. Library services must be delivered in ways that are digitally based or conveniently located in public places to help users with their busy schedules
  • Collections are being carefully consid­ered so as not to occupy too much square footage, leaving room for tech and social spaces, and a center for multiple activi­ties.  
  • Library staff in the future will be organized on the floor to be more effec­tive ‘information guides’ to help patrons.
  • There will be more flex­ible spaces for evolving services and forms of information offering.  
  • Libraries are no longer single-purpose repositories of books dedicated to quiet study. They have become dynamic hubs in various ways for the community of users

Tools for Discovering and Archiving the Mobile Web

Tools for Discovering and Archiving the Mobile Web. Frank McCown, Monica Yarbrough and Keith Enlow. D-Lib Magazine. March/April 2015.
Many websites have been adapted for access by smartphones and tablets. This has required web archivists to change the way they archive this ephemeral web content. A tool has been created using Heretrix called MobileFinder which can be used to automatically detect mobile pages when given a seed URL. It can be used as a web service.
There are three primary techniques used by websites to deliver mobile pages, and the results of a test to determine which technique was used:
  1. Using responsive web design  techniques to deliver the same content to both desktop and mobile devices: 68%
  2. Dynamically serving different HTML and CSS to mobile devices using the same URL: 24%
  3. Using different URLs to send out desktop or mobile pages: 8%
MobileFinder found in a test that 62% of randomly selected URLs had a mobile-specific root page. A web archiving tool needs to be aware of when these methods are being used so it doesn't miss finding mobile content.

Thursday, March 19, 2015

Falling Though the Cracks: Digital Preservation and Institutional Failures

Falling Though the Cracks: Digital Preservation and Institutional Failures. Jerome McDonough. CNI.  December 2014. Video.
A video that explores whether libraries, archives and museums are designed in a way to really provide long-term access to cultural heritage materials. Why are we doing digital preservation, how to do it better, how do we do librarianship better.  Looks at OAIS and the complexities of preserving cultural materials. Need to train people to have broader perspectives across different fields, such as librarians, archivists, and curators.

Trustworthiness: Self-assessment of an Institutional Repository against ISO 16363-2012

Trustworthiness: Self-assessment of an Institutional Repository against ISO 16363-2012. Bernadette Houghton. D-Lib Magazine. March/April 2015.
Digital preservation is a relatively young field, but progress has been made for developing tools and standards to better support preservation efforts. There is increased interest in standards for the audit and certification of digital repositories because researchers want to know they can trust digital repositories. Digital preservation is a long-term issue. The Trustworthy Repositories Audit and Certification (TRAC) checklist has been widely used as the basis of the activities. It later became ISO 16363 (based on the OAIS model) which contains 105 criteria in 3 areas:
  1. Organizational infrastructure (governance, structure and viability, staffing, accountability, policies, financial sustainability and legal issues)
  2. Digital object management (acquisition and ingest of content, preservation planning and procedures, information management and access)
  3. Infrastructure and security risk management (technical infrastructure and security issues)
 "Undertaking a self-assessment against ISO 16363 is not a trivial task, and is likely to be beyond the ability of smaller repositories to manage." An audit is an arms-length review of the repository, requiring evidence of compliance and testing to see that the repository is functioning as a Trusted Digital Repository.  Most repositories at this time are in an ad hoc, still-evolving situation. That is appropriate at this time, but a more mature approach should be taken in the future. The assessment process would rate features for: Full Compliance, Part Compliance, Not Compliant. The conclusions in the article include:
  • Self-assessment is time-consuming and resource-heavy, but a beneficial exercise
  • Self-assessment is needed before considering external certification. 
  • Certification is expensive.
  • Get senior management on board. Their support is essential.
  • Consider doing an assessment first against NDSA Levels of Digital Preservation  
  • Repository software may be OAIS-compliant, but it doesn't mean your repository is also
  • Not all ISO 16363 criteria have the same importance. Assess each criteria accordingly
  • ISO 16363 is based on a conceptual model and may not fit your exact situation
  • Determine in advance how deep the assessment will go.
  • Document the self-assessment from the start on a wiki and record your findings  

Wednesday, March 18, 2015

Storage is a Strategic Issue: Digital Preservation in the Cloud

Storage is a Strategic Issue: Digital Preservation in the Cloud. Gillian Oliver, Steve Knight. D-Lib Magazine. March/April 2015.
Many areas are mandating a 'cloud first' policy for information technology infrastructures. The article is a case study of the decision to outsource and its consequences. Some highlights:
  1. data held in archives must be expected to be both preserved and accessible beyond the commercial lifespan of any current technology or service provider.
  2. an approach to addressing serious risks (such as loss, destruction or corruption of data) that is based purely on financial reasons is not acceptable; it does not take into account the  preservation and custodial role of archives;
  3. there must be an explicit provision made for pre-defined exit strategies as well as effective monitoring and audit procedures
Two main challenges
  1. tensions between the information management and information technology perspectives. From the IT perspective the information managers were perceived as crossing boundaries into areas which were not of their concern.
  2. funding model. This change was a consequence of moving from the purchase of equipment for storage for use in house, to the provision of storage as a service.
"If most organisations lose a document, so long as they get the document back they're pretty happy. But because of digital preservation being what it is, you don't want to lose or corrupt any of the bits, they have to be exactly the way they were before." 
Cultural heritage institutions should investigate using storage as a service offerings, and also look ahead to utilizing other cloud based services. When making such decisions, you must be aware of the short term consequences of cost saving (i.e. increased burden on operating budgets) as set against potential long term benefits.

Tuesday, March 17, 2015

The news where you are: digital preservation and the digital dark ages

The news where you are: digital preservation and the digital dark ages. William Killbride. The informed. 25 February 2015.
Excellent post about the state of digital preservation and the existing concerns. "It’s undoubtedly true that preserving digital content through technological change is a real and sometimes daunting challenge.  Our generation has invested as never before in digital content and it is frankly horrifying when you consider what rapid changes in technology could do to that investment." We desperately need to raise awareness about the challenge of digital preservation so that solutions can be found and implemented.  Digital preservation is a concern for everyone. Many organizations depend on data but few have a mission to preserve that data. These are social and cultural challenges as well as technical challenges. We need better tools and processes. But the lack of skills is a bigger challenge than obsolescence.  The loss of data can cause major problems for countries, organizations and people.

The truth about contracts

The truth about contracts. Kevin Smith. Scholarly Communications at Duke. February 13, 2015.
This post looks at how to license student work for deposit in an institutional repository and also some basic truths about contracts and licenses.  It is a good statement of what contracts and licenses are. Contracts are business documents, intended to accomplish specific goals shared by the parties; they should clearly express the intentions of the parties involved.

Contracts can supersede copyright law "not because they are so 'big' but because they are small."   A contract is a “private law” arrangement by which two or more parties rearrange their relationship.  It need not be formal; it is simply the mechanism we use to arrange our relationships in a great many situations, including teaching situations that implicate the copyrights held by students.

A license is “a revocable permission to commit some act that would otherwise be unlawful". Not all licenses are contracts, but most are.

Main features of CERIF

Main features of CERIF. Website. March. 2015.
Common European Research Information Format (CERIF) is a concept about research entities and their relationships. The data-centric data model allows for a metadata representation of research entities, their activities, interconnections, such as research, and the results. It includes semantic relationships which enable quality maintenance, archiving, access and interchange of research information. It is very flexible and can be implemented in different ways. Today CERIF is used as
  • a model for implementation of a standalone Current Research Information System (CRIS)
  • as a model to define the wrapper around a legacy non-CERIF CRIS, and
  • as a definition of a data exchange format to create a common data warehouse from several systems
"Metadata is ... normally understood to mean structured data about resources that can be used to help support a wide range of operations. These might include, for example, resource description and discovery, the management of information resources and their long-term preservation.” 
Presentation for more information.

Saturday, March 14, 2015

Rosetta Metadata-extractor tool

Rosetta Metadata-extractor tool. Chris Erickson. March 10, 2015.
We have been looking at creating a Rosetta plugin to extract metadata from other types of image files, particularly Canon raw images. A tool on Github looked like it was a great starting place; it was created by Drew Noakes. This tool extracts many types metadata from many types of image files, including the Canon and Nikon raw images. This extractor tool metadata-extractor, is a Java library and works with Exif, IPTC, XMP, ICC and other metadata which may be present in a single image:

Metadata formats include:
  • Exif
  • IPTC
  • XMP
  • ICC Profiles
  • Photoshop fields
  • PNG properties
  • BMP properties
  • GIF properties
  • PCX properties
Types of files it will process:
  • JPEG
  • TIFF
  • WebP
  • PSD
  • PNG
  • BMP
  • GIF
  • ICO
  • PCX
  • Camera Raw
    Camera-specific "makernote" data is decoded for many types of digital cameras.
    Our plugin can extract the relevant metadata; we are working now to integrate that with Rosetta to process our SIP files.

    Friday, March 13, 2015

    Networked Information's Risky Future: The Promises and Challenges of Digital Preservation

    Networked Information's Risky Future: The Promises and Challenges of Digital Preservation. Amy Kirchhoff, Sheila Morrissey, and Kate Wittenberg. Educause Review. March 2, 2015.
    There has been tremendous growth in the amount of digital content, which has great benefits. But digital objects may be extremely short-lived without proper attention to preservation. "What are the odds that twenty years from now you will be able to find it and read it with whatever device and software you will be using then? What will be the cost to locate and reproduce the original files in a format that is usable in twenty years?" How do we ensure that our content is truly safe? There are a lot of questions to be answered. A few points:
    1. Near-Term Protection: Backup. Imperative for continuity of operations. Multiple copies in multiple locations will provide for near-term access. 
    2. Mid-Term Protection: Byte Replication. Create multiple identical copies of files and preferably store in other locations. Don't rely on special software for access. These byte replicas will provide content that is authentic and accessible for as long as the file formats remain usable.
    3. Long-Term Protection: Managed Digital Preservation. Establish policies and activities, including those above, to manage content over the very long term. 
    Four goals are key to successful managed digital preservation:
    • Usability: The objects must remain usable with current technology.
    • Discoverability: The objects must have metadata so they can be found by users over time.
    • Authenticity: The provenance of the content must be proven.
    • Accessibility: The content must be available for use by the appropriate community.
    An organization that wants to successfully preserve digital content needs to have, among other things:
    • A preservation mission
    • An infrastructure to support digital preservation
    • An economic model that can support preservation activities over time
    • Clear legal rights to preserve the content
    • Understand needs of stakeholders and users
    • A preservation strategy and policies consistent with best practices
    • A technological infrastructure that supports the selected preservation strategy
    • Transparency of preservation services, strategies, customers, and content
    The three items are required elements of long-term preservation and are appropriate steps in protecting content through preservation. There are best practices available. Institutions starting out should inventory their content, have good backups, and create a long term plan. There us still much to learn. "Ultimately, it is the responsibility of those who produce and care for valuable content to understand preservation options and take action to ensure that the scholarly record remains secure for future generations."

    Thursday, March 12, 2015

    MediaSCORE and MediaRIVERS preservation prioritization tool

    MediaSCORE and MediaRIVERS preservation prioritization tool.  Mike Casey, Patrick Feaster, Chris Lacinak. Indiana University, AVPreserve. March 12, 2015.
    Indiana University has announced the release of free, open source media preservation prioritization software created in collaboration with AVPreserve.
    1. MediaSCORE (Media Selection: Condition, Obsolescence, and Risk Evaluation). This tool enables a detailed analysis of degradation and obsolescence risk factors for most analog and physical digital audio and video formats.
    2. MediaRIVERS (Media Research and Instructional Value Evaluation and Ranking System). This tool guides a structured assessment of research and instructional value for media holdings. 
    Some additional key features of the software include:
    • Browser-based web-application for all Windows and Mac operating systems 
    • Permissions based access and views across MediaSCORE and MediaRIVERS.
    • Controlled vocabularies and field validation to help ensure consistent data entry.
    • Provides auditing path to help with quality assurance and transparency.
    The two applications are bundled together but may be used separately. They can be found along with a detailed user guide on GitHub at https://github.com/IUMDPI/MediaSCORE. Also available is a conceptual document that explores assessment of research and instructional value. The software requires installation and configuration on a server, requiring the appropriate expertise. AVPreserve is also offering MediaSCORE/RIVERS as a hosted application on a monthly subscription basis.

    Wednesday, March 11, 2015

    Getting to the Bottom Line: 20 Cost Questions for Digital Preservation

    Getting to the Bottom Line: 20 Cost Questions for Digital Preservation. MetaArchive. March 11, 2015.
    The MetaArchive Cooperative has, for over a decade, worked to respond transparently to many of the questions they have about the cost of doing digital preservation. These 20 questions will help institutions compare digital preservation solutions. Features and functionality are important, and often easy to identify. But it is more difficult to identify and compare short- and long-term costs, including a variety of  sometimes hidden fees.

    The questions are directed at learning about fees, memberships, storage costs, content limits, cost of having the content geographically distributed, hardware and software costs, services and products that the fees pay for, billing plans, and costs for backing up, retrieving or deleting content.

    Tuesday, March 10, 2015

    Investing in Curation. A Shared Path to Sustainability. Final RoadMap.

    Investing in Curation. A Shared Path to Sustainability. Paul Stokes. The 4C project. March 9, 2015.
    Digital curation involves managing, preserving and adding value to digital assets over their entire life cycle. Actively managing digital assets maximizes their value and reduces the risk of obsolescence. The costs of curation is a concern to stakeholders. The final version of the road map is now available; it starts with a focus on the costs of digital curation, but the ultimate goal is to change the way that all organizations manage their digital assets.

    The vision: Cost modeling will be a part of the planning and management activities of all digital repositories.
    • Identify the value of digital assets and make choices
      • Value is an indirect economic determinant on the cost of curating an asset. The perception of value will affect the methods chosen and how much investment is required.
      • Content owners should have clear policies regarding the scope of their collections, the type of assets sought, the preferred file formats.
      • Establish value criteria for assets as a component of curation, understanding that certain types of assets can be re-generated or re-captured relatively easily, thereby avoiding curation costs
    • Demand and choose more efficient systems
      • Requirements for curation services should be specified according to accepted standards and best practices.
      • More knowledgeable customers demanding better specified and standard functionality means that products can mature more quickly.
    • Develop scalable services and infrastructure
      • Organizations should aim to work smarter and be able to demonstrate the impact of their investments.
    • Design digital curation as a sustainable service
      • Effective digital curation requires active management throughout the whole lifecycle of a digital object.
      • Curation should be undertaken with a stated purpose.
      • Making curation a service further embeds the activity into the organization's normal business function.
    • Make funding dependent on costing digital assets across the whole lifecycle
      • Digital curation activity requires a flow of sufficient resources for the activity to proceed.
      • Some digital assets may need to be preserved in perpetuity but others will have a much more predictable and shorter life-span.
      • All stakeholders involved at any point in the curation lifecycle will need to understand their fiscal responsibilities for managing and curating the asset until such time that the asset is transferred to another steward in the lifecycle chain.
    • Be collaborative and transparent to drive down costs
      • Each organization is looking to realize a return on their investment.
      • If those who provide digital curation services can be descriptive about their products and transparent about their pricing structures, this will enhance possible comparisons, drive competitiveness and lead the market to maturity.

    Ending the Invisible Library | Linked Data

    Ending the Invisible Library | Linked Data. Matt Enis. Library Journal. February 24, 2015.
    The World Wide Web began as a collection of web pages that were navigated with links. Now, and going forward, the web is increasingly about data and relationships among data objects. The use of MARC is "becoming an anachronism in an increasingly networked world". The site schema.org, is a collection of structured data ­schemas that help web designers specify entities and relationships among entities, but these tools were not designed with libraries in mind. MARC lacks the ability to encode this information or make it accessible on the web. Libraries need to start formatting their data so it can be accessed from internet search tools.

    The W3C Schema Bib Extend Community Group (librarians, vendors, and organizations) have been working to expand schema.org to better represent library bibliographic information for search engines. The Library of congress has been working with the BIBFRAME project; “a major focus of the project is to translate the MARC 21 format to a Linked Data model while retaining as much as possible the robust and beneficial aspects of the historical format.” This will structure library records so that search engines can “extract meaningful information" and make it available. Ultimately, LC plans for BIBFRAME to replace MARC; there is a tool to convert MARC records to BIBFRAME.

    The Libhub Initiative is a proof-of-concept project to build a network of libraries using BIBFRAME standards to link data between institutions and show how this can make library resources more visible on the internet.

    Friday, March 06, 2015

    Infokit: Digital file formats

    Infokit: Digital file formats. Matt Faber. JISC. March 6, 2015.
    JISC has released a new infokit resource on Digital file formats. The infokit presents an overview of the current state of digital file formats for still images, audio and moving images, and it is looking toward future formats, and shifts to new formats from previously popular formats.
    Choosing the right file format is important to successfully creating, digitizing, delivering, and preserving the digital media objects:
    • The format helps define the quality of a digital object. 
    • Using poorly supported formats that may restrict or block use will hinder file distribution
    • Selecting a proprietary format with a short shelf life, or a compressed format that irreversibly loses data will hamper digital preservation
    • Selecting the right format for a project should not be taken lightly

    Maintenance of digital media files is an ongoing process. This kit is to:
    • Provide a comprehensive understanding of what a file format is 
    • The considerations in choosing the correct format for your project
    • Provide quick and practical answers to ‘what file format should I use for…? 
    • Help identify uncommon digital file formats.
    • Provide in-depth technical information about the digital files and file format properties.

    Wednesday, March 04, 2015

    Building Productive and Collaborative Relationships at the Speed of Trust

    Building Productive and Collaborative Relationships at the Speed of Trust. Todd Kreuger. Educause Review. March 2, 2015.
    To make projects successful, it is important to create trust and collaboration among IT, staff, and campus groups. To create that trust, the staff must establish highly productive relationships with the school's departments, faculty, and students. Collaboration, design thinking, and innovation go hand-in-hand. Many projects fall short of customer needs, fail or achieve less than satisfactory results, including plenty of finger pointing and wasted time, money, and opportunity. Some of the lessons learned:
    • Get on the same page
    • Build and establish trust
    • Provide the tools and expectations for success
    • Focus on both strategic and operational needs
    • Clarify process ownership and the associated responsibilities
    • Recognize the desired performance and celebrate success
    It is critical to have an open dialogue with various customer groups and to attempt to exceed their expectations. Another challenge is to ensure that people recognize the past as the past and not as an indicator of future performance. The best way to begin a change in culture is to identify issues and challenges that you can immediately address. The reservoir of trust is built one action at a time and emptied in a hurry. To steadily build trust, you must say what you are going to do and do what you say. Communication is the heart and soul of trust. It is imperative that you ask appropriate questions and listen to gain understanding. Collaboration should not be a project in and of itself, but the way in which we work.

    Cycle of Productivity model. Processes and tasks must have a defined owner and be documented and published, and change must be managed to ensure that everyone is aware of the new expectations. The basic premise is that training, assessment of effectiveness, and feedback all must occur to ensure the process or task is completed as expected.

    The end result "is one in which a culture of collaboration, coupled with a relentless focus on challenging the status quo, results in our encouraging, pushing, and helping each other innovate, transform, and differentiate."  

    Tuesday, March 03, 2015

    Significance 2.0: a guide to assessing the significance of collection

    Significance 2.0: a guide to assessing the significance of collections. Roslyn Russell, Kylie Winkworth. Collections Council of Australia Ltd. 2009.
    This guide is for defining an adaptable method for determining significance across all collections in Australia. The intention is that it will improve collection decision-making in areas such as, preservation, access, and funding support. Regarding significance:
    • We cannot keep everything forever. It is vital we make the best use of our scarce resources for collecting, conserving, documenting and digitising our collection materials.
    • Significance is not an absolute state; it can change over time.  
    • Collection custodians have a responsibility to consult communities and respect other views in constructing societal memory and identity. 
    • It is vital to understand, respect and document the context of collection materials that shape collection materials.
    Significance’ refers to the values and meanings that items and collections have for people and communities. Significance helps unlock the potential of collections, creating opportunities for
    communities to understand, access and enjoy collections. Artistic, scientific and social or
    spiritual values are the criteria or key values that help to express how and why an item or collection is significant. Part of the criteria are: provenance, rarity or representativeness, condition or completeness, and interpretive capacity. Significance assessment involves five main steps:
    1. analysing an item or collection
    2. researching its history, provenance and context
    3. comparison with similar items
    4. understanding its values by reference to the criteria
    5. summarising its meanings and values in a statement of significance
     A statement of significance is a concise summary of the values, meaning and importance of an item
    or collection. It is an argument about how and why an item or collection is of value. This should be reviewed as circumstances change.  Significance assessment is
    • a process to help with good management of items and collections; 
    • it is a collaborative process and consultation is essential.
    • it will substantiate justify assessments objectively rather than subjectively
    The process is:
    1. Collate information about the history and development of the collection
    2. Research the history, scope and themes of the collection
    3. Consult knowledgeable people
    4. Explore the context of the collection
    5. Analyse and describe the condition of the collection
    6. Compare the collection with similar collections
    7. Identify related places and collections
    8. Assess significance against the criteria
    9. Write a statement of significance
    10. List recommendations and actions
    Significance assessment is only the first part of the significance process. Once an item or collection has been assessed as significant, there will be a range of actions to better manage the collections.

    Oops! Article preserved, references gone

    Oops! Article preserved, references gone. Digital Preservation Seeds. February16, 2015.
    A blog post concerning the article Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot.  References in academic publications justify the argument. Missing references are a significant problem with the scholarly record because arguments and conclusions cannot be verified. In addition, missing or incomplete resources and information will devalue national and academic collections. The Significance method can be used to determine the value of collections. There is currently no robust solution, but a robustify script can direct broken links to Memento. The missing references problem emphasizes that without proper context, preserved information is incomplete.