Wednesday, October 26, 2016

Research data is different

Research data is different. Simon Wilson. Digital Archiving blog. 5 August 2016.
     A blog post about some born digital archives at Hull.  It is not academic research data but instead comes from a variety of sources. By using DROID to look at 270,867 accessioned files they discovered the following:
  • 97.96% of files were identified by DROID 
  • There were 228 different format types were identified 
  • The most common format is fmt/40 (MS Word 97-2003) with 120,595 files (44.5%).  
  •   The top formats they found were:
    Microsoft Word Document (97-2003)                 44.52%
    Microsoft Word for Windows (2007 and later)     5.63%
    Microsoft Excel 97 Workbook                              5.08%
    Graphics Interchange Format                              4.15%
    Acrobat PDF 1.4 - Portable Document Format     3.12%
    JPEG File Interchange Format (1.01)                    2.72%
    Microsoft Word Document (6.0 / 95)                    2.46%
    Acrobat PDF 1.3 - Portable Document Format     2.39%
    JPEG File Interchange Format (1.02)                    1.83%
    Hypertext Markup Language (v4)                         1.67%
 The number of and type of formats they found in their collections was different from other institutions that had research data.  An important step is to then look at the identified file formats and determine a strategy to migrate that format. Knowing the number and frequency of the formats in the collections will allow efforts to be prioritized.

Tuesday, October 25, 2016

Checksum 101: A bit of information about Checksums

Checksum 101: A bit of information about Checksums. Ross Spencer. Archives NZ Workshop. 2 October 2016.
    A slide presentation providing very good information on checksums. Why do we use checksums:
  • Policy: Provides Integrity
  • Moving files: Validation after the move
  • Working with files: Uniquely identifying what we’re working with
  • Security:  a by-product of file integrity
An algorithm does the computing bit, and there are a variety of types, MD5, CRC32, SHA, etc. A checksum algorithm is a one way function that can't be reversed. DROID can handle MD5, SHA1, and SHA256.  Why use multiple checksums? This helps to avoid potential collisions, though the probabilities are low. The presentation shows the different type of checksums and how they are generated.

Checksums will ensure uniqueness. We can automate processes better with file checksums. Some people may have a preference of which checksums to use. Using the checksums will help future proof the systems and provide greater security

Monday, October 24, 2016

Our Preservation Levels

Our Preservation Levels. Chris Erickson. October 24, 2016.
     After looking at the levels used by various groups, we have decided on 4 levels for our preservation plan. We want to keep it simple so that it is not difficult to determine and that it is meaningful for our workflows. Our Rosetta preservation system is a dark archive that can harvest digital materials from several publicly accessible content management systems. The curator or subject specialist for the collection will determine the level of preservation together with the preservation priority and will indicate that on the Digital Preservation Decision Form.

The Preservation Levels
0.   No preservation. Regular backups only (for example: Shared network drive that is  backed up regularly by IT)
1.   Basic preservation. A copy on M-Disc in Special Collections besides an access copy in our CMS, which is backed up by IT. No other preservation processing
2.   Full preservation. A master copy in Rosetta, with format migration, descriptive and preservation metadata, fixity checks, multiple copies (tape, data center, Granite Mountain Vault)
3.  Extended preservation. Full preservation services plus either DPN or remote/internet storage copy for materials that are appropriate for DPN
The intention is to recognize that some materials do not need full preservation services, nor long term storage in DPN. We will evaluate the levels next year and see if they are working the way we expect.

Thursday, October 20, 2016

Digital Storage In Space Rises Above The Cloud

Digital Storage In Space Rises Above The Cloud.  Tom Coughlin. Forbes. October 13,  2016.
     A start up company (Cloud Constellation) plans to build an array of earth orbiting data center satellites that would provide a space-based infrastructure for cloud service providers that can provide a private network with communications directly to and from the satellite network without any communication over the Internet via tight beam radio and hence no public data transmission headers.
The company says that latencies will be lower than those through conventional Internet transmission.

The digital storage in these orbiting data centers will be solid-state drives and the internal temperature inside the satellites will be kept at about 70 degrees Fahrenheit. The budget to build the initial phase of this satellite network is estimated at $400 M, much less than the cost of building an equivalent terrestrial global data center network with an equivalent level of security. Data is encrypted on the way to the satellite chain, inside the satellite storage and when the data is transmitted back to earth. This should provide secure storage and transport of data without interruption or exposure to exposed networks.It could protect critical and sensitive data for potential clients, including university archives and libraries. The first phase is planned to be operational in 2018 or 2019. Soon many companies and organizations will have an option to store their data securely in outer space.

Wednesday, October 19, 2016

Filling the Digital Preservation Gap. Phase Three report - October 2016.

Filling the Digital Preservation Gap. A Jisc Research Data Spring project. Phase Three report - October 2016. 19 October 2016. Jenny Mitcham, et. al. [PDF]
     This is a report of phase 3 of the Filling the Digital Preservation Gap project.  It is important to
consider how we incorporate digital preservation functionality into our Research Data Management workflows.
  • Phase 1: addressed the need for digital preservation as part of the research data management infrastructure
  • Phase 2: practical steps to enhance their preservation system for research data 
  • Phase 3 has the following aims:
    • To establish proof of concept implementations of Archivematica at the Universities of Hull and York, integrated with other research data systems at each institution
    • To investigate the problem of unidentified research data file formats and consider practical steps for increasing the representation of research data formats in PRONOM3
    • To continue to disseminate the outcomes of the project both nationally and internationally and to a variety of different audiences

"Preserving digital data isn’t solely reliant on the implementation of a digital preservation system, it is also necessary to think about related challenges that will be encountered and how they may be addressed."  In working with formats it was clear that DROID does not look inside the zip files, and not all files were assigned a file format identification. Of the 3752 files analysed at York, only 1382 (37%) were assigned a file format identification by DROID. At the University of Hull a similar exercise had quite different results, with 89% of files assigned an identification by DROID. At Lancaster University the identification rate was 46%. Of the files, 70% of the files were TIFF images. Of the files that were not automatically identified, files with no extension made up 26% of the total.

"One possible solution to the file format problem as described would be to limit the types of files that would be accepted within the digital repository. This is a tried and tested approach for certain disciplines and data archives" and follows the NDSA level one recommendations, to “... encourage use of a limited set of known open formats ...”. This may be a problem with preserving research data, since researchers use a wide range of specialist hardware and software and it will be "hard for the repository and research support staff to provide appropriate advice on suitable formats. For much of the data there will be no obvious preservation format for that data."

The University of York encourages researchers (in training sessions and webpages) to consider file formats throughout their project, and the longevity and accessibility of the formats they select, but  researcher decides what formats to deposit their data in. The university accepts these formats and will preserve them on a best efforts basis. "Understanding the file format moves us one step closer to preservation and reuse over the longer term." In order to help the research data community their recommendations include:
  • For data curators: 
    • Greater engagement with researchers on the value and necessity of recognising and recording the file formats they will use/generate to inform effective data curation.
  • For researchers:
    • Supply adequate metadata about submitted datasets. Clear and accurate metadata about file formats and hardware/software dependencies will aid file format identification and future preservation work. 
    • Be open to sharing sample files for testing and to aid signature development where appropriate.

Appendix 2 contains A Draft PCDM-based Data Model for Datasets

Tuesday, October 18, 2016

When Archivists and Digital Asset Managers Collide: Tensions and Ways Forward

When Archivists and Digital Asset Managers Collide: Tensions and Ways Forward. Anthony Cocciolo. The American Archivist. Spring/Summer 2016. [PDF]
     The article looks at tensions in an organization between archivists and digital asset managers. Archivists maintain the inactive records (paper or electronic) of permanent value for an organization. A records manager’s role is to manage active records, and records with permanent value are transferred to the archives when they become inactive. Digital asset managers often see their role in  creating repositories of assets that can be easily and efficiently reused by staff. This accompanies the attitude that digital files will never become inactive.

This study is limited because it provides at a single instance that may not apply to other organizations that have both archivists and digital asset managers. It looks at tensions that can exist between archivists and digital asset managers which mostly come from digital asset managers and archivists not recognizing the different role each plays. 

For archives, the unit being managed is a record (“data or information in a fixed form that is created or received in the course of individual or institutional activity and set aside (preserved) as evidence of that activity for future reference"). In digital asset management, the unit being managed is an asset (a kind of record that individuals can readily reuse in future work products). Archivists are interested in the record not only for its content and aspects about the record itself, such as historical and social implications. Digital asset managers are more focused on the content and the legal rights to reuse, and are more like libraries in their approach.

One tension between the two groups is that if a file was deposited and permanently preserved in the DAM, there would be no reason to deposit it in the archives. Other tensions are
  1. Users, Files, and Where They Get Stored
  2. Differing Work Practices
  3. Approaches to Digital Preservation
  4. Communication
  5. Differing Approaches to Planning
The article states that archivists and digital asset manager differ in the view of preservation planning, fixity checking, formats accepted, and how to respond to file formats once they became obsolete. [Not all digital asset managers are as 'short term' as implied. cle]  However,  digital asset or content management systems are “not adequate for long-term digital preservation because [they include] no mechanisms for reliably assuring authenticity and intelligibility of digital documents for fifty years or longer.”   Also, another problem is that many things are called an “archives” which can be troubling for the archivists, who must contend with staff who believe that they are keeping archives and may view the DAM as yet another archives.

The article recommends that items deemed assets be deposited both in the DAM system and in the digital archives. In the digital archives, the asset will be grouped with other records of the same provenance and metadata will be attached to the file to make it more find-able. The archivists will document the activity of the institution for researchers. Since the purposes are not the same and the user groups do not overlap entirely, it is sensible that assets appear in both places. This is not wasteful because digital preservationists because multiple copies can increase object safety.  At a minimum, references to the assets in the DAM should be added to the archives intellectually if not physically. Asset management systems should not replace the need to create digital archives that document
institutional activity.

It is also essential that digital asset managers and archivists respect the different roles they play and not try to undermine each other. Each should focus on their own missions:
  • digital asset managers: creating a collection of digital assets for effective and efficient reuse by staff members. 
  • archivists: documenting institutional activity through records of permanent value in whatever format they may occur for use by staff and public researchers.

Monday, October 17, 2016

Digital Preservation through Digital Sustainability

Digital Preservation through Digital Sustainability. Matthias Stuermer, Gabriel Abu-Tayeh. iPres 2016.  (Proceedings p. 18 - 22/ PDF p. 10-12).
     The concept of digital sustainability examines how to maximize the benefits of digital resources. They specify nine basic conditions for digital sustainability which also contribute to potential solutions to the challenges of digital preservation:

    Conditions regarding the digital artifact:
1. Elaborateness: For instance, data quality requires characteristics such as accuracy, relevancy, timeliness, completeness and many more characteristics. Quality of data plays a significant role within digital preservation
2. Transparent structures: technical openness of content and software is essential for digital sustainability. Open standards and open file formats are particularly important for digital preservation.
3. Semantic data: adding meaningful information about the data to make it more easily comprehensible
4. Distributed location: redundant storage of information in different locations decreases the risk of loss

    Conditions regarding the ecosystem:
5. Open licensing regime: the legal framework plays a crucial role for digital artifacts. Objects are protected by rights, but it hinders the use of digital assets and decreases their potential for society as a whole.
6. Shared tacit knowledge: enables individuals and groups to understand and apply technologies and create further knowledge, which all needs to be updated and adapted continuously
7. Participatory culture: an active ecosystem leads to significant contributions from outsiders such as volunteers. The expertise from an international set of contributors can lead to high-quality peer-reviewed processes of knowledge creation.
8. Good governance: While technology companies and innovative business models are considered part of sustainable digital resources, they should remain independent from self-serving commercial interests and control by a few individuals.
9. Diversified funding: this reduces control by a single organization, which increases the independence of the endeavor.

Saturday, October 15, 2016

DPTP: Introduction to Digital Preservation Planning for Research Managers

DPTP: Introduction to Digital Preservation Planning for Research Managers. Ed Pinsent, Steph Taylor. ULCC. 15 October 2016.
     Today I saw this course offered and thought it looked interesting (wish I were in London to attend).  It is a one-day introduction to digital preservation and is designed specifically to look at preservation planning from the perspective of the research data manager. Digital preservation, the management and safeguarding of digital content for the long-term, is becoming more important for research data managers to make sure  content remains accessible and authentic over time.  The learning outcomes are:
  • Understand what digital preservation means and how it can help research managers
  • How to assess content for preservation
  • How to integrate preservation planning into a research data management plan
  • How to plan for preservation interventions
  • How to identify reasons and motivations for preservation for individual projects
  • What storage means, and the storage options that are available
  • How to select appropriate approaches and methods to support the needs of projects
  • How to prepare a business case for digital preservation
The course contains eight modules, which are:
  1. Find out about digital preservation and how and why it is important in RDM.
  2. Assessing research data and understanding how to preserve them for the longer term, and understanding your users.
  3. Learn how a RDM plan can include preservation actions. 
  4. Managing data beyond the life of projects, planning the management of storage and drafting a selection policy.
  5. Understanding individual institutions, stakeholders and requirements and risk assessment.
  6. Understand why preservation storage has extra requirements, considering‘the Cloud’
  7. The strategy of migrating formats, including databases; risks and benefits, and tools you can use. 
  8. Making a business case (Benefits; Risks; Costs) to persuade your institution why digital preservation is important

Friday, October 14, 2016

Digital Preservation Program: Levels of Digital Preservation Support

Digital Preservation Program. South Dakota State Historical Society. 2015.
     A look at the South Dakota State Archives webpage concerning the levels of digital preservation.  They are committed to collecting, preserving, and providing access to their materials.

Levels of Digital Preservation Support:  The Archives has established three distinct levels of preservation support for digital archival materials that will be applied to digital materials at the time of accession. The levels are:
  • Full Support:  The Archives will take all reasonable actions to maintain usability including migration, emulation, or normalization and will ensure data fixity for all original and transformed files and will provide access to transformed files.
  • Limited Support:  The Archives will take limited steps to maintain usability and undertake strategic monitoring. They may actively transform a file from one format to another to mitigate format obsolescence, and will ensure data fixity for all original and transformed files and will provide access to transformed files.
  • Basic Support: The Archives will provide access to the item in its submission file format only and will work to ensure data fixity of the submitted file. No transformations will be enacted on these files for preservation purposes.
The archives also has created a chart that outlines the preservation tasks associated with each level of preservation support. The tasks are:
  • Create preservation metadata for accessibility, provenance, and management
  • Perform fixity checks on a regular basis using proven checksum methods
  • Periodically refresh storage media
  • Provide for discovery of objects via online descriptive finding aid  
  • Undertake strategic monitoring of file format
  • Plan and perform file normalization if necessary
  • Plan and perform migration to succeeding format upon obsolescence
  • Offer long-term storage in a trusted preservation-worthy format

Thursday, October 13, 2016

Generating public interest in digital preservation

Born Digital 2016: Generating public interest in digital preservation. Sarah Slade. Poster, iPres 2016.  (Proceedings p. 262 / PDF p. 132).
     This poster describes the development and delivery of a national media and communications campaign by the National and State Libraries of Australasia Digital Preservation Group in order to broaden public awareness of what digital preservation is and why it matters.  The campaign focused on the benefits to the wider community of collecting and preserving digital material, rather than on the concept of loss, which forms the usual arguments about why digital preservation is important.

Their Digital Preservation Group identified best practice and collaborative options for the preservation of born digital and digitised materials.  Earlier they had identified six priority themes and their poster addresses  priority 5 (Collaboration and Partnership).
  1. What is it and why? A statement on digital preservation and set of principles.
  2. How well? A Digital Preservation Environment Maturity Matrix.
  3. Who? A Digital Preservation Organisational Capability and Skills Maturity Matrix. 
  4. Nuts and Bolts: A technical registry of file formats with software/hardware dependencies.
  5. Collaboration and Partnership: Opportunities for promotion and collaboration.
  6. Confronting the Abyss: A business case for research on preserving difficult object types.
While it is true that digital material is being lost to future generations due to inadequate digital collecting practices and the lack of resources and systems, they felt that it was important to reframe the discussion with a more positive focus in order to involve the public and traditional media in this campaign. They decided the most effective way to do this was with a collaborative, coordinated communications strategy, and they chose a theme for each of the five days:  Science and Space; Indigenous Voices; Truth and History; Digital Lifestyles; and Play. These would provide an opportunity for "national and local engagement with audiences through traditional and social media, and for individual libraries to hold events". The themes would target  a broad range of community sectors and ages, as well as a different focus for the public to think about reasons why digital material should be collected and preserved. A high-profile expert speaker was chosen for each of the themes and included scientists, journalists, academics, and gaming and media personalities.

Wednesday, October 12, 2016

Q&A with CNI’s Clifford Lynch: Time to re-think the institutional repository?

Q&A with CNI’s Clifford Lynch: Time to re-think the institutional repository?  Richard Poynder. Blog: Open and Shut? September 22, 2016.
     In 1999, a meeting was held to discuss scholarly archives and repositories and ways in which to make them interoperable and to avoid needlessly replicating each other’s content. This led to the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). One notion was that the individual archives "would be given easy-to-implement mechanisms for making information about what they held in their archives externally available".  Open access advocates saw OAI-PMH as a way of aggregating content hosted in local archives, or institutional repositories. This would "encourage universities to create their own repositories and then instruct their researchers to deposit in them copies of all the papers they published in subscription journals."

The interoperability promised by OAI-PMH has not really materialised, and author self-archiving "has remained a minority sport, with researchers reluctant to take on the task of depositing their papers in their institutional repository". Some believe the "IR now faces an existential threat". The interview and additional information are available in a separate PDFThis file looks at whether the IR will survive, be "captured by commercial publishers" or "the research community will finally come together, agree on the appropriate role and purpose of the IR, and then implement a strategic plan that will see repositories filled with the target content."

Tuesday, October 11, 2016

Digital Preservation of Photo Books

Digital Preservation of Photo Books. Mark Mizen. All About Images Blog. September 20, 2016.
     The post is a follow on to the paper Long-Term Digital Preservation of Photo Books presented at the International Symposium on Technologies for Digital Photo Fulfillment in Manchester, England. The  presentation highlights the need to think about photo books as not just a printed book but a combination of the printed book and the related electronic files, which are resources to be preserved.

Photo books give context to photos and provide an unparalleled source of information about life as it is happening. They are today’s scrapbooks and give a glimpse into everyday life. Preserving the digital file that created the photo book is important, but unfortunately, most manufacturers do not provide the digital file when the book is printed and the files are lost as soon as the book is printed.  The Forever company allows the PDF file to be saved and preserved. Always ask the photo book supplier for the files. 

Monday, October 10, 2016

Secure cloud doesn’t always mean your stuff in it is secure too

Secure cloud doesn’t always mean your stuff in it is secure too. Gareth Corfield. The Register. 6 Oct 2016.
     Workflows are moving to the cloud and security technology is helping to build customer confidence. “Picking a secure cloud partner is not as trivial as it may seem. Don't assume that because the cloud is secure, your business within the cloud is secure."  The public cloud can provide  better security monitoring and analysis, management, redundancy and resilience. But you have to choose a secure cloud platform. Microsegmentation can help secure the platform against malware and other security threats. It helps to improve operational efficiency. The cloud provides many services, more than just storage.

Saturday, October 08, 2016

Preserving & Curating ETD Research Data & Complex Digital Objects

Preserving & Curating ETD Research Data & Complex Digital Objects. Katherine Skinner, Sam Meister. ETDplus project, Educopia Institute. October 7, 2016.
     The ETDplus project is funded by IMLS and led by the Educopia Institute, in collaboration with many others.  The project helps ensure the longevity and availability of ETD research data and complex digital objects (e.g., software, multimedia files) that are part of student theses and dissertations. The project has just published a set of six Guidance Briefs to help students understand how to prepare, manage, and store the research files associated with their ETDs.

The Guidance Briefs are short “how-to” oriented briefs "designed to help ETD programs build and nurture supportive relationships with student researchers. These briefs are written for a student audience. They are designed to assist student researchers in understanding how their approaches to data and content management impact credibility, replicable research, and general long-term accessibility: knowledge and skills that will impact the health of their careers for years to come."

The Guidance Briefs can be downloaded at the site, and cover the following topics:
1. Copyright
2. Data Structures
3. File Formats
4. Metadata
5. Storage
6. Version Control

Institutions can use the guides as fits their local audiences. Each Brief includes information about the  topic and a “Local Practices” section where an institutions can highlight their own activities.

Friday, October 07, 2016

Proceedings of the 13th International Conference on Digital Preservation: iPres 2016

Proceedings of the 13th International Conference on Digital Preservation. iPRES 2016. October 3 – 6, 2016. 169 pp.  PDF  (Link updated)
     The proceedings of the conference, along with other posts of presenters, including slides and images. A wealth of information to read in the weeks ahead.

‘We’re going backward!’

‘We’re going backward!’ Vinton G. Cerf. Communications of the ACM. October 2016.  HTML   PDF
     Update from Vinton Cerf on blog about media longevity.  "Perhaps  by  now  you  are  noticing  a   trend  in  the  narrative.  As we move toward the present, the media of our expression seems to have decreasing longevity." It is not just digital media, but physical as well. Photographs may not last more than 150–200 years and normal books may not last more than 100 years. He is concerned for the "longevity of digital media and our ability to correctly interpret digital content, absent the software that produced it". He reflects on the ephemeral nature of our artifacts and that the centuries before ours may be better known than ours unless we are persistent about preserving digital content.

"Just as the monks and Muslims of the Middle Ages preserved content by copying into new media, won’t we need to do the same for our modern content? These thoughts immediately raise the question of financial support for such work."  In the past, patrons, religious orders and centers of Islamic science underwrote the preservation costs. Our society must find a way to underwrite the cost of preserving knowledge in media that will have some permanence and the executable software for their rendering. Unless we face this challenge the knowledge we have produced may simply evaporate with time.

Thursday, October 06, 2016

Judging a book through its cover

Judging a book through its cover. Larry Hardesty. MIT News Office. September 9, 2016.
    MIT researchers and colleagues are designing an imaging system that can read closed books, and particularly antique books that are too fragile to touch. The system uses terahertz radiation emitted in short bursts that can gauge the distance to individual pages of the book and can distinguish between ink and blank paper, in a way that X-rays can’t. It is still a new technology but they are working to improve the depth of penetration of a book and also the accuracy.

Wednesday, October 05, 2016

How many copies are needed for preservation?

How many copies are needed for preservation? Chris Erickson. 4 October 2016.
     An important component for preservation is to have multiple copies. The specific questions are: how many copies, how should they be stored, and where should they be located. Many people advocate the 3-2-1 rule for digital storage: three copies, stored on two different media, and one copy located off-site, preferably in areas with different disaster threats. (NARA; Library of Congress) The NDSA levels also incorporate this rule in the storage section.

The copies we have been looking at are:
     Copy 1: Rosetta storage on spinning disk in the campus data center
     Copy 2: Tape copies of our archive in the Granite Mountain Record Vault.
                    Annual Tape archive plus incremental transactional backups
     Copy 3: Internet copy, with DPN or Amazon Glacier
     Copy 4: Access copy within Special Collections on M-Discs or our CMS
What we choose to put in DPN will affect the third copy. We need to determine if these copies are adequate, and if not, then find different storage methods that are cost effective and fit within our workflow.

Additional posts:

Monday, October 03, 2016

Digital Preservation Priorities: What to preserve?

Digital Preservation Priorities: What to preserve? Chris Erickson. 3 October 2016.
     Recently we have been reviewing the digital preservation policies that we have been working under. The current policy states that the subject specialists (curators, subject librarians, faculty members) who are responsible for a collection should decide what will be preserved in the Rosetta digital archive. They should know the library collection and the collecting policies, as well as the faculty and the university curriculum, and be able to decide what is worth preserving long term. We provide the Digital Preservation Decision Form to help them in their decisions. Currently the choices are to preserve, not to preserve, and the order in which collections needs to be processed.

The amount of content in our digital archive is increasing rapidly. As we plan for the future of the archive, there are questions raised about the number of archival copies, particularly when discussing what content should go into DPN. Those questions in turn raise other questions, including the question of preservation priorities. Are all objects equally important? If not, what are the most important objects or collections to preserve? Should we periodically revisit what is in the archive and deaccession content that is less important? In a world of finite resources we decided that we need to determine our preservation priorities in order to better preserve the important content.

Our goal is to preserve the important digital resources created in, or acquired by, the University Library and Archives. The proposed change is that the content preserved will be addressed according to the following guidelines, in descending order of importance:
1.      Unique University created content with no physical copy 
2.      Unique University owned items that are At-risk 
3.      Digital content in the library with a physical copy that may be at risk
4.      Digital content that would be difficult or costly to reproduce 
5.      Content digitized for convenience

We will be reviewing our digital collections and deciding if these priorities will help our selection and preservation processes, if they need to be revised, or if we need to go in another direction. We are also looking at implementing levels of preservation along with these priorities.

Saturday, October 01, 2016

What happens when the Internet and digital preservation coincide

What happens when the Internet and digital preservation coinicide. Jay Gattuso. jaygattuso's Blog, Open Preservation Foundation. 25th Sep 2016.
     A very thought provoking post that uses a job recruitment post as the basis for a discussion about the library's going digital preservation program, and what happens when they identify a gap in the capability that can’t be ignored. The primary purpose of the Digital Preservation Web Engineer is to "define, implement and support the efficient acquisition, preservation and discovery/delivery of web based digital content subject to the Library’s legislated mandate."  They understand that there is digital, and there is “online”, and sometimes digital is online. It is important to be able to confidently collect online digital content, maintain a sense of content, context and structure but there is a capability gap that they have been working around for a while. There are still many questions deliver content to a readership that is still establishing its own needs. And there is the challenge of doing this on a large scale.

They want the two processes, digital collecting and digital preservation to dovetail into a well-considered unified workflow.While they are all about collecting, storing and preserving important things that are precious to New Zealand, the same concepts hold true for others that collect to their own mission. "We don’t believe this point can be understated. We are slowly start to understand the cultural and research impact of web content, and this new post is a direct response to the challenge that sits behind national level collection building and the rapid uptake of Internet based content and information."

The content collected has an extremely important role in their National memory, and they have an obligation to operate with the care and expertise that this content demands. The collections help people understand their sense of place and history as well as informing research and creative outputs alike.

Their post addresses one of the problems facing digital preservation today. Digital Preservation is "an emergent discipline, finding our way through new challenges, and without specifically crafted routes into the work we expect to undertake. We are only just starting to see the edges of what’s possible, and unless we repeatedly open the door to complimentary professions we are going to struggle to address the contemporary challenge of collecting fast moving content, regardless of the ongoing care required when today’s harvests become tomorrow’s Preservation Masters with all the attendant questions of technical sustainability."

Friday, September 30, 2016

Victoria University of Wellington Selects Ex Libris Rosetta for Preserving and Managing Digital Assets

Victoria University of Wellington Selects Ex Libris Rosetta for Preserving and Managing Digital Assets. Press release. August 2016.
     "Victoria University of Wellington has selected Rosetta as its digital preservation and asset management solution. The Victoria University Library serves as the custodian of over 15,000 digitized historical cultural works, part of the New Zealand Electronic Text Collection, and over 11,000 born-digital theses and research projects in institutional repositories, including several other smaller digital collections. Rosetta will be a key element of the Library’s digital assets management and preservation processes and will enable researchers in any location to read or view the digital objects in the Library's extensive collections."

"Adopting Rosetta will enable us to manage, maintain, and preserve these collections in the long term, as well as grow them in the future. As our collections increase, standards of digital preservation and description become more vital to the continuity and discovery of materials for future knowledge creation. It’s not just about the students and researchers of today. It’s about the students and researchers of the future, too.”

Wednesday, September 28, 2016

The Document Life Cycle Road to Digital Preservation and Archiving

The Document Life Cycle Road to Digital Preservation and Archiving. Brett Claffee. Document Strategy. Aug. 18 2016.
     What is the difference between documents and records in today’s digital enterprises? Documents do not become records until they are declared a record. "When a document is first created, it is under its author’s control and typically goes into a workflow, put simply—a document life cycle".  When a document is declared a record, it moves from the author’s control to corporate control under the retention schedule, which determines what eventually happens to the record. Typically, a document’s life cycle involves these phases:
  • Creation
  • Management
  • Storage
  • Retrieval
  • Distribution
  • Disposal
When a document is declared a record, it "becomes subject to corporate control and cannot be destroyed until it meets all of its retention obligations, including being released from any legal, financial, or regulatory holds". Record types with long-term retention requirements may be kept permanently.

There are differences between digital preservation and archiving and how they look at documents and record life cycles. Records life cycle adds retention and archiving as a phase,which includes document destruction as part of the document life cycle.

The digital archiving and preservation is a multi-layered process, that deals with "provenance and authentication practices, to chain of custody and accountability, to format transformations—all designed to keep information legitimate, useful, and, if required for long-term retention, preserved". With the large volume of data inf recent years, data archiving has come to the forefront. It is estimated that there are more than 30 billion documents used each year in the United States. Archiving provides five critical advantages:
  1. Ensuring regulatory compliance for data retention, data immutability, and audit trails
  2. Improving performance and productivity of current business applications
  3. Making archived records widely available and easy to retrieve by authorized users
  4. Removing the problems of maintaining obsolete systems just for the data
  5. Reducing IT costs and time for back-up, upgrades, and other needs
For records, organizations can follow a retention schedule, but a retention schedule for documents and data are not as clear. The life span of these are often looked at in terms of use, informatics and analytics. A life cycle approach can assure consistent control.

Tuesday, September 27, 2016

Digital Preservation File Names

Digital Preservation File Names. Chris Erickson. September 27, 2016. 
     While processing some collections, we had difficulty creating the mets xml files because of some characters in the file names. The characters may be valid in some systems, but may cause difficulties in others. From comments on the internet it appears that there are only a few characters that are forbidden, but experiences from a number of people suggest that some systems may not support all the characters in file names. We decided that it was better to use only alpha numeric characters, and underscores as a separator, and a fullstop (period) before the extension.  When preserving digital files it is important to remember that the files may be used by a variety of computer systems over their life time. To have the greatest chance of keeping the files usable in the future it is best to follow some basic standards when naming files.

Here are some suggestions we are considering:
  1. Decide on file naming conventions so that file names have meaning.
  2. File extensions can help determine the type of file it is (such as .txt, .doc, .wav, .jpg)
  3.  File name length varies for different operating systems, so generally stay under 30 characters
  4. Avoid spaces in file names. Spaces are an acceptable character for most file names, but they can cause difficulty when processing. Underscores may be used as a separator.
  5. Avoid punctuation and special characters. The safest characters to use are numbers and letters. Most operating systems are case sensitive. Some characters to avoid for our preservation system are spaces, ampersands, brackets, and commas
  6. Keep the filenames to a reasonable length and it is best if they are under 30 characters.
  7. Don’t start or end the filename with a space, special characters, or punctuation marks.
  8. These conventions apply to folders as well as files
Characters that others have had difficulties with and which should not be used in filenames:

# pound                      < left angle bracket               $ dollar sign                      + plus sign
% percent                   > right angle bracket             ! exclamation point           ` backtick
& ampersand             * asterisk                               ‘ single quotes                   | pipe
{ left bracket              ? question mark                     “ double quotes                = equal sign
} right bracket            / forward slash                       : colon                                      
\ back slash                 blank spaces                          @ at sign

Some other resources of information:

Monday, September 26, 2016

Selection and Appraisal in the OAIS Model

Selection and Appraisal in the OAIS Model. Ed Pinsent. DART Blog. 7 September 2016.
     The post asks if the OAIS Model accommodate the skills of selection and appraisal, then suggests that it cannot.  The Model presents an over-simplified view where in a state that is all ready to preserve, which ignores the beginning processes.There is a need to define the pre-ingest stage in OAIS, but there needs to be  a greater recognition of the archivists' Selection and Appraisal skills, can have tremendous value in digital preservation. Archivists assess the value of the content in a contextual framework, based on other records in the archive and in the context of provenance. It requires an understanding of context, provenance, record series, to help identify the potential value of content. A Series model is the "foundation for all Archival arrangement, and is the cornerstone of our profession". It is difficult to see where the record / archival series is in all this.  "The integrity and contextual meaning of a collection is being overlooked, in favour of this atomised digital-object view.

OAIS, if strictly interpreted, could bypass the Series altogether in favour of an assembly line workflow that simply processes one digital object after another."  The blog post asserts that there is a need to rediscover the value of Appraisal and Selection and its importance in the digital realm. 

Assessing and Quantifying Risk to Digital Media Materials

Assessing and Quantifying Risk to Digital Media Materials. Lance Thomas Stuchell. Bits and Pieces. August 31, 2016.
     A post written by Sarah Breen, Alix Norton, and Alexa Hagen. Archives are increasingly facing challenges in preserving digital media materials; creating digital processing workflows and workstations is one often discussed challenge. This article discusses a framework for assessing risk of loss to digital archival materials and shows that the methodology can highlight materials most susceptible to loss. This will help administrators demonstrate the need for immediate intervention and processing.

The methodology used a formula for calculating risk to physical collections:
"The formula yields a calculation of the magnitude of a given risk (MR) by multiplying the factors of the fraction of the collection that is susceptible (FS), loss of value (LV), probability of risk (P), and extent of the risk (E). By giving each of these factors a value between 0 and 1, we calculated MR values for the overall magnitude of a variety of risks, also between 0 and 1. While this formula is often used to assess risks over a 100 year period, due to the nature of the short lifespan and rapid obsolescence of digital media, we have used this formula to assess risks over a 10 year period".
External risks would affect the collection as a whole, and would include fire, theft, damage, and lack of funding to continue preservation projects. Internal risks are more specific to the physical digital media format, such as obsolescence of format and media degradation. Management, funding, administrative decisions and the storage environment can also be areas of high risk.

The highest risks assessed include:
  1. degradation and obsolescence, 
  2. lack of funding, and 
  3. potential loss of management support. 
The article recommends actions be taken to mitigate these risks early by:
  • migrating digital content to a stable content management system, 
  • lowering relative humidity of the storage environment, 
  • securing the lowest cost digital storage option that remains aligned with the library’s policy, and 
  • advocate to library and university administration showing the need for preservation
These recommendations should significantly reduce the highest risks and help ensure the preservation of the digital information.

Saturday, September 24, 2016

Digital Preservation Projects in 2016

Digital Preservation Projects in 2016. Chris Erickson. 6 September 2016.
     For the past several months we have been working on a number of digital preservation projects, which include:
  • Reworking the ContentDM to Rosetta ingest interface. Orginally it was just for images or simple objects. It has been expanded to include also the compound objects particularly those  with page level metadata, page by page transcriptions, and such.
  • Improving our unstructured data ingest process. It uses a spreadsheet template for metadate related to files to be ingested into Rosetta. The content creator can enter the metadata or we have a file discovery tool that can traverse a directory structure and enter file and folder metadata into the spreadsheet template. The collection I am just finishing with this tool totaled about 45,000 tiff images.
  • Restructuring our digital ingest workflows from project based into a digital pipeline.  We now have a shared drive between Rosetta and our content creators, more storage disk space, and this makes it easier to transfer files at the end of a project, or they can transfer files as they go if it is a long project. 
  • Using all this to keep up with all new projects being created and adding them to Rosetta,  which allows more time to ingest the backlog of projects waiting for preservation. The usual rate of ingest now, depending on preparations of the collections is usually a couple of TBs each week.

Thursday, September 22, 2016

Content Delivery Drives The Move To The Cloud

Content Delivery Drives The Move To The Cloud. Tom Coughlin. Forbes. Sep 13, 2016.
     The growing reliance on the Internet is also increasing cloud-based services for collaborative workflows and content delivery in the Media and Entertainment Industry. This is causing a shift from capital expenses to operating expenses for media and entertainment content storage.  Cloud storage for the media and entertainment industry is projected to grow from $2.5 billion in 2016 to over $20 billion by 2021.  Archiving and preservation is a large part of this, seen in this chart.

Thursday, September 15, 2016

The Secret Libraries of History

The Secret Libraries of History. Fiona Macdonald. 19 August 2016.
     Religious or political pressures have meant that books have been hidden throughout history – whether in secret caches or private collections. This article looks at libraries that have been preserved over time, either to keep them hidden, or because of neglect.
  • Syria’s secret library currently beneath the streets of a suburb of Damascus
  • The Library Cave on the edge of the Gobi Desert in China, sealed for almost 1000 years.
  • The Vatican Secret Archives with papal correspondences going back over 1000 years,
  • The Cairo Genizah in a wall of the Ben Ezra synagogue containing almost 280,000 Jewish manuscript fragments from the ninth to the nineteenth centuries
  • A Hidden Medieval Archive found in papers used in binding medieval books
[Note: This is a good reminder of what it is we are trying to do, to keep important content for future generations. Chris]

Tuesday, September 06, 2016

The Pathways of Research Software Preservation: An Educational and Planning Resource for Service Development

The Pathways of Research Software Preservation: An Educational and Planning Resource for Service Development. Fernando Rios. D-Lib Magazine. July/August 2016.
     A great deal of effort has gone into preserving digital research data, but not as much for the software and code preservation to use the data. The computer programs to view, process, analyze, and create data are an integral part of the research workflow. There are "many issues remain in regards to identifying and capturing metadata, dependencies, support for attribution and citation, infrastructure development, and developing appropriate workflows to enable service provision." The article looks at the development of a visual representation of software preservation at Johns Hopkins University, which looks at: 
  1. major approaches to software preservation for research software and data
  2. a need to evaluate our capacity to offer software preservation services
  3. and the need for a road-map
One way to look at this is to view the development and use of software in the research process in general phases:
  • developing concepts and theory
  • writing the software 
  • obtaining all objects required for execution 
  • collecting inputs and setting parameters, and 
  • making use of the results
A visual approach for evaluating the possible Pathways of Software Preservation, a user can better understand the capacity for software preservation activities, as well "a better appreciation of the nuances of research software preservation and sharing".

Monday, September 05, 2016

Preservation Challenges in the Digital Age

Preservation Challenges in the Digital Age. Bernadette Houghton. D-Lib Magazine. July/August 2016.
     The rapidly evolving digital preservation field has many preservation challenges:
  • Digital materials are more at risk than analogue
  • Preserving digital materials is also providing access to the material
  • Ensuring the infrastructure that renders the file is preserved or replicated
  • Focal areas changing and best practices still under debate.
"The optimal preservation strategy for individual organisations will differ according to their requirements, resources and data type. Each strategy comes with its own set of challenges, many of which are dependent on, or impacted in some way by, other challenges. This article will cover what the author sees as the major challenges for digital preservation at this point in time, covering a range of technical, administrative, logistical and legal aspects."

Other challenges:
  • Data volumes. Digital storage is becoming cheaper, but not every file and every version of it can and should be stored or preserved. Selecting what to preserve and when to take preservative action becomes more complex with a larger volume of data and a wider range of storage media. This  increases the risk of failing to preserve materials of historical value. There is also a higher risk of data not finding data because of poor metadata.
  • Archivability. One of the most fundamental challenges in archiving is determining what should be preserved and the extent of preservation.
  • Multiplicities. Materials born digital today are likely to have multiple copies in multiple versions stored in multiple locations, possibly under multiple filenames and in multiple file formats.
  • Hardware and storage. Obsolescence, deterioration of media and hardware mechanical failure increase the risk of loss. The cloud is increasingly used for storage, but there are also significant issues with using it.
  • File formats. File formats were considered a big risk in digital preservation but they have not proven to be the overwhelming danger that it was initially perceived to be. Proprietary file formats continue to pose a challenge.
  • Metadata. Metadata is probably the most important aspect of digital preservation. Materials with poor metadata may be undiscoverable, and their authenticity, verifiability and their context unclear.
  • Legalities. Digital preservation presents some complex legal issues
  • Privacy. Material chosen for preservation may contain private and confidential information, and its unauthorised release may lead to legal action.
  • Resourcing. Preservation costs involve not just the actual digitisation, but also storage, infrastructure, staff resourcing and training, ongoing maintenance and auditing of the digitised materials. There are also costs associated with providing access
The challenge is to use the scarce resources to preserve the most important materials, using the most cost-effective and efficient methods. Even choosing not to preserve materials also involves costs. Those who will benefit most from current preservation programs are future generations, which makes it difficult to justify expenditure on digital preservation, since there is little current benefit. The "best that the preservation community can do with digital material is to make educated guesses based on a few decades of mostly anecdotal experience".

"The challenges in digital preservation involve dealing with not just the technologies of the past, but also those to come". The digital preservation field is developing rapidly and the people working with digital materials need to keep up with the changes.

Friday, September 02, 2016

TRAC Certified Long-term Digital Preservation: DuraCloud and Chronopolis for Institutional Treasures

TRAC Certified Long-term Digital Preservation: DuraCloud and Chronopolis for Institutional Treasures. Website. 1 September 2016.
     "An institution’s identity is often formed by what it saves for current and future access. Digital collections curated by the academy can include research data, images, texts, reports, artworks, books, and historic documents help define an academic institution’s identity."

DuraSpace and the Chronopolis service at the University of California at San Diego’s  announce the DuraCloud Enterprise Chronopolis subscription plan for digital preservation. It stores digital content in Amazon and in the Chronopolis network. It provides geographic replication and synchronization of content between three storage locations, and has content integrity monitoring in a dark storage option. Plan options are a combination of Amazon S3, Amazon Glacier, and SDSC.

Pricing and Plan details
DuraCloud Preservation                    Subscription Fee: $1,175 Storage: $700/TB
DuraCloud Preservation Plus             Subscription Fee: $1,175 Storage: $825/TB
DuraCloud Enterprise                        Subscription Fee: $5,250 Storage: $500/TB
DuraCloud Enterprise Plus                Subscription Fee: $5,250 Storage: $625/TB
DuraCloud Enterprise Plus                Subscription Fee: $5,550 Storage: $1,200/TB (Option 2)
DuraCloud Enterprise Chronopolis    Subscription Fee: $2,750 Storage: $500/TB (Ingest and retrieval fees extra)

Thursday, September 01, 2016

Digital Preservation: Keep calm and get on with it!

Digital Preservation: Keep calm and get on with it! Matthew Addis. Archives and Records Association 2016. 30 August 2016.
     This is a presentation about simple and practical steps towards digital preservation using open source tools best practices. The benefits of a digital preservation strategy is increasingly clear, but implementing the strategy can be overwhelming. The presentation lists resources and tools, such as the Digital Preservation Coalition handbook, the COPTR tool website, DROID, and the Data Assessment Framework. Sometimes complex resources can also be overwhelming and make decisions more difficult. "If you think that you’re not able to ‘do enough’ or ‘do it properly’, then this can result in doing nothing because this feels like the next best thing." But doing nothing has serious consequences in the digital world. "It’s almost always better to get on and do something than it is to do nothing." The presentation also refers to ‘parsimonious preservation’ or starting with minimal actions. Understand what you have and try to keep it safe through safe copies. It is important to understand formats and to use the tools to keep the content safe. "File format identification gives the information needed to make decisions." Another important part is to start simple and add functionality as you go. The maturity model from the National Digital Stewardship Alliance is a good guide.