Saturday, February 28, 2015

OxGarage Conversion

OxGarage Conversion. Website. February 27, 2015.
An interesting web tool from the University of Oxford for converting documents to different formats.  OxGarage is a web, and RESTful, service to transform documents between a variety of formats, which uses the Text Encoding Initiative format as a pivot format.The initial option is to select:
  • Documents
  • Presentations
  • Spreadsheets
There are dozens of source and target formats listed, such as Word, WordPerfect, RSS, PDF, ppt, csv, xls, and so forth. There is an option when you upload an XML document with links to images on your computer to also add the images.  If you have a document with links to images on the internet, these will be downloaded and included with your document.

Friday, February 27, 2015

Data on the Web Best Practices

Data on the Web Best Practices. W3C First Public Working Draft. 24 February 2015.
This document provides best practices related to the publication and usage of data on the Web. Data should be discoverable and understandable by humans and machines and the efforts of the data publisher recognized.This will help the interaction between the publishers and users.

Data on the Web allows for the existence of multiple ways to represent and to access data which is a challenge. Some of the other challenges include: metadata, formats, provenance, quality, access, versions, and preservation. The Best Practices proposed should help data publishers and data consumers overcome the different challenges faced during the data life cycle on the web. The draft proposes best practices for each one of the described challenges.

Thursday, February 26, 2015

Library of Congress Recommended Format Specifications. Comments Requested.

Library of Congress Recommended Format Specifications.  Library of Congress website. February 26, 2015.
Comments and feedback requested by March 31, 2015.
Because of the dynamic, ever-changing nature and availability of formats, the Library plans to revisit the specifications annually. Reviewing the specifications annually will permit the Library to keep pace with developments in the creative world, so that changes to the Format Specifications, although made frequently, can be made in small increments. Input and feedback are greatly encouraged and welcomed.

Cloud Storage and Digital Preservation: New guidance from the National Archives

Cloud Storage and Digital Preservation: New guidance from the National Archives. Laura Molloy. Digital Curation Centre. 13 May, 2014.
The use of cloud storage in digital preservation is a rapidly evolving field and this guidance explores how it is developing, emerging options and good practice, together with requirements and standards that archives should consider. Digital preservation is a significant issue for almost all public archives. There is an increasing demand for storage of both born-digital archives and digitised material, and an expectation that public access to this content will continue to expand. Five detailed case studies of UK archives that have implemented cloud storage solutions

Digital preservation can be defined as: “the series of managed activities necessary to ensure continued access to digital materials for as long as necessary, beyond the limits of media failure or technological and organisational change”. The challenges are urgent but can be taken one step at a time; you can address current technology and needs while ensuring that the content can be passed on to the next generation. With cloud storage there are many positives and negatives that must be considered. The article reviews many of these. When establishing your needs: Identify what are the ‘must have’ needs and what are the ‘wants’. Define your requirements and decide on the required capabilities rather than a specific technology, implementation, or product.
  • We should be concerned about the security of data, wherever it is stored, but it would be unrealistic to suggest that most cloud services are inherently less secure than most local data centres.
  • Adoption of a digital preservation strategy utilising cloud computing inevitably brings with it a range of legal questions.
  • Cloud storage services can achieve significant economies of scale.
  • Cloud services are typically considered to be operational rather than capital expenditure

Why Digital Storage Formats Are So Risky

Why Digital Storage Formats Are So Risky.  Matthew Woollard.  Lifehacker. 25 February 2015.
While it may seem that digital files last forever, the growing digital sphere faces enormous losses. Even Google has been unable to ensure access for its archive of digital content. Technical solutions already exist, but they’re not well known and relatively expensive.

How much are we prepared to pay to ensure that digital content that exists today will be usable in the future? We need to think about the value of the content and decide if it is worth keeping. Determining the value can be difficult. However, "re-use is a significant benefit from preserving data and adds value." Besides economic value, there are also cultural and intellectual reasons for preserving data. An example of preservation of data from the middle ages can be seen with scribes that used wax tablets for temporary records, and parchment for permanent records.

The chances of born-digital material being usable in 100 years will be considerably improved by actively taking steps now to ensure the preservation of the items. Effective digital preservation relies on the activities of the creator as well as the archivist. It is important to make decisions about providing context, the types of formats to use, how to organize the material, and resolving rights issues to avoid future problems. 

Tuesday, February 24, 2015

Why we should all think about data preservation

Why we should all think about data preservation. Stephanie Taylor. School of Advanced Study. February 19, 2015.
The SHARD project, which ended in 2012, identified  four basic principles of digital preservation for researchers:
  1. Start early: The sooner you start thinking about what to preserve, how to do it, and when, the greater the chance of avoiding problems. Early planning means involving everyone in a research project in the discussion to help identify additional issues.
  2. Explain it: Context provides meaning and is vital in digital preservation. There is little point in preserving material and data without context.
  3. Store it safely: Backups are not preservation. It needs multiple copies in different locations. Use open source file formats and be careful how you and others handle and access files. Select carefully the files to be preserved.
  4. Share it: Sharing your research material and data is beneficial.  In one way or another, the main reason to carry out preservation at all, on any level, is to be able to share your work with others, now and in the future.
Many things are being lost or threatened because no one saw a good reason to preserve them until it was almost too late.  The Data Preservation Online Training resource guides students through the reasons to preserve and share data and challenges that they might face.

Monday, February 23, 2015

Threeding Uses Artec 3D Scanning Technology to Catalog 3D Models for Bulgaria’s National Museum of Military History

Threeding Uses Artec 3D Scanning Technology to Catalog 3D Models for Bulgaria’s National Museum of Military History. Bridget Butler Millsaps. 3D Printer & 3D Printing News. February 20, 2015.
The National Museum of Military History is collaborating on a 3D scanning technology to preserve physical pieces of history by creating 3D digital models. With the scans, the museum can create a virtual museum. It also plans to share the models online and allow the public to use 3D printing images to print replicas of the artifacts.

Saturday, February 21, 2015

OAI-PMH harvesting from SharePoint

SharePoint 2010 to Primo.  Cillian Joy. Tech Blog. July 2014.
They have a system to manage the submission, storage, approval, and discovery of taught thesis documents., which uses SharePoint 2010 as a the document repository and Exlibris Primo as the discovery tool. The solution uses PHP, XML, XSLT, CURL, and SharePoint REST API using oData.
Uses standards ATOM and OAI-PMH.

SharePoint 2013 .NET Server, CSOM, JSOM, and REST API index

Friday, February 20, 2015

Enjoy your digital films and videos while you can... before they disappear

Enjoy your digital films and videos while you can... before they disappear. David Shapton. RedShark Publications. February 17, 2015.
Article about fragility of digital objects. Examples how digital files can fail even when there are multiple copies. Drives can fail, systems can be obsolete. Some statements:
  • paid-for cloud storage and synchronisation company that seems to be doing OK today but which might not be here at some point in the future.
  • Absolutely the most important thing to remember here is that this can happen right under your nose without you realising it. It's like they way you forget things.
  • We can have backup strategies. But that's clearly not enough. There's no point at all in backing up all your files so that they're stored on accessible error-free media, only to find that you don't have any applications to play them.  
  • Cerf has said "that we have to not only preserve the files, but the means to decode them as well."
  • we also have to preserve a working copy of the operating system that can play back the media files, and because machines go out of date, we have to preserve a working copy of the machine.
  • You don't get a warning when something is about to become obsolete or unreadable. You just get an error message bringing you the bad news, or the device doesn't show up in your file system explorer. 
  • Data doesn't fade away gradually. It just becomes inaccessible. But when you step back and look at a mass of data from afar, the effect is that it gradually goes away.
It may be possible to create a virtual machine with a portable language, which could help resurrect machine/software combinations. Making this work will be work and expensive. "It will have to happen because if it doesn't, our films, videos, music tracks, personal memories, and in fact the whole of our recent (and future) history, will simply disappear."

Thursday, February 19, 2015

From Theory to Action: Good Enough Digital Preservation for Under-Resourced Cultural Heritage Institutions

From Theory to Action: Good Enough Digital Preservation for Under-Resourced Cultural Heritage Institutions. Jaime Schumacher, et al. Digital POWRR White Paper for the Institute of Museum and Library Services. 27 August 2014.
The Digital POWRR team is comprised of archivists, curators, librarians, and a digital humanist, from small and mid-sized Illinois institutions who know that digital content is vulnerable, but are  lacking significant financial resources and have been unable to come up with programmatic and technical solutions to mitigate the risk. Each institution produced a case study and a gap analysis, with a plan to address the obstacles.  Some institutions have created and implemented digital preservation programs; however, medium-sized and smaller organizations with fewer resources like those of the POWRR institutions are in a vulnerable position.Some statements of interest:
  • "Common elements emerged from our gap analyses: a lack of available financial resources; limited or nonexistent dedicated staff time for digital preservation activities; and inadequate levels of appropriate technical expertise. Some of the case studies also mentioned a lack of institutional awareness of the fragility of digital content and a lack of cohesive policies and practices across departments as a contributing factor towards the absence of real progress." 
  • Digital preservation is best thought of as an incremental, ongoing, and ever-shifting set of actions, reactions, workflows, and policies.
  • the notion that it is necessary to research all available tools and services exhaustively before taking any basic steps to secure digital content is yet another misconception that often prevents any progress from occurring.
  • Fortunately, practitioners can get started with simple, freely available triage tools while researching which of the more robust solutions will best suit their needs.
The group developed the POWRR tool for evaluating solutions and tools. The appendix also has recommendations for the Developer Community.

ArchivesDirect hosted service

ArchivesDirect website. February 18, 2015.
ArchivesDirect is a web based hosted service of Archivematica offered by DuraSpace for creating OAIS-based digital preservation workflows with content packages that are archived with DuraCloud and Amazon Glacier. It includes open source preservation tools, and generates archival packets using microservices, PREMIS, and mets xml files. ArchivesDirect is intended for small to mid sized institutions. Duraspace is a partnership with DSpace, Fedora, and Vivo.

Pricing and subscription plans include:
ArchivesDirect Standard (System, training, 1 TB): $11,900
ArchivesDirect Digital Preservation Assessment: $4,500
Additional Storage in Amazon S3 and Glacier: $1,000/TB/year

Wednesday, February 18, 2015

Rosetta and Amazon Storage

Rosetta and Amazon Storage. Chris Erickson. February 2015.
In the search for more file storage, as well as more affordable file storage, we tried Amazon Simple Storage Service (Amazon S3). The plan was to connect the Rosetta Digital Preservation System to the Amazon cloud storage, and evaluate it as a possible storage solution. There is a free trial. The Free Tier includes 5GB storage, 20,000 Get Requests, and 2,000 Put Request.

I tried various configurations, but decided on a single bucket for the files. I setup buckets for the IEs and metadata, but after trying it, decided to only keep the files on Amazon. That would keep the metadata local. I had tried nested folders, but couldn't figure out how to designate that in the storage rules and definition. So I create the folders by time period.

In the Rosetta Admin interface I create a File storage group, using the S3 storage plugin, and then entered the Bucket name, Secret Access Key, Access Key ID, and left the Maximum waiting time at the default. For the test, I set up a retention code for Amazon, and the storage rule used that code to determine what went to the Amazon storage. In a real storage instance, it would be better to use something that would not change, like the producer, etc.

It took a few tests to get everything in sync. The result was that Rosetta stored the content in Amazon just fine. I also tried adding content with a one day retention period, and the content was removed from Amazon after the day. A fixity check task was also able to work without a problem.

This gives us another storage option, though we decided to not use it at present.

Pricing at the time of this comparison, was:

1 TB 50 TBs
Digital Storage Costs Annual  Cost 20 Year Projected Yearly  Charge 10 Year Projected 20 Year Projected 50 Year Projected
Cloud Storage
Amazon S3 - Regular $360 $7,200 $17,706 $177,060 $354,120 $885,300
Amazon S3 - Copy / Glacier $480 $9,600 $23,706 $237,060 $474,120 $1,185,300
Amazon S3 - Reduced & Glacier $288 $5,760 $14,165 $141,648 $283,296 $708,240
DuraSpace - Preservation $1,800 $36,000 $36,100 $361,000 $722,000 $1,805,000
DuraSpace - Dark copy/Glacier $1,925 $38,500 $42,350 $423,500 $847,000 $2,117,500
DuraSpace - Enterprise Plus $5,625 $112,500 $64,425 $644,250 $1,288,500 $3,221,250

More storage options will be considered. 

Save the Voices of Tolkien, Joyce And Tennyson

Save the Voices of Tolkien, Joyce And Tennyson. Laura Clark. Smithsonian.
The British Library issued a public call for help safeguarding the over 6.5 million recordings in their archives through digital preservation. It will take around £40 million to fully fund the effort, and time is running short. The British Library’s sound archives includes audio files from Tolkien, Joyce, Florence Nightingale, Tennyson, WWI soldiers, as well as many nature sounds, oral histories and theater performances.  Thousands of others are at risk and will disappear soon if no action is taken.

Tuesday, February 17, 2015

AHRQ Public Access to Federally Funded Research

AHRQ Public Access to Federally Funded Research. Francis D. Chesley. Agency for Healthcare Research and Quality. February, 2015.
The Agency for Healthcare Research and Quality's  has established a policy for public access to scientific publications and scientific data in digital format resulting from funding through the agency. Preservation is one of the Public Access Policy's primary objectives.

The Public Access Policy includes the following objectives:
  • Ensure that the public can access the final published digital documents.
  • Facilitate easy public search, analysis of and access to these publications
  • Ensure the attributes to authors, journals, and original publishers are maintained.
  • Ensure that publications and metadata are in an archival solution.
  • Ensure that all researchers receiving grants develop data management plans, describing how they will provide for long-term preservation of and access to scientific data in digital format.
Data management plans will include:
  • A plan for protecting confidentiality and personal privacy.
  • A description of how scientific data in digital format will be shared
  • It must include a plan for long-term preservation and access to the data
The data management plans will be evaluated based on the values of long-term preservation, access, and the associated cost, and administrative burden. AHRQ will contract with a commercial repository to ensure long-term preservation and full access to the public.

Digital scientific data is defined as "the digital recorded factual material commonly accepted in the scientific community as necessary to validate research findings including data sets used to support scholarly publications, but does not include laboratory notebooks, preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, communications with colleagues, or physical objects, such as laboratory specimens."

Monday, February 16, 2015

Internet future blackout: No way to preserve our data

Internet future blackout: No way to preserve our data. Lisa M. Krieger. San Jose Mercury News. February 12, 2015.

Vint Cerf spoke about the "digital vellum" needed to maintain the data that comprise text, video, software games, scientific data and other digital objects, and how to preserve their meaning.
Aiding preservation might involve "Information Centric Networking," which is based on two simple concepts -- addressing information by its name, rather than location, and adding computation and memory to the network. David Oran of Cisco Systems and Glenn Edens, research director of Palo Alto's Xerox Park, are working on that new technology.

Google is not directly involved in the digital preservation effort "although we have worked really hard at preserving the digital information of the day. We aren't planning to become the archive of the future -- although I think it would be cool." Cerf envisions libraries and governments investing in the technology needed to carry today's information into the distant future.

The "digital vellum" project led has been created at Carnegie Mellon University. This is how it would work: There's a digital snapshot of a document, which is then built into one giant file. Using preserved and transmitted instructions, a virtual computer pretends to be a 2015-era Mac or IBM computer, and can find the document. "If I can substantiate those bits (of a document) in another computer, years in future, then I will have created the same document -- I can reproduce what you were doing. It is a digital copy of the state of the computer you were using when you created new documents."

Rather than the current paradigm -- names, addresses and routes -- it would start with an "object," with a name, as the thing to be stored and moved. "Our current system of 'domain names' is not a stable system. The current routing system could be replaced by the information-centric system, where we keep track of everything not by where it is hosted but the information itself, by name.
Just as we need preservation of today's bits and software, our encryption systems will need to be preserved as well -- because once they're lost, so is the material that they're protecting. "We need an encryption system in which the keys never wear out and are never broken, that represents keying for hundreds of thousands of years."

Saturday, February 14, 2015

Google VP leads calls for web content preservation

Google VP leads calls for web content preservation. Caroline Donnelly. IT Pro.  13 Feb, 2015.

Vint Cerf says action is needed to preserve the content of the internet for future generations to enjoy. Historians in the future could may view the 21st century as an “information black hole” because the software and services used to access online content could become defunct over time. To protect against this, he wants to see efforts made to create a “digital vellum” that will preserve the hardware and software needed to access online content in the years to come. "If we want to preserve them, we need to make sure that the digital objects we create today can still be rendered far into the future.” All web users are at risk of throwing their data away into a “digital black hole” in the mistaken belief that uploading content to a site or service will preserve it. “We digitise things because we think we will preserve them, but what we don’t understand is that unless we take other steps, those digital versions may not be any better, and may even be worse than, than the artefacts that we digitised.”

Friday, February 13, 2015

Crystal clear digital preservation: a management issue

Crystal clear digital preservation: a management issue.  Barbara Sierman. Digital Preservation Seeds.
February 1, 2015.
The book Digital Preservation for Libraries, Archives and Museums by Edward Corrado and Heather Lea Moulaison does a great job of explaining to people about digital preservation. "In crystal clear language, without beating about the bush and based on extensive up to date (until 2014) literature, digital preservation is explained and almost every aspect of it is touched upon. " It explains what digital preservation is not (backup, etc.) The point of the book is expressed by the statement:
“ensuring ongoing access to digital content over time requires careful reflection and planning. In terms of technology, digital preservation is possible today. It might be difficult and require extensive, institution-wide planning, but digital preservation is an achievable goal given the proper resources. In short, digital preservation is in many ways primarily a management issue”.
It uses the Digital Preservation Triad to symbolize the interrelated activities of
  • Management-related activities,
  • Technological activities and
  • Content-centred activities.
The book, which is also available as an eBook, has a practical approach and emphasizes that “that digital preservation is important to the overall mission of the organization”, and not just an experimental project.

Thursday, February 12, 2015

Save our Sounds

Save our Sounds. Luke McKernan. British Library.  12 January 2015.
The nation’s sound collections are under threat, from physical degradation, and also as play back devices wear out and disappear. "Archival consensus internationally is that we have approximately 15 years in which to save our sound collections by digitising them before they become unreadable and are effectively lost." The British Library collection contains over 6.5 million recordings of speech, music, wildlife and the environment, from the 1880s to the present day. The Save our Sounds program has three major aims:
  1. Preserve as much as possible of the nation's rare and unique sound recordings from collections across the UK
  2. Establish a national radio archive to collect, protect and share with other partners
  3. Invest in new technology to enable us to receive music in digital formats, working with  industry partners, to ensure their long-term preservation
The library is seeking to raise funds to help the preservation efforts, and to implement a national audit to map sound archives in the UK and to find those that are at risk.

Wednesday, February 11, 2015

New Expert Panel Report From Council of Canadian Academies Says Canada’s Memory Institutions “Falling Behind” in Preservation of Digital Materials

New Expert Panel Report From Council of Canadian Academies Says Canada’s Memory Institutions “Falling Behind” in Preservation of Digital Materials. Gary Price. Library Journal. February 4, 2015.
An expert panel report, Leading in the Digital World: Opportunities for Canada’s Memory Institutions, (208 pages; PDF) addresses the challenges and opportunities that exist for libraries, archives, museums, and galleries as they adapt to the digital age. Vast amounts of digital information are at risk of being lost because many traditional tools are no longer adequate in the digital age. Memory institutions face the difficult task of preserving digital files in formats that will remain accessible over the long term. Institutions to collaborate more strategically and develop interactive relationships with users. They must also be leaders within and among their respective organizations. Many of the challenges faced are rooted in technical issues associated with managing digital content, the sheer volume of digital information, and the struggle to remain relevant. Collaboration is essential for adaptation, which enables institutions to access the resources required to deliver the  services that users now expect.

Tuesday, February 10, 2015

Reference rot in web-based scholarly communication and link decoration as a path to mitigation

Reference rot in web-based scholarly communication and link decoration as a path to mitigation.
Martin Klein, Herbert Van de Sompel. LSE Impact of Social Sciences blog. February 6, 2015.
The failure of a web address to link to the appropriate online source is a significant problem facing scholarly material. The ability to reference sources is a fundamental part of scholarship. "Increasingly, we see references to software, ontologies, project websites, presentations, blogs, videos, tweets, etc. Such resources are usually referenced by means of their HTTP URI as they exist on the web at large. These HTTP URIs allow for immediate access on the web, but also introduce one of the detrimental characteristics of the web to scholarly communication: reference rot." Reference rot is a combination of two problems common for URI references:
  • link rot: A URI ceases to exist; the page is not found
  • content drift: The resource identified by its URI changes over time and is not what was originally referenced
Their study shows that articles published in 2012 suffer from link rot: 13% of arXiv, 22% of Elsevier, and 14% of PubMed Central. For articles published in 2005, the numbers are higher: corresponding numbers are 18%, 41%, and 36%.

The typical strategy to address the problem is to link to a snapshot of the web page (instead of the original web page) created at the time and stored in a web archive, such as the Internet Archive,, and

 There are problems with the approach. The link copy may not remain in place either. The linking URI is lost, as is the any information about the page or changed page. Link decoration can be used, with the URI of the original, the snapshot, and datetime of linking. Memento can provide this information but there are discussions needed to decide how to best convey the information.

Monday, February 09, 2015

All in the (Apple ProRes 422 Video Codec) Family

All in the (Apple ProRes 422 Video Codec) Family. The Signal.
The Apple ProRes 422 family of video codecs to the Sustainability of Digital Formats website. These codecs are proprietary, lossy compressed, high quality intermediate codecs for digital video primarily supported by Final Cut Pro.

The Apple ProRes 422 Codec Family comprises four subtypes:
  1. ProRes 422 HQ: the highest data-rate version of the ProRes 422 codecs, applying the least compression for the best quality but the largest files.
  2.  ProRes 422:  the second-highest data-rate of the group, often used for multistream, real-time editing and has a significant storage savings over uncompressed video 
  3. ProRes 422 LT:  the third-highest data-rate version, considered an editing codec with  smaller file sizes
  4. ProRes 422 Proxy: the lowest data-rate version often used in offline post-production work that requires low data rates but also a full screen picture.

Saturday, February 07, 2015

Digital Preservation Coalition publishes ‘OAIS Introductory Guide (2nd Edition)’ Technology Watch Report

Digital Preservation Coalition publishes ‘OAIS Introductory Guide (2nd Edition)’ Technology Watch Report. Brian Lavoie.  Digital Preservation Coalition. Watch Report. October, 2014. [PDF]

The report describes the OAIS, its core principles and functional elements, as well as the information model which support long-term preservation, access and understandability of data. The OAIS reference model was approved in 2002 and revised and updated in 2012. Perhaps “the most important achievement of the OAIS is that it has become almost universally accepted as the lingua franca of digital preservation”.

The central concept in the reference model is that of an open archival information system. An OAIS-type archive must meet a set of six minimum responsibilities to do with the ingest, preservation, and dissemination of archived materials: Ingest, Archival Storage, Data Management, Preservation Planning, Access, and Administration. There are also Common Services, which consist of basic computing and networking resources.

An OAIS-type archive references three types of entities: Management, Producer, and Consumer, which includes the Designated Community: consumers expected to independently understand the archived information in the form in which it is preserved and made available by the OAIS. This is a  framework to encourage dialogue and collaboration among participants in standards-building activities, as well as identifying areas most likely to benefit from standards development.

An OAIS-type archive is expected to:
  • Negotiate for and accept appropriate information from information producers;
  • Obtain sufficient control of the information in order to meet long-term preservation objectives;
  • Determine the scope of the archive’s user community;
  • Ensure the preserved information is independently understandable to the user community
  • Follow documented policies and procedures to ensure the information is preserved against all reasonable contingencies
  • Make the preserved information available to the user community, and enable dissemination of authenticated
An OAIS should be committed to making the contents of its archival store available to its intended user community, through access mechanisms and services which support users’ needs and requirements. Such requirements may include preferred medium, access channels, and any access restrictions should be clearly documented.

 The OAIS information model is built around the concept of an information package, which includes: the Submission Information Package, the Archival Information Package, and the Dissemination Information Package. Preservation requires metadata to support and document the OAIS’s preservation processes, called Preservation Description Information, which ‘is specifically focused on describing the past and present states of the Content Information, ensuring that it is uniquely identifiable, and ensuring it has not been unknowingly altered’. The information consists of:
  • Reference Information (identifiers)
  • Context Information (describes relationships among information and objects)
  • Provenance Information (history of the content over time)
  • Fixity Information (verifying authenticity)
  • Access Rights Information (conditions or restrictions)
OAIS is a model and not an implementation. It does not address system architectures, storage or processing technologies, database design, computing platforms, or other technical details of setting up a functioning archival system. But it has been used as a foundation or starting point. Efforts, such as TRAC, have been made to put the attributes of a trusted digital archive into a ‘checklist’ that could be used to support a certification process. PREMIS is a preservation metadata initiative that has emerged as the de facto standard. METS, and XML based  document form, has become widely used for encoding OAIS archival information packages.

The ‘OAIS reference model provides a solid theoretical basis for digital preservation efforts, though theory and practice can sometimes have an uneasy fit.’

Digital Tools and Apps

Digital Tools and Apps. Chris Erickson. Presentation for ULA. 2014. [PDF]
This is a presentation I created for ULA to briefly outline a few tools that I find helpful. There are many useful tools, and more are being created all the time. Here are a few that I use.
  • Copy & Transfer Tools: WinSCP; Teracopy;
  • Rename Tools: Bulk Rename Utility
  • Integrity & Fixity Tools: MD5Summer; MD5sums 1.2; Quick Hash; Hash Tool
  • File Editing Tools: Babelpad; Notepad++; XML Notepad; 
    • ExifTool; BWF MetaEdit; BWAV Reader;
  • File Format Tools: DROID; 
  • File Conversion:  Calibre; Adobe Portfolio;
  • Others: A whole list of other tools that I use or suggest you look at.
    •  PDF/A tools
    • Email tools
 Please let me know what tools you find helpful.

Friday, February 06, 2015

Preserving progress for future generations

Preserving progress for future generations. Rebecca Pool. Research Information. February/March 2015.
Digital preservation remains one of the most critical challenges facing scholarly communities today. From e-journals and e-books to emails, blogs and more, electronic content is proliferating and organizations worldwide are trying to preserve information before the electronic information is lost. Some of the organizations include: Portico (which preserves content on behalf of participating publishers; the number of open access journals it includes is rising, ); CLOCKSS (still grappling with the cost models of providing preservation service).

There is a rising demand for the preservation of dynamic content. No one is able to "capture dynamic content and [preserve] a day-to-day, or even, minute-to-minute feed of this content." There are only snapshots. CLOCKSS is developing the ‘how to’ process to preserve these ‘snapshots’ across multiple locations, validating each against the other, and is also exploring the best pricing structures to preserve such content.

Other organizations include LOCKSS, The Digital Preservation Network, HathiTrust, Preservica, Archivematica, and Rosetta, whose recent clients are the State Library of New South Wales and the State Library of Queensland.

The digital preservation development is clearly gaining momentum, growing in both size and complexity. "Clearly progress is being made and you can measure that by the maturity of solutions on offer." But for most organizations, the urgency of digital preservation has yet to hit home.

"Trying to sell the idea of digital preservation on the basis of return on investment has been very hard. By its nature, it’s a long-term activity and you’re really hedging your bets against future risks. I think we are still in the very early days of genuinely understanding the value of digital assets... and transferring this understanding over to financial assets doesn’t yet work very well." The European consortium 4C (Collaboration to Clarify the Costs of Curation) has been investigating this problem. Their road map helps organisations appraise digital assets, adopt a strategy to grow preservation assets and develop costing processes. In addition they have developed a model for curation costs. The only way to understand the costs of preservation is though sharing, through openness and collaboration.

Wednesday, February 04, 2015

The Cobweb. Can the Internet be archived?

The Cobweb. Can the Internet be archived? Jill Lepore. The New Yorker. January 26, 2015.

The average life of a Web page is about a hundred days. The pages can disappear through “link rot,” or people may see an updated web page where most likely the original has been overwritten. Or the page may have been moved and something else is where it used to be. This is known as “content drift.” This is worse than an error message since it’s impossible to tell that what you’re seeing isn’t what you went to look for: the overwriting, erasure, or moving of the original is invisible.

Link rot and content drift, collectively known as “reference rot,” have been disastrous for the law and courts. In providing evidence, legal scholars, lawyers, and judges often cite Web pages in their footnotes; they expect that evidence to remain where they found it as their proof. But a 2013 survey of law- and policy-related publications found that after six years, nearly fifty per cent of the URLs cited in those publications no longer worked. A Harvard Law School study in 2014  showed “more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs within United States Supreme Court opinions, do not link to the originally cited information.”

The overwriting, drifting, and rotting of the Web also affects engineers, scientists, and doctors. Recently, researchers at Los Alamos National Laboratory reported the results of a study of three and a half million scholarly articles published in science, technology, and medical journals between 1997 and 2012: one in five links provided in the notes suffers from reference rot.

The problems with links disappearing has been known since the start of the internet. Tim Berners-Lee proposed the HTTP protocol to link web pages, and he had also considered a time axis for the protocol, but "preservation was not a priority.” Other internet pioneers are also concerned. Vint Cerf has talked about a need for a long-term storage “digital vellum”:  “I worry that the twenty-first century will become an informational black hole.” Brewster Kahle started the Internet Archive, which has archived more than four hundred and thirty billion Web pages.

Herbert Van de Sompel has been working on Memento which allows a user to look at pages around the time it was written.

Tuesday, February 03, 2015

Open Preservation Foundation to provide sustainable home for JHOVE

Open Preservation Foundation to provide sustainable home for JHOVE. Becky. Open Preservation Foundation Blog. 3 Feb 2015.
The Open Preservation Foundation is taking stewardship of the JHOVE preservation tool and providing a sustainable home.  The tool will become part of the OPF software portfolio
and follow their Software Maturity Model. Portico is contributing code improvements that they have made to the tool. Other tools in the portfolio include:
  • Jpylyzer: JP2 image validator and properties extractor
  • FIDO: command-line tool to identify the file formats of digital objects.
  • Matchbox: duplicate image detection tool
  • xcorrSound: four tools to improve Digital Audio Recordings

Office Opens up with OOXML

Office Opens up with OOXML. Carl Fleischhauer, Kate Murray. The Signal. February 3, 2015.
Nine new format descriptions have been added to the Library’s Format Sustainability Web site. These closely related formats relate to the Office Open XML (OOXML) family, which are the formats of the Microsoft family of “Office” desktop applications, including Word, PowerPoint and Excel. Formerly, these applications produced files in proprietary, binary formats with the extensions doc, ppt, and xls. The current versions employ an XML structure for the data and an x has been added to the extensions: docx, pptx, and xlsx.

"In addition to giving the formats an XML expression, Microsoft also decided to move the formats out of proprietary status and into a standardized form (now focus on the word Open in the name.) Three international organizations cooperated to standardize OOXML."

The list of the nine:
  • OOXML_Family, OOXML Format Family, ISO/IEC 29500 and ECMA 376
  • OPC/OOXML_2012, Open Packaging Conventions (Office Open XML), ISO 29500-2:2008-2012
  • DOCX/OOXML_2012, DOCX Transitional (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 1-4
  • DOCX/OOXML_Strict_2012, DOCX Strict (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 2-4
  • PPTX/OOXML_2012, PPTX Transitional (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 1-4
  • PPTX/OOXML_Strict_2012, PPTX Strict (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 2-4
  • XLSX/OOXML_2012, XLSX Transitional (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 1-4
  • XLSX/OOXML_Strict_2012, XLSX Strict (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 2-4
  • MCE/OOXML_2012, Markup Compatibility and Extensibility (Office Open XML), ISO 29500-3:2008-2012, ECMA-376, Editions 1-4
 "Meanwhile, readers should remember that the Format Sustainability Web site is not limited to formats that we consider desirable. We list as many formats (and subformats) as we can, as objectively as we can, so that others can choose the ones they prefer for a particular body of content and for particular use cases."

Monday, February 02, 2015

Websites Change, Go Away and Get Taken Down

Websites Change, Go Away and Get Taken Down.  Website. January 2015. is a beta service that allows users to create citation links that will never break.
When a user creates a link, archives a copy of the referenced content, and generates a link to an unalterable hosted instance of the site. Regardless of what may happen to the original source, if the link is later published by a journal using the service, the archived version will always be available through the link.

When readers click on a link they are directed to a page which points to either the original site (which may have changed since the link was created) or see the archived copy of the site in its original state. is an online preservation service developed by the Harvard Law School Library in conjunction with university law libraries across the country and other organizations in the “forever” business.