Wednesday, February 01, 2017

Why Aren't We Doing More With Our Web Archives?

Why Aren't We Doing More With Our Web Archives? Kalev Leetaru. Forbes. January 13, 2017.
     The post looks at the many projects that have been launched to archive and preserve the digital world; the best known is the Internet Archive, "which has been crawling and preserving the open web for more than two decades" and has preserved more than 510 billion distinct URLs from over 361 million websites. The author asks: "With such an incredible repository of global society’s web evolution, why don’t we see more applications of this unimaginable resource?"

Some of the reasons that there isn't a more vibrant and active research and software development community around web archives may be:
  • Economics plays a role, 
  • Complex nature of web archives
  • The Internet Archive archive is over 15 petabytes, which is difficult to manipulate
  • There aren't many tools that can use the archive, particularly indexing
The Internet Archive last year announced the first efforts at keyword search capability. These kinds of search tools are needed to make the Archive’s holdings more accessible to researchers and data miners.

"At the end of the day, web archives are our only record capturing the evolution of human society from the physical to the virtual domains. The Internet Archive in particular represents one of the greatest archives ever  created of this immense transition in human existence and with the right tools and a greater focus on non-traditional avenues, perhaps we can launch a whole new world of research into how humans evolved into a digital existence."

Tuesday, January 31, 2017

20 TB Hard Disk Drives, The Future Of HDDS

20 TB Hard Disk Drives, The Future Of HDDS. Tom Coughlin. Forbes. January 28, 2017.
     Interesting article on the status and future of hard drives. It looks at the declining market and the trends for hard disk drives over the next few years.  Overall drive shipments in 2016 dropped about 9.4%, meaning that 424 million drives were shipped in 2016. Of the total HDDs shipped in 2016:
  • Western Digital shipped 41% 
  • Seagate shipped 37%  
  • Toshiba shipped 22%.
"The long-term future of HDDs likely rests with high capacity HDDs, particularly in data centers serving cloud storage applications".  Seagate plans to ship 14 and 16 TB drives in the next 18 months, and possibly 20 TB drives in the next three years.

Digital Preservation and Archaeological data

Digital Preservation.  Michael L. Satlow. Then and Now. Jan 26, 2017.
     The post looks at the issue of preservation in relation to the modern scholarly and artistic works. "The underlying problem is a simple one: most scholarly and creative work today is done digitally." Archaeological excavations generate reams of data, and like other scientific data, archaeological data are valuable.  There is no single way that archaeologists record their findings. "Unlike scientists, many archaeologists and humanists have not thought very hard about the preservation of digital data. Scientists routinely deposit their raw data in institutional repositories and are called upon to articulate their digital data management and preservation plan on many grant applications. The paths open to others are less clear."

Institutional digital repositories provide a simple and inexpensive solution. When the project is complete, the data can be converted to xml and deposited. The data conversion would be the most involved part. The xml format would allow the data to be easily accessed and used. "It is time to think about digital preservation as a staple of our 'best practices'.”


Monday, January 30, 2017

Born-digital news preservation in perspective

Born-digital news preservation in perspective. Clifford Lynch. RJI Online. January 26, 2017. [Video and transcript.]
   The challenge with news and academic journals: how do you preserve this body of information. The journal community has working on that in a much more systematic way. There is a shared consensus among all players that preserving the record of scholarly journal publication is essential. Nobody wants their scholarship to be ephemeral so you have to tell people a convincing story about how their work will be preserved.

The primary responsibility for the active archive in most cases is the publisher, but there must be some kind of external fallback system so content will survive the failure of the publisher and the publisher’s archive. These are usually collaborative. Libraries have been the printed news archive, but that is changing. There is also a Keepers Registry so you can see how many keepers are preserving a given journal. The larger journals are well covered, but the smaller ones are really at risk, and a lot of these are small open source journals. "So, we need to be very mindful of those kinds of dynamics as we think about what to do about strategies for really handling the digital news at scale."

With the news, there are a few very large players, and a whole lot of other small news outlets of various kinds. Different strategies are needed for the two groups. We need to be very cautious about news boundaries. "Now in many, many cases, the journalism is built on top of and links to underlying evidence which at least in the short term is readily inspectable by anyone clicking on a link." But the links deteriorate and the material goes away and "preserving that evidence is really important." But it is unclear who is or should be preserving this. There are also questions about the news, the provenance, the motives, the accuracy, and these have to be handled in a more serious way.

"most social media is actually observation and testimony. Very little of it is synthesized news. It’s much more of the character of a set of testimonies or photographs or things like that. And collectively it can serve to give important documentation to an event, but often it is incomplete and otherwise problematic. We need to come to some kind of social consensus about how social media fits into  the cultural record.

We need to devise some systematic approaches to this because the journalistic organizations really need help; "their archives are genuinely at risk" and in many cases the "long term organizational viability is at risk". We need a public consensus. "We need a recognition that responsible journalism implies a lasting public record of that work." The need for free press is recognized consitutionally. "We cannot, under current law, protect most of this material very effectively without the active collaboration of the content producers." This is too big a job for any single organization, and we don't want a single point of failure.


Tuesday, January 24, 2017

The UNESCO/PERSIST Guidelines for the selection of digital heritage for long-term preservation

The UNESCO/PERSIST Guidelines for the selection of digital heritage for long-term preservation. Sarah CC Choy, et al. UNESCO/PERSIST Content Task Force. March 2016.
     The survival of digital heritage is much less assured than its traditional counterparts. “Identification of significant digital heritage and early intervention are essential to ensuring its long-term preservation.” This project was created to help preserve our cultural heritage, and to provide a starting point for institutions creating their policies. Preserving and ensuring access to its digital information is also a challenge for the private sector. Acquiring and collecting digital heritage requires significant effort and resources. It is vital that organizations accept digital stewardship roles and responsibilities.Some thoughts and quotes from the document.
  • There is a strong risk that the restrictive legal environment will negatively impact the long-term survival of important digital heritage.
  • The challenge of long-term preservation in the digital age requires a rethinking of how heritage institutions identify significance and assess value.
  • new forms of digital expression blur boundaries and lines of responsibility and challenge past approaches to collecting.
  • libraries, archives, and museums have common interests to each preserve heritage
  • heritage institutions must be proactive to identify digital heritage and information for long-term preservation before it is lost.
  • Selection is as essential, as it is economically and technically impossible, and often legally prohibited, to collect all current digital heritage. Selecting for long-term preservation will thus be a critical function of heritage institutions in the digital age.
  • Selecting digital heritage for long-term preservation may focus primarily on evaluating publications already in their collection, originally acquired for short-term use, rather than assessing new publications for acquisition. 
  • Rapid obsolescence in digital formats, storage media, and systems is collapsing the window of opportunity of selection, and increase the risk that records are lost that might not have yet “proved” their significance over time.
Address strategies for collecting digital heritage and develop selection criteria for an institution. Four possible steps to use:
  1. Identify the material to be acquired or evaluated
  2. Determine the legal obligation to preserve the material
  3. Assess the material using three selection criteria: significance, sustainability, and availability
  4. Compile the above information and make a decision based on the results
Management of long-term digital preservation and metadata is important. There are five basic functional requirements for digital metadata:
  1. Identification of each digital object
  2. Location of each digital object so that it can be located and retrieved.
  3. Description of digital object is needed for recall and interpretation, both content and context
  4. Readability and encoding, in order to remain legible over time.
  5. Rights management, including conditions of use and restrictions of each digital item
“The long-term preservation of digital heritage is perhaps the most daunting challenge facing heritage institutions today.”

Wednesday, January 11, 2017

Digital preservation is a mature concept, but we need to pitch it better

Digital preservation is a mature concept, but we need to pitch it better. Dave Gerrard. Digital Preservation at Oxford and Cambridge.  6 December, 2016.
     The OAIS standard can be confusing for newcomers to the field, and one of the potentially confusing areas is the Administrative area. It looks "like a place where much of the hard-to-model, human stuff had been separated from the technical, tool-based parts." The diagram is busier and more information-packed than other areas, and thus could use more modeling. The standard may be easier to use if there were other documents focusing on the ‘technical’ and ‘human’ aspects.

Communication, particularly an explanation to funders, about the importance of digital preservation is vital. It will help to have an 'elevator pitch' to explain simply what digital preservation is. The post suggests "Digital Preservation means sourcing computer-based material that is worthy of preservation, getting that material under control, and then maintaining the usefulness of that material, forever." [Some of these words may be easily misunderstood.]

The "OAIS standard is confusing" "but it has reached a level of maturity: it’s clear how much deep thought and expertise underpins it."  The digital preservation community is ready to take their ideas to a wider audience: "we perhaps just need to pitch them a little better".

Saturday, December 31, 2016

Managing the preservation and accessibility of public records from the past into the digital future

Managing the preservation and accessibility of public records from the past into the digital future.  Dean Koh. Open Gov.  30 November 2016.
     A post about the Public Record Office of the State Archives of Victoria. They have many paper records but now also a lot of born digital records governments, so the archives is a hybrid paper and digital archives. For accessibility purposes, paper records are digitised to provide online access. The Public Record Office also sets records management standards for government agencies across Victoria. "In the digital environment, there is not a lot of difference between records and information so that means we set standards in the area of information management as well." Access to records is a major focus, including equity of access in a digitally focused age.

"There’s a lot to access that isn’t necessarily ‘just digitise something’, there’s a lot of work to be done in addition to just digitising them. There’s capturing metadata about the digital images because again, if I just take photographs of a whole lot of things and send you the files, that’s not very accessible, you have to open each one and look at it in order to find the one that you want. So we have to capture metadata about each of the images in order to make them accessible so a lot of thinking and work goes into that."

Another issue around records, particularly born digital records, is the different formats used to create records in government. There are a "whole bunch of different technologies" used to create born digital records and the archives is trying to manage the formats and the records so that they "continue to remain accessible into the far future. So 50 years, a 100 years, 200 years, they still need to be accessible because those records are of enduring value to people of Victoria. So that’s a format issue and a format obsolescence issue."


Friday, December 30, 2016

How Not to Build a Digital Archive: Lessons from the Dark Side of the Force

How Not to Build a Digital Archive: Lessons from the Dark Side of the Force. David Portman. Preservica. December 21, 2016.
     This post is an interesting and humorous look at Star Wars archiving: "Fans of the latest Star Wars saga Rogue One will notice that Digital Archiving forms a prominent part in the new film. This is good news for all of us in the industry, as we can use it as an example of how we are working every day to ensure the durability and security of our content. Perhaps more importantly it makes our jobs sound much more glamorous – when asked 'so what do you do' we can start with 'remember the bit in Rogue One….'"

The Empire’s choice of archiving technology is not perfect and there are flaws in their Digital Preservation policy in many areas, such as security, metadata, redundancy, access controls, off site storage, and format policy. Their approaches are "hardly the stuff of a trusted digital repository!"

Thursday, December 29, 2016

Robots.txt Files and Archiving .gov and .mil Websites

Robots.txt Files and Archiving .gov and .mil Websites. Alexis Rossi. Internet Archive Blogs. December 17, 2016.
     The Internet Archive collects webpages "from over 6,000 government domains, over 200,000 hosts, and feeds from around 10,000 official federal social media accounts". Do they ignore robots.txt files? Historically, sometimes yes and sometimes no, but the robots.txt file is less useful that it was, and is becoming less so over time as, particularly for web archiving efforts. Many sites do not actively maintained the files or increasingly block crawlers with other technological measures. The "robots.txt file is not relevant to a different era". The best way for webmasters to exclude their sites is to contact archive.org and to specify the exclusion parameters.

"Our end-of-term crawls of .gov and .mil websites in 2008, 2012, and 2016 have ignored exclusion directives in robots.txt in order to get more complete snapshots. Other crawls done by the Internet Archive and other entities have had different policies."  The archived sites are available in the beta wayback. They have had little feedback at all on their efforts. "Overall, we hope to capture government and military websites well, and hope to keep this valuable information available to users in the future."


Thursday, December 22, 2016

Securing Trustworthy Digital Repositories

Securing Trustworthy Digital Repositories. Devan Ray Donaldson, Raquel Hill, Heidi Dowding, Christian Keitel.  Paper, iPres 2016. (Proceedings p. 95-101 / PDF p. 48-51).
     Security is necessary for a digital repository to be trustworthy. This study looks at digital repository staff members’ perceptions of security for Trusted Digital Repositories (TDR) and explores:
  • Scholarship on security in digital preservation and computer science literature
  • Methodology of the sample, and data collection, analysis techniques
  • Report findings; discussion of implications of the study and recommendations
Security in the paper refers to “the practice of defending information from unauthorized access, use, disclosure, disruption, modification, perusal, inspection, recording or destruction”.  Three security principles mentioned are confidentiality, integrity, and availability.  Recent standards for TDRs show the best practices of the digital preservation community, including security as part of attaining formal “trustworthy” status for digital repositories. However, security can be hard to measure. Part of security is the threat modeling process, where "assets are identified; threats against the assets are enumerated; the likelihood and damage of threats are quantified; and mechanisms for mitigating threats are proposed". Understanding threats should be based on "historical data, not just expert judgment" to avoid unreliable data. The study discusses the Security Perception Survey, which "represents a security metric focused on the perceptions of those responsible for managing and securing computing infrastructures". 

Two standards, DIN 31644 and ISO 16363, draw upon DRAMBORA, an earlier standard, which consisted of six steps for digital repository staff members:
  1. identify their objectives.
  2. identify central activities necessary to achieve their objectives and assets.
  3. align and document risks to their activities and assets.
  4. assess, avoid, and treat risks by each risk’s probability, impact, owner, and remedy
  5. determine what threats are most likely to occur and identify improvements required. 
  6. complete a risk register of all identified risks and the results of their analysis.
Security is a major issue for digital repositories. "Taken together, standards for TDRs underscore the importance of security and provide relatively similar recommendations to digital repository staff members about how to address security." Participants in this study found the security criteria in the standard that they chose sufficient.

Wednesday, December 21, 2016

We Are Surrounded by Metadata--But It’s Still Not Enough

We Are Surrounded by Metadata--But It’s Still Not Enough. Teresa Soleau. In  Metadata Specialists Share Their Challenges, Defeats, and Triumphs. Marissa Clifford. The Iris. October 17, 2016.
     Many of their digital collections end up in their Rosetta digital preservation repository. Descriptive and structural information about the resources comes from many sources, including the physical materials themselves as they are being reformatted. "Metadata abounds. Even file names are metadata, full of clues about the content of the files: for reformatted material they may contain the inventory or accession number and the physical location, like box and folder; while for born-digital material, the original file names and the names of folders and subfolders may be the only information we have at the file level."

A major challenge is that the collection descriptions must be at the aggregate level because of the volume of materials, "while the digital files must exist at the item level, or even more granularly if we have multiple files representing a single item, such as the front and back of a photograph". The questions is how to provide useful access to all the digital material with so little metadata. This can be overwhelming and inefficient if the context and content is difficult to recognize and understand.  And "anything that makes the material easier to use now will contribute to the long-term preservation of the digital files as well; after all, what’s the point of preserving something if you’ve lost the information about what the thing is?"

Technical information about the files themselves are fingerprints that help verify the file hasn’t changed over time, in addition to tracking what has happened to the files after entering the archive. Software preservation, such as with the Software Preservation Network, is now being recognized as an important effort. Digital preservationists are working out who should be responsible for preserving which software. There are many preservation challenges yet to be solved in the years ahead.


Tuesday, December 20, 2016

File Extensions and Digital Preservation

File Extensions and Digital Preservation. Laura Schroffel. In  Metadata Specialists Share Their Challenges, Defeats, and Triumphs. Marissa Clifford. The Iris. October 17, 2016
     The post looks at metadata challenges with digital preservation. Most of the born-digital material they work with exists on outdated or quickly obsolescing media, such as floppy disks, compact discs, hard drives, and flash drives that are transferred into their Rosetta digital preservation repository, and accessible through Primo.

"File extensions are a key piece of metadata in born-digital materials that can either elucidate or complicate the digital preservation process". The extensions describe format type, provide clues to file content, and indicate a file that may need preservation work. The extension is an external label that is human readable, often referred to as external signatures. "This is in contrast to internal signatures, a byte sequence modelled by patterns in a byte stream, the values of the bytes themselves, and any positioning relative to a file."

Their born-digital files are processed on a Forensic Recovery of Evidence Device ( FRED) which can acquire data from many types of media, such as Blu-Ray, CD-ROM, DVD-ROM, Compact Flash, Micro Drives, Smart Media, Memory Stick, Memory Stick Pro, xD Cards, Secure Digital Media and Multimedia Cards. The workstation also has the Forensic Toolkit (FTK) software is capable of processing a file and can indicate the file format type and often the software version. There are challenges since file extensions are not standardized or unique, such as naming conflicts between types of software, or older Macintosh systems that did not require files extensions. Also, because FRED and FTK originated in  law enforcement, challenges arise when using it to work with cultural heritage objects.


Monday, December 19, 2016

Metadata Specialists Share Their Challenges, Defeats, and Triumphs

Metadata Specialists Share Their Challenges, Defeats, and Triumphs. Marissa Clifford. The Iris. October 17, 2016.
     "Metadata is a common thread that unites people with resources across the web—and colleagues across the cultural heritage field. When metadata is expertly matched to digital objects, it becomes almost invisible. But of course, metadata is created by people, with great care, time commitment, and sometimes pull-your-hair-out challenge."  At the Getty there are a number of people who work with metadata "to ensure access and sustainability in the (digital) world of cultural heritage—structuring, maintaining, correcting, and authoring it for many types of online resources." Some share their challenges, including:
Some notes from some of the challenges:
  • The metadata process had to be re-thought when they started publishing digitally because the metadata machinery was specifically for print books. That proved mostly useless for their online publications so that started from scratch to find the best ways of sharing book metadata to increase discoverability. 
  • "Despite all of the standards available, metadata remains MESSY. It is subject to changing standards, best practices, and implementations as well as local rules and requirements, catalogers’ judgement, and human error." 
  • Another challenge with access is creating relevancy in the digital image repository 
  • Changes are needed in skills and job roles to make metadata repositories truly useful. 
  • "One of the potential benefits of linked open data is that gradually, institutional databases will be able speak to each other. But the learning curve is quite large, especially when it comes to integrating these new concepts with traditional LIS concepts in the work environment."

Thursday, December 15, 2016

DPN and uploading to DuraCloud Spaces

DPN and uploading to DuraCloud Spaces. Chris Erickson. December 15, 2016.
     For the past while we have been uploading preservation content into DuraCloud as the portal to DPN. DuraCloud can upload files by drag-and-drop but a better way is with the DuraCloud Sync Tool. (The wiki had helpful information in setting this up). This sync tool can copy files from any number of local folders to a DuraCloud Sspace, and can add, update, and delete files. I preferred the GUI version in one browser window and the DuraCloud Account in another.

We have been reviewing all of our long term collections and assigning Preservation Priorities, Preservation Levels, and also the number of Preservation Copies. From all this we decided on three collections to add to DPN, and created a Space (which goes into an Amazon bucket) for each. The Space will then be processed into DPN:
  1. Our institutional repository, including ETDs which are now digitally born, and research information. From our ScholarsArchive repository
  2. Historic images that have been scanned; the original content is either fragile or not available. Exported from Rosetta Digital Archive.
  3. University audio files; the original content was converted from media that is at risk. Some from hard drives, others exported from Rosetta Digital Archive.
Some of the files were already in our Rosetta preservation archive, and some were in processing folders ready to be added to Rosetta. They all had metadata files with them. The sync tool worked well for uploading these collections by configuring the source location as the Rosetta folders and target was the corresponding DuraCloud Space. Initially, the uploading was extremely slow, several days to load 200 GB. But DuraCloud support provided a newer, faster version of the sync tool, and we changed to a faster connection. The upload threads changed from 5 to 26 and we uploaded the next TB in about a day.

We also had a very informative meeting with DPN and the two other universities in Utah that are DPN members, where Mary and Dave told us that the price per TB was now half the original cost. Also, that unused space could be carried over to the next year. This will be helpful in planning additional content to add. Instead of replicating our entire archive in DPN, we currently have a hierarchical approach, based on the number and location of copies, along with the priorities and preservation levels.

Related posts:

Wednesday, December 14, 2016

PDF/A as a preferred, sustainable format for spreadsheets?

PDF/A as a preferred, sustainable format for spreadsheets?  Johan van der Knijff. johan's Blog. 9 Dec 2016.
     National Archives of the Netherlands published a report on preferred file formats, with an overview of their ‘preferred’ and ‘acceptable’ formats for 9 categories. The blog post concerns the ‘spreadsheet’ category for which it lists the following ‘preferred’ and ‘acceptable’ formats:
  • Preferred:  ODS, CSV, PDF/A     
  • Acceptable: XLS, XLSX
And the justification / explanation for using PDF:
PDF/A – PDF/A is a widely used open standard and a NEN/ISO standard (ISO:19005). PDF/A-1 and PDF/A-2 are part of the ‘act or explain’ list. Note: some (interactive) functionality will not be available after conversion to PDF/A. If this functionality is deemed essential, this will be a reason for not choosing PDF/A
There are some problems of the choice of PDF/A and its justification.
  • Displayed precision not equal to stored precision
  • Loss of precision after exporting to PDF/A
    • Also loss of precision after exporting to CSV
    • Use of cell formatting to display more precise data is possible but less than ideal,
  • Interactive content
  • Reading PDF/A spreadsheets: This may be difficult without knowing the intended users, the target software, the context, or how the user intends to use the spreadsheet. 
The justification states that some interactive functionality "will not be available after conversion to PDF/A. If this functionality is deemed essential, this will be a reason for not choosing PDF/A." However, deciding what functionality is ‘essential’ depends on the context and intended user base. In addition, interactive aspect may imply that "any spreadsheets that do not take any interaction with a user can be safely converted to PDF/A. But it may also be better to make a distinction between ‘static’ and ‘dynamic’ spreadsheets.

There may be situations where PDF/A is a good or maybe the best, but choosing a preferred format should "take into account the purpose for which a spreadsheet was created, its content, its intended use and the intended (future) user(s)."


Monday, December 12, 2016

Harvesting Government History, One Web Page at a Time

Harvesting Government History, One Web Page at a Time.  Jim Dwyer. New York Times. December 1, 2016.
     With the arrival of any new president, large amounts of information on government websites are at risk of vanishing within days. Digital federal records, reports and research are very fragile. "No law protects much of it, no automated machine records it for history, and the National Archives and Records Administration announced in 2008 that it would not take on the job."  Referring to government websites: “Large portions of dot-gov have no mandate to be taken care of. Nobody is really responsible for doing this.”  The End of Term Presidential Harvest 2016  project is a volunteer, collaborative effort by a small group of university, government and nonprofit libraries to find and preserve valuable pages that are now on federal websites. The project began before the 2008 elections. Harvested content from previous End of Term Presidential Harvests is available at http://eotarchive.cdlib.org/.

The project has two phases of harvesting:
  1. Comprehensive Crawl: The Internet Archive crawl the .gov domain in September 2016, and also after the inauguration in 2017.
  2. Prioritized Crawl: The project team will create a list of related URL’s and social media feeds.
The political changes in the past 8 years at the end of presidential terms has made a lot of people worried about the longevity of federal information.

Saturday, December 10, 2016

Error detection of JPEG files with JHOVE and Bad Peggy – so who’s the real Sherlock Holmes here?

Error detection of JPEG files with JHOVE and Bad Peggy – so who’s the real Sherlock Holmes here?  Yvonne Tunnat. Yvonne Tunnat's Blog. 29 Nov 2016.
     Post that describes an examination of the findings of two validation tools, JHOVE (Version 1.14.6) and Bad Peggy (version 2.0), which scans image files for damages, using the Java Image IO library. The goal of the test is to compare the findings from these validation tools and know what to expect for digital curation work. There were 3070 images for the test, which included images from Google's publicly available Imagetestsuite. Of the images, 1,007 files had problems.

The JHOVE JPEG module can determine 13 different error conditions; Bad Peggy can distinguish at least 30 errors. The results of each are in tables in the post. The problem images could not be opened and displayed or had missing parts, mixed up parts and colour problems. The conclusion is that the tool Bad Peggy was able to detect all of the visually corrupt images. The JHOVE JPEG module missed 7 corrupt images out of 18.

Thursday, December 08, 2016

OAIS: a cage or a guide?

OAIS: a cage or a guide? Barbara Sierman. Digital Preservation Seeds. December 3, 2016.    
     Post about the OAIS standard and asking if it is a restriction or a guide. OAIS, the functional model, the data model and metrics in OAIS and the related standards like the audit and certification standard. "OAIS is out there for 20 years and we cannot imagine where digital preservation would be, without this standard." It is helpful for discussing preservation by naming the related functions and meta data groups. But it lacks a link to implementation and application for daily activities. OAIS is a lot of common sense put into a standard. The audit and certification standard, ISO 16363, is meant to explain how compliance can be achieved, a more practical approach.

Many organisations are using this standard to answer to the question "Am I doing it right?" People working with digital preservation want to know the approach that others are using, the issues that they have solved. The preservation community needs to "evaluate regularly whether the standards they are using are still relevant in the changing environment" and a continuous debate is required to do this. In addition, we need evidence that practical implementations that follow OAIS are the best way to do digital preservation. Proof of what worked and what did not work is needed in order to adapt standards, and the DPC OAIS community wiki has been set up to gather thoughts related to the practical implementation of OAIS and to provide practical information about the preservation standards,


Monday, December 05, 2016

Digital Preservation Network - 2016

Digital Preservation Network - 2016. Chris Erickson. December 5, 2016.
     An overview of  the reason for DPN. Academic institutions require that their scholarly histories, heritage and research remain part of the academic record. This record needs to continue beyond the life spans of individuals, technological systems, and organizations. The loss of academic collections that are part of these institutions could be catastrophic. These collections, which include oral history collections, born digital artworks, historic journals, theses, dissertations, media and fragile digitizations of ancient documents and antiquities are irreplaceable resources.

DPN is structured to preserve the stored content by using diverse geographic, technical, and institutional environments. The preservation process consists of:
  1. Content is deposited into the system through an Ingest Node, which are preservation repositories themselves; 
  2. Content is replicated to at least two other Replicating Nodes and stored in different types of repository infrastructures; 
  3. Content is checked by bit auditing and repair services to prevent change or loss; 
  4. Changed or corrupted content is restored by DPN; 
  5. As Nodes enter and leave DPN, preserved content is redistributed to maintain the continuity of preservation services into the far-future.
The Ingest Node that we are using is through DuraCloud.


Thursday, December 01, 2016

Implementing Automatic Digital Preservation for a Mass Digitization Workflow

Implementing Automatic Digital Preservation for a Mass Digitization Workflow. Henrike Berthold, Andreas Romeyke, Jörg Sachse.  Short paper, iPres 2016.  (Proceedings p. 54-56 / PDF p. 28-29). 
     This short paper describes their preservation workflow for digitized documents and the in-house mass digitization workflow, based on the Kitodo software, and the three major challenges encountered.
  1. validating and checking the target file format and the constraints to it,
  2. handling updates of d content already submitted to the preservation system, 
  3. checking the integrity of all archived data in an affordable way
They produce several million scans a year and preserve these digital documents in their Rosetta based archive which is complemented by a submission application for pre-ingest processing, an access application that prepares the preserved master data for reuse, and a storage layer that ensures the existence of three redundant copies of the data in the permanent storage and a backup of data in the processing and operational storage. They have customized Rosetta operations with plugins they developed.  In the workflow, the data format of each file is identified, validated and technical metadata are extracted. AIPS are added to the permanent storage (disk and LTO tapes). The storage layer, which uses hierarchical storage management, creates two more copies and manages them.

To ensure robustness, only single page, uncompressed TIFF files are accepted. They use the open-source tool checkit-tiff to check files against a specified configuration. To deal with AIP updates, files can be submitted multiple times: the first time is an ingest, all transfers after that are updates. Rosetta ingest functions can add, delete, or replace a file. Rosetta can also manage multiple versions of an AIP, so older versions of digital objects remain accessible for users.

They manage three copies of the data, which totals 120 TBs. An integrity check of all digital documents, including the three copies, is not feasible due to the time that is required to read all data from tape storage and check them. So to get reliable results without checking all data in the archive they use two different methods:

  • Sample Method Integrity 1% sample of archival copies is checked yearly 
  • Specified fixed bit pattern workflow that is checked quarterly.

Their current challenges are in developing new media types (digital video, audio, photographs and pdf documents), unified pre-ingest processing, and automation of processes (e.g. to perform tests of new software versions).


Wednesday, November 30, 2016

To Act or Not to Act - Handling File Format Identification Issues in Practice

To Act or Not to Act - Handling File Format Identification Issues in Practice. Matthias Töwe, Franziska Geisser, Roland E. Suri. Poster, iPres 2016.  (Proceedings p. 288-89 / PDF p. 145).
     Format identification output needs to be assessed within an institutional context and also consider provenance information to determine actions. This poster presents ways to address file format identification and validation issues that are mostly independent of the specific tools and systems employed; they use Rosetta, DROID, PRONOM, and JHOVE. Archives rely on technical file format information and therefore want to derive as much information about the digital objects as possible before ingest. But there are issues that occur in everyday practice, such as:
  • how to proceed without compromising preservation options
  • how to make efforts scalable 
  • issues with different types of data
  • issues related to the tool's internal logic
  • metadata extraction which is also format related
 The use cases vary depending on the customers, types of material, and formats. A broad range of use cases apply to safeguarding research data for a limited period of time (ten years at minimum) to publishing and preserving data in the long term. Understanding the use cases’ characteristics helps provides "a better understanding of what actually matters most in each case."

Ideally, format identification should yield reliable and unambiguous information on the format of a given file, however a number of problems make the process more complicated. Handling files on an individual basis does not scale well. This may mean that unsatisfactory decisions need to be taken to keep the volume of data manageable. Some criteria to consider:
  • Usability: can the file be used as expected with standard software?
  • Tool errors: is an error known to be tool-related?
  • Understanding: is the error actually understood?
  • Seriousness: does the error concern the format's significant properties?
  • Correctability: is there a documented solution to the error?
  • Risk of correcting: what risks are associated with correcting the error?
  • Effort: what effort is required to correct the error?
  • Authenticity: is the file’s authenticity more relevant than format identification?
  • Provenance: can the data producer help resolve this and future errors?
  • Intended preservation: what solution is acceptable for lower preservation periods?
There are no simple rules to resolve these, so other considerations are needed to determine what actions to take:
  • Should format identification be handled at ingest or as a pre-ingest activity?
  • How to document measures taken to resolve identified problems?
  • Can unknown formats be admitted to the archive? 
  • Should the format identification be re-checked later? 
  • Do we rely on PRONOM or do we need local registries? 
  • How to preserve formats where no applications exist.
"Format validation can fail when file properties are not in accord with its format’s specification. However, it is not immediately clear if such deviations prevent current usability of a file orcompromise the prospects for a file’s long term preservability." If the file is usable today, does that mean it is valid? Digital archives need to "balance the efforts for making files valid vs. making files pass validation in spite of known issues."

The failure to extract significant properties has no immediate consequences, and institutions need to decide if they will correct the issues, especially if the metadata is embedded and the file must be altered, which creates the risk of unknowingly introducing other changes. The authors of this poster act even more cautiously when fixing metadata extraction issues that require working with embedded metadata or file properties.

Tuesday, November 29, 2016

German hbz Consortium Selects Ex Libris Rosetta Digital Asset Management and Preservation Solution

German hbz Consortium Selects Ex Libris  Rosetta Digital Asset Management and Preservation Solution. Press Release. ProQuest. 29 November 2016.
     Hochschulbibliothekszentrum des Landes Nordrhein‑Westfalen has chosen the Ex Libris Rosetta digital asset management and preservation solution. There are more than 40 member institutions that will be able to deposit digital collections in the central Rosetta system. “Our preservation and management plans across the entire North-Rhine Westphalia region include both artifacts and modern research output. With Rosetta, we will be able preserve a wide range of data and manage digital assets on both the consortium and institutional level. Rosetta meets our current and long-term needs.” 

Tuesday, November 22, 2016

Every little bit helps: File format identification at Lancaster University

Every little bit helps: File format identification at Lancaster University.  Rachel MacGregor. Digital Archiving at the University of York. 21 November 2016
   The post is about Rachel's work on identifying research data and follows on from the work of Filling the Digital Preservation Gap and provides a interesting comparison with the statistics reported previously. The aim was to understand the nature of research data and to inform their approaches to preservation. The summary of the statistics:
Of 24,705 files: 

  • 11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
  • 99.3% were given one file identification and 76 files had multiple identifications. 
    • 59 files had two possible identifications
    • 13 had 3 identifications
    • 4 had 4 possible identifications. 
    • 50 of these were either 8-bit or 7-bit ASCII text files.  
    • The remaining 26 were identified by container as various types of Microsoft files.

Of the 11008 identified files:

  • 89.34% were identified by signature
  • 9.2% were identified by extension
  • 1.46% identified by container
When adjusted for the 7,000 gzip files, the percentages identified were:
  • 68% (2505) by signature
  • 27.5% (1013) by extension
  • 4.5% (161) by container
These results were different from York's results but not so dramatically.

Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more.  Of these most were Microsoft files with multiple id's and also a set of lsm files identified as TIFFs. 

In all, 59 different file formats were identified, GZIP  was the most frequently occurring followed by xml format.

Files that weren't identified
  • There were 13697 files not identified by DROID of which 4947 (36%) had file extensions.  
  • 64% had no file extension (64%). 
  • Top counts of unidentified file extensions: dat, data, cell, param,
Gathering this information helps contribute towards our overall understanding of file format types. "Every little bit helps."

Monday, November 21, 2016

The Digital Preservation Gap(s)

The Digital Preservation Gap(s). somaya langley. Digital Preservation at Oxford and Cambridge. 18 November 2016.
     This is a broader comment on the field of digital preservation and the various gaps in the digital preservation field. Some of these are:
  • Silo-ing of different areas of practice and knowledge (developers, archivists etc.)
  • Lack of understanding of front-line staff working with born-digital materials 
  • Archivists, curators and librarians wanting a ‘magic wand’ to deal with ‘all things digital’
  • Tools that are limited or currently do not exist
  • Lack of knowledge to run the few available tools
  • Lack of knowledge of how to approach problem-solving
At iPres "the discussion still began with the notion that digital preservation commences at the point where files are in a stable state, such as in a digital preservation system (or digital asset management system). Appraisal and undertaking data transfers wasn’t considered at all, yet it is essential to capture metadata (including technical metadata) at this very early point. (Metadata captured at this early point may turn into preservation metadata in the long run.)" First-hand experiences of acquiring born-digital collections provide greater understanding of what it takes to do this type of work and will help in developing policies.

It is important to understand common real-world use cases and experiences in acquiring born-digital collections. "Archivists have an incredible sense of how to manage the relationship with a donor who is handing over their life’s work, ensuring the donor entrusts the organisation with the ongoing care of their materials" but preservationists that are traditionally trained archivists, curators and librarians often lack technical skill sets.  On the other hand, technologists lack experience with liaising with donors first-hand. Both groups would benefit from the others. Sharing approaches to problem-solving is definitely important.  The term ‘digital stewardship’ may be more helpful in acquiring and managing born-digital materials. 

Saturday, November 19, 2016

Software Sustainability and Preservation: Implications for Long-term Access to Digital Heritage

Software Sustainability and Preservation: Implications for Long-term Access to Digital Heritage. Jessica Meyerson, David Rosenthal, Euan Cochrane. Panel, iPres 2016.  (Proceedings p. 294-5 / PDF p. 148).

     Digital content requires software for interpretation, processing, and use, and sustaining the software functionality beyond its normal life span is an issue. It may not be possible, economically or otherwise, for the software vendors to maintain software long term. Virtualization and emulation are two techniques that may be viable options for long-term access to objects, and there are currently efforts to preserve essential software that is needed to access or render digital content. Some efforts are the earlier KEEP Emulation Framework project, and currently the bwFLA Emulation as a Service (EaaS) project has demonstrated the ability to provide access to emulated and virtualized environments via a simple web browser and as part of operational archival and library workflows.

Memory institutions and software vendors have valuable digital heritage software collections that need to be maintained. A growing number of digital objects require software in order to be used and viewed. Yale University, the Society of American Archivists and others are working to resolve legal barriers to software preservation practices. The preservation community "continues to evolve their practices and strive for more comprehensive and complete technical registries to support and coordinate software preservation efforts".


Friday, November 18, 2016

Challenges and benefits of a collaboration of the Collaborators

Challenges and benefits of a collaboration of the Collaborators. William Kilbride, et al. Panel, iPres 2016.  (Proceedings p. 296-7 / PDF p. 149).
     The importance of collaboration in digital preservation has been emphasized by many professionals in the field. Because of rapid technological developments, the increase of digital material and the growing complexity of digital objects, "no one institution can do digital preservation on its own". Digital preservation related tasks and responsibilities has led to a network of relationships between various groups, such as the DPC and then other institutions, founded as a “collaborative effort to get digital preservation on the agenda of key decision-makers and funders”. These organizations are help to encourage collaboration to help libraries, archives, museums and experts to work together to ensure the long-term preservation and accessibility of digital sources

A logical next step is to establish a larger collaborative infrastructure to preserve all relevant digital data from the public sector. This would require storage facilities, but also knowledge and manpower to ensure proper management of the facilities. There must be agreement about which tasks and  responsibilities can be performed by the institutions themselves, and which could be carried out in collaboration with others. This seems to be the right time to join forces, to be more effective in our work, and to share our experiences. This can help answer questions about prioritization, solutions and policies for the next steps in international collaboration.


Thursday, November 17, 2016

Preserving Data for the Future : Research Data Management in an Academic Library Consortium

Preserving Data for the Future : Research Data Management in an Academic Library Consortium. Alan Darnell. PASIG 2016.
     A presentation about managing and preserving data. Three major points:
1. Get the Data. Important to get the data as soon as possible. The Availability of research data declines rapidly; time is the enemy of preservation.

2. Preserve the Data. The goal is to automated the transfer of data to a secure repository, in this case from Dataverse to Archivematica.  There are issues that need to be resolved, such as scalability, file size, increasing volume of materials, unrecognized file types, etc. There are tools that can help. The AIPs in a repository need continued management, formats, checksums.

3. Ensure that the Data is Usable. Reproducibility of results is a key measure of the usability of data. The data management process needs to capture more of the context of the research process that created the data, including software, metadata, all research materials, including notebooks. A Data management plan is important in this process.

Wednesday, November 16, 2016

A Doomsday Scenario: Exporting CONTENTdm Records to XTF

A Doomsday Scenario: Exporting CONTENTdm Records to XTF. Andrew Bullen. D-Lib Magazine. November/December 2016.
     Because of budgetary concerns, the Illinois State Library asked Andrew Bullen to explore how their CONTENTdm collections could be migrated to another platform. (The Illinois Digital Archives repository is based on CONTENTdm). He chose methods that would allow him to quickly migrate the collections using existing tools, particularly PHP, Perl, and XTF which they use as the platform for a digital collection of electronic Illinois state documents. The article shows the perl code written, metadata, record examples, and walks through the process. He started A Cookbook of Methods for Using CONTENTdm APIs. Each collection presented different challenges and required custom programming. He recommends reviewing the metadata elements of each collection and normalizing like elements as much as possible, and plan what elements can be indexed and how faceted browsing could be implemented. The test was to see if the data could be reasonably converted so not all parts were implemented. In a real migration, CONTENTdm's APIs could be used as a data transfer medium.

Tuesday, November 15, 2016

Digital Preservation for Libraries, Archives, & Museums, a review

Digital Preservation for Libraries, Archives, & Museums. 2nd edition.  Edward M. Corrado, Heather Moulaison Sandy. Rowman & Littlefield. 2016.
     I don't usually include publisher reviews here, but I got to know Edward Corrado when he worked with the Rosetta system at Binghamton. I received an advance copy of this book and provided a review for it. This is a very thorough book on a very large topic and I thought the review worth including.

This very thorough and well researched book on digital preservation is for libraries, archives and museums of all sizes.  It covers a wide range of digital preservation topics which will prove useful for managers and technical staff alike.  The foreword to the book states that digital preservation is not a problem but an opportunity. The topics covered in this book help the reader understand how to implement these opportunities within their own organization. Digital preservation cannot be done in isolation from the rest of the organization; it needs to be an integral part of the whole. The authors demonstrate that with the proper resources and technical expertise, organizations can preserve "today's digital content long into the future". 

The table of contents of the book shows the range of topics covered:

Parts of the book:
I. Introduction to Digital Preservation,
II. Management Aspects,
III. Technology Aspects, and
IV. Content-Related Aspects.

Sections of the book
1. What is Digital Preservation? What it is not.
2. Getting Started with the Digital Preservation Triad: Management, Technology, Content
3. Management for Digital Preservation
4. The OAIS Reference Model
5. Organizing Digital Content
6. Consortia and Membership Organizations
7. Human Resources and Education
8. Sustainable Digital Preservation, financial factors
9. Digital Repository Software and Digital Preservation Systems
10. The Digital Preservation Repository and Trust
11.  Metadata for Digital Preservation
12. File Formats and Software for Digital Preservation
13. Emulation
14. Selecting Content
15. Preserving Research Data
16. Preserving Humanities Content
17. Digital Preservation of Selected Specialized Formats
Appendix A: Select Resources in Support of Digital Preservation

A few quotes and thoughts from the book that I thought especially useful:
  • three interrelated activities: management-related activities, technological activities, and content-centered activities.
  • technology cannot --- and should not --- be the sole concern of digital preservation. 
  • concerned with the life cycle of the digital object in a robust and all-inclusive way.
  • digital preservation is in many ways a management issue.  It requires interaction with the process and procedures of all parts of an organization.
  • Regardless of the role any particular staff member plays in digital preservation, one of the most important attributes required is passion for digital preservation.
  • Ultimately, digital preservation is an exercise in risk management.
  • Primarily, digital preservation is something that must be accepted on the basis of trust. can help build trust using self-assessments, certification, and audit tools
  • Digital preservation allows information professionals and those working in cultural heritage institutions to preserve, for the long-term, content that otherwise, if not cared for, would unquestionably be lost.
It helps to answer some basic questions:
  • How can I preserve the digital content available in my institution for the future?
  • What do I need to know to carry out this work?
  • How can I plan for the future in terms of the technology, human resources, and collections?
  • How do I know if I’m on the right track with my digital preservation efforts?


Monday, November 14, 2016

The (information) machine stops

The (information) machine stops. Gary McGath. Mad File Format Science Blog. March 14, 2016.
     The “Digital Dark Age” discussion comes up again.  Instead of asking what could trigger a Digital Dark Age, we ought to ask
  1. what conditions are necessary and sufficient for the really long-term preservation of information,
  2. what will minimize the risk of widespread loss of today’s history, literature, and news?
Our storage ability has increased but the durability of that storage has decreased. We deal with obsolescence and format, file, and device failures. "Anything we put on a disk today will almost certainly be unusable by 2050. The year 3016 just seems unimaginably far. Yet we still have records today from 1016, 16, and even 984 B.C.E. How can our records of today last a thousand years?"

The current practices rely on curation, migration, and hoping that storage providers will be around forever. Or that some institutions will take up the task of preservation and continue it forever. This requires "an unbroken chain of human activity to keep information alive". History shows that information is often neglected or destroyed, and in reality, only a tiny fraction has survived. "Today’s leading forms of digital storage simply can’t survive that degree of neglect." Abby Smith Rumsey writes, "The new paradigm of memory is more like growing a garden. Everything that we entrust to digital code needs regular tending, refreshing, and periodic migration to make sure that it is still alive, whether we intend to use it in a year, a hundred years, or maybe never." It is not a safe assumption that "things will always be the way they are today, maybe with some gradual improvement or decline, but nothing that will seriously disrupt the way we and future generations live."

However, we have to remember that people and information have survived many types of catastrophes. The original question in the post was "If an uninterrupted succession of custodians isn’t the best way to keep history alive, what is? The answer must be something that’s resilient in the face of interruptions." An important part of this is to avoid reliance on fragile protection; the keys are durability and decentralization. "The hard parts are avoiding physical degradation, hardware obsolescence, and format obsolescence. Physical durability isn’t out of reach. Devices like the M-disc have impressive durability."

"The way to address obsolescence is with designs simple enough that they can be reconstructed." We need decentralized archives in many places with different approaches. "The problem is solvable. The mistake is thinking that an indefinite chain of short-term solutions can add up to a long-term solution."

Related posts:

Wednesday, November 09, 2016

Autonomous Preservation Tools in Minimal Effort Ingest

Autonomous Preservation Tools in Minimal Effort Ingest. Asger Askov Blekinge, Bolette Ammitzbøll Jurik, Thorbjørn Ravn. Andersen.  Poster, iPres 2016.  (Proceedings p. 259-60 / PDF p. 131).
     This poster presents the concept of Autonomous Preservation Tools developed by the State and University Library, Denmark. It is an expansion of their idea of Minimal Effort Ingest. In Minimal Effort Ingest most of the preservation actions are handled within the repository when resources are available. The incoming data is to be secured quickly, even when resources are sparse. Preservation actions should happen when resources are available, rather than by a static ingest workflow.

From these concepts they created the idea of Autonomous Preservation Tools which are more like software agents rather than a static workflow system. The process is more flexible and allows for easy updates or changes to the workflow steps. A fixed workflow is replaced with a decentralised implicit workflow which defines the set of events that an AIP must go through.  Rather than a static workflow that must process AIPs in a fixed way, the Autonomous Preservation Tools "can discover AIPs to process on their own". Because AIPs maintain an record of past events tools can determine whether or not an AIP has been processed or if other Tool actions must be performed first. So the workflow is the tools finding and processing items correctly until every item has been processed.  This becomes an alternative method of processing.

Establishing Digital Preservation At the University of Melbourne

Establishing Digital Preservation At the University of Melbourne. Jaye Weatherburn. Poster, iPres 2016.  (Proceedings p. 274-5 / PDF p. 138).
     The University of Melbourne’s Digital Preservation Strategy is to make the "University’s digital product of enduring value available into the future, thus enabling designated communities to access digital assets of cultural, scholarly, and corporate significance over time". The long-term, ten-year vision of their strategy looks at four interrelated areas in phases over the next three years:
  1. Research Outputs
  2. Research Data and Records
  3. University Records
  4. Cultural Collections
The key principles around which action is required: Culture, Policy, Infrastructure, and Organization. The University’s research strategy recognizes the importance of their digital assets by declaring that "the digital research legacy of the University must be showcased, managed, and preserved into the future". The project team members need to start a comprehensive advocacy campaign to illustrate the importance of preservation. Instead of digital preservation being perceived as a bureaucratic and financial burden it needs to be seen as a useful tool for academic branding and profiling, as well as important for the long-term sustainability of their research.