Tuesday, July 07, 2015

Collection, Curation, Citation at Source: Publication@Source 10 Years On

Collection, Curation, Citation at Source: Publication@Source 10 Years On. Jeremy G. Frey, et al. International Journal of Digital Curation. Vol 10, No 2, 2015.
   The article describes a scholarly knowledge cycle which says the accumulation of knowledge is based on the continuing use and reuse of data and information. Collection, curation, and citation are three processes intrinsic to the workflows of the cycle. The currency of collection, curation, and citation is metadata."Policies should recognize that small amounts of adequately characterized, focused data are preferable to large amounts of inadequately defined and controlled data stored in a random repository." The increasing size of data-sets and the growing risk of loss through catastrophic failure (such as a disk failure) has led to researchers to use cloud storage, perhaps too uncritically so.

The responsibilities of researchers for meeting the requirements of sound governance and ensuring the quality of their work have become more apparent. The article places the responsibility for curation firmly with the originator of the data. "Researchers should organize their data and preserve it with semantically rich metadata, captured at source, to provide short- and long-term advantages for sharing and collaboration."  Principal Investigators, as custodians, are particularly responsible for clinical data management and security (though curation and preservation activities exist in other research roles). "Curators usually attempt to add links to the original publications or source databases, but in practice, provenance records are often absent, incomplete or ad hoc, often despite curators’ best efforts. Also, manually managed provenance records are at higher risk of human error or falsification." There is a pressing need for training and education to encourage researchers to curate the data as they collect it at source.

"All science is strongly dependent on preserving, maintaining, and adding value to the research record, including the data, both raw and derived, generated during the scientific process. This statement leads naturally to the assertion that all science is strongly dependent on curation."

Monday, July 06, 2015

TIFF/A

TIFF/A. Gary McGath. File Formats Blog.  July 3, 2015.
   The tiff format has been around for a long time. There have been many changes and additions, such that "TIFF today is the sum of a lot of unwritten rules".  A group of academic archivists have been working on a long term readable version, calling it TIFF/A. A white paper discusses the technical issues. Discussions starting in September will hope to create a version to submit for ISO consideration.

Presentation on Evaluating the Creation and Preservation Challenges of Photogrammetry-based 3D Models

Presentation on Evaluating the Creation and Preservation Challenges of Photogrammetry-based 3D Models. Michael J. Bennett. University of Connecticut. May 21, 2015.
    Photogrammetry allows for the creation of 3D objects from 2D photography, which mimics human stereo vision. There are many steps in the process, images, masks, depth maps, models, and textures. The question is, what should be archived for long term digital preservation? When models are output into an open standard, there is data loss, since “native 3D CAD file formats cannot be interpreted accurately in any but the original version of the original software product used to create the model.”

General lessons from archiving CAD files, are that, when possible, the data should be normalized into open standards. But native formats, which are often proprietary, should also be archived. With Photogrammetry Data, the author reviews some of the options and recommendations. There are difficulties with archiving the files, and also organizing the files in a container that are documents the relationships of the files. Digital repositories can play a role in the preservation of the 3D datasets.

Friday, July 03, 2015

Australian electronic books to be preserved at the National Library in Canberra under new laws

Australian electronic books to be preserved at the National Library in Canberra under new laws. Clarissa Thorp. ABC. 3 July 2015.
Starting in January of next year digital materials including e-books, blogs, prominent websites, and  important social media messages will be collected as a snapshot of Australian life. Under existing copyright laws, the National Library of Australia is able to collect all books produced by local publishers through the legal deposit system. Now with new legislation adopted by the Federal Parliament the Library will be able to preserve published items from the internet that could disappear from view in future. "This legislation puts us in a position where we are able to ask publishers to deposit electronic material with the National Library in a comprehensive way." "So we will be able to open that up and collect the whole of the Australian domain, for websites for example it means we are able to collect e-books that are only published in digital form." This new legislation will expand the Library's digital preservation program and ensure that future collections reflect Australian society as a whole.

Thursday, July 02, 2015

Vatican Library digitizes ancient manuscripts, makes them available for free

Vatican Library digitizes ancient manuscripts, makes them available for free. Justin Scuiletti.  PBS NewsHour. October 22, 2014.
The Vatican Apostolic Library is digitizing its archive of ancient manuscripts and making them available to view.  view. They are undertaking an extensive digital preservation of its 82,000 documents.  The entire undertaking is expected to take at least 15 years and cost more than $63 million. “Technology gives us the opportunity to think of the past while looking towards the future, and the world’s culture, thanks to the web, can truly become a common heritage, freely accessible to all, anywhere and any time.” The current list of digitized manuscripts can be viewed through the Vatican Library website  and the project website.

Wednesday, July 01, 2015

Over 28 exabytes of storage shipped last quarter

More than 28 billion gigabytes of storage shipped last quarter. Lucas Mearian. Computerworld. June 30, 2015.
Worldwide data storage hardware sales increased 41.4% over the same quarter in 2014. This past quarter, 28.3 exabytes of capacity was shipped out.  Traditional external arrays decreased while demand strongly increased for server-based storage and hyperscale infrastructure (distributed infrastructures that support cloud and big data processing, and can scale to thousands of servers). The largest revenue growth was in the server market (new server sales and not just upgrades to existing server infrastructures).  The most popular external storage arrays were all-flash models and hybrid flash arrays that combine NAND flash with hard disk drives.

Tuesday, June 30, 2015

National Archives kicks off 'born-digital' transfer

National Archives kicks off 'born-digital' transfer. Mark Say. UKAuthority. 24 June 2015.
The National Archives is looking at the long term issue of keeping records accessible as the technology in which they are originally created changes.

"To make sure born-digital records can be permanently preserved we’re engaged in what we call parsimonious presentation, in which we’re making sure it can be used by the next trends of technology being developed. We want them to be easily viewed in 10 years’ time, although we cannot plan for 100 years as there’s no way we can know what the technology will look like."

“To ensure records will still be used in the same way we want to see what the technology is going to do in the next 10 years.

“Digital preservation is a major international challenge. Digital technology is changing what it means to be an archive and we are responding to these changes.

“These records demonstrate how we are leading the archive sector in embracing the challenges of storing digital information for future generations. We are ensuring that we are ready to keep the nation’s public records safe and accessible for the future, whatever their format.”

Monday, June 29, 2015

File identification tools, part 5: FITS

File identification tools, part 5: FITS. Gary McGath. File Formats Blog.  June 25, 2015.
The File Information Tool Set (FITS), which aggregates results from several file identification tools, was created by the Harvard University Libraries and is available in Github. FITS uses Apache Tika, DROID, ExifTool, FFIdent, JHOVE, the National Library of New Zealand Metadata Extractor, and four Harvard tools.  The tool can be used in the ingest process; it processes directories and subdirectories, and produces a single XML output file in various schemas. It can be run as a standalone tool or incorporated with other tools, and can be configured to determine which tools to run and which extensions to examine.  Documentation is found on Harvard’s website.

SIRF: Self-contained Information Retention Format

SIRF: Self-contained Information Retention Format. Sam Fineberg,et al. SNIA Tutorial. 2015. [PDF]
Generating and collecting very large data sets that need to be kept for long periods is a necessity for many organizations, included sciences, archives, commerce. The presentation describes the challenges with keeping data long term with Linear Tape File System (LTFS) technology and a Self-contained Information Retention Format (SIRF). The top external factors driving long-term retention requirements are: Legal risk, compliance regulations, business risk, and security risk.

What does long-term mean? Retention of 20 years or more is required by 70% of the responses in a poll.
  • 100 years: 38.8%
  • 50-100 years: 18.3%
  • 21-50 years: 31.1%
  • 11-20 years: 15.7%
  • 7-10 years: 12.3%
  • 3-5 years: 1.9%
The need for digital preservation:
  • Regulatory compliance and legal issues
  • Emerging web services and applications
  • Many other fixed-content repositories (Scientific data, libraries, movies, music, etc.)
Data stored should remain accessible, undamaged, and usable for as long as desired and at an affordable cost. Affordable depends on the "perceived future value of information". There are problems with verifying the correctness and authenticity of semantic information over time. SIRF is the digital equivalent of a self contained archival box. It contains:
  • set of preservation objects and a catalog (logical or physical)
  • metadata about the contents and individual objects
  • self describing standard catalog information so it can all be maintained
  • a "magic object" that identifies the container and version
The metadata contains basic information that can vary depending on the preservation needs. It allows a deeper description of t he objects along with the content meaning and the relationship between the objects.

When preserving objects, we need to keep all the information to make them fully usable in the future. No single technology will be "usable over the time-spans mandated by current digital preservation needs". LTFS technologies are "good for perhaps 10-20 years".

Saturday, June 27, 2015

Russian Official Proposes International Investigation Into U.S. Moon Landings. Cultural Preservation?


Russian Official Proposes International Investigation Into U.S. Moon Landings. Ingrid Burke. The Moscow Times.  June 16, 2015.
Russia's Investigative Committee spokesman, Vladimir Marki, called for an international investigation to (among other things) solve the mystery of the disappearance of film footage from the original moon landing in 1969. "But all of these scientific - or perhaps cultural - artifacts are part of the legacy of humanity, and their disappearance without a trace is our common loss. An investigation will reveal what happened."

 [Interesting that the political wranglings have now reached the level of  historical archiving and cultural preservation.]
 

Friday, June 26, 2015

ARSC Guide to Audio Preservation

ARSC Guide to Audio Preservation. Sam Brylawski, et al. National Recording Preservation Board of the Library of Congress. May 2015. [PDF, 252 pp.]
CLIR, the Association for Recorded Sound Collections (ARSC) and the National Recording Preservation Board (NRPB) of the Library of Congress, has published CLIR Publication No. 164, an excellent guide to audio preservation.
"Our audio legacy is at serious risk because of media deterioration, technological obsolescence, and, often, lack of accessibility. This legacy is remarkable in its diversity, ranging from wax cylinders of extinct Native American languages to tapes of local radio broadcasts, naturalists’ and ethnographers’ field recordings, small independent record company releases, and much more. These recordings are held not by a few large organizations, but by thousands of large and small institutions, and by individuals. The publishers hope that this guide will support and encourage efforts at all institutions to implement best practices to help meet the urgent challenge of audio preservation."

Chapters include:

  • Preserving Audio (Recorded Sound at Risk, Preservation Efforts, Roles)
  • Audio Formats: Characteristics and Deterioration (Physical, digital)
  • Appraisals and Priorities (Tools; Selection/collection policies, decisions)
  • Care and Maintenance (Handling, assessment) and arrangement
  • Description of Audio Recordings (Metadata, standards, tools)
  • Preservation Reformatting (Conversion to digital files, metadata, funding)
  • Digital Preservation and Access: Process, storage infrastructure
  • Audio Preservation: The Legal Context (Copyright, control, donor agreements)
  • Disaster Prevention, Preparedness, and Response
  • Fair Use and Sound Recordings Lessons
Some notes from reading the publication:
  • the ultimate goals of preservation are sustained discovery and use
  • all these dissimilar recordings together represent is an audio DNA of our culture
  • our enjoyment of the recordings has far exceeded our commitment to preserve them
  • history is represented in sound recordings; it entertains and enriches us
  • if compressed files are the only versions available to the public, we have no assurances that anyone is maintaining the higher fidelity originals
  • efforts of large and small institutions and private collectors are needed to make a meaningful dent in the enormous volume of significant recordings not yet digitized for preservation
  • if we are to preserve our audio legacy, all institutions with significant recordings must be part of the effort
  • proactive attention, care, and planning are critical to the future viability and value of both analog and digital recordings
  • institutions often have more items in their care than they have resources for adequate processing, cataloging, and preservation
  • the potential technical obsolescence of the hardware to play a recording should influence priorities and resources allocated for preservation
  • perhaps the most crucial feature a metadata schema is its degree of interoperability for sharing, searching, harvesting, and transformation or migration  
  • the preservation choice is not binary "either we implement intensive preservation immediately and forever; or we do nothing". We should not delay action because the ideal cannot be achieved
  • preservation metadata is the information needed to support the long-term management and
    usability of an object 
  • the Broadcast Wave Format (BWF) is the de facto standard for digital audio archiving
  • monitoring and planning to avoid obsolescence are important aspects of a solid digital preservation strategy
  • audio preservation is an ongoing process that may be challenging and intimidating; setting priorities is central to a successful preservation strategy
  • digital preservation will enable the fulfillment of the goal of long-term use (whether focused on education, scholarship, broadcasting, marketing, or sales)
  • ensure that there is at least one geographically separate copy of all digital content
  • recognize the use of sound recordings as sources of information by students and researchers
  • libraries and memory institutions should provide points of cultural reference for the current generation of creators
Several free, open source software tools are available
  • assessing audio collections for the purpose of setting preservation priorities
    • The Field Audio Collection Evaluation Tool (FACET)
    • Audio/Video Survey
    • Audiovisual Self-Assessment Tool (AvSAP)
    • MediaSCORE and MediaRIVERS
  • metadata tools
    • CollectiveAccess
    • Audio-Visual and Image Database (AVID)
    • AudioVisual Collaborative Cataloging (AVCC)
    • PBCore
 "When libraries, archives, and museums exercise their legal rights to preserve and facilitate
access to information, even without permission or payment, they are
furthering the goals of copyright."

"The professional management of a collection requires the development of criteria for selecting and preserving collections of sound recordings. A selection or collection development policy defines and sets priorities for the types of collections that are most appropriate and suitable for an organization to acquire and to preserve. The basis for these criteria should be the goals and objectives of the individual institution."

Thursday, June 25, 2015

Arizona State University and Northern Arizona University Select Ex Libris Rosetta.

Arizona State University and Northern Arizona University Select Ex Libris Rosetta. Ex Libris Press Release. June 25, 2015.
Arizona State University and Northern Arizona University have adopted the Rosetta digital asset management and preservation solution. Rosetta will enable the libraries to manage and preserve their digital collections, including born-digital objects such as web sites and research data, in perpetuity. With Rosetta, the three institutions will be able to implement the solution together and work off one infrastructure, providing end-to-end digital asset management and preservation for the vast array of assets in all of their libraries.

The two Arizona schools join the University of Arizona, already a Rosetta customer, to provide shared digital asset management and preservation service for all public higher education in the state.

Storing Digital Data for Eternity

Storing Digital Data for Eternity. Betsy Isaacson. Newsweek Tech & Science. June 22, 2015.
“People think by digitizing photographs, maps, we have preserved them forever, but we’ve only preserved them forever if we can continue to read the bits that encode them.” An example of data loss is NASA's Viking probes, where mission data were saved on magnetic tape. After 10 years, no one had the skills or software to read the data, and a portion of the data was permanently lost. The moral of this is to be skeptical of the promises of technology. Cloud technologies may feel safe, but there is no guarantee that the data will continue to exist.

There are some projects underway to build storage for digital data that doesn’t degrade. Some of these use quartz glass (which is ultra expensive with lasers that cost over $100,000); DNA (too slow to be practical to load data, and so complex that only only specialized labs can manage it, and as volatile as magnetic tapes); metal etched disks that can be read with an optical microscope; and the Long Server, an ever-growing database of file-conversion resources. And Vint Cerf's suggestion of creating “digital vellum,” a technique for packing and storing digital files along with all the code that’s needed to decrypt them.

Wednesday, June 24, 2015

Rosetta - version 4.2 released

Rosetta - version 4.2 released. Ex Libris. June 22, 2015.
The latest release of Rosetta is now available. It contains many system improvements and updates. Some of these include:
  • Enhanced ability for depositing large SIPs containing multiple files
  • Improved security features
  • Improved deposit functionality
  • Publishing of Itemized Sets
  • SIP load management 

Monday, June 22, 2015

Why Libraries Matter More Than Ever in the Age of Google.

Why Libraries Matter More Than Ever in the Age of Google. Amien Essif. AlterNet. May 23, 2015.
This article is in response to the book BiblioTech: Why Libraries Matter More Than Ever in the Age of Google. Of all the public and private institutions we have, the public library is the truest democratic space. The library’s value is obvious.  A Gallup survey found that libraries are not just popular, they are extremely popular. "Over 90% of Americans feel that libraries are a vital part of their communities, compared to 53% for the police, 27% for public schools, and 7% for Congress. This is perhaps the greatest success of the public sector."

Yet, a government report showed that while the nation’s public libraries served 298 million people in 2010 (96% of the U.S. population) funding has been cut drastically. “It seems extraordinary that a public service with such reach should be, in effect, punished despite its success.” Libraries are becoming more important, not less, to our communities and our democracy.

About 90% of all existing data is less than two years old.  Much of the information could be moderated for the public good, and libraries are able to do that. However, tech companies have put themselves into this role; "the risk of a small number of technically savvy, for-profit companies determining the bulk of what we read and how we read it is enormous."

Libraries are at risk because politicians are moving away from the public good, "favoring private enterprise and making conditions ripe for a Google-Apple-Amazon-Facebook oligopoly on information."
"It’s not too much of a stretch to say that the fate of well-informed, open, free republics could hinge on the future of libraries.”

Saturday, June 20, 2015

PREMIS Data Dictionary for Preservation Metadata, Version 3.0

PREMIS Data Dictionary for Preservation Metadata, Version 3.0. Library of congress.
June 10, 2015. [Full PDF]
The PREMIS Data Dictionary and its supporting documentation is a comprehensive, practical resource for implementing preservation metadata in digital archiving systems. The Data Dictionary is built on a data model that defines five entities: Intellectual Entities, Objects, Events, Rights, and Agents. Each semantic unit defined in the Data Dictionary is a property of one of the entities in the data model.

This new publications are:
  • PREMIS Data Dictionary. Version 3.0. This is the full document which includes the PREMIS Introduction, the Data Dictionary, Special Topics, and Glossary.
  • PREMIS Data Dictionary This document only has the Data Dictionary, introductory materials
  • Hierarchical Listing of Semantic Units: PREMIS Data Dictionary, Version 3.0
  • The Version 3.0 PREMIS Schema is not yet available
Version 3 of the Data Dictionary includes some major changes and additions to the Dictionary, which are:
  • Reposition Intellectual Entity as a category of Object to enable additional description within PREMIS and linking to related PREMIS entities.
  • Reposition Environments (i.e. hardware and software needed to use digital objects) so that they can be described and preserved reusing the Object entity. That is to say, they can be described as Intellectual Entities and preserved as Representation, File or Bitstream Objects.
  • Add physical Objects to the scope of PREMIS so that they can be described and related to digital objects.
  • Add a a new semantic unit to the Object entity: preservationLevelType (O, NR) to indicate the type of preservation functions expected to be applied to the object for the given preservation level.
  • Add a new semantic unit to the Agent entity to express the version of software Agents: agentVersion (O, NR).
  • Add a new semantic unit to the Event entity: eventDetailInformation (O, R)

There are major additions in the “PREMIS Data Model” and “Environment” sections.
The data model:


The entities in the PREMIS data model are:
  • Object: a unit subject to digital preservation.This can now be an environment.
  • Environment: technology supporting a Digital Object. Can now be as Intellectual Entity.
  • Event: an action concerning an Object or Agent associated with the preservation repository.
  • Agent: entity associated with Rights, Events, or an environment Object.
  • Rights Statement: Rights or permissions pertaining to an Object and/or Agent.
With the advent of Intellectual Entities in PREMIS 3.0, environments have been transformed. "Before version 3.0, there was an environment container within an Object that described the environment supporting that Object. If a non-environment Object needs to refer to an environment, it is now recommended that the environment is described as an Object in its own right and the two Objects are linked with a dependency relationship."