Showing posts with label data preservation. Show all posts
Showing posts with label data preservation. Show all posts

Saturday, March 09, 2019

What to Keep: A Jisc research data study

What to Keep: A Jisc research data study. Neil Beagrie. Jisc. February 2019.  [PDF]
     This study is about research data management and also appraisal and selection. This is an issue that has become more significant in recent years as volumes of data have grown. "The purpose is to provide new insights that will be useful to institutions, research funders, researchers, publishers, and Jisc on what research data to keep and why, the current position, and suggestions for improvement."

"Not all research data is the same: it is highly varied in terms of data level; data type; and origin. In addition, not all disciplines are in the same place or have identical needs."

"It is essential to consider not only What and Why to keep data, but for How Long to keep it, Where to keep it, and increasingly How to keep it in ways that reflects its potential value, cost, and available funding."

The study lists ten recommendations:
  1. Consider what is transferable between disciplines. Support adoption of effective practice via training, technologies, case studies, and guidance checklists.
  2. Bring communities together with workshops to evolve disciplinary norms 
  3. Harmonise funder requirements for research data where relevant
  4. Investigate the costs and benefits of curation levels, storage, or appraisal for what to keep f
  5. Implement the FAIR principles as appropriate for kept data.  
  6. Enhance data discoverability and identification of data by recording and to identifying data  generated by research projects in existing research databases.
  7. Require Data Access Statements in all published research articles where data is used as evidence, and encourage adoption of the Transparency and Openness Promotion (TOP) guidelines 
  8. Improve incentives and lower the barriers for data sharing.
  9. Increase publisher and funder collaborations around research data. 
  10. Improve communication on what research data management costs can be funded and by whom
Definition of research: "a process of investigation leading to new insights, effectively shared. It includes work of direct relevance to the needs of commerce, industry, and to the public and voluntary sectors; scholarship ...; the invention and generation of ideas, images, performances, artefacts including design, where these lead to new or substantially improved insights; and the use of existing knowledge in experimental development to produce new or substantially improved materials, devices, products and processes, including design and construction.”

Other notes from the study:
Costs of research data management seen as too high
Obsolescence of data format or software

The volume of research data and the number of new research data services and repositories is increasing.

"The high-level principles for research data management may be established but the everyday practice and procedures for the full-range of research data, what and why to keep, for how long, and where and how to keep it, are still evolving."

“All those engaged with research have a responsibility to ensure the data they gather and generate is properly managed, and made accessible, intelligible, assessable and usable by others unless there are legitimate reasons to the contrary. Access to research data therefore carries implications for cost and there will need to be trade-offs that reflect value for money and use.”

The Core Trustworthy Data Repositories Requirements notes four curation levels that can be performed by trusted repositories:
a. As deposited
b. Basic curation eg, brief checking, addition of basic metadata or documentation
c. Enhanced curation eg, conversion to new formats, enhancement of documentation
d. Data level curation (as in C above, with additional editing of data for accuracy)


Thursday, March 02, 2017

A lifetime in the digital world

A lifetime in the digital world. Helen Hockx. Blog: Things I cannot say in 140 characters.
February 15, 2017.
     A very interesting post about papers donated to the University of Notre Dame in 1996, and how the library has been dealing with the collection. The collection includes a survey that is possibly “the largest, single, data gathering event ever performed with regard to women religious”. The data was stored on “seven reels of 800 dpi tapes, ]rec]120, blocksize 12,000, approximately 810,000 records in all”, extracted from the original EBCDIC tapes and converted to newer formats in 1996, transferred to CDs then to computer hard disk in 1999. The 1967 survey data has fortunately survived the format migrations. Some other data in the collection had been lost: at least 3 tape reels could not be read during the 1996 migration exercise and at least one file could not be copied in 1999. "The survey data has not been used for 18 years since 1996 – nicely and appropriately described by the colleague as “a lifetime in the digital world”.

The dataset has now been reformatted and stored in .dta and .csv formats. We also recreated the “codebook” of all the questions and pre-defined responses and put in one document. The dataset is in the best possible format for re-use. The post gives examples of  digital collection items that require intervention or preservation actions. A few takeaways:
  • Active use seems to be the best way for monitoring and detecting digital obsolescence.
  • Metadata really is essential. Without the notes, finding aid and scanned codebook, we would not be able to make sense of the dataset.
  • Do not wait a lifetime to think about digital preservation. 
  • The longer you wait, the more difficult it gets.

Tuesday, January 31, 2017

Digital Preservation and Archaeological data

Digital Preservation.  Michael L. Satlow. Then and Now. Jan 26, 2017.
     The post looks at the issue of preservation in relation to the modern scholarly and artistic works. "The underlying problem is a simple one: most scholarly and creative work today is done digitally." Archaeological excavations generate reams of data, and like other scientific data, archaeological data are valuable.  There is no single way that archaeologists record their findings. "Unlike scientists, many archaeologists and humanists have not thought very hard about the preservation of digital data. Scientists routinely deposit their raw data in institutional repositories and are called upon to articulate their digital data management and preservation plan on many grant applications. The paths open to others are less clear."

Institutional digital repositories provide a simple and inexpensive solution. When the project is complete, the data can be converted to xml and deposited. The data conversion would be the most involved part. The xml format would allow the data to be easily accessed and used. "It is time to think about digital preservation as a staple of our 'best practices'.”


Wednesday, November 02, 2016

Should We Keep Everything Forever? Determining Long-Term Value of Research Data

Should We Keep Everything Forever? Determining Long-Term Value of Research Data. Bethany Anderson, et al.  iPres 2016. (Proceedings p. 284,5/ PDF p. 143). Poster.
     The poster describes efforts by the institution to launch an institutional data repository called the Illinois Data Bank. The Research Data Service is committed to preserving and providing access
to published research datasets for a minimum of five years after the date of publication in the Data Bank. A preservation review developed preservation review processes and guidelines for datasets that will help promote the discoverability and use of open research data. They offer a preservation and access solution that is trusted by researchers.

The framework includes guidelines and processes for reviewing published datasets after their five-year commitment ends and decide if they should be retained or, deaccessioned. This systematic appraisal approach helps them decide the long-term viability of a dataset, its value to research communities and its preservation viability.

The Preservation review guidelines for the Illinois Data Bank are:

Evaluated by Curators/Librarians/Archivists
  • Cost to Store:  estimated cost of continuing to store
  • Cost to Preserve: estimated cost of continuing or escalating preservation
  • Access: use metrics to determine interest in this dataset
  • Citations:  has the dataset been cited in any publications
  • Restrictions: are there access or re-use restrictions
Evaluated by Domain Experts
  • Possibility of Re-creation
  • Cost of Re-creation
  • Impact of Study: did the study for this dataset significantly impact research
  • Uniqueness of Study
  • Quality of Study
  • Quality of Dataset
  • Current Relevance to contemporary research questions
Evaluated by Curators/Librarians/Archivists and Domain Experts
  • Are other copies available
  • Understandability: is the metadata & documentation for access / reuse sufficient
  • Dependencies: what are the software and environment dependencies
  • Appropriateness of Repository: is there a better repository for the dataset

Saturday, October 15, 2016

DPTP: Introduction to Digital Preservation Planning for Research Managers

DPTP: Introduction to Digital Preservation Planning for Research Managers. Ed Pinsent, Steph Taylor. ULCC. 15 October 2016.
     Today I saw this course offered and thought it looked interesting (wish I were in London to attend).  It is a one-day introduction to digital preservation and is designed specifically to look at preservation planning from the perspective of the research data manager. Digital preservation, the management and safeguarding of digital content for the long-term, is becoming more important for research data managers to make sure  content remains accessible and authentic over time.  The learning outcomes are:
  • Understand what digital preservation means and how it can help research managers
  • How to assess content for preservation
  • How to integrate preservation planning into a research data management plan
  • How to plan for preservation interventions
  • How to identify reasons and motivations for preservation for individual projects
  • What storage means, and the storage options that are available
  • How to select appropriate approaches and methods to support the needs of projects
  • How to prepare a business case for digital preservation
The course contains eight modules, which are:
  1. Find out about digital preservation and how and why it is important in RDM.
  2. Assessing research data and understanding how to preserve them for the longer term, and understanding your users.
  3. Learn how a RDM plan can include preservation actions. 
  4. Managing data beyond the life of projects, planning the management of storage and drafting a selection policy.
  5. Understanding individual institutions, stakeholders and requirements and risk assessment.
  6. Understand why preservation storage has extra requirements, considering‘the Cloud’
  7. The strategy of migrating formats, including databases; risks and benefits, and tools you can use. 
  8. Making a business case (Benefits; Risks; Costs) to persuade your institution why digital preservation is important

Monday, June 20, 2016

Preserving Transactional Data

Preserving Transactional Data. Sara Day Thomson. DPC Technology Watch Report 16-02. May 2016.
     This report examines the requirements for preserving transactional data and the challenges in re-using these data for analysis or research.   Transactional will be used to refer to "data that result from single, logical interactions with a database and the ACID properties (Atomicity, Consistency, Isolation, Durability) that support reliable records of interactions."

Transactional data, created through interactions with a database, can come from many sources and different types of information. "Preserving  transactional data, whether large or not, is imperative for the future usability of big data, which is often comprised of many sources of transactional data.  Such data have potential for future developments in consumer analytics and in academic research and "will only lead to new discoveries and insights if they are effectively curated and preserved to ensure appropriate reproducibility."

The organizations who collect transactional data aim to manage and preserve collected data for business purposes as part of their records management. There are strategies for database preservation as well as tools and standards  that can look at data re-use. The strategies for managing and preserving big transactional data must adapt to both SQL and NoSQL environments. Some significant challenges include the large amounts of data, rapidly changing data, and different sources of data creation. 

Some notes:
  • understanding the context and how the data were created may be critical in preserving the meaning behind the data
  • data purpose: preservation planning is critical in order to make preservation actions fit for purpose while keeping preservation cost and complexity to a minimum
  • how data are collected or created can have an impact on long-term preservation, particularly when database systems have multiple entry points, leading to inconsistency and variable data quality.
  • Current technical approaches to preserving transactional data primarily focus on the preservation of databases. 
  • Database preservation may not capture the complexities and rapid changes enabled by new technologies and processing methods 
  • As with all preservation planning, the relevance of a specific approach depends on the organization’s objectives.
There are several approaches to preserving databases:
  • Encapsulation
  • Emulation 
  • Migration/Normalization
  • Archival Data Description Markup Language (ADDML)
  • Standard Data Format for Preservation (SDFP) 
  • Software Independent Archiving of Relational Databases (SIARD)
"Practitioners of database preservation typically prefer simple text formats based on open standards. These include flat files, such as Comma Separated Value (CSV), annotated textual documents, such as Extended Markup Language (XML), and the international and open Structured Query Language (SQL)." The end-goal is to keep data in a transparent and vendor-neutral database so they can be  reintegrated into a future database.

Best practices:
  1. choose the best possible format, either preserving the database in its original format or migrating to an alternative format.
  2. after a database is converted, encapsulate it by adding descriptive, technical, and other relevant documentation to understand the preserved data.
  3. submit database to a preservation environment that will curate it over time.
Research is continuing in the collection, curation, and analysis of data; digital preservation standards and best practices will make the difference between just data and "curated collections of rich information".

Monday, February 08, 2016

Keep Your Data Safe

Love Your Data Week: Keep Your Data Safe. Bits and Pieces.  Scott A Martin. February 8, 2016.
     The post reflects on a 2013 survey of 360 respondents:
  • 14.2% indicated that a data loss had forced them to re-collect data for a project.  
  • 17.2% indicated that they had lost a file and could not re-collect the data.
If this is indicative of  the total population of academic researchers, then there is a lot of lost research time and money due to lost data. Some simple guidelines can greatly reduce the chances of catastrophic loss if steps are included in your own research workflow:
  1. Follow the 3-2-1 rule for backing up your data: store at least 3 copies of each file (1 working copy and 2 backups), 2 different storage media and at least 1 offsite copy 
  2. Perform regular backups
  3. Test your backups periodically
  4. Consider encrypting your backups.  Just make sure that you’ve got a spare copy of your encryption password stored in a secure location!  

Friday, September 25, 2015

Data Management Practices Across an Institution: Survey and Report

Data Management Practices Across an Institution: Survey and Report. Cunera Buys, Pamela Shaw. Journal of Librarianship and Scholarly Communication. 22 Sep 2015.
     Data management is becoming increasingly important to researchers in all fields. The results of a survey show that both short and long term storage and preservation solutions are needed. When asked, 31% of respondents did not know how much storage they will need, which makes establishing a correctly sized research data storage service difficult. This study presents results from a survey of digital data management practices across all disciplines at a university. In the survey, 65% of faculty said it was important to share data, but less than half of the them "reported that they 'always' or 'frequently' shared their data openly, despite their belief in the importance of sharing".

Researchers produce a wide variety of data types and sizes, but most create no metadata or do not use metadata standards and most researchers were uncertain about how to meet the NSF data management plan requirements (only 45% had a plan). A study in 2011 of data storage and management needs across several academic institutions and found many researchers were satisfied with short-term data storage and management practices, but not satisfied with long-term data storage options. Researchers in the same study did not believe their institutions provided adequate funds, resources, or instruction on good data management practices. When asked about where research data is stored:
  • Sixty-six percent use computer hard drives
  • 47% use external hard drives
  • 50% use departmental or school servers
  • 38% store data on the instrument that generated the data
  • 31% use cloud-based storage services
    •  Dropbox was the most popular service at 63%
  • 27% use flash drives
  • 6% use external data repositories.

Most researchers expected to store raw and published data, “indefinitely”. Many respondents also selected 5-10 years, and very few said they keep data for less than one year. All schools all schools suggest that data are  relevant for long periods of time or indefinitely. Specific retention preferences by school were:
  • The college of arts and sciences prefers “indefinitely” for ALL data types
  • Published data: All schools prefer “indefinitely” for published data except
    • The law school prefers 1-5 years for published data
  • Other data:
    • The school of medicine prefers 5-10 years for all other data types
    • The school of engineering prefers 1-5 years for all other data types
    • The college of arts and sciences “Indefinitely” for raw data
    • The school of management “Indefinitely” for raw data

Keeping raw data / source material was useful since researchers may use it for
  • future / new studies (77 responses), 
  • utilize it for longitudinal studies (9 responses)
  • share it with colleagues (6 responses). 
  • valuable for replicating study results (10 responses), 
  • responding to challenges of published results, 
  • data would be difficult or costly to replicate 
  • simply stated that it is good scientific practice to retain data (4 responses).

When asked, 66% indicated they would need additional storage; most said 1-500 gigabytes or  “don’t know.” Also, when asked what services would be useful in managing research data the top responses were:
  • long term data access and preservation (63%), 
  • services for data storage and backup during active projects(60%), 
  • information regarding data best practices (58%), 
  • information about developing data management plans or other data policies (52%), 
  • assistance with data sharing/management requirements of funding agencies (48%), and 
  • tools for sharing research (48%).
Since most respondents said they planned to keep their data indefinitely, that means that institutional storage solutions would need to accommodate "many data types and uncertain storage capacity needs over long periods of time". The university studied lacks a long term storage solution for large data, but has short term storage available. Since many researchers store data on personal or laboratory computers, laboratory equipment, and USB drives, there is a greater risk of data loss. There appears to be a need to educate researchers on best practices for data storage and backup.

There appears to be a need to educate researchers on external data repositories that are available and on funding agencies’ requirements for data retention. The library decided to provide a clear set of  funder data retention policies linked from the library’s data management web guide. Long-term storage of data is a problem for researchers because of the data and the lack of stable storage solutions and that limits data retention and sharing.

Thursday, September 03, 2015

NMC Horizon Report: 2015 Library Edition

NMC Horizon Report: 2015 Library Edition The New Media Consortium. September 2015. [PDF]     This report looks at how key trends, significant challenges, and developments in technology will impact academic and research libraries, and that this will be a technology-planning guide.  Among other things, it looks at research data management and access. Some notes and quotes:
  • The move to a more expansive online presence is calling for state-of-the-art data management processes that both make content more discoverable and ensure long-term preservation. Libraries have long played key roles in this area and are continuing to refine their workflows as well as the digital infrastructures that support them. 
  • Experts argue that librarians must recognize how social media is changing the nature of scholarly record elements and develop plans to properly capture and preserve these activities.
  • Formats: Cambridge University Library in the UK offers researchers guidelines for choosing different formats for their data, with an emphasis on long-term sustainability. They assert that the format used for data collection can be different than the one used to archive that data, and recommend that researchers wait until the project is completed to convert their materials to any new formats. Other institutions encourage researchers to look to national agencies for digital file format best practice
  • As universities generate more data over time, libraries are well poised to be the managers and curators of this information. By digitally archiving the datasets from every publication they contain, tagging them with keywords, and making them searchable, library databases can uncover links and patterns between studies, revealing the full trajectory of an idea as it grows.
  • Metadata: The metadata preserves the meaning of data, ensuring the research materials will be searchable, discoverable, and accessible long-term
  • Enhanced formats and workflows within the realm of electronic publishing have enabled experiments, tests, and simulation data to be represented by audio, video, and other media and visualizations.  The emergence of these formats has led to libraries rethinking their processes for managing data and linking them between various publications.
  • As the types of mediums for research and data expand over time, library leaders must strategize and build sustainable databases that can house enormous amounts of research materials in nearly any format.
  • An example of digital curation: Digital Curation at ETH Zurich. Through the ETH Data Archive, ETH-Bibliothek provides an infrastructure for the medium and long-term storage of digital information such as research data, documents, or images.
  • Some institutions have data management services that were created to assist researchers with organizing, managing, and curating research data to "ensure its long-term preservation and accessibility".  They provide a data management planning tool plus an online repository for storing research materials and associated metadata. 
  • Some libraries help researchers design digital data management and archival plans for the grans that they are applying to. 
  • Digital  strategies [which includes Preservation/Conservation  Technologies]  are not so much technologies as they are ways of using devices and software to enrich teaching, learning, research, and information management.

Tuesday, September 01, 2015

NISO Launches New Primer Series with the Publication of Primer on Research Data Management

NISO Launches New Primer Series with the Publication of Primer on Research Data Management. NISO Press Release. 31 Aug 2015.
     The National Information Standards Organization has created a new Primer Series about information management technology issues. The series provides an overview of how data management and outlines best practices for collecting, documenting, and preserving research data. There is an increase in data driven research and the management of the data is a concern for researchers.The goal of the primer is to educate researchers to ensure that their data is easily reproducible, transparent and available for others. The first of three primers:  Research data management, by Carly Strasser.
  1. Planning for data management
    1. Data management plans. Many funders realize that planning before beginning a research data project is critical. Most Data Management Plans have five basic components:
      1. A description of the types of data from the project
      2. The standards that will be used for those data and metadata
      3. A description of the data policies
      4. Plans for archiving and preservation of the data generated
      5. A description of the data management resources needed
    2. Best practices for data management planning
      1. Naming schemes should be descriptive, unique, and reflect the content/sample
      2. Spreadsheets should ensure provenance and documentation of the entire workflow
      3. Keep the raw data on a separate tab
      4. Put only one type of data in any given cell
      5. Create a metadata collection plan
      6. Establish a plan for how the data will be backed up
  2. Documenting Research data 
    1. Metadata.  High-quality metadata are as critical to effective data sharing as the data itself, since the better the metadata, the more likely a dataset will be able to be reuse
    2. Document software and workflows. The complete project should be reproducible. 
  3. Administration. Sharing research data implies that others may examine, download, and/or use that data in the future. Ensuring that data are available for use and reuse requires proper licensing or waivers that enable these activities.
    1. Data storage, backups, and security. At a minimum, there should be three copies of the full dataset, associated code, and workflows: original, near, and far.
      Original: the working dataset and associated files, usually housed on a researcher’s primary computer
      Near: a copy ideally not in the same physical location; updated daily and often on a file server within the researcher’s institution.
      Far: A copy not be in the same building, and ideally located in an area with different disaster threats
  4. Preservation
    1. Best practices: 
      1. Formats: Use standard, open source formats rather than proprietary formats
        Identifier: The data should have a unique identifier
        Metadata: Create high quality, machine-readable metadata
      2. Repositories. When selecting a repository, researchers should consider:
        Location of similar datasets
        Access and use policies for the repository
        Length of time the data be should / will be kept
        Management and costs of the repository
        Existence of policies for replication, fixity, disaster recovery, and continuity?
  5. Use and re-use: For data to be used by others, there must be a way to identify, cite, and link
    datasets
"The guidelines in this primer give insight into how best to plan, document, and preserve datasets
responsibly so that they are easier to use and share, as well as making the opportunities for
collaboration with other researchers less difficult."

Saturday, August 08, 2015

Where Should You Keep Your Data?

Where Should You Keep Your Data? Karen M. Markin. The Chronicle of Higher Education. June 23, 2015.
     Federal funding agencies have made it clear that grant proposals must include plans for sharing research data with other scientists. What has not been clear is how and where researchers should store their data, which can range from sensitive personal medical information to enormous troves of satellite imagery.  Although data-sharing requirements have been in place for years, universities have been slow to assist principal investigators make that happen. Now if you don’t comply with the new policies, you might be prohibited from receiving additional grant money. Funding can be withheld from researchers who don’t comply. Principal investigators are urged to place their data in existing publicly accessible repositories and the NIH has a list of repositories. The NSF directs researchers to specific repositories.
The "DMP Tool," hosted by the University of California, provides a free, interactive form that walks you through the preparation of a data-management plan for more than a dozen organizations.

Many libraries are playing a role in this effort and researchers should check with reference librarians for help on this. Data storage and preparation can get complicated and it’s useful to have someone to guide you through the process. Federal agencies plan to establish standards for these so-called "metadata."

Related posts:

Monday, July 27, 2015

Researchers Open Repository for ‘Dark Data’

Researchers Open Repository for ‘Dark Data’. Mary Ellen McIntire. Chronicle of Higher Education.  July 22, 2015.
     Researchers working to create a one-stop shop to retain data sets after the papers they were produced for are published. The DataBridge project will attempt to expand the life cycle of so-called dark data by creating an archive for data sets and metadata, and will group them into clusters of information to make relevant data easier to find. They can then be reused, re-purposed, and then be reused by others to further science. A key aspect of the project will be to allow researchers to make connections pull in other data of a similar nature.

The researchers want to also include archives of social-media posts by creating algorithms to sort through tweets for researchers studying the role of social media. This could save people time who may otherwise spend a lot of time cleaning their data reinventing the wheel. The project could serve as a model for libraries at research institutions that are looking to better track data in line with federal requirements and extend researchers’ “trusted network” of colleagues with whom they share data.

Related posts:

Tuesday, July 14, 2015

Seagate Senior Researcher: Heat Can Kill Data on Stored SSDs

Seagate Senior Researcher: Heat Can Kill Data on Stored SSDs.  Jason Mick. Daily Tech. May 13, 2015.
   A research paper by Alvin Cox, a senior researcher, warns that those storing solid state drives should be careful to avoid storing them in hot locations. Average "shelf life" in a climate controlled environment is about 2 years but drops to 6 months if the temperature hits 95° F / 35° C. It also says that typically enterprise-grade SSDs can retain data for around 2 years without being powered on if the drive is stored at a temperature of 25°C / 77°F. For every 5°C / 9°F increase, the storage time halves.  This also applies to storage of solid-state equipped computers and devices. If only a few  sectors are bad it may be possible to repair the drive.  But if too many sectors are badly corrupted, the only option may be to format the device and start over.

Friday, July 10, 2015

Track the Impact of Research Data with Metrics; Gauge Archive Capacity

How to Track the Impact of Research Data with Metrics. Alex Ball, Monica Duke.  Digital Curation Centre. 29 June 2015.
   This guide from the DCC provides help on how to track and measure the impact of research data. It provides:
  • impact measurement concepts, services and tools for measuring impact
  • tips on increasing the impact of your data 
  • how institutions can benefit from data usage monitoring  
  • help to gauge capacity requirements in storage, archival and network systems
  • information on setting up promotional activities 
Institutions can benefit from data usage monitoring as they:
  • monitor the success of the infrastructure providing access to the data
  • gauge capacity requirements in storage, archival and network systems
  • create promotional activities around the data, sharing and re-use
  • create special collections around datasets;
  • meet funder requirements to safeguard data for the established lifespan
Tips for raising research data impact
  • deposit data in a trustworthy repository
  • provide appropriate metadata
  • enable open access
  • apply a license to the data about what uses are permitted
  • raise awareness to ensure it is visible (citations, publication, provide the dataset identifier, etc)

Monday, July 06, 2015

Presentation on Evaluating the Creation and Preservation Challenges of Photogrammetry-based 3D Models

Presentation on Evaluating the Creation and Preservation Challenges of Photogrammetry-based 3D Models. Michael J. Bennett. University of Connecticut. May 21, 2015.
    Photogrammetry allows for the creation of 3D objects from 2D photography, which mimics human stereo vision. There are many steps in the process, images, masks, depth maps, models, and textures. The question is, what should be archived for long term digital preservation? When models are output into an open standard, there is data loss, since “native 3D CAD file formats cannot be interpreted accurately in any but the original version of the original software product used to create the model.”

General lessons from archiving CAD files, are that, when possible, the data should be normalized into open standards. But native formats, which are often proprietary, should also be archived. With Photogrammetry Data, the author reviews some of the options and recommendations. There are difficulties with archiving the files, and also organizing the files in a container that are documents the relationships of the files. Digital repositories can play a role in the preservation of the 3D datasets.

Monday, June 29, 2015

SIRF: Self-contained Information Retention Format

SIRF: Self-contained Information Retention Format. Sam Fineberg,et al. SNIA Tutorial. 2015. [PDF]
Generating and collecting very large data sets that need to be kept for long periods is a necessity for many organizations, included sciences, archives, commerce. The presentation describes the challenges with keeping data long term with Linear Tape File System (LTFS) technology and a Self-contained Information Retention Format (SIRF). The top external factors driving long-term retention requirements are: Legal risk, compliance regulations, business risk, and security risk.

What does long-term mean? Retention of 20 years or more is required by 70% of the responses in a poll.
  • 100 years: 38.8%
  • 50-100 years: 18.3%
  • 21-50 years: 31.1%
  • 11-20 years: 15.7%
  • 7-10 years: 12.3%
  • 3-5 years: 1.9%
The need for digital preservation:
  • Regulatory compliance and legal issues
  • Emerging web services and applications
  • Many other fixed-content repositories (Scientific data, libraries, movies, music, etc.)
Data stored should remain accessible, undamaged, and usable for as long as desired and at an affordable cost. Affordable depends on the "perceived future value of information". There are problems with verifying the correctness and authenticity of semantic information over time. SIRF is the digital equivalent of a self contained archival box. It contains:
  • set of preservation objects and a catalog (logical or physical)
  • metadata about the contents and individual objects
  • self describing standard catalog information so it can all be maintained
  • a "magic object" that identifies the container and version
The metadata contains basic information that can vary depending on the preservation needs. It allows a deeper description of t he objects along with the content meaning and the relationship between the objects.

When preserving objects, we need to keep all the information to make them fully usable in the future. No single technology will be "usable over the time-spans mandated by current digital preservation needs". LTFS technologies are "good for perhaps 10-20 years".

Thursday, June 11, 2015

Libraries and Research Data Services

Libraries and Research Data Services. Megan Bresnahan, Andrew Johnson. University of Colorado Boulder. 2014. [PDF]
A presentation that looks at the importance of training librarians to become experts in research data services (RDS). “Reassigning existing library staff is the most common tactic for offering RDS. This approach also needs to be supported with professional development for staff so they can gain the required expertise to provide the full range of RDS”

Some feedback from Subject Librarians shows that they know it is part of their duties and that it is becoming more important, but that it is difficult to accomplish in the current environment:
  • “Research data is intimidating!”
  • “How can I take on research data support with so much else already on my plate?!”
  • “I need practical tools to use to help researchers with their data”
  • “Helping faculty and students with their data is an increasingly important part of my liaison duties”
Becoming expert requires data services training with established learning goals. With training, Subject Librarians are able to:
  • Understand the stages
  • Define the role
  • Apply skills
  • Plan for outreach
  • Feel confident
  • Engage with researchers
DataQ is a website that serves as a collaborative, peer-reviewed reference tool for librarians providing research data services

Tuesday, June 02, 2015

Data Archives and Digital Preservation

Data Archives and Digital Preservation. Council of European Social Science Data Archives. June 1, 2015.
Data Archives and Digital Preservation Data archives play a central role in research. Data is considered “the new gold”. There is increasing pressure on researchers to manage, archive, and share their datadata archives. It is important to securely store research data, and to allow researchers to reuse data in their own analyses or teaching.

Archives are much more than just a storage facility; they actively curate and preserve research data. They must have suitable strategies, policies, and procedures to maintain the usability, understandability and authenticity of the data. There are also numerous requirements from users, data producers, and funders. In the social science research data preservation and sharing, archives have the added responsibility of protecting the human subjects of the research.

The CESSDA site has many resources. Some of these are:

  • What is digital preservation 
  • OAIS 
  • Data appraisal and ingest 
  • Documentation and metadata 
  • Access and reuse 
  • Trusted digital repositories: audit and certification.


Tuesday, April 28, 2015

Database Preservation Toolkit

Database Preservation Toolkit. Website. April 2015.
The Database Preservation Toolkit uses input and output modules and allows conversion between database formats, including connection to live systems. It allows conversion of live or backed-up databases into preservation formats such as DBML, SIARD, or XML-based formats created for the purpose of database preservation.

This toolkit was part of the RODA project and now has been released as a separate project. The site includes download links and related publications and presentations.

Saturday, April 18, 2015

Digital Curation and Doctoral Research: Current Practice

Digital Curation and Doctoral Research: Current Practice. Daisy Abbott. International Journal of Digital Curation. 10 February 2015.[PDF]
More doctoral students are engaging in research data creation, processing, use, management, and preservation activities (digital duration) than ever before. Digital curation is an intrinsic part of the skills that students are expected to acquire.

Training in research skills and techniques is the key element in the development of a research student. The integration of digital curation into expected research skills is essential. Doctoral supervisors "should discuss and review research data management annually, addressing issues of the capture, management, integrity, confidentiality, security, selection, preservation and disposal, commercialization, costs, sharing and publication of research data and the production of descriptive metadata to aid discovery and re-use when relevant." Those supervisors may not necessarily have those skills themselves. And there is a gap in the literature about why and how to manage, curate, and preserve digital data as part of a PhD program.

While both doctoral students and supervisors can benefit from traditional resources on the topic, the majority of guidance on digital curation takes the form of online resources and training programs. In a survey,
  • over 50% of PhD holders consider long-term preservation to be extremely important. 
  • under 40% of students consider long-term preservation to be extremely important.
  • 90% of doctoral students and supervisors consider digital curation to be moderately to extremely important. 
  • Yet 74% of respondents stated that they had limited or no skills in digital curation and only 10% stated that they were “fairly skilled” or “expert”. 
And generally researchers were not are of the digital curation support services that are available. The relatively recent emphasis on digital curation in research nature of or the processes, present problems for supervisors. Developing the appropriate skills and knowledge to create, access, use, manage, store and preserve data should therefore be considered an important part of any researcher’s development. Efforts should be taken to
  • Ensure practical digital curation is understood
  • Encourage responsibility for digital curation activities in institutional support structures
  • Increase the discoverability and availability of digital curation support services