The National Information Standards Organization has created a new Primer Series about information management technology issues. The series provides an overview of how data management and outlines best practices for collecting, documenting, and preserving research data. There is an increase in data driven research and the management of the data is a concern for researchers.The goal of the primer is to educate researchers to ensure that their data is easily reproducible, transparent and available for others. The first of three primers: Research data management, by Carly Strasser.
- Planning for data management
- Data management plans. Many funders realize that planning before beginning a research data project is critical. Most Data Management Plans have five basic components:
- A description of the types of data from the project
- The standards that will be used for those data and metadata
- A description of the data policies
- Plans for archiving and preservation of the data generated
- A description of the data management resources needed
- Best practices for data management planning
- Naming schemes should be descriptive, unique, and reflect the content/sample
- Spreadsheets should ensure provenance and documentation of the entire workflow
- Keep the raw data on a separate tab
- Put only one type of data in any given cell
- Create a metadata collection plan
- Establish a plan for how the data will be backed up
- Documenting Research data
- Metadata. High-quality metadata are as critical to effective data sharing as the data itself, since the better the metadata, the more likely a dataset will be able to be reuse
- Document software and workflows. The complete project should be reproducible.
- Administration. Sharing research data implies that others may examine, download, and/or use that data in the future. Ensuring that data are available for use and reuse requires proper licensing or waivers that enable these activities.
- Data storage, backups, and security. At a minimum, there should be three copies of the full dataset, associated code, and workflows: original, near, and far.
Original: the working dataset and associated files, usually housed on a researcher’s primary computer
Near: a copy ideally not in the same physical location; updated daily and often on a file server within the researcher’s institution.
Far: A copy not be in the same building, and ideally located in an area with different disaster threats
- Best practices:
- Formats: Use standard, open source formats rather than proprietary formats
Identifier: The data should have a unique identifier
Metadata: Create high quality, machine-readable metadata
- Repositories. When selecting a repository, researchers should consider:
Location of similar datasets
Access and use policies for the repository
Length of time the data be should / will be kept
Management and costs of the repository
Existence of policies for replication, fixity, disaster recovery, and continuity?
- Use and re-use: For data to be used by others, there must be a way to identify, cite, and link
responsibly so that they are easier to use and share, as well as making the opportunities for
collaboration with other researchers less difficult."