Monday, August 13, 2012

The Problem of Data

The Problem of Data. Lori Jahnke, Andrew Asherpub, Spencer D. C. Keralis. CLIR Report. Council on Library and Information Resources. August 12, 2012.
Excellent report on data storage, use, and curation.  A section contains a snapshot of the current digital data curation education landscape.  Below are some long notes and excerpts from the PDF article:

Key Findings
  • None of the researchers interviewed for this study have received formal training in data management practices, nor do they expresssatisfaction with their level of expertise.
  • Few researchers, especially among those who are early in their career, think about long-term preservation of their data.
  • The demands of publication output overwhelm long-term considerations of data curation. Metadata and documentation are of interest only if they help a researcher complete his or her work.
  • There is a great need for more effective collaboration tools, as well as online spaces that support the volume of data generated and provide appropriate privacy and access controls.
  • Few researchers are aware of the data services that the library might be able to provide and seem to regard the library as a dispensary of goods (e.g., books, articles) rather than a place for research/professional support.
  • There is unlikely to be a single out-of-the-box solution that can be applied to the problem of data curation. Instead, an approach is needed that emphasizes working with researchers to identify or build appropriate tools.
  • Researchers must have access to adequate networked storage.
  • Universities should revise access policies to support multi - institutional research projects.
  • Programs should begin early in the researcher career path for the greatest long-term benefit.
  • Data curation systems should be integrated with the active research phase (i.e., as a backup, etc).
  • Privacy and data access control tools should be developed to manage confidential data. Policies must be developed that support researchers in using these technologies.
Other notes:
  • Data curation, a term generally defined as a set of activities that includes the preserving, maintaining, archiving, and depositing of data to keep it secure, intact, and accessible for reuse.
  • Many researchers expressed concerns surrounding the ethical reuse of research data. Additional work is needed to establish best practices in this area, particularly for qualitative data sets.
  • Most participants reported feeling adrift when establishing protocols for managing their data and added that they lacked the resources to determine best practices, let alone to implement them. Almost none of the scholars reported that data curation training was part of their graduate curriculum.
  • Perhaps one of the more complicated issues for data curation is the complex life cycle of research data and projects. Data collection may occur throughout the project and change from before it is completed.
  • Scholars may collect data on a phenomenon unrelated to their current project with no clear idea of the potential usefulness of those data. Such data might be integrated with a later project, given away to an interested colleague, or never used at all.
  • It would be helpful to have a way to collect data into a collection space that could be used throughout the project.
  • The researchers held contradictory views about the value of their data. Some wanted to associate their data with publications or to have it available for use in the classroom
  • Few of the researchers thought about long-term preservation of their data, especially those who were early in their career.
  • The academic system offers little or no career reward for preserving one’s data.
  • Data preservation strategies must take into account varied, proprietary, and non-standard data formats, and provide a real-time benefit for the scholar in meeting research goals.
  • Given the lack of infrastructure for sharing and storing data, the social sciences may face similar problems of data loss in documenting social phenomena as researchers begin to work within larger collaborative groups and with larger data sets. Data stored on personal media devices are especially vulnerable to this type of loss, as few scholars have the skills necessary to maintain data over time and across hardware and software platforms. Several of the scholars interviewed reported storing data on legacy systems that may become inaccessible
  • University policies that appropriately address the ethical considerations relating to data sharing and preservation would benefit researchers, administrators, and technologists alike.
  • Researchers hold tremendous amounts of data on personal computers and hard drives, many of which are not backed up adequately. Among the participants, the research data ranged from under 1 GB to multiple terabytes. Data types included various formats of images, video, audio files, data sets, documents, etc.
  • Managing large files presents significant challenges for researchers in that university infrastructures typically do not provide adequate storage space or sufficient bandwidth for data access.  The data may be lost when researchers upgrade their computers or software. Few researchers put more than minimal effort into organizing non-active data or ensuring its continued compatibility with new software or hardware.
  • There is a clear need for libraries to move beyond passively providing technology to embrace the changes in scholarly production that emerging technologies have brought.  
  • The data preservation step must be fully integrated into a scholar’s research workflow. Not only are necessary metadata and other materials much more easily captured while research is in progress, but also there is a real opportunity to streamline research workflows and to provide much needed support. Scholars need help with the technical aspects of managing and preserving data, as well as with basic curation issues (e.g., what to keep and what to delete), and the ethical implications of sharing their data (e.g., what is an appropriate latency period for the data and how does one balance the need to provide meaningful access with the risk of inadvertently exposing confidential participant information).
  • Although some researchers acknowledge that their data could be useful to other researchers, there is little incentive to invest time in archiving or repackaging data sets.
  • Extensive outreach to scholars is necessary to build the relationships that will facilitate data preservation. This is likely to be a slow process initially. Researchers are unlikely to engage with those they do not view as peers.
  • Researchers need additional tools to manage preserved data on their own, and they would benefit from access to professionals who can offer advice on management strategies.
  • Researchers typically align themselves with their disciplines rather than with their institutions; therefore, support models that extend beyond the university are likely to be especially beneficial.
  • Reaching the level of collaboration among universities and the technical interoperability required to capture and preserve a career’s worth of data in the current environment is a challenge.
  • Current data management systems must be fundamentally improved so that they can meet the capacity demand for secure storage and transmission of research data. Integrating the data preservation system with the active research cycle is essential to encourage researcher investment.
  • Researchers are not well positioned to meet the technical and policy challenges without the coordinated support of libraries, information technology units, and professionals who possess both technical and research expertise.
  •  One example concerning the PETRA e+e collider project in Hamburg, Germany; In the more than 25 years since, theoretical insights and computing advancements have made the data valuable once again. However, much of the data have been irrevocably lost to corrupt storage media, lost computer code, and deactivated personal accounts. These early particle physics experiments are unique, as modern colliders operate at higher energy levels and cannot replicate the particle interactions.


Kalpana Ganesan said...

Informatics Outsourcing is an Offshore Data Management service company. Data Management Service includes all types of Data Conversion, File Conversion, XML Conversion, HTML Conversion,SGML Conversion, Document Conversion,Data Entry, Data Extraction and Validation,OCR and ICR Services with affordable price. Our team to give the solution quickly and given requirements.

Shania Simpsons said...

Well, who actually wants to lose vital data and information because of irresponsible data management? The researchers should handle data with more care and expertise expected of their profession. There are many ways of preserving data. The most common method is to provide and have a backup copy in hand in case of data corruption. Training the researchers how to handle data is a plus as well.

Chris said...

Shania, I think you need to read the article. Backups are good, but they are not preservation, and they are not permanent. While it would be good have the researchers more aware of data preservation, that is not what they are trying to do, nor is it what their organizations reward them for. They usually are under pressure to do the research and publish and move on. Other key points in the article are that that they need training and tools, and time, which they don't have. There is a better chance of preserving and sharing data if they partner with those who already have the time and training and tools. Awareness of the problem and collaboration with partners, such as the library, are important.

Ruby said...

One characteristic of a good data management system is it lets you systemize your records such as people (costumers or employees), project and assets into well-structured body. Preferably, in way that you can view them in several ways (list, tables or visual forms).

Ruby Badcoe