JHOVE 1.10b3. Gary McGath. File Formats Blog. June 5, 2013.
JHOVE 1.10b3
is now available. This is the release candidate, and there won’t be any
further changes beyond the version number designation unless a serious
problem shows up.
This blog contains information related to digital preservation, long term access, digital archiving, digital curation, institutional repositories, and digital or electronic records management. These are my notes on what I have read or been working on. Please note: this does not reflect the views of my employer or anyone else.
Friday, June 21, 2013
Saturday, June 15, 2013
EPUB for archival preservation: an update
EPUB for archival preservation: an update. Johan van der Knijff's blog on Open Planets. 23 May 2013.
In 2012 the KB released a report on the suitability of the EPUB format for archival preservation. A substantial number of EPUB-related developments have happened since then, and as a result some of the report's findings and conclusions have become outdated, particularly the observations on EPUB 3, and the support of EPUB by characterisation tools. This blog post provides an update to those findings :
In 2012 the KB released a report on the suitability of the EPUB format for archival preservation. A substantial number of EPUB-related developments have happened since then, and as a result some of the report's findings and conclusions have become outdated, particularly the observations on EPUB 3, and the support of EPUB by characterisation tools. This blog post provides an update to those findings :
- Use of EPUB in scholarly publishing
- Adoption and use of EPUB 3
- EPUB 3 reader support
- Support of EPUB by characterisation tools
Friday, June 14, 2013
EPUB for archival preservation
EPUB for archival preservation. Johan van der Kniff. KB/National Library of the Netherlands. 20 July 2012.
The EPUB format has become increasingly popular in the consumer market. A number of publishers have indicated their wish to use EPUB for supplying their electronic publications to the KB. This document looks at the characteristics and functionality of the format, and whether or not it is suitable for preservation. Conceptually, an EPUB file is just an ordinary ZIP archive which includes one or more XHTML files, in one or more directories. Cascading Style Sheets are used to define layout and formatting. A number of XML files provide metadata.
EPUB has a number of strengths that make it attractive for preservation. It is an open format that is well documented, and there are no known patents or licensing restrictions. The format's specifications are freely available. It is largely based on well‐established and widely‐used standards so it scores high marks for transparency and re‐usability. For situations where authenticity is crucial (e.g. legal documents) all or parts of a document can be digitally signed. Also, EPUB 2 is a popular format with excellent viewer support, including several open source implementations. There is concern that its role is limited because the current e‐book market is dominated by proprietary formats. And EPUB3 is currently less stable. There is a chart of recommendations for using EPUB.
The EPUB format has become increasingly popular in the consumer market. A number of publishers have indicated their wish to use EPUB for supplying their electronic publications to the KB. This document looks at the characteristics and functionality of the format, and whether or not it is suitable for preservation. Conceptually, an EPUB file is just an ordinary ZIP archive which includes one or more XHTML files, in one or more directories. Cascading Style Sheets are used to define layout and formatting. A number of XML files provide metadata.
EPUB has a number of strengths that make it attractive for preservation. It is an open format that is well documented, and there are no known patents or licensing restrictions. The format's specifications are freely available. It is largely based on well‐established and widely‐used standards so it scores high marks for transparency and re‐usability. For situations where authenticity is crucial (e.g. legal documents) all or parts of a document can be digitally signed. Also, EPUB 2 is a popular format with excellent viewer support, including several open source implementations. There is concern that its role is limited because the current e‐book market is dominated by proprietary formats. And EPUB3 is currently less stable. There is a chart of recommendations for using EPUB.
Strategy for archiving digital records at the Danish National Archives
Strategy for archiving digital records at the Danish National Archives. Statens Arkiver. January 2013.
Their aim is to ensure the preservation of records that are of historical value, or that serve as documentation of significant administrative matters or legal importance for citizens and
authorities. The vision is to ensure that digital records are preserved so as to maintain their authenticity, and so that they can be found and reused. Preserving digital information for the long term, in a form that makes it reusable, requires some deliberate choices to be made in terms of methods, technologies and documentation. Digital preservation must also take economic considerations into account.
The basic strategy choice faced by preservation institutions is whether to pursue an emulation strategy or a migration strategy. This will determine how digital preservation in the institution is organised. The Danish National Archives have chosen a migration strategy which requires that the Archives to migrate digital records to a few, well-defined standard formats, and from time to time, be migrated to new formats and structures.
The Danish National Archives’ strategy must not be dependent on continuous access to the system
in which the data was originally created. It must be possible to interpret and re-use data in other systems. The term “original” cannot be applied in the same way to digital records. Whether data
is extracted from tables in a database or digital documents, a representation of the content is preserved in the preservation format. A digital archive primarily preserves data or information. The key aspect is the preservation of authentic information. The implementation of the strategy requires
implementation of its strategy so that the vision remains attainable and within its reach.
Their aim is to ensure the preservation of records that are of historical value, or that serve as documentation of significant administrative matters or legal importance for citizens and
authorities. The vision is to ensure that digital records are preserved so as to maintain their authenticity, and so that they can be found and reused. Preserving digital information for the long term, in a form that makes it reusable, requires some deliberate choices to be made in terms of methods, technologies and documentation. Digital preservation must also take economic considerations into account.
The basic strategy choice faced by preservation institutions is whether to pursue an emulation strategy or a migration strategy. This will determine how digital preservation in the institution is organised. The Danish National Archives have chosen a migration strategy which requires that the Archives to migrate digital records to a few, well-defined standard formats, and from time to time, be migrated to new formats and structures.
The Danish National Archives’ strategy must not be dependent on continuous access to the system
in which the data was originally created. It must be possible to interpret and re-use data in other systems. The term “original” cannot be applied in the same way to digital records. Whether data
is extracted from tables in a database or digital documents, a representation of the content is preserved in the preservation format. A digital archive primarily preserves data or information. The key aspect is the preservation of authentic information. The implementation of the strategy requires
- Early identification and approval of systems for submission purposes
- Frequent submission in non-system dependent format
- Ongoing planning of preservation and periodical migration to a new preservation format
implementation of its strategy so that the vision remains attainable and within its reach.
Wednesday, June 12, 2013
Web-Archiving
Web-Archiving. Maureen Pennock. DPC Technology Watch Report 13-01. March 2013. Publicly released 24 May 2013.
This report is intended for those wanting to develop a better understanding of the issues and options for archiving web content, and for those intending to set up a web archive. Web archiving technology allows valuable web content to be preserved and managed for future generations.
Web content is lost at an alarming rate and our digital cultural memory and organizational accountability is at risk. Organizational needs and resources must be considered when choosing web archiving tools and services. Issues with web archiving include selection of content, authenticity and integrity, quality assurance, duplication of content, legal rights, viruses, and the long term preservation of resources. Web archiving is not a single action but often a suite of applications used in various ways at different stages of the archiving process. Archiving tools may include commercial services, Web Curator Tool, Netarchive Suite, the Heritrix web crawler, WGet, and the Wayback access interface. Archiving a simple website may be straightforward, but archiving large numbers of websites for the long term becomes much more complicated and requires a complex solution. The International Internet Preservation Consortium has played key roles in developing standards, such as the WARC standard, and archiving tools.
There are three main technical approaches:
1. Client-side archiving, using web crawlers such as Heritrix or HTTrack
2. Transactional archiving, which addresses the capture of client-side transactions
3. Server-side archiving, which requires active participation from publishing organizations
Another option being explored is the use of RSS feeds to identify and pull content into a web archive.
In spite of all of the efforts for capture and managing web content, web archives still face significant challenges, such as quality assurance issues, the need for more capable tools, and the need for better legislation. "The technical challenges of web archiving cannot, and should not, be addressed in isolation."
This report is intended for those wanting to develop a better understanding of the issues and options for archiving web content, and for those intending to set up a web archive. Web archiving technology allows valuable web content to be preserved and managed for future generations.
Web content is lost at an alarming rate and our digital cultural memory and organizational accountability is at risk. Organizational needs and resources must be considered when choosing web archiving tools and services. Issues with web archiving include selection of content, authenticity and integrity, quality assurance, duplication of content, legal rights, viruses, and the long term preservation of resources. Web archiving is not a single action but often a suite of applications used in various ways at different stages of the archiving process. Archiving tools may include commercial services, Web Curator Tool, Netarchive Suite, the Heritrix web crawler, WGet, and the Wayback access interface. Archiving a simple website may be straightforward, but archiving large numbers of websites for the long term becomes much more complicated and requires a complex solution. The International Internet Preservation Consortium has played key roles in developing standards, such as the WARC standard, and archiving tools.
There are three main technical approaches:
1. Client-side archiving, using web crawlers such as Heritrix or HTTrack
2. Transactional archiving, which addresses the capture of client-side transactions
3. Server-side archiving, which requires active participation from publishing organizations
Another option being explored is the use of RSS feeds to identify and pull content into a web archive.
In spite of all of the efforts for capture and managing web content, web archives still face significant challenges, such as quality assurance issues, the need for more capable tools, and the need for better legislation. "The technical challenges of web archiving cannot, and should not, be addressed in isolation."
Tuesday, June 04, 2013
Cerf sees a problem: Today's digital data could be gone tomorrow.
Cerf sees a problem: Today's digital data could be gone tomorrow. Patrick Thibodeau. Computerworld. June 4, 2013.
Vinton Cerf is concerned that much of the data that has been created in the past few decades and for years still to come, will be lost to time. Digital materials from today, such as spreadsheets, documents, presentations as well as mountains of scientific data, won't be readable in the years and centuries ahead. Software backward compatibility is very hard to preserve over very long periods of time, and the data objects are only meaningful if the software programs are available to interpret them. "The scientific community collects large amounts of data from simulations and instrument readings. But unless the metadata survives, which will tell under what conditions the data was collected, how the instruments were calibrated, and the correct interpretation of units, the information may be lost. If you don't preserve all the extra metadata, you won't know what the data means. So years from now, when you have a new theory, you won't be able to go back and look at the older data."
What is needed is a "digital vellum," a digital medium that is as durable and long-lasting as the material that has successfully preserved written content for more than 1,000 years. If a company goes out of business and there is no provision for its software to become accessible to others, all the products running that software may become inaccessible. The cloud computing environment may help; it may be able to emulate older hardware on which we can run operating systems and applications. We need to preserve the bits, but also the a way of interpreting them.
Vinton Cerf is concerned that much of the data that has been created in the past few decades and for years still to come, will be lost to time. Digital materials from today, such as spreadsheets, documents, presentations as well as mountains of scientific data, won't be readable in the years and centuries ahead. Software backward compatibility is very hard to preserve over very long periods of time, and the data objects are only meaningful if the software programs are available to interpret them. "The scientific community collects large amounts of data from simulations and instrument readings. But unless the metadata survives, which will tell under what conditions the data was collected, how the instruments were calibrated, and the correct interpretation of units, the information may be lost. If you don't preserve all the extra metadata, you won't know what the data means. So years from now, when you have a new theory, you won't be able to go back and look at the older data."
What is needed is a "digital vellum," a digital medium that is as durable and long-lasting as the material that has successfully preserved written content for more than 1,000 years. If a company goes out of business and there is no provision for its software to become accessible to others, all the products running that software may become inaccessible. The cloud computing environment may help; it may be able to emulate older hardware on which we can run operating systems and applications. We need to preserve the bits, but also the a way of interpreting them.
The CODATA Mission: Preserving Scientific Data for the Future
The CODATA Mission: Preserving Scientific Data for the Future.Jeanne Kramer-Smyth. Spellbound Blog. February, 2013.
This is a post (and a link to the slides) about a session that was part of The Memory of the World in the Digital Age: Digitization and Preservation conference. The aim was to describe the initiatives of the Data at Risk Task Group (DARTG), which is part of the International Council for Science Committee on Data for Science and Technology (CODATA).
The goal is to preserve scientific data that is in danger of loss because they are not in modern electronic formats, or have particularly short shelf-life. The task group is seeking out sources of such data worldwide since many are irreplaceable for research into the long-term trends that occur in the natural world. One speaker talked about two forms of knowledge that we are concerned with here: the memory of the world and the forgettery of the world. Only the digital, or recently digitized, data can be recalled readily and made immediately accessible for research in the digital formats that research needs. The “forgettery of the world” is the analog records, ones that have been set aside for whatever reason, or put away for a long time and have become almost forgotten. It the analog data which are considered to be “at risk” and which are the task group’s immediate concern. Some of the early digital data are insufficiently described, or the format is out of date and unreadable, or the records cannot be located at all easily.
How can such “data at risk” be recovered and made useable? An inventory website has been set up where one can report data-at-risk. The overarching goal is to build a research knowledge base that offers a complimentary combination of past, present and future records. Some data mentioned: Oceanographic; climate; satellite; and other scientific data sets; born digital maps. With digital preservation initiatives there is a lot of rhetoric, but not so much action. There have been many consultations, studies, reports and initiatives but not very much has translated into action.
This is a post (and a link to the slides) about a session that was part of The Memory of the World in the Digital Age: Digitization and Preservation conference. The aim was to describe the initiatives of the Data at Risk Task Group (DARTG), which is part of the International Council for Science Committee on Data for Science and Technology (CODATA).
The goal is to preserve scientific data that is in danger of loss because they are not in modern electronic formats, or have particularly short shelf-life. The task group is seeking out sources of such data worldwide since many are irreplaceable for research into the long-term trends that occur in the natural world. One speaker talked about two forms of knowledge that we are concerned with here: the memory of the world and the forgettery of the world. Only the digital, or recently digitized, data can be recalled readily and made immediately accessible for research in the digital formats that research needs. The “forgettery of the world” is the analog records, ones that have been set aside for whatever reason, or put away for a long time and have become almost forgotten. It the analog data which are considered to be “at risk” and which are the task group’s immediate concern. Some of the early digital data are insufficiently described, or the format is out of date and unreadable, or the records cannot be located at all easily.
How can such “data at risk” be recovered and made useable? An inventory website has been set up where one can report data-at-risk. The overarching goal is to build a research knowledge base that offers a complimentary combination of past, present and future records. Some data mentioned: Oceanographic; climate; satellite; and other scientific data sets; born digital maps. With digital preservation initiatives there is a lot of rhetoric, but not so much action. There have been many consultations, studies, reports and initiatives but not very much has translated into action.
Subscribe to:
Posts (Atom)