This blog contains information related to digital preservation, long term access, digital archiving, digital curation, institutional repositories, and digital or electronic records management. These are my notes on what I have read or been working on. Please note: this does not reflect the views of my employer or anyone else.
Thursday, December 04, 2008
Hard drive sounds
Read more about each product and common problems, or listen to the sounds. They advise if you hear the sounds and your drive is still working, to back it up immediately. An interesting site.
PDF/A Competence Center.
Transmitting data from the middle of nowhere.
Interesting look at a marine survey company and the ways it has developed to transmit large amounts of data from remote locations. They have set up servers that reduce the amount of information transmitted by removing duplicate information then replicate the information. The data is backed up using IBM's Tivoli software and a StorageTek tape library with LTO-3 and LTO-4 tape drives for archiving data. "So data movement is the number one problem for anyone in the survey or field scientific world." They use Data Domain storage appliances for data backup / disaster recovery; it has a compression algorithm that is supposed to make sure that transmitted does not already exist on the system at the San Diego data center. It can also stream the data for maximum performance.
Friday, November 28, 2008
Digital Preservation Matters - 28 November 2008
The Future of Repositories? Patterns for (Cross-) Repository Architectures. Andreas Aschenbrenner, et al. D-Lib Magazine. November/December 2008.
Repositories have been created mostly by academic institutions to share scholarly works, for the most part using Fedora, DSpace and EPrints. While it is important to look at manageability, cost efficiency, and functionalities, we need to keep our focus on the real end user (the Scholar). The OpenDOAR directory lists over 1200 repositories. The repository adoption curve shows cycles, trends, and developments. “It is the social and political issues that have the most significant effect on the scholarly user and whether or not that user decides to use a repository.” The repository's primary mission is to disseminate the university's primary output. Researchers, not institutions, are the most important users of repositories. The benefits of repositories may not be clear to researchers, and the repository needs to “become a natural part of the user's daily work environment.” To do this we should focus on features such as:
- Preserve the user's intellectual assets in a long-term trusted digital repository
- Allow scientific collaboration through reuse of publications as well as primary data
- Embed repositories into the user's scientific workflows and technology (workbench)
- Customize the repository to the local user needs and technology
- Manage intellectual property rights and security
Individual repositories may not be able to address all these issues. Preservation is one of the main motivators for people to use a repository. “Trust in a stable and secure repository service is established through the repository's policies, status among peers, and added-value services.” Users want someone to take responsibility for the servers and tools. Trust depends on:
- The impact a service has on users' daily lives
- How the service blends into their routine,
- If the repository's policies and benefits works for the users.
Managing the collective Collection. Richard Ovenden. OCLC. 6 November 2008. [pdf]
A PowerPoint presentation on managing a collection in the future. Looks at Uniformity vs. Uniqueness, and the sameness of e-resources. The collective collection is now an aggregated digital collection rather than a distributed print collection. Access to the core aggregated collection is no longer a factor of time and craft but one of money. With this new sense of uniformity, uniqueness has a new value.
Local unique: Sensible stewardship of locally-generated assets:
University archives
Research Data
Global unique: Selected and curated content that has been actively acquired through competition
Personal digital collections
Copy-specific printed books
Personal digital collections: new phenomenon, new problem.
Acquisition from older media
New management issues
Preservation of the unique more important than ever.
Who will bear the cost of keeping print?
New models of collaboration
Expectations of the Screenager Generation. Lynn Silipigni Connaway. OCLC. 6 November 2008. [pdf]
Lot of information here. Some notes: Some attitudes of the newer generation: Information is information; Media formats don’t matter; Visual learners; Different research skills. They meet Information Needs mostly through the Internet or other people. They are attracted to resources based on convenience, immediate answers, and no cost. They prefer to do their own research. They don’t use libraries because they don’t know they, they are satisfied with other sources, the library takes too long or too difficult to use. The image of libraries is Books. They do not think of a library as an information resource. Search engines are trusted about the same as a library. What can we do? Encourage, promote, use creative marketing, build relationships, understand their needs better.
Digital preservation of e-journals in 2008: Urgent Action revisited. Portico. January 2008. [pdf]
The document has been out for a while, but I found it interesting in light of current efforts. It presents the results of a survey concerning eJournals. The survey was designed to:
- Analyze attitudes and priorities that can be used to guide Portico
- Assist library directors in prioritizing and allocating limited resources.
Here are some of the findings:
- 76% said they do not yet participate in an e-journal preservation initiative.
- 71% felt it would be unacceptable to lose access to e-journal materials permanently
- 82% agreed that “libraries need to support community preservation initiatives because it’s the right thing to do.”
- 73% agreed that “our library should ensure that e-journals are preserved somewhere
- 4% believed preservation could be achieved by publishers holding redundant copies of eJournals
Libraries are unsure about how urgent the issue is and whether they need to take any action in the next two years. This appears to follow the interest of the faculty in the issue. Where the library was interested in eJournal preservation, 74% had been approached by faculty on the issue. When the library was not interested, only 34% had ever been approached by faculty, and less than 10% had ever been approached by faculty more than twice. Many libraries feel the issue is complicated and are not sure who should preserve the eJournals. They are uncertain about the best approach, and there are competing priorities. “Research institutions are far more likely than teaching institutions to have taken action on e-journal preservation.” Most libraries do not have an established digital preservation budget, and the money is borrowed from other areas, such as the collections budget.
Friday, November 21, 2008
Digital Preservation Matters - 21 November 2008
Archives: Challenges and Responses. Jim Michalko. OCLC. 6 November 2008. [pdf]
Interesting view of ‘The Collective Collection’. A framework for representing content.
- Published Content: books, journals, newspapers, scores, maps, etc.
- Special Collections: Rare books, local histories, photos, archives, theses, objects, etc.
- Open Web Content: Web resources, open source software, newspaper archives, images, etc.
- Institutional Content: ePrints, reports, learning objects, courseware, manuals, research, data, etc.
Select>Deliver>Describe>Acquire>Appraise>Survey>Disclose>Discover
Managing the Collective Collection: Shared Print. Constance Malpas. OCLC. 6 November 2008. [pdf]
Concern that many print holdings will be ‘de-duped’ and that there will not be enough to maintain the title. Some approaches are offsite storage, digitization, distributed print archives. “Without system-wide frameworks in place, libraries will be unable to make decisions that effectively balance risk and opportunity with regard to de-accessioning of print materials.” The average institutional holdings for in WorldCat: for serials=13; for books=9. Up to 40% of book titles have a single institution holding. There is a need for a progressive preservation strategy.
Ancient IBM drive rescues Apollo moon data. Tom Jowitt. Computerworld. November 12, 2008.
Data gathered by the Apollo missions to the moon 40 years ago looks like it may be recovered after all, thanks to a donation of an “ancient” IBM tape drive. The mission data had been recorded onto 173 data tapes, which had then been 'misplaced' before they could be archived. The tapes have been found but now they did not have a drive to read the data; one has been found at the Australian Computer Museum Society. It will require some maintenance and to restore to working condition. "It's going to have to be a custom job to get it working again," which may take several months.
Google to archive 10 million Life magazine photos. Heather Havenstein. Computerworld. November 18, 2008.
Google plans to archive as many as 10 million images from the Life magazine archives, and about 20% are already online. Some of the images date back to the 1750s; many have never been published. The search archive is here.
PREMIS With a Fresh Coat of Paint. Brian F. Lavoie. D-Lib Magazine. May/June 2008.
Highlights from the Revision of the PREMIS Data Dictionary for Preservation Metadata. This looks at PREMIS 2.0 and the changes made:
- Update to the data model clarifying relation between Rights and Agents, and Events and Agents
- Completely revised and expanded Rights entity: a more complete description of rights statements
- A detailed, structured set of semantic units to record information about significant properties
- Added the ability to accommodate metadata from non-PREMIS specifications
- A suggested registry to be created of suggested values for semantic units
Thursday, November 20, 2008
Top five IT spending priorities for hard times
With the current economic times, organizations are busy looking to see what costs they can cut. Analysts agree these areas need to be funded.
- Storage: Disks and management software. For many the largest expenditure is storage. Data doubles yearly
- Business intelligence: Niche analytics. Information and resources to help accomplish keys goals.
- Optimizing resources. Get the most out of what you already have.
- Security. Keeping the resources secure.
- Cloud computing: Business solutions.
PC Magazine will be online only
Ziff Davis Media announced it was ending print publication of its 27-year-old flagship PC Magazine; following the January 2009 issue, it will be online only. "The viability for us to continue to publish in print just isn't there anymore." PC Magazine derives most of its profits from its Web site. More than 80 percent of the profit and about 70 percent of the revenue come from the digital business. This is not too much of an adjustment since all content goes online first, and then the print version has been choosing what it wants to print.
A number of other magazines have ended their print publications. The magazines that have gone to online only have been those that are declining. "Magazines in general are going to be dependent on print advertising for a long time into the future."
Massive EU online library looks to compete with Google
The European Union is launching the Europeana digital library, an online digest of Europe's cultural heritage, consisting of millions of digital objects, including books, film, photographs, paintings, sound files, maps, manuscripts, newspapers, and documents.
The prototype will contain about two million digital items already in the public domain. By 2010, the date when Europeana is due to be fully operational, the aim is to have 10 million works available of the estimated 2.5 billion books in Europe's more common libraries. The project plans to be available in 21 languages, though English, French and German will be most prevalent early on.
Thursday, November 13, 2008
Digital Preservation Matters - 14 November 2008
Library of Congress Digital Preservation Newsletter. Library of Congress. November 2008.
There are three interesting items in the November newsletter:
1. The NDIIPP Preserving Digital Public Television Project is building infrastructure, creating standards and obtaining resources. The project is trying to create a consistent approach to digital curation among those who produce PBS programs. Their metadata schema includes four elements: PBCore (a standard developed by and for public media organizations), METS rights, MODS and PREMIS. The goal is to put the content in the Library’s National Audio-Visual Conservation Center where it will be preserved on servers and data tapes. This will support digital archiving and access for public television and radio programs in the US. Many stations are unsure about what to do with their programs for the long term and the American Archive is seen as a solution.
2. Digitization Guidelines: An audiovisual working group will set standards and guidelines for digitizing audiovisual materials. The guidelines will cover criteria such as evaluating image characteristics and establishing metadata elements. The recommendations will be posted on two Web sites:
www.digitizationguidelines.gov/stillimages/
www.digitizationguidelines.gov/audio-visual/
3. Data Archive Technology Alliance: A meeting was held to establish a network of data archives to help develop shared technologies for the future. They hope to set standards for shared, open-source and community developed technologies for data curation, preservation, and data sharing. It is critical to clearly define the purpose and outcome of the effort. Those involved will develop a shared inventory of their tools, services, and also list new developments to enhance data stewardship.
JHOVE2 project underway. Stephen Abrams. Email. November 6, 2008.
The JHOVE tool has been an important part of digital repository and preservation workflows. It has a number of limitations and a group is starting a two-year project to develop a next-generation JHOVE2 architecture for format-aware characterization. Among the enhancements planned for JHOVE2 are:
· Support for: signature-based identification, extraction, validation, and rules-based assessment
· A data model supporting complex multi-file objects and arbitrarily-nested container objects
· Streamlined APIs for integrating JHOVE2 in systems, services, and workflows
· Increased performance
· Standardized error handling
· A generic plug-in mechanism supporting stateful multi-module processing;
· Availability under the BSD open source license
Planetarium - Planets Newsletter Issue 5. 22 October 2008 [PDF]
The newsletter includes several items about Planets (Preservation and Long-term Access through Networked Services) which is a European to address digital preservation challenges. Here are a few items from the newsletter: Project Planets will provide the technology component of The British Library digital preservation solution.
The preservation planning tool Plato implements the PLANETS Preservation Planning approach. It looks and guides users through four steps:
- define context and requirements;
- select potential actions and evaluate them on sample content;
- analyze outcomes and;
- define a preservation plan based on this empirical evidence.
Digital preservation activities can only succeed if they consider the wider strategy, policy, goals, and constraints of the institution that undertakes them. For digital preservation solutions to succeed it is essential to go beyond the technical properties of the digital objects to be preserved, and to understand the and institutional framework in which data, documents and records are preserved. The biggest barriers to preservation are:
- lack of expertise
- funding and
- buy-in at senior level.
Cisco unveils a router for the 'Zettabyte Era'. Matt Hamblen. Computerworld. November 11, 2008.
Cisco introduced the "Zettabyte Era," and announced the Aggregation Services Router (ASR) 9000, the next generation of extreme networking. They believe service providers need to prepare for petabytes or even exabytes data from video applications which need faster routing. “Instead of needing switching for petabytes or even exabytes of data, the zettabyte will soon be the preferred term, equal to 10 to the power of 18”.
In praise of ... preserving digital memories. Editorial. The Guardian. September 30, 2008.
Some people are thinking centuries ahead. The British Library hosted the iPres conference to work out ways to preserve data for future generations. Since most everything is in digital form now, this is a difficult thing to do. By 2011 “it is expected that half of all content created online will fall by the wayside.” There is no Rosetta Stone for digital but progress is being made.
Skills, Role & Career Structure of Data Scientists & Curators: Assessment of Current Practice & Future Needs. Alma Swan, Sheridan Brown. JISC. 31 July 2008.
The report of a study that looks at those who work with data
It identifies four roles, which may overlap
- Data Creator: Researchers who produce and are experts in handling, manipulating and using data
- Data Scientist: Those who work where the research is carried out and may be involved in creative enquiry and analysis
- Data Manager: Those who take responsibility for computing facilities, storage, continuing access and preservation of data
- Data Librarian: Librarians trained and specializing in the curation, preservation and archiving of data
There is a continuing challenge to make sure people have the skills needed. Three main potential roles for the library:
- Training researchers to be more data-aware
- Adopt a data archiving and preservation role; provide services through institutional repositories
- Training of data librarians
Caring for the data frees data scientists from the task and allows them to focus on other priorities. Data issues are moving so fast that periodic updating is much more effective than an early, intensive training with no follow-up. Some institutions offer training courses and workshops on data-related topics.
Tuesday, November 11, 2008
JHOVE2 project underway
Sent: Thursday, November 06, 2008 3:43 PM
JHOVE2 project underway
The open source JHOVE characterization tool has proven to be an important
component of many digital repository and preservation workflows. However, its
widespread use over the past four years has revealed a number of limitations
imposed by idiosyncrasies of design and implementation. The California Digital
Library (CDL), Portico, and Stanford University have received funding from the
Library of Congress, under its National Digital Information Infrastructure
Preservation Program (NDIIPP) initiative, to collaborate on a two-year project
to develop a next-generation JHOVE2 architecture for format-aware
characterization.
Among the enhancements planned for JHOVE2 are:
* Support for four specific aspects of characterization: signature-based
identification, feature extraction, validation, and rules-based assessment
* A more sophisticated data model supporting complex multi-file objects and
arbitrarily-nested container objects
* Streamlined APIs to facilitate the integration of JHOVE2 technology in
systems, services, and workflows
* Increased performance
* Standardized error handling
* A generic plug-in mechanism supporting stateful multi-module processing;
* Availability under the BSD open source license
To help focus project activities we have recruited a distinguished advisory
board to represent the interests of the larger stakeholder community. The board
includes participants from the following international memory institutions,
projects, and vendors:
* Deutsche Nationalbibliothek (DNB)
* Ex Libris
* Fedora Commons
* Florida Center for Library Automation (FCLA)
* Harvard University / GDFR
* Koninklijke Bibliotheek (KB)
* MIT / DSpace
* National Archives (TNA)
* National Archives and Records Administration (NARA)
* National Library of Australia (NLA)
* National Library of New Zealand (NLNZ)
* Planets project
The project partners are currently engaged in a public needs assessment and
requirements gathering phase. A provisional set of use cases and functional
requirements has already been reviewed by the JHOVE2 advisory board.
The JHOVE2 team welcomes input from the preservation community, and would
appreciate feedback on the functional requirements and any interesting test
data that have emerged from experience with the current JHOVE tool.
The functional requirements, along with other project information, is available
on the JHOVE2 project wiki
<http://confluence.ucop.edu/display/JHOVE2Info/Home>. Feedback on project goals
and deliverables can be submitted through the JHOVE2 public mailing lists.
To subscribe to the JHOVE2-TechTalk-L mailing list, intended for in-depth
discussion of substantive issues, please send an email to <listserv at ucop dot
edu> with an empty subject line and a message stating:
SUB JHOVE2-TECHTALK-L Your Name
Likewise, to subscribe to the JHOVE2-Announce-L mailing list, intended for
announcements of general interest to the JHOVE2 community, please send an email
to <listserv at ucop dot edu> with an empty subject line and a message stating:
SUB JHOVE2-ANNOUNCE-L Your Name
To begin our public outreach, team members recently presented a summary of
project activities at the iPRES 2008 conference in London, entitled "What? So
What? The Next-Generation JHOVE2 Architecture for Format-Aware
Characterization," reflecting our view of characterization as encompassing both
intrinsic properties and extrinsic assessments of digital objects.
Through the sponsorship of the Koninklijke Bibliotheek and the British Library,
we also held an invitational meeting on JHOVE2 following the iPRES conference
as a opportunity for a substantive discussion of the project with European
stakeholders.
A similar event, focused on a North American audience, will be held as a
Birds-of-a-Feather session at the upcoming DLF Fall Forum in Providence, Rhode
Island, on November 13. Participants at this event are asked to review closely
the functional requirements and other relevant materials available on the
project wiki at <http://confluence.ucop.edu/display/JHOVE2Info/Home> prior to
the session.
Future project progress will be documented periodically on the wiki.
Stephen Abrams, CDL
Evan Owens, Portico
Tom Cramer, Stanford University
on behalf of the JHOVE2 project team
Friday, November 07, 2008
Digital Preservation Matters - 07 November 2008
Digital Preservation Policies Study. Neil Beagrie, et al. JISC. 30 October 2008. [pdf]
This study will become part of the foundation documents for digital preservation. It provides a model for digital preservation policies and looks at the role of digital preservation in supporting and delivering strategies for educational institutions. The study also includes 1) a model/framework for digital preservation policies; 2) a series of mappings of digital preservation to other key institutional strategies in universities, libraries, and Records Management. This is intended to help institutions develop appropriate digital preservation policies. Some notes:
Long-term access relies heavily on digital preservation strategies being in place and we should focus on making sure they are in place. Developing a preservation policy will only be worthwhile if it is linked to core institutional strategies: it cannot be effective in isolation. One section outlines well steps that must be taken to implement a digital preservation solution. Policies should outline what is preserved and what is excluded. Digital preservation is a means, not an end in itself. Any digital preservation policy must be seen in terms of the strategies of the institution. An appendix has created a summary of the strategy aims and objectives for certain institutions and the implications for digital preservation activities within the organization. Definitely worth studying the approximately 120 pages.
Predicting the Longevity of DVDR Media by Periodic Analysis of Parity, Jitter, and ECC Performance Parameters. Daniel Wells. BYU Thesis. July 14, 2008.
The summarizing statement for me was: “there is currently extreme reluctance to use DVD-R’s for future digital archives as well as justifiable concern that existing DVD archives are at risk.” We have certainly found this in our own experience, having very high failure rates with some collections.
The abstract: For the last ten years, DVD-R media have played an important role in the storage of large amounts of digital data throughout the world. During this time it was assumed that the DVD-R was as long-lasting and stable as its predecessor, the CD-R. Several reports have surfaced over the last few years questioning the DVD-R's ability to maintain many of its claims regarding archival quality life spans. These reports have shown a wide range of longevity between the different brands. While some DVD-Rs may last a while, others may result in an early and unexpected failure. Compounding this problem is the lack of information available for consumers to know the quality of the media they own. While the industry works on devising a standard for labeling the quality of future media, it is currently up to the consumer to pay close attention to their own DVD-R archives and work diligently to prevent data loss. This research shows that through accelerated aging and the use of logistic regression analysis on data collected through periodic monitoring of disc read-back errors it is possible to accurately predict unrecoverable failures in the test discs. This study analyzed various measurements of PIE errors, PIE8 Sum errors, POF errors and jitter data from three areas of the disc: the whole disc, the region of the disc where it first failed as well as the last half of the disc. From this data five unique predictive equations were produced, each with the ability to predict disc failure. In conclusion, the relative value of these equations for end-of-life predictions is discussed.
DCC Curation Lifecycle Model. Chris Rusbridge. Digital Curation Centre Blog. 8 October 2008.
The model they have put together is available in graphical form. Like all models it is of course a compromise between succinctness and completeness. They plan it to use it to structure information on standards and as an entry point to the DCC web site and it is explained in an article in the International Journal of Digital Curation. The model is a high level overview of the stages required for successful curation, and complements OAIS and other standards. The actions for Digital Objects or Databases are:
- Full Lifecycle Actions: Description and Representation Information; Preservation Planning; Community Watch and Participation Curate and Preserve
- Sequential Actions: Conceptualise; Create or Receive; Appraise and Select; Ingest; Preservation Action; Store; Access, Use and Reuse; Transform
- Occasional Actions: Dispose; Reappraise; Migrate
The model is part of a larger plan to take a detailed look at processes, costs, governance and implementation.
WVU Libraries Selected for Digital Pilot Project. September 15, 2008.
The West Virginia University Libraries are among 14 institutions picked to participate in a book digitization pilot project led by PALINET. Each institution will submit five to ten books to be digitized during a pilot project. After that, the initial target will be to digitize 60,000 books and put them in the Internet Archive. “Another benefit of the project is preservation.” The Rare Books Curator, said a dilemma is allowing access and yet providing for the maximum amount of preservation. “These books are old and they’re fragile, and there is always the difficulty of preserving a book that is used a lot. Maintaining that balance is essential. It’s a fine line that we’re always on. Book digitization is a way of providing access and assuring preservation of the original.”
Friday, October 31, 2008
Digital Preservation Matters - 31 October 2008
Google Settles Book-Scan Lawsuit, Everybody Wins. Chris Snyder. Wired. October 28, 2008.
Google settled a lawsuit by agreeing to pay $125 million to authors and publishers. In addition, out of print, copyright protected books will still be scanned and publishers have the option to activate a “Buy Now” button so readers can download a copy of the book. Google will take a 37 percent share of the profits, plus an administrative fee of 10 to 20 percent, and the remaining goes to authors and publishers. This creates a market for out-of-print works that were not likely to get back into "print" any other way, and it establishes a new non-profit Book Rights Registry to manage royalties.
Universities and institutions can buy a subscription service to view the entire collection, and U.S public libraries will have terminals for students and researchers to view the catalog for free.
Christian Science Monitor Goes All in on the Web. Meghan Keane. Wired. October 28, 2008.
The Christian Science Monitor plans to halt publication of its Monday through Friday newspaper in favor of daily web content. They are also creating a weekly Sunday magazine. This will cut The Monitor's subscription revenue in half, but it will also cut overhead in half as well. "Maybe the reason newspapers could go out of business is because they think they're in the newspaper business instead of the news gathering and dissemination business. To hang on to a two century old technology just because that’s the way we’ve always done it, that’s a recipe for failure."
Transition or Transform? Repositioning the Library for the Petabyte Era. Liz Lyon. UKOLN. ARL / CNI Forum. October 2008. [PowerPoint]
A recent study shows that data is continually re-analysed and new analytic techniques add value to older data. Data-sharing is seen as a form of trade or gift exchange: “give to get” rather than “give away”.
Preservation & sustainability Recommendations:
- Use DRAMBORA for self-assessment of data repositories
- Add PREMIS preservation metadata
- Collect representation information
- Examine that repository conforms to OAIS Model
- Survey partner preservation policies
Some challenges:
- Understand and manage risks
- Building a consensus in the community
- Appraisal and selection criteria
- Document the data; add metadata validate
- Data provenance, authenticity
Mourning Old Media’s Decline. David Carr. The New York Times. October 28, 2008.
There have been a number of newspapers having difficulties, not just the Christian Science Monitor. “The paradox of all these announcements is that newspapers and magazines do not have an audience problem … but they do have a consumer problem.” People get their information on the internet more than paper, but why does it matter? “The answer is that paper is not just how news is delivered; it is how it is paid for.” Part of the difficulty is that the move to digital media means that there are fewer people now employed in the industry who provide or report the information. The Google CEO said if the trusted brands of journalism vanish, the internet becomes a “cesspool” of useless information.
Wednesday, October 29, 2008
Christian Science Monitor Goes All in on the Web
Christian Science Monitor Goes All in on the Web. Meghan Keane. Wired. October 28, 2008.
The Christian Science Monitor plans to halt publication of its Monday through Friday newspaper in favor of daily web content. They are also creating a weekly Sunday magazine. This will cut The Monitor's subscription revenue in half, but it will also cut overhead in half as well. "Maybe the reason newspapers could go out of business is because they think they're in the newspaper business instead of the news gathering and dissemination business. To hang on to a two century old technology just because that’s the way we’ve always done it, that’s a recipe for failure."
Google Settles Book-Scan Lawsuit, Everybody Wins
Google Settles Book-Scan Lawsuit, Everybody Wins. Chris Snyder. October 28, 2008.
Google settled a lawsuit by agreeing to pay $125 million to authors and publishers. In addition, out of print, copyright protected books will still be scanned and publishers have the option to activate a “Buy Now” button so readers can download a copy of the book. Google will take a 37 percent share of the profits, plus an administrative fee of 10 to 20 percent, and the remaining goes to authors and publishers. This creates a for out-of-print works that were not likely to get back into "print" any other way, and establishes a new non-profit Book Rights Registry to manage royalties.
Universities and institutions can buy a subscription service to view the entire collection, and U.S public libraries will have terminals for students and researchers to view the catalog for free.
Friday, October 24, 2008
Digital Preservation Matters - 24 October 2008
HathiTrust: A Digital Repository for Libraries, by Libraries. Beth Ashmore. Information Today. October 23, 2008.
HathiTrust is a shared digital repository of two dozen libraries aimed at bringing the vast collections of print books and journals in libraries into the digital world for access, discovery, and preservation. "We have become convinced that there are some approaches to using this content, from an academic standpoint, that Google may not address." One of the areas in which the projects diverge is the importance placed on long-term preservation. "[Long-term preservation] is something we feel libraries need, and I think it has been one of the concerns about Google as a digitization partner. These resources need to be, in the long term, managed by libraries. This is something Google understood from the beginning in their partnership with us." One of the goals is an open technical framework. As well as a public discovery system which will hopefully be available in early 2009.
Bringing a Trove of Medieval Manuscripts Online for the Ages. John Tagliabue. The New York Times. October 20, 2008.
One of the oldest and most valuable collections of handwritten medieval books in the world, housed in the library of the abbey in St. Gallen, Switzerland, is going online with the help of a $1 million grant from the Andrew W. Mellon Foundation. The reduced price of computer memory helped make this possible. This will make the library more visible. “On the Internet we now have more visitors than in the real library.”
Government Documents Online. NELA Conference Blog. CT State Library October 19, 2008.
Julie’s Schwartz’s Presentation: The Connecticut State Library initiated the Connecticut Digital Archive Project because so many state documents and reports are now only available online, and often are posted for only a month or two and then disappear. Search engines don’t provide access to most of these publications even though users expect easy access. The archive project harvests and ingests “born digital” Connecticut state publications, catalogues them in MARC, and integrates linked records in their OPAC, and available in WorldCat. The WebHarvest program can harvest an entire webpage and sometime find other documents on archived pages. Preservation metadata is important for accuracy and management. Standardization is in important principle of digital preservation.
Seagate's 1.5TB Barracuda drive -- quiet, sips power. Rich Ericson. Computerworld. October 23, 2008.
Seagate has a new 1.5 terabyte drive which provides good performance into a single internal hard drive, despite the large capacity.
Friday, September 26, 2008
Digital Preservation Matters - 26 September 2008
iPres 2008 Web archiving. Digital Curation Blog. 30 September 2008.
Thorstheinn Hailgrimsson: Some web tools include Heretrix crawler, Web Curator from BL/NZNL, and Netarchive Curator Tool Suite from Denmark, plus access tools including NutchWAX for indexing, and an open source version of the Wayback machine. Three main approaches to web archiving: bulk, selective based on criteria, and event-based such as around an election, disaster, etc.
Colin Webb: Challenges are interconnected: what we want to collect, what we’re allowed to collect, what we’re able to collect, and what we can afford to collect.
Challenges of web archiving: How do you select material? It is the information or the ‘experience’ of the web page that is important? How can you move web documents between curatorial environments? “Even those who care about information persistence don’t necessarily do a good job of it on their Web sites.” Not everything on the web needs to be kept. The JISC PoWR (Preservation of Web Resources) project has created a blog and workshops to help develop best practices for web archiving. There are legal challenges and that brings some risks.
Copyright Act change shifts software rights. Ulrika Hedquist. Computerworld. 29 September, 2008
A change in New Zealand’s copyright law may affect who owns software. An amendment to the Copyright Act was introduced that would repeal the commissioning rule for software developers.
The general rule is that the creator of an artistic work or software holds the copyright to it. The commissioning rule is an exception which means that the commissioner of a work is the default copyright holder. Under the current rule, software developers have no rights to code developed for clients unless there is a contract in place saying otherwise. If enacted, the amendment could make significant changes to the industry.
Friday, September 19, 2008
Digital Preservation Matters - 19 September 2008
When to shred: Purging data saves money, cuts legal risk. Mary Brandel. Computerworld. September 18, 2008.
Many organizations never throw away data unless they run out of data, and they increase the amount of data by 20% - 50% each year. Not everything can or should be saved, it is important to decide what should be kept and for how long. Many organizations should be saving less data. The increase of data is growing faster than the decline of the cost of storage. The cost of storing and backing up data, including multiple copies of data, is increasing, as is the cost of e-discovery for lawsuits, which can range from $1 million to $3 million per terabyte of data. Electronic records management can help value the data and determine the retention period.
Edinburgh Repository Fringe. Website. August 2008.
This is a website of a ‘repository festival’ in Edinburgh which looks at repository issues, ideas, new perspectives, new projects, and interaction about repositories. It includes some documents, slides, and video streams of the discussions. A few items from the sessions:
- Faculty repositories: variety of sources, aware they need to make data available, most stored on department servers or desktops, sharing is often by email, large datasets are a problem. They want them published on the web and find linking very useful.
- They want a secure and user-friendly way to store and share research data, as well as the infrastructure to publish and preserve data.
- We need to gather requirements, look at current and planned services, meet needs.
- Promote favorable information: 87% said items found at the top of search results are seen as more authoritative
Poor E-Mail Archive Habits Plague Businesses. Leo King. Computerworld. August 31, 2008.
Research shows that employees do not properly archive e-mails because they are either too busy or are unsure how. Most employees do not receive guidance on how they should be archiving their email; many organizations do not have a policy.
- 30% said they had lost important documents
- 50% say email archiving is too time consuming
- 30% say it is too complicated
- 41 % leave files attached to e-mails forever
- 50% have an enforced limit on their email storage
- Over 25% save the files to the company system
- 28 % save them to their hard drive
Thursday, September 18, 2008
IMLS funds TIPR Demonstration Project
Thursday, September 18, 2008 9:54 AM
The Cornell University Library, New York University Libraries and the Florida Center for Library Automation are happy to announce the receipt of an IMLS National Leadership Grant for the demonstration project:
Towards Interoperable Preservation Repositories (TIPR).
The task of preserving our digital heritage for future generations far exceeds the capacity of any government or institution. Responsibility must be distributed across a number of stewardship organizations running heterogeneous and geographically dispersed digital preservation repositories. For reasons of redundancy, succession planning and software migration, these repositories must be able to exchange copies of archived information packages with each other. Practical repository-to-repository transfer will require a common, standards-based transfer format capable of transporting rich preservation metadata as well as digital objects, and repository systems must be capable of exporting and importing information packages utilizing this format.
The three TIPR partners run three technically heterogeneous, geographically distributed digital preservation repositories. Cornell University Library runs CUL-OAIS based on aDORe, New York University Libraries' Preservation Repository is based on DSpace, and the FCLA's Florida Digital Archive uses DAITSS. The TIPR partners will:
* design a shared transfer format based on METS and PREMIS schemas;
* enhance each of their preservation repository systems to support import and export of this information;
* test the actual transfer of processed and enriched archival information packages between the three repository systems.
The goals of the project are to:
* demonstrate the feasibility of repository-to-repository transfer of rich archival information packages;
* advance the state of the art by identifying and resolving issues that impede such transfers;
* develop a usable, standards-based transfer format, building on prior work;
* disseminate these results to the international preservation community and the relevant standards activities.
This two-year project will begin October 1, 2008.
Friday, September 12, 2008
Digital Preservation Matters - 12 September 2008
It’s Happening Now: This is the Tera Era of Data Storage. Larry Swezey. Computer Technology Review. 16 September 2008.
New visual and audio drive storage capacities upward but the new digital data explosion is very different. More data is being produced and it becoming a more important part of in all aspects of our lives. But the large files we see now are just beginning. The size of files and the amount of data is increasing dramatically. The sizes are moving into the terabyte range [already there in many cases]. More data is being retained. AV items demand more storage. People expect large amount of information to be available almost immediately.
Using METS, PREMIS and MODS for Archiving eJournals. Angela Dappert, Markus Enders. D-Lib Magazine. September/October 2008.
Many decisions need to be made on metadata, including the structural and preservation metadata. The British Library is developing a system for ingest, storage, and preservation of digital with eJournals as the first content stream and developing a common format for the eJournal OAIS Archival Information Package (AIP). EJournals are complex and outside the outside the control of the digital repository so it does not have the structure for submission packets, format standards and such. This article shows one approach to defining an eJournal Archival Information Package. It has a database that provides an interface for resource discovery and delivery. An archival store is a long-term storage component that supports preservation activities. All archival metadata is linked to the content and placed into the archival store. The archival metadata is represented as a hierarchy of METS files with PREMIS and MODS components that reference all content. Each manifestation of an article is stored in a separate METS file. There is no existing metadata schema that has all the descriptive, preservation and structural metadata, but this is how they use a combination of METS, PREMIS and MODS to create an eJournal Archival Information Package.
Introducing djatoka: A Reuse Friendly, Open Source JPEG 2000 Image Server. Ryan Chute, Herbert Van de Sompel. D-Lib Magazine. September/October 2008.
Support for the JPEG 2000 format is emerging in major consumer applications, many consider it suitable for digital preservation. This introduces djatoka, an open source JPEG 2000 image server with basic features and they urge others to help develop it. Often the tiff format is used for the high resolution and a derivative image is available on the web. JPEG2000 has multiple resolutions, region extraction, lossless and lossy compression, and display can start without waiting for the entire file to be loaded. djatoka improves the performance, supports many formats, manipulation of the image (such as watermarking), and works with Open URL.
Friday, September 05, 2008
Digital Preservation Matters - 05 September 2008
Preserving Government Web Sites at the End-of-Term. Library of Congress Newsletter. September 3, 2008.
When political offices change, the websites often change dramatically in the transition. "Digital government information is considered at-risk.” The Internet Archive will undertake a comprehensive crawl of the .gov domain. The Library of Congress has been preserving congressional Web sites each month since December 2003 and will focus on developing of this collection for the project. Others will focus on in-depth crawls of specific government agencies or will help selecting or prioritizing web sites to be included in the collection, as well as identifying the frequency and depth of the act of collecting.
Poor E-Mail Archive Habits Plague Businesses. Leo King. PCWorld. August 31, 2008.
Employees are failing to properly archive e-mails, according to research, because they are often too busy or too unsure of their IT skills. Most employees received no guidance on the requirements and methods for archiving e-mail; one third said their company has no e-mail policy. Also, a third of employees said they had lost important electronic documents and never recovered them. More than half said e-mail archiving is too time-consuming, and thirty percent find it "complicated" or "unreliable." This suggests that the organizations either do not archive emails or that they do not communicate the methods to their employees.
"Digital Preservation" term considered harmful? Chris Rusbridge. Digital Curation Blog. 29 July 2008.
The term ‘digital preservation’ may not be a useful term with decision makers. “The digital preservation community has become very good at talking to itself and convincing ‘paid-up’ members of the value of preserving digital information, but the language used and the way that the discourse is constructed is unlikely to make much impact on either decision-makers or the creators of the digital information (academics, administrators, etc.).” Part of the problem is that digital preservation describes a process, and not an outcome. We value the outcomes not necessarily the processes we use to get the outcomes, and the terminology we use should reflect that, which is more persuasive. Digital preservation has been over-sold as difficult, complex and expensive over the long term, while the term itself contains no notion of its own value. Phrases like "long term accessibility" or "usability over time" are better than the process-oriented phrase "digital preservation".
European Archive. Website. September 2008.
The European Archive is a digital library of cultural artifacts in digital form. They provide free access to researchers, historians, scholars, and the general public. The site contains web archives, videos, and plans to add audio recordings. The Living Web Archives project will carry Web archiving beyond the current approach, characterized by static snapshots, to one that fully accounts for the dynamics and interrelations of Web content.
Friday, August 29, 2008
Digital Preservation Matters - August 2008
PREMIS specifies the information needed to maintain digital objects long term. Many look at METS (Metadata Encoding and Transmission Standard) to implement this. Ambiguities between the two need to be clarified. This shows some of the structures, the ambiguities and redundancies. A working group has been established to develop guidelines for using PREMIS and METS to resolve the differences. The PREMIS in METS guidelines are a work in progress, and as institutions experiment with them there will further revisions.
A Format for Digital Preservation of Images A Study on JPEG 2000 File Robustness. Paolo Buonora and Franco Liberati. D-Lib Magazine. July/August 2008.
Many have talked about JPEG 2000 not only as a "better" JPEG delivery format, but also as new "master" file for high quality images and as a replacement for the TIFF format. The authors look at JPEG 2000 from a technical viewpoint. JPEG 2000 file structure is not only robust itself, but there are some enhancements that can make it better to use. One is the utility FixIt! JPEG 2000 that can extract the file header; test and fix corrupted images; and save it in XML format. They conclude the format is a good solution for digital repositories.
New record keeping standards announced. Judith Tizard. Press Release: New Zealand Government. 27 August 2008.
The New Zealand Archives announced two new recordkeeping standards, the
1. Create and Maintain Recordkeeping Standard: identifies the key requirements for successful information management for recordkeeping.
2. Electronic Recordkeeping Metadata Standard: a systematic approach to managing information. "Information management is an essential and important legacy." These standards ensure that information has meaning; it can be found when needed; it can be relied on to be what it sets out to be; and it can be moved safely from one system to another. Archives need to answer who created a record, for what purpose, and whether or not it has been altered.
Dead Sea Scrolls go from parchment to the Internet. CNN. August 27, 2008.
The Dead Sea Scrolls are going digital as part of an effort to better preserve the ancient texts and let more people view them. The initiative, announced Wednesday, will also reveal text that was not otherwise visible. Over the next two years, the Israel Antiquities Authority will digitally photograph and scan every bit of crumbling parchment and papyrus that makes up the scrolls. The images eventually will be posted on the Internet. Israel has assembled an international team of technical people for the project.
Very Long-Term Backup. Kevin Kelly. Weblog. August 20, 2008.
Paper, while destructible and limited, can be a stable media over the long term. Digital storage is not stable over long periods. A project has been underway to create a stable medium. This page provides information (and pictures) on the Rosetta project. The project used technology commercialized by Norsam to etch 13,500 pages of information on a titanium disk. The disk is not digital and requires a microscope to read.
OCLC Crosswalk Web Service Demo. OCLC. August 2008.
The purpose of the Crosswalk Web Service is to translate a group of metadata records from one format into another. For this service, a metadata format is defined as:
· The metadata standard of the record (e.g. MARC, DC, etc)
· The structure of the metadata (e.g. XML, RDF, etc)
· The character encoding of the metadata (e.g. MARC8, etc.)
It requires a client software component. As a demo, only a limited number of records can be translated at a time.
OCLC's new Web Harvester captures Web content to add to digital collections. Press Release. July 29, 2008.
OCLC is now offering Web Harvester, a new an optional product that allows libraries to capture and add Web content to their ContentDM digital collections. It captures content ranging from single, Web-based documents to entire Web sites. Once retrieved, users can review the captured Web content and add it to a collection. Master files of the captured content also can be ingested to the OCLC Digital Archive, the service for long-term storage of originals and master files from libraries' digital collections. The Web Harvester is integrated into library workflows, allowing staff to capture content as part of the cataloging process, which is then sent to the digital collections where it can be managed with other ContentDM content. OCLC is committed to provide solutions for the entire digital life cycle.
Friday, July 25, 2008
ArcMail unveils email archiving appliance with Blu-Ray disks. July 23, 2008.
ArcMail Technology, a provider of email archiving and management technology, announced they will include Blu-Ray disks as part of the product offering. The product can store up to 16 TB. The cost starts at $3000.
Wednesday, July 02, 2008
Digital content management: the search for a content management system. Yan Han. Library Hi Tech. Volume 22 · Number 4 · 2004. [PDF]
Digital content management system: a software system that provides preservation, organization and dissemination services for digital collections. This article analyzes Greenstone, Fedora, and DSpace in the key areas of digital content management. A content management system should also provide tools and support for preservation, control and dissemination of both local documents and external content, and be cost-effective as well. DSpace received the highest marks in the operational analysis, schedule analysis and economic analysis, while Fedora received the highest score in technical analysis. DSpace was ranked first among these systems, then Fedora. The Appendix contains the functional requirements.Friday, June 20, 2008
Librarians Confer in a Midwinter Meeting of Some Discontents
Andrea Foster. The Chronicle of Higher Education. January 25, 2008.
In part of this article it discusses some of the challenges in building an institutional repository. An Ohio university has more than 21,000 articles, including conference papers, teaching materials, photographs, and multimedia works, in the archive.
"Faculty members will submit research papers to the repository often unaware that they have signed away the rights to their work to a journal publisher, Ms. Davis said. "They are stunned that they have not retained the copyrights," she said. "They're vehemently adamant" that they still have rights to the work."
Some people add other scholars' material to the repository, incorrectly assuming that this is allowed by fair use.
Friday, May 16, 2008
Digital Preservation Matters - May16, 2008
Digital research raises issues relating to access, curation and preservation. Fund institutions are now
requiring researchers to submit plans for data management or preservation. The extremely detailed study includes a framework for determining costs variables, a cost model, and case studies. The service requirements for data collections will be more complex than many have thought previously. Accessioning and ingest costs were higher than ongoing long-term preservation and archiving cost:
2. Archival Storage & Preservation ...... ca. 23%
3. Access ............................................ ca. 35%
Ten years of data from the Archaeology Data Service show relatively high costs in the early years after acquisition but costs decline to a minimal level over 20 years. Decline of data storage costs, costs for ongoing actions such as file format migrations, and others, provide economies of scale.
Some significant issues for archives and preservation costs include:
- Timing: Costs vary depending on when actions are taken. Costs for initially creating metadata for 1000 records is about 300 euros. Fixing bad metadata after 10 years may cost 10,000 euros.
- Efficiency: The start-up costs can be substantial. The operational phases are more productive and efficient as procedures become established, refined, and the volume increases.
- Economy of scale: Increased volume has an impact on the unit costs for digital preservation. One example is that a 600% increase in accessions only increases costs by 325%.
“While the costs of maintaining digital preservation capacity are not insignificant, the costs of the alternative are often greater.” They consider three staff essential to establish a repository:
- Archive Manager: co-ordinate activities;
- System Administrator: (half time) to install and manage hardware and software;
- Collections Officer: develop and implement appropriate workflow and standards
Tasks for the digital preservation planner include: Implementing a lifecycle management approach to digital materials, continuously assessing collections, their long-term value and formats, and making recommendations for action needed to ensure long-term usability. Also:
- audit the Library’s digital assets, evaluating their volume, formats, and state of risk.
- research into preservation methodologies.
- ensure that preservation actions are carried out on digital assets at risk of loss by
- formulate and publicize advice to data creators
“A data audit exercise is needed at the outset of scoping a digital archive. This will identify collections and their relative importance to the institution and wider community.”
Also, a library should consider federated structures for local data storage, “comprising data stores at the departmental level and additional storage and services at the institutional level. These should be mixed with external shared services or national provision as required.” The hierarchy should reflect the content, the services required, and the importance of the data.
The real cost of archiving results data roughly drops by 25% as new methods and media become available. The cost of migrations is extremely high. Raw data preservation costs per sample.
1970-1990 Paper records £30.00
1989-1996 Magnetic tapes £21.95
1990-2000 Floppy disks £ 7.25
1997-2003 Compact Discs £ 6.00
2000-present Computer disks £ 2.15
“A data preservation strategy is expected to form part of the university’s overall information strategy.” Start-up costs are higher for the early phases, especially for developing the first tools, standards and best practices.
Library of Congress Digital Preservation Newsletter. Library of Congress. May 2008. [PDF]
There are a number of items in the newsletter of interest, including:
- LC creates and supports the development of some key open standards for digital content, such as
- Office Open XML. These estimate that over 400 million people use the different versions of the Microsoft Office programs. This new standard supports all the features of the various versions of Microsoft Office since 1997. Microsoft has released the specifications of its earlier binary formats and asked the Library of Congress to hold copies.
- PDF /A, which is a subset of the PDF format, suitable for preservation.
- The Data Preservation Alliance for the Social Sciences website is a partnership to identify, acquire and preserve data which is at risk of being lost to social science research.
- The MetaArchive Cooperative is participating in the NDIIPP digital preservation network. They have added an international member to the participants. The site provides documentation and information for private LOCKSS networks and a “Guide to Distributed Digital Preservation.”
The 29 fakes behind a rewriting of history. Paul Lewis. The Guardian. May 5, 2008.
The article emphasizes the importance and need for archive security and object authentication and verification. It is not just a problem for digital objects. Several books had been written based on forged documents planted in the UK National Archives. The author of the books used 29 documents in 12 separate files to write books on historical events; he is the only person to have checked out the files. An investigation resulted uncovered the fake documents; the Archives takes a serious view of anything that compromises the integrity of the information and the archive.