Showing posts with label web archiving. Show all posts
Showing posts with label web archiving. Show all posts

Monday, March 06, 2017

Electric WAILs and Ham

Electric WAILs and Ham. John Berlin. Web Science and Digital Libraries Research Group. February 13, 2017.
     Web Archiving Integration Layer (WAIL) is a one-click configuration and utilization tool that fits between institutional and individual archiving tools from a user's personal computer. Changing the tool from a Python application into an Electron application has brought with it many improvements especially the ability to update and package it for Linux, MacOS, and Windows.

WAIL is now collection-centric and provides users with the ability to curate personalized web archive collections, similar to Archive-It, but on their local machines. It also adds the ability to monitor and archive Twitter content automatically. WAIL is now available from the project's release page on Github.  More information about WAIL is available on their wiki.

Wednesday, February 01, 2017

Why Aren't We Doing More With Our Web Archives?

Why Aren't We Doing More With Our Web Archives? Kalev Leetaru. Forbes. January 13, 2017.
     The post looks at the many projects that have been launched to archive and preserve the digital world; the best known is the Internet Archive, "which has been crawling and preserving the open web for more than two decades" and has preserved more than 510 billion distinct URLs from over 361 million websites. The author asks: "With such an incredible repository of global society’s web evolution, why don’t we see more applications of this unimaginable resource?"

Some of the reasons that there isn't a more vibrant and active research and software development community around web archives may be:
  • Economics plays a role, 
  • Complex nature of web archives
  • The Internet Archive archive is over 15 petabytes, which is difficult to manipulate
  • There aren't many tools that can use the archive, particularly indexing
The Internet Archive last year announced the first efforts at keyword search capability. These kinds of search tools are needed to make the Archive’s holdings more accessible to researchers and data miners.

"At the end of the day, web archives are our only record capturing the evolution of human society from the physical to the virtual domains. The Internet Archive in particular represents one of the greatest archives ever  created of this immense transition in human existence and with the right tools and a greater focus on non-traditional avenues, perhaps we can launch a whole new world of research into how humans evolved into a digital existence."

Thursday, December 29, 2016

Robots.txt Files and Archiving .gov and .mil Websites

Robots.txt Files and Archiving .gov and .mil Websites. Alexis Rossi. Internet Archive Blogs. December 17, 2016.
     The Internet Archive collects webpages "from over 6,000 government domains, over 200,000 hosts, and feeds from around 10,000 official federal social media accounts". Do they ignore robots.txt files? Historically, sometimes yes and sometimes no, but the robots.txt file is less useful that it was, and is becoming less so over time as, particularly for web archiving efforts. Many sites do not actively maintained the files or increasingly block crawlers with other technological measures. The "robots.txt file is not relevant to a different era". The best way for webmasters to exclude their sites is to contact archive.org and to specify the exclusion parameters.

"Our end-of-term crawls of .gov and .mil websites in 2008, 2012, and 2016 have ignored exclusion directives in robots.txt in order to get more complete snapshots. Other crawls done by the Internet Archive and other entities have had different policies."  The archived sites are available in the beta wayback. They have had little feedback at all on their efforts. "Overall, we hope to capture government and military websites well, and hope to keep this valuable information available to users in the future."


Monday, December 12, 2016

Harvesting Government History, One Web Page at a Time

Harvesting Government History, One Web Page at a Time.  Jim Dwyer. New York Times. December 1, 2016.
     With the arrival of any new president, large amounts of information on government websites are at risk of vanishing within days. Digital federal records, reports and research are very fragile. "No law protects much of it, no automated machine records it for history, and the National Archives and Records Administration announced in 2008 that it would not take on the job."  Referring to government websites: “Large portions of dot-gov have no mandate to be taken care of. Nobody is really responsible for doing this.”  The End of Term Presidential Harvest 2016  project is a volunteer, collaborative effort by a small group of university, government and nonprofit libraries to find and preserve valuable pages that are now on federal websites. The project began before the 2008 elections. Harvested content from previous End of Term Presidential Harvests is available at http://eotarchive.cdlib.org/.

The project has two phases of harvesting:
  1. Comprehensive Crawl: The Internet Archive crawl the .gov domain in September 2016, and also after the inauguration in 2017.
  2. Prioritized Crawl: The project team will create a list of related URL’s and social media feeds.
The political changes in the past 8 years at the end of presidential terms has made a lot of people worried about the longevity of federal information.

Saturday, October 29, 2016

Beta Wayback Machine – Now with Site Search!

Beta Wayback Machine – Now with Site Search! Vinay Goel. Internet Archive Blogs. October 24, 2016.
     The Wayback Machine has provided access to the Internet Archive's archived websites for 15 years. Previously the URL was the main access. There is a new beta keyword search that returns a list of relevant archived websites with additional information.

Friday, June 17, 2016

The Web’s Past is Not Evenly Distributed

The Web’s Past is Not Evenly Distributed. Ed Summers. Maryland Institute for Technology. May 27, 2016.
     This post discusses ways to structure the content "with the grain of the Web so that it can last (a bit) longer."The web was created so that there was not a central authority to sure all the links work, and permission is not needed to link to a site. It does result in a web where about 5% of links break per year, according to one site.

"The Web dwells in a never-ending present. It is—elementally—ethereal, ephemeral, unstable, and unreliable. Sometimes when you try to visit a Web page what you see is an error message: Page Not Found. This is known as link rot, and it’s a drag, but it’s better than the alternative. Jill Lepore." If we didn’t have a partially broken Web, where content constantly change and links break, it’s quite possible we wouldn’t have a Web at all.  Some things to take note of:
  • problems with naming things
  • redirects
  • proxies
  • web archives
  • static sites
  • data export
"Being able to export your content from one site to the next is extremely important for the long term access to your data. In many ways the central challenge we face in the preservation of Web content, and digital content generally, is the recognition that each system functions as a relay in a chain of systems that make the content accessible."

"Our knowledge of the past has always been mediated by the collective care of those who care to preserve it, and the Web is no different."


Wednesday, March 23, 2016

New Report on Web Archiving Available

New Report on Web Archiving Available. Andrea Goethals. IIPC. 21 March 2016.
     Harvard Library recently released a report to:
  • explore and document current web archiving programs
  • identify common practices, needs, and expectations in the collection of web archives
  • identify the provision and maintenance of web archiving infrastructure and services;
  • identify the use of web archives by researchers.
The environmental scan showed 22 opportunities for future research and development, which includes:
  • Dedicate full-time staff to work in web archiving to keep up on latest developments, best practices and be part of the web archiving community.
  • Conduct outreach, training and professional development for existing staff who are being asked to collect web archives.
  • Institutional web archiving programs should be transparent about holdings, terms of use, preservation commitment, are curatorial decisions made for each capture.
  • Develop a collection development tool to show holdings information to researchers and other collecting institutions.
  • Train researchers to be able to analyze big data found in web archives.
  • Establish a standard for describing the curatorial decisions behind collecting web archives.
  • Establish a feedback loop between researchers and the librarians/archivists.
There is also a need to "radically increase communication and collaboration" among all involved in web archiving. Much more communication and collaboration is needed between those collecting web content and researchers who would like to use it.

Monday, March 21, 2016

How many of the EOT2008 PDF files were harvested in EOT2012

How many of the EOT2008 PDF files were harvested in EOT2012.  Mark Phillips. mark e. phillips journal. February 23, 2016.
     Post aabout  the author looking at some of the data from the End of Term 2012 Web Archive snapshot at the UNT Libraries. From the EOT2008 Web archive 4,489,675 unique (by hash) PDF files were extracted and then compared recently to see how many of those nearly 4.5 million PDFs were still around in 2012 when they crawled the federal Web again as part of the EOT2012 project. The findings:

After the numbers finished running,  it looks like the following.

                       PDFs        Percentage
Found             774,375       17%
Missing        3,715,300       83%
Total            4,489,675     100%

So 83% of the PDF files that were present in 2008 are not present in the EOT2012 Archive. It is possible that the items are still available at a different URL entirely in 2012 when it was harvested again. So the URL might not be available but the content could be available at another location.


Wednesday, March 02, 2016

Leveraging Heritrix and the Wayback Machine on a Corporate Intranet: A Case Study on Improving Corporate Archives

Leveraging Heritrix and the Wayback Machine on a Corporate Intranet: A Case Study on Improving Corporate Archives. Justin F. Brunelle, et al. D-Lib Magazine. January/February 2016.
     This is a case study about using open-source, web-scale web archiving tools, Heritrix and the Wayback Machine. Internet archiving does not have the opportunity to archive Intranet-based resources, such as corporate content. Past research has shown that "web pages' reliance on JavaScript to construct representations leads to a reduction in archivability".  The Internet Archive uses Heritrix and the Wayback Machine to archive web resources and replay mementos on the public web.
The article recommends content authors use robots.txt and noarchive HTTP response headers to avoid sensitive information. Accidentally archiving sensitive information can result in loss of mementos within a WARC. Recommendations include:
  • Use smaller storage devices to limit the problems if sensitive information is crawled;
  • Develop a way to remove a sensitive memento from a WARC file 
  • Identify high-risk vs. low-risk archival targets within the Intranet.
Archiving intranet content needs to fit within a larger documentation plan and knowing what the key resources and elements are that need to be preserved in order to preserve corporate memory. There is value for a corporation to have a web crawling archiving strategy. It "may make more sense for a corporate archives to preserve information about its corporation's projects that is tracked in a database and served to an Intranet through an export directly from the database rather than crawling the Intranet for the project data".

The case study and the next steps proposed will help archive corporate memory, improve information longevity, and can help corporate archivists implement web archiving strategies.


Tuesday, February 23, 2016

Preserving Social Media

New Technology Watch report: Preserving Social Media. Sara Day Thomson. Digital Preservation Coalition and Charles Beagrie Ltd. 16 Feb 2016. [PDF]
     This report looks at the related issues of preserving social media. Institutions collecting this type of media need new approaches and methods.  The report looks at "preserving social media for long-term access by presenting practical solutions for harvesting and managing the data generated by the interactions of users on web-based networking platforms such as Facebook or Twitter." It does not consider blogs. "Helen Hockx-Yu defines social media as: ‘the collective name given to Internet-based or mobile applications which allow users to form online networks or communities’.

Web 1.0 media can be harvested by web crawlers such as Heritrix; Web 2.0 content, like social media platforms, is more effectively archived through APIs. This is often an extension of an institution's web archiving. Transparency and openness will be important when archiving content. APIs allow developers to call raw data, content and metadata directly from the platform, all transferred together in formats like JSON or XML.

Maintaining long-term access to social media data faces a number of challenges, such as working with user-generated content, continued access to social media data, privacy issues, copyright infringement issues, and having a way to maintain the linked, interactive nature of most social media platforms. There is also "the challenge of maintaining the meaning of the social media over time, which means ensuring that an archive contains enough metadata to provide meaningful context."  There are also third-party services and self-archiving services available.

Social media is vulnerable to potential loss. The report quotes one study which looked at "the lifespan of resources shared on social media and found that ‘after the first year of publishing, nearly 11% of shared resources will be lost and after that we will continue to lose 0.02% per day’."

Some other quotes:
  • Overall, the capture and preservation of social media data requires adequate context.
  • Capturing data, metadata, and documentation may not provide enough context to convey user experiences with these platforms and technologies.
  • When considering the big picture, however, the preservation of social media may best be undertaken by a large, centralized provider, or a few large centralized providers, rather than linking smaller datasets or collections from many different institutions.

Thursday, February 11, 2016

To ZIP or not to ZIP, that is the (web archiving) question

To ZIP or not to ZIP, that is the (web archiving) question. Kristinn Sigurðsson. Kris's blog. January 28, 2016.
     This post looks at the question: Do you use uncompressed (W)ARC files? Many files on the Internet are already compressed and there is "little additional benefit gained from compressing these files again (it may even increase the size very slightly)."  For other files, such as text, tremendous storage savings can be realized using compression, usually about 60% of the uncompressed size. Compression has an effect on
disk or network access and on memory. But "the additional overhead of compressing a file, as it is written to disk, is trivial."

On the access side, the bottleneck is disk access but "compression can actually help!" It can save time and money and performance is barely affected. One exception may be with HTTP Range Requests which when accessing a WARC record would have to decompress the entire payload until it finds the requested item. A hybrid solution may be the best solution: "compress everything except files whose content type indicates an already compressed format."  This would also avoid a lot of unneeded compression / uncompression.


Wednesday, January 13, 2016

Now What You Put on the Internet Really Could Last Forever

Now What You Put on the Internet Really Could Last Forever. Ryan Steadman. Observer Culture. January 5, 2016.
     Digital art institution Rhizome has won a two-year grant from the Andrew W. Mellon Foundation to continue development of Webrecorder, a newly developed archiving tool for the web. The tool, Webrecorder, which will be free to the public, provides the ability to capture and play back dynamic web content and thus improve digital social memory. An open source version of Webrecorder is already available at webrecorder.io, where users are invited to build their own archive. However,  "further development is needed to make it into the comprehensive archive Rhizome would like to build."

Related:
 

Friday, December 11, 2015

oldweb.today website

oldweb.today.  Ilya Kreymer. Website. December 10, 2015.
     This is an interesting site that provides an emulator for various web browsers to search historic web sites. The tool Netcapsule, which can be used on the website oldweb.today, is built with open source tools that communicates with web archives. It allows you to browse "old web pages the old way with virtual browsers"; the user can navigate by url and by time. When the page is loaded "the old browser is loaded in an emulator-like setup" that can connect to the archive. Any archive that supports the CDX or Memento protocol interfaces can be a source. Full source code is available on Github.

Tuesday, November 24, 2015

Five Takeaways from AOIR 2015

Five Takeaways from AOIR 2015. Rosalie Lack. Netpreserve blog. 18 November 2015. 
     A blog post on the annual Association of Internet Researchers (AOIR) conference in
Phoenix, AZ. The key takeaways in the article:
  1. Digital Methods Are Where It’s At.  Researchers are recognizing that digital research skills are essential. And, if you have some basic coding knowledge, all the better. The Digital Methods Initiative from Europe has tons of great information, including an amazing list of tools.
  2. Twitter API Is also Very Popular
  3. Social Media over Web Archives. Researchers used social media more than web archived materials.  
  4. Fair Use Needs a PR Movement. There is a lot of misunderstanding or limited understanding of fair use, even for those scholars who had previously attended a fair use workshop. Many admitted that they did not conduct particular studies because of a fear of violating copyright. 
  5. Opportunities for Collaboration.  Many researchers were unaware of tools or services they can use and/or that their librarians/archivists have solutions.
There is a need for librarians/archivists to conduct more outreach to researchers and to talk with them about preservation solutions, good data management practices and copyright.


Monday, November 23, 2015

The Provenance of Web Archives

The Provenance of Web Archives. Andy Jackson; Jason Webber. UK Web Archive blog. 20 November 2015.
     More researchers are taking an interest in web archives.  The post author says their archive has "tried to our best to capture as much of our own crawl context as we can." In addition to the WARC request and response records, they store other information that can answer how and why a particular resource has been archived:
  • links that the crawler found when it analysed each resource 
  • the full crawl log, which records DNS results and other situations
  • the crawler configuration, including seed lists, scope rules, exclusions etc.
  • the versions of the software we used  
  • rendered versions of original seeds and home pages  and associated metadata.
Th archive doesn't "document every aspect of our curatorial decisions, e.g. precisely why we choose to pursue permissions to crawl specific sites that are not in the UK domain. Capturing every mistake, decision or rationale simply isn’t possible, and realistically we’re only going to record information when the process of doing so can be largely or completely automated". In the future, there "will be practical ways of summarizing provenance information in order to describe the systematic biases within web archive collections, but it’s going to take a while to work out how to do this, particularly if we want this to be something we can compare across different web archives."

No archive is perfect. They "can only come to be understood through use, and we must open up to and engage with researchers in order to discover what provenance we need and how our crawls and curation can be improved. " There are problems need to be documented, but researchers "can’t expect the archives to already know what they need to know, or to know exactly how these factors will influence your research questions."

Saturday, November 21, 2015

How Much Of The Internet Does The Wayback Machine Really Archive?

How Much Of The Internet Does The Wayback Machine Really Archive? Kalev Leetaru. Forbes.  November 16, 2015.
     "The Internet Archive turns 20 years old next year, having archived nearly two decades and 23 petabytes of the evolution of the World Wide Web. Yet, surprisingly little is known about what exactly is in the Archive’s vaunted Wayback Machine." The article looks at how the Internet Archive archives sites and suggests "that far greater understanding of the Internet Archive’s Wayback Machine is required before it can be used for robust reliable scholarly research on the evolution of the web." It requires a more "systematic assessment of the collection’s holdings." Archive the open web uses enormous technical resources.

Maybe the important lesson to learn is that we have little understanding of what is actually in the data we use and few researchers really explore the questions about the data.  The archival landscape of the Wayback Machine was far more complex than original realized, and it is unclear how the Wayback Machine has been constructed. This insight is critical. "When archiving an infinite web with finite resources, countless decisions must be made as to which narrow slices of the web to preserve." The selection can be either random or prioritized by some element.  Each approach has distinct benefits and risks.

Libraries have formalized over time how they make collection decisions. Web archives must adopt similar processes.  The web is "disappearing before our very eyes" which can be seen in the fact that  up to 14% of all online news monitored by the GDELT Project is no longer accessible after two months".  We must "do a better job of archiving the online world and do it before this material is lost forever."

Monday, November 09, 2015

Web Archiving Questions for the Smithsonian Institution Archives

Five Questions for the Smithsonian Institution Archives’ Lynda Schmitz Fuhrig. Erin Engle. The Signal. October 6, 2015.   
     Article about the Smithsonian's Archives and what they are doing. Looks at the Smithsonian Institution archives its own sites and the process. Many of the sites contain significant content of historical and research value that is now not found elsewhere. These are considered records of the Institution that evolve over time and they consider that it would irresponsible as an archives to only rely upon other organizations to archive the websites. They use Archive-It to capture most of these sites and they retain copies of the files in their collections. Other tools are used to capture specific tweets or hashtags or sites that are a little more challenging due to the site construction and the dynamic nature of social media content.

Public-facing websites are usually captured every 12 to 18 months, though it may happen more frequently if a redesign is happening, in which case the archiving will happen before and after the update. An archivist appraises the content on the social media sites to determine if it has been replicated and captured elsewhere.

The network servers at the Smithsonian are backed up, but that is the not the same as archiving. Web crawls provide a snapshot in time of the look and feel of a website. "Backups serve the purpose of having duplicate files to rely upon due to disaster or failure" and are only saved for a certain time period. The website archiving we do is kept permanently. Typically, website captures may not going to have everything because of excluded content, blocked content, or dynamic content such as Flash elements or calendars that are generated by databases. Capturing the web is not perfect.

Friday, August 28, 2015

The Internet Is Failing The Website Preservation Test

The Internet Is Failing The Website Preservation Test. Ron Miller. Tech Crunch. August 27, 2015.
     Article about an author finding out that information may not remain on the internet very long. There are issues with content preservation on the internet. "If the internet is at its core is a system of record, then it is failing to complete that mission." When websites disappear, all of the content may disappear as though it never existed. That "can have a much bigger impact than you imagine on researchers, scholars" or others.  The content "should all be automatically archived, a digital Library of Congress to preserve and protect all of the content on the internet."

Publishers cannot be relied upon to keep an historical record. "When it no longer serves a website owner’s commercial purposes, the content can disappear forever." That will leave large gaps in the online record. “So much of our communication and content consumption now ... is online and in digital form. We rely on publishers (whether entertainment, corporate, scientific, political) that have moved to predominantly, if not exclusively, digital formats. Once gone or removed from online access we incur black holes in our [online] memory”. The lost content "extends to science, law, education, and all sorts of other cultural aspects that depend on referencing a stable and informative past to build our knowledge for the present. The loss is pernicious, because we don’t immediately notice it - it’s only over time we realize what we have lost." The problem of link rot extends to many areas, including the legal profession where it is having an enormous impact on legal research.

Organizations, such as The Internet Archive, can offer partial solutions, but it can be a challenge to find what we are looking for in the vast archive. The access tools are lacking. "Content preservation should not be the sole responsibility of individuals or businesses. We have to find a way to make it an official part of the coding process." We should try to find "automated technological solutions to preserve our content for future generations. At the very least, we are duty bound to try."

Thursday, August 27, 2015

Google is not the answer: How the digital age imperils history

Google is not the answer: How the digital age imperils history. John Palfrey. Salon.  May 30, 2015.
     We get better at storing digital content, but are not good and preserving our digital history. The problem in brief is that no one is doing enough to select and preserve the bits that really matter.
"One of the great paradoxes of the digital age is that we are producing vastly more information than ever before, but we are not very good at preserving knowledge in digital form for the long haul." Industry is good at creating storage systems but not very good at choosing and preserving the data that matters, and then being able to make it useful in the future. "We are radically underinvesting in the processes and technologies that will allow us to preserve our cultural, literary and scientific records."  We are continuously making progress in how we store our media, and trapping information in lost formats in the process. Obsolescence of unimportant information may, in fact, be a blessing, but not when the lost knowledge has historical significance.

It is possible to transfer information from one format to another; with enough effort and cost, most data can be transferred to formats that can be read today. But different problems come when we create information at such speed and scale.  Most data companies now are for-profit firms that are not in the business of long-term storage. And, unlike universities, libraries and archives, these businesses will probably not be around for hundreds of years. Plus, the amount of important information being created makes it very difficult to create scale-able solutions to curate the meaningful content.

"Today, librarians and archivists are not involved enough in selecting and preserving knowledge in born-digital formats, nor in developing the technologies that will be essential to ensuring interoperability over time. Librarians and archivists do not have the support or, in many cases, the skills they need to play the central role in preserving our culture in digital format." The Government Accountability Office even criticized the Library of Congress for its information technology practices:  “Library of Congress: Strong Leadership Needed to Address Serious Information Technology Management Weaknesses.”

"The deeper problem behind the problem of digital preservation is that we undervalue our libraries and archives." We under-invest in them in them in an important time as we move from an analog society to a digital one. "If we fail to support libraries in developing new systems, those who follow us will have ample reason to be angry at our lack of foresight."

"If we don’t address our underinvestment in libraries and archives, we will have too much information we don’t need and too little of the knowledge we do."

Tuesday, August 11, 2015

Digital Preservation Tools on Github.

Digital Preservation Tools on Github. Chris Erickson. Blog. August 2015.
     While looking for a particular tool I came across several others that look interesting. I have not yet tried them, but this is a reminder that I need to check into them. 
  • epubcheck: a tool to validate EPUB files. It can detect many types of errors in EPUB. OCF container structure, OPF and OPS mark-up, and internal reference consistency are checked. EpubCheck can be run as a standalone command-line tool or used as a Java library.
  • preservation-tools: Bundles a number of preservation tools for all file types and tools in a modular way. Includes:
    • PdfHeaderChecker (able to detect the software used to create a PDF),
    • PdfAValidator (Checks via PDFBox if a PDF/A is valid. Runs through a folder and picks out only PDF/A-files),
    • iTextRepairPdf (take a PDF-file and copies the content page-per-page to a new, PDFA1-conform PDF-file)
    • PdfToImageConverter (Converts PDF Files in a certain folder to JPEGs page-per-page)
    • PdfTwinTest (compares the two PDF line-by-line and puts out differences. This is handy for after-Migration Quality-Checking)
  • wail: Web Archiving Integration Layer (WAIL). A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.
  • db-preservation-toolkit. The Database Preservation Toolkit allows conversion between Database formats, including connection to live systems, for purposes of digitally preserving databases. The toolkit allows conversion of live or backed-up databases into preservation formats such as DBML or SIARD, XML-based formats created for the purpose of database preservation. The toolkit also allows conversion of the preservation formats back into live systems to allow the full functionality of databases. For example, it supports a specialized export into MySQL, optimized for PhpMyAdmin, so the database can be fully experimented using a web interface.
  • DPFManager. DPF Manager is an open source modular TIFF conformance checker that is extremely easy to use, to integrate with existing and new projects, and to deploy in a multitude of different scenarios. It is designed to help archivists and digital content producers ensure that TIFF files are fit for long term preservation, and is able to automatically suggest improvements and correct preservation issues. The team developing it has decades of experience working with image formats and digital preservation, and has leveraged the support of 60+ memory institutions to draft a new ISO standard proposal (TIFF/A) specifically designed for long term preservation of still-images. An open source community will be created and grown through the project lifetime to ensure its continuous development and success. Additional commercial services will be offered to make DPF Manager self-sustainable and increase its adoption.
  • PreservationSimulation. This project is to provide baseline data for librarians and researchers about long-term survival rates of document collections. We have developed computer simulations to estimate document failure rates over a wide variety of conditions. The data from these simulations should be useful to stewards of such collections in planning and budgeting for storage and bandwidth needs to protect their collections.
  • flint.  Facilitate a configurable file/format validation. Its underlying architecture is based on the idea that file/format validation almost always has a specific use-case with concrete requirements that may differ from a validation against the official industry standard of a given format. The following are the principle ideas we've implemented in order to match such requirements.
  • excel. Regarding the second issue: how to best retain formulas and other essential components of spreadsheets, like Excel, one of our data curators, John McGrory (U of Minnesota), just published a tool in GitHub that can help. In our data repository, we use the tool each time a dataset is submitted and zip these resulting files as the "Archival Version of the Data." Download the software at http://z.umn.edu/exceltool. See also a description of what the tool does: http://hdl.handle.net/11299/171966