Post examines several articles concerning the reliability and accuracy of digital text extracted from printed books in five digital libraries: the Internet Archive, Project Gutenberg, the HathiTrust, Google Books, and the Digital Public Library of America.
In a study by Paul Conway of page images in the HathiTrust, he found 25% of the 1000 volumes examined by Conway contained at least one page image whose content was “unreadable.” Only 64.9% of the volumes examined were considered accurate and complete enough to be considered “reliably intelligible surrogates.” HathiTrust only attests to the integrity of the transferred file, and not to the completeness of the original digitization effort.
The “uncorrected, often unreadable, raw OCR text” that most mass-digitization projects produce today, will be inadequate for future, more sophisticated uses. Libraries that are concerned about their future and their role in the information ecosystem should look to the future needs of users when evaluating digitization projects. Libraries have a special obligation to preserve the historic collections in their charge in an accurate form.
- Kichuk, Diana. “Loose, Falling Characters and Sentences: The Persistence of the OCR Problem in Digital Repository E-Books.”
- Conway, Paul. “Preserving Imperfection: Assessing the Incidence of Digital Imaging Error in HathiTrust,” 2013.
- Jacobs, James A., and James R. Jacobs. “The Digital-Surrogate Seal of Approval: A Consumer-Oriented Standard.” D-Lib Magazine 19, no. 3/4 (March 2013).