Quick Review - How to Make Text Digital

Christine Anderson
Sep 6, 2020
3 min read

Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web

“How to Make Text Digital: Scanning, OCR, and Typing”

This section is part of a book chapter within Digital History. The piece offers many considerations to be made when looking at digitization and setting up a project. According to Cohen and Rosenzweig, “for many digital projects, scanning will turn out to be one of the easiest tasks that you do.” The true cost (money, time, and effort) of such projects are often never fully realized until they are over.

Depending on the scale and scope of the digitization, these projects can be completed in a week on a $100 scanner or take up years of work and cost thousands of dollars. Costs quickly climb when the materials to be digitized are specialized. Microfilm scanners cost far more than regular scanners. Rare books will require special book cradles and overhead pictures. Both will require much more time.

Exactness in Replication

There is also an ongoing issue of accuracy, particularly in books with old handwriting or faded texts. Optical Character Recognition (OCR), the software that converts the picture of letters and words into machine-readable text, still have limitations.

“They don’t, for example, do well with non-Latin characters, small print, certain fonts, complex page layouts or tables, mathematical or chemical symbols, or most texts from before the nineteenth century. Forget handwritten manuscripts. And even without these problems, the best OCR programs will still make mistakes” (Cohen & Rosenzweig, 2005). In fact, research has “shown that the time spent correcting a small number of OCR errors can wind up exceeding the cost of typing the document from scratch” (Cohen & Rosenzweig, 2005). That is a pretty hefty cost consideration.

OCR’s raw copies offer fairly high accuracy levels though. JSTOR claims 97-99.95% accuracy for their journals. Other studies have shown accuracy levels of 95% and higher in character transcription. It is important to note that this does not include errors in typography or layout, only character replication.

This means if 100% accuracy is important to the project, hiring typist may be the best option. Those highly concerned with accuracy – think historical documents or great literary works – tend to only have documents rekeyed without the help of OCR. Some institutions even use triple keying, where “two people type the same document; then a third person reviews the discrepancies identified by a computer” (Cohen & Rosenzweig, 2005).

Considering True Accessibility

When most people consider digitization as making materials more accessible, they are thinking of physical access to the material. The issue is much more complicated, however. For example, to save costs, many projects like JSTOR leave data in its raw, scanned form. Publishers and authors strive for a reputation of quality. This status can be negated by typos created by OCR scanning. Cohen and Rosenzweig say that in order to keep up the perception of quality, JSTOR displays the scanned page image and uses the uncorrected OCR only as an invisible search file. This means that a search will show you the pages where a word or phrase appears, but not the specific spot.

This creates issues for visually impaired and learning-disabled users since the absence of machine-readable text makes it more difficult for them to use read-aloud devices. Other projects allow the uncorrected OCR to be displayed. This helps those needing reading devices, as well as makes it possible to copy and paste text, find a specific word quickly, and assess the quality of the OCR. (Cohen & Rosenzweig, 2005).

What other aspects of accessibility should be considered when planning digitization projects?

Cohen, D. & Rosenzweig, R. (2005). Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web. University of Pennsylvania Press.

Quick Review - How to Make Text Digital

“How to Make Text Digital: Scanning, OCR, and Typing”

Recent Posts

Comments