Optical Character Recognition (OCR) is a process of scanning printed pages as images on a flatbed scanner and then using OCR software to recognize the letters as ASCII text. The OCR software has tools for both acquiring the images from the scanner and recognizing the text. This is the technology long used by libraries and government agencies to make lengthy documents quickly available electronically.
OCR works best with original or very clear copies and mono-spaced fonts like Courier. If you have choices, use the following source material:
- 12 point or greater font size.
- Black text on a white background.
- A clean copy, not a fuzzy multi-generation copy from a copy machine.
- Standard type font (Times New Roman, Courier, etc…). Fancy fonts may not be recognized.
- Single column layout.
Using text from a source with font size less than 12 points or from a fuzzy copy will result in more errors. Except for tab stops and paragraphs mark, most documents formatting is lost during text scanning, (Bold, Italic and Underline are sometimes recognized).
The output from a finished text scan will be a single column editable text file. This text file will always require spell checking and proofreading as well as reformatting to desired final layout. Scanning plain text files or printouts from a spreadsheet usually works, but text must be imported into a spreadsheet and reformatted to match the original.