What is OCR?
- These historical databases have been created from microfilm using OCR (Optical Character Recognition) technology to digitize the content.
How does OCR Work?
- OCR works by recognizing shapes on a white background, and by matching those shapes with known letter shapes that are stored in the computer's memory.
- In some cases, especially in the case of old newspapers, the letters "bleed" into each other, making the shapes unrecognizable or mistakenly interpreted as other letters by the computer.
Are there different qualities of OCR?
- Yes. In GenealogyBank, each page is produced in a manner that provides the highest quality possible image from rare, fragile newspapers and microfilm. This includes a specialized process that de-skews and crops every page image.
- Keep in mind that the historical content in GenealogyBank is among the most difficult type of content to digitize because of age of the documents and the wide variety of constantly changing type faces, font sizes, ink quality, article format and more.
- We are continually striving to improve image quality as technology evolves.
- GenealogyBank brings some of the oldest content published in the U.S. to a searchable online format because of its value to the genealogical and historical communities.
Does the OCR process cause some false hits?
- Sometimes. This is a universal problem within the industry at this time.
- Since OCR is imperfect, and it can lead to words being "closer" in the search text than they appear in the image.
- Any problems on the pages, such as inkblots, speckles, poor type quality, fading, folds, wrinkles, tears or discoloration of the original paper page, can interfere with the OCR process.
- When the computer cannot recognize or misinterprets some of the letter shapes on the page this can result in false hits and mistakes in keyword highlighting in the search results.
- That is why it is important to keep refining your search when using historical content.