Optical Character Recognition Evaluation for Historical Archival Records
Monday, Aug 4: 11:50 AM - 12:05 PM
2098
Contributed Papers
Music City Center
Architectural advancements in Optical Character Recognition (OCR) are enabling the deployment of OCR models without the resource burden of training models from scratch. The National Archives and Records Administration (NARA) now deploys a pre-trained text and image transformer to support the Citizen Archivist mission--an effort relying on human volunteers to transcribe historical documents. While pre-trained Transformers have demonstrated improvements over preceding generations of RNNs and CNNs, historical documents vary in structure, vocabulary, and handwriting styles, posing a unique challenge that will require additional model enhancements.
This paper evaluates the performance of NARA's OCR across diverse record collections to assess performance and identify model limitations. Specifically, we ask [1] How does the model perform, measured by character error rate (CER), on each of the document collections? [2] What attributes of these documents present challenges for a general-purpose model? [3] What options are available for improving performance? This research offers findings that can strengthen performance for challenging documents and improve accuracy rates.
Optical Character Recognition
Library Science
Machine Learning
Artificial Intelligence
Data Science
Main Sponsor
Section on Text Analysis
You have unsaved changes.