Optical Character Recognition Evaluation for Historical Archival Records

Madison Hall Co-Author
University of Michigan
 
Conor York Co-Author
University of Michigan
 
Tianyu Hu Co-Author
University of Michigan
 
Cameron Milne Co-Author
Reveal Global Consulting
 
Taylor Wilson Co-Author
Reveal Global Consulting, LLC
 
Madeline Kelsch First Author
University of Michigan
 
Madeline Kelsch Presenting Author
University of Michigan
 
Monday, Aug 4: 11:50 AM - 12:05 PM
2098 
Contributed Papers 
Music City Center 
Architectural advancements in Optical Character Recognition (OCR) are enabling the deployment of OCR models without the resource burden of training models from scratch. The National Archives and Records Administration (NARA) now deploys a pre-trained text and image transformer to support the Citizen Archivist mission--an effort relying on human volunteers to transcribe historical documents. While pre-trained Transformers have demonstrated improvements over preceding generations of RNNs and CNNs, historical documents vary in structure, vocabulary, and handwriting styles, posing a unique challenge that will require additional model enhancements.

This paper evaluates the performance of NARA's OCR across diverse record collections to assess performance and identify model limitations. Specifically, we ask [1] How does the model perform, measured by character error rate (CER), on each of the document collections? [2] What attributes of these documents present challenges for a general-purpose model? [3] What options are available for improving performance? This research offers findings that can strengthen performance for challenging documents and improve accuracy rates.

Keywords

Optical Character Recognition

Library Science

Machine Learning

Artificial Intelligence

Data Science 

Main Sponsor

Section on Text Analysis