Automated and Secure Text Scraping with Generative AI of School Transcripts for Surveys

Dale Holstein Co-Author
RTI International
 
Michael Long Co-Author
RTI International
 
Andy Kawataba Co-Author
RTI International
 
Ethan Ritchie Co-Author
RTI International
 
John Bollenbacher Co-Author
RTI International
 
Michael Wenger Co-Author
RTI International
 
Stuart Allen Co-Author
RTI International
 
Emily Hadley First Author
RTI International
 
Michael Long Presenting Author
RTI International
 
Monday, Aug 4: 11:20 AM - 11:35 AM
1830 
Contributed Papers 
Music City Center 
We present TranscriptGenie, a prototype application developed to address the need for efficient and accurate text extraction from PDF school transcripts for several large federal surveys. Secondary and postsecondary transcript data are crucial for understanding student educational journeys and outcomes. Yet extracting meaningful data from PDF school transcripts has long been a labor-intensive process that is often fraught with challenges due to variability in transcript formats, embedded tables, and diverse data structures. In this session, we will provide a comprehensive overview of TranscriptGenie's development process by highlighting the requirements that drove its design and the novel solutions that underpin its capabilities. This includes integrating generative AI technology to handle text variations and leveraging natural language processing techniques for data annotation. We will discuss how this tool is designed to comply with security standards and the use of a graph database to efficiently manage and query the extracted data. Finally, we will discuss next steps needed for deployment and broader implications for transcript analysis in surveys.

Keywords

Text analysis

Surveys

Generative Artificial Intelligence

Natural Language Processing

Education

Graph database 

Main Sponsor

Section on Text Analysis