Topics in Text Analysis: Topic Modeling, LLMs, and Beyond

Arinjita Bhattacharyya Chair
Merck & Co., Inc.
 
Monday, Aug 4: 10:30 AM - 12:20 PM
4062 
Contributed Papers 
Music City Center 
Room: CC-103C 
This contributed session covers various topics in text analysis, including applications of generative AI, topic modeling, and more traditional text classification tasks. Application domains span federal surveys, precision medicine, national archives, and even the development of novel statistical methodology.

Main Sponsor

Section on Text Analysis

Co Sponsors

Section on Text Analysis

Presentations

Enhancing Research Discovery with LLMs: A Comparative Study of Traditional Topic Modeling Algorithms

Topic modeling is essential for uncovering latent themes in scientific literature, aiding research discovery. Traditional models like Latent Dirichlet Allocation (LDA) rely on probabilistic word distributions, often producing incoherent topics, while BERTopic improves clustering with transformer embeddings but requires manual post-processing. TopicGPT, a prompt-based framework powered by Large Language Models (LLMs), generates interpretable topics as natural language descriptions rather than ambiguous word clusters.
This study compares TopicGPT, LDA, and BERTopic across cyberbullying research, forensic science literature, and general scientific papers. The models are evaluated on coherence, diversity, redundancy, interpretability, and research discovery efficiency. Results suggest TopicGPT produces more interpretable and distinct topics, improving classification accuracy and reducing redundancy. While BERTopic excels in semantic clustering, it shows higher topic overlap, and LDA struggles with coherence and interpretability.
These findings highlight LLM-driven topic modeling as a benchmark for enhancing literature analysis, research workflows, and knowledge discovery. 

Keywords

AI-Powered Topic Modeling

Latent Dirichlet Allocation (LDA)

BERTopic

Large Language Models (LLMs)

Natural Language Processing (NLP)

Scientific Literature Analysis 

Co-Author

Larry Tang, University of Central Florida

First Author

Amir Alipour Yengejeh, University of Central Florida

Presenting Author

Amir Alipour Yengejeh, University of Central Florida

Measuring Brand Sentiment using Generative AI

Gen AI is a powerful tool with many uses throughout industries. But can it accurately measure a brand's connection to emotional words like premium, strong, or affordable?

Implicit Association Tests (IAT) are useful to market researchers because they reveal how well a brand connects to these emotions on a gut level. However, it takes time and money to perform speeded IAT tests, so we want to see if AI can measure brand sentiment accurately.

We use 4 US Harris client studies run in 2022 and 2023 as our baseline. Blinded results of these studies function as our source of truth to see if large language models can accurately measure brand sentiment.

After establishing a baseline, we ask gen AI to score the companies using the -100 to 100 scoring scale, and to rank the companies against each other.

We found that:
• Gen AI does well at ranking more well-known companies
• Gen AI models are decent at ranking companies relative to each other
• AI overpredicts the brands' emotional connections

Attendees should come away with:
• Understanding of IAT/speeded response scores and how gen AI works
• Ability to identify where gen AI can more accurately measure associations 

Keywords

Artificial Intelligence (AI)

Generative AI

Implicit Association Test (IAT)

IAT score

brand sentiment

speeded response 

Co-Author

Coleen Schofield, Harris Poll

First Author

Tomer Zur, The Harris Poll

Presenting Author

Tomer Zur, The Harris Poll

A Neo4j Knowledge Graph for RAG Guidance Enforcement

This session explores the demands of processing one or a few documents with absolute fidelity when presented with this scenario: given a guidance document, ensure that no sensitive data is present in a new dataset. In many NLP applications, processing numerous documents with high precision is often desirable, but not mandatory. However, our specific use case demands the processing of documents with nearly perfect precision. To achieve this, we have developed a knowledge graph implementation that checks for compliance with the guidance document. This knowledge graph must precisely mirror the content of the guidance document, necessitating the retention of the original text along with transformer-produced vector embeddings for Retrieval-Augmented Generation (RAG) interpretations of the database contents on each node. Our technique, leveraging RAG, is broadly applicable to any scenario requiring strict data compliance with a guidance document. 

Keywords

Knowledge Graph

RAG 

First Author

Andrew Flinders, Northrop Grumman

Presenting Author

Andrew Flinders, Northrop Grumman

Automated and Secure Text Scraping with Generative AI of School Transcripts for Surveys

We present TranscriptGenie, a prototype application developed to address the need for efficient and accurate text extraction from PDF school transcripts for several large federal surveys. Secondary and postsecondary transcript data are crucial for understanding student educational journeys and outcomes. Yet extracting meaningful data from PDF school transcripts has long been a labor-intensive process that is often fraught with challenges due to variability in transcript formats, embedded tables, and diverse data structures. In this session, we will provide a comprehensive overview of TranscriptGenie's development process by highlighting the requirements that drove its design and the novel solutions that underpin its capabilities. This includes integrating generative AI technology to handle text variations and leveraging natural language processing techniques for data annotation. We will discuss how this tool is designed to comply with security standards and the use of a graph database to efficiently manage and query the extracted data. Finally, we will discuss next steps needed for deployment and broader implications for transcript analysis in surveys. 

Keywords

Text analysis

Surveys

Generative Artificial Intelligence

Natural Language Processing

Education

Graph database 

Co-Author(s)

Dale Holstein, RTI International
Michael Long, RTI International
Andy Kawataba, RTI International
Ethan Ritchie, RTI International
John Bollenbacher, RTI International
Michael Wenger, RTI International
Stuart Allen, RTI International

First Author

Emily Hadley, RTI International

Presenting Author

Michael Long, RTI International

A Novel Statistical Method for Dynamic Topic Modeling

Topic modeling aims at extracting a low-rank semantic structure from a large corpus of text documents. Most existing methods fall into either the Latent Dirichlet Allocation (LDA) framework or the Probabilistic Latent Semantic Indexing (pLSI) framework. However, the underlying word co-occurrence pattern is often neglected. Motivated by this limitation, we have proposed a novel statistical method by incorporating word co-occurrence. Specifically, we use a hypergraph structure to model word interaction and use the node heterogeneity to model word frequency. Then, we learn a latent low-rank factorization of the hypergraph parameters to recover the topics. Moreover, our proposed method can be flexibly generalized for dynamic topic modeling of a sequence of corpora over multiple time windows via a temporal constraint on the hypergraph structure. Overall, the proposed method is easy to implement and its versatility is supported by numerical studies on semi-synthetic data and a real corpus. 

Keywords

Topic model

Latent representation

Nonnegative matrix factorization

Vertex hunting algorithm

Anchor word

Dynamic textual data 

Co-Author(s)

Qing Nie, University of California, Irvine: Mathematics; Developmental and Cell Biology
Annie Qu, University of California At Irvine

First Author

Hanjia Gao, University of California, Irvine

Presenting Author

Hanjia Gao, University of California, Irvine

Optical Character Recognition Evaluation for Historical Archival Records

Architectural advancements in Optical Character Recognition (OCR) are enabling the deployment of OCR models without the resource burden of training models from scratch. The National Archives and Records Administration (NARA) now deploys a pre-trained text and image transformer to support the Citizen Archivist mission--an effort relying on human volunteers to transcribe historical documents. While pre-trained Transformers have demonstrated improvements over preceding generations of RNNs and CNNs, historical documents vary in structure, vocabulary, and handwriting styles, posing a unique challenge that will require additional model enhancements.

This paper evaluates the performance of NARA's OCR across diverse record collections to assess performance and identify model limitations. Specifically, we ask [1] How does the model perform, measured by character error rate (CER), on each of the document collections? [2] What attributes of these documents present challenges for a general-purpose model? [3] What options are available for improving performance? This research offers findings that can strengthen performance for challenging documents and improve accuracy rates. 

Keywords

Optical Character Recognition

Library Science

Machine Learning

Artificial Intelligence

Data Science 

Co-Author(s)

Madison Hall, University of Michigan
Conor York, University of Michigan
Tianyu Hu, University of Michigan
Cameron Milne, Reveal Global Consulting
Taylor Wilson, Reveal Global Consulting, LLC

First Author

Madeline Kelsch, University of Michigan

Presenting Author

Madeline Kelsch, University of Michigan