Monday, Aug 4: 10:30 AM - 12:20 PM
4062
Contributed Papers
Music City Center
Room: CC-103C
This contributed session covers various topics in text analysis, including applications of generative AI, topic modeling, and more traditional text classification tasks. Application domains span federal surveys, precision medicine, national archives, and even the development of novel statistical methodology.
Main Sponsor
Section on Text Analysis
Co Sponsors
Section on Text Analysis
Presentations
Topic modeling is essential for uncovering latent themes in scientific literature, aiding research discovery. Traditional models like Latent Dirichlet Allocation (LDA) rely on probabilistic word distributions, often producing incoherent topics, while BERTopic improves clustering with transformer embeddings but requires manual post-processing. TopicGPT, a prompt-based framework powered by Large Language Models (LLMs), generates interpretable topics as natural language descriptions rather than ambiguous word clusters.
This study compares TopicGPT, LDA, and BERTopic across cyberbullying research, forensic science literature, and general scientific papers. The models are evaluated on coherence, diversity, redundancy, interpretability, and research discovery efficiency. Results suggest TopicGPT produces more interpretable and distinct topics, improving classification accuracy and reducing redundancy. While BERTopic excels in semantic clustering, it shows higher topic overlap, and LDA struggles with coherence and interpretability.
These findings highlight LLM-driven topic modeling as a benchmark for enhancing literature analysis, research workflows, and knowledge discovery.
Keywords
AI-Powered Topic Modeling
Latent Dirichlet Allocation (LDA)
BERTopic
Large Language Models (LLMs)
Natural Language Processing (NLP)
Scientific Literature Analysis
Gen AI is a powerful tool with many uses throughout industries. But can it accurately measure a brand's connection to emotional words like premium, strong, or affordable?
Implicit Association Tests (IAT) are useful to market researchers because they reveal how well a brand connects to these emotions on a gut level. However, it takes time and money to perform speeded IAT tests, so we want to see if AI can measure brand sentiment accurately.
We use 4 US Harris client studies run in 2022 and 2023 as our baseline. Blinded results of these studies function as our source of truth to see if large language models can accurately measure brand sentiment.
After establishing a baseline, we ask gen AI to score the companies using the -100 to 100 scoring scale, and to rank the companies against each other.
We found that:
• Gen AI does well at ranking more well-known companies
• Gen AI models are decent at ranking companies relative to each other
• AI overpredicts the brands' emotional connections
Attendees should come away with:
• Understanding of IAT/speeded response scores and how gen AI works
• Ability to identify where gen AI can more accurately measure associations
Keywords
Artificial Intelligence (AI)
Generative AI
Implicit Association Test (IAT)
IAT score
brand sentiment
speeded response
This session explores the demands of processing one or a few documents with absolute fidelity when presented with this scenario: given a guidance document, ensure that no sensitive data is present in a new dataset. In many NLP applications, processing numerous documents with high precision is often desirable, but not mandatory. However, our specific use case demands the processing of documents with nearly perfect precision. To achieve this, we have developed a knowledge graph implementation that checks for compliance with the guidance document. This knowledge graph must precisely mirror the content of the guidance document, necessitating the retention of the original text along with transformer-produced vector embeddings for Retrieval-Augmented Generation (RAG) interpretations of the database contents on each node. Our technique, leveraging RAG, is broadly applicable to any scenario requiring strict data compliance with a guidance document.
Keywords
Knowledge Graph
RAG
We present TranscriptGenie, a prototype application developed to address the need for efficient and accurate text extraction from PDF school transcripts for several large federal surveys. Secondary and postsecondary transcript data are crucial for understanding student educational journeys and outcomes. Yet extracting meaningful data from PDF school transcripts has long been a labor-intensive process that is often fraught with challenges due to variability in transcript formats, embedded tables, and diverse data structures. In this session, we will provide a comprehensive overview of TranscriptGenie's development process by highlighting the requirements that drove its design and the novel solutions that underpin its capabilities. This includes integrating generative AI technology to handle text variations and leveraging natural language processing techniques for data annotation. We will discuss how this tool is designed to comply with security standards and the use of a graph database to efficiently manage and query the extracted data. Finally, we will discuss next steps needed for deployment and broader implications for transcript analysis in surveys.
Keywords
Text analysis
Surveys
Generative Artificial Intelligence
Natural Language Processing
Education
Graph database
Topic modeling aims at extracting a low-rank semantic structure from a large corpus of text documents. Most existing methods fall into either the Latent Dirichlet Allocation (LDA) framework or the Probabilistic Latent Semantic Indexing (pLSI) framework. However, the underlying word co-occurrence pattern is often neglected. Motivated by this limitation, we have proposed a novel statistical method by incorporating word co-occurrence. Specifically, we use a hypergraph structure to model word interaction and use the node heterogeneity to model word frequency. Then, we learn a latent low-rank factorization of the hypergraph parameters to recover the topics. Moreover, our proposed method can be flexibly generalized for dynamic topic modeling of a sequence of corpora over multiple time windows via a temporal constraint on the hypergraph structure. Overall, the proposed method is easy to implement and its versatility is supported by numerical studies on semi-synthetic data and a real corpus.
Keywords
Topic model
Latent representation
Nonnegative matrix factorization
Vertex hunting algorithm
Anchor word
Dynamic textual data
Co-Author(s)
Qing Nie, University of California, Irvine: Mathematics; Developmental and Cell Biology
Annie Qu, University of California At Irvine
First Author
Hanjia Gao, University of California, Irvine
Presenting Author
Hanjia Gao, University of California, Irvine
Architectural advancements in Optical Character Recognition (OCR) are enabling the deployment of OCR models without the resource burden of training models from scratch. The National Archives and Records Administration (NARA) now deploys a pre-trained text and image transformer to support the Citizen Archivist mission--an effort relying on human volunteers to transcribe historical documents. While pre-trained Transformers have demonstrated improvements over preceding generations of RNNs and CNNs, historical documents vary in structure, vocabulary, and handwriting styles, posing a unique challenge that will require additional model enhancements.
This paper evaluates the performance of NARA's OCR across diverse record collections to assess performance and identify model limitations. Specifically, we ask [1] How does the model perform, measured by character error rate (CER), on each of the document collections? [2] What attributes of these documents present challenges for a general-purpose model? [3] What options are available for improving performance? This research offers findings that can strengthen performance for challenging documents and improve accuracy rates.
Keywords
Optical Character Recognition
Library Science
Machine Learning
Artificial Intelligence
Data Science