Applications of Text Analysis

Jerry Timbrook Chair
 
Monday, Aug 5: 10:30 AM - 12:20 PM
5059 
Contributed Papers 
Oregon Convention Center 
Room: CC-D138 

Main Sponsor

Section on Text Analysis

Co Sponsors

Caucus for Women in Statistics

Presentations

Understanding Mentions of BLS Products Through Topic Modeling of News Articles

The Bureau of Labor Statistics (BLS) measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making. To meet this mission, BLS not only publishes statistics and research on its own website but also seeks to understand when and where its products are mentioned in online news sources. Making sense of this huge volume of news articles is impossible without a means of summarizing and grouping them. Using article data collected by a third-party service, we experimented with several methods to model the topics contained in news articles that mention BLS products. We compared and optimized candidate models with a goal of meeting the needs of internal stakeholders who use the output to help evaluate the impact of their outreach efforts. Ultimately, we selected a model that provided the best balance of evaluation metrics and utility to these users. This presentation will include a summary of the models we explored and the process we developed to compare them. 

Keywords

topic model

machine learning

natural language processing 

First Author

Erin Boon

Presenting Author

Erin Boon

A Retrospective Analysis of the SmartFind COVID-19 Vaccine Chatbot: Statistical Insights

In 2021, the Centers for Disease Control and Prevention (CDC) implemented a cloud-based chatbot called SmartFind COVID-19 Vaccine Chatbot. This Chatbot employed natural language processing (NLP) to automatically address vaccination-related inquiries. It provided high-confidence responses matched to CDC's COVID-19 frequently asked questions and answers (FAQ&As), or a "Sorry, the ChatBot couldn't find a good match" response for low-confidence matches. An analysis of system logs from August 30, 2021 to March 16, 2023 examined 64,884 visitor questions (of which about three-fifths received an NLP matched response) and 3,925 visitor feedback entries. The goal of this project is to use NLP statistical methods, including tokenization and feature extraction, to analyze question text to determine topics that the chatbot was not able to provide a matched response for, including by design for clinical and disease-related questions, for example. The results can guide improvement of vaccination content by including FAQ&As on CDC's webpages and informing development of future chatbots using more powerful large language models. 

Keywords

Chatbot

Natural Language Processing

COVID-19 Vaccination 

Co-Author(s)

Suchita A. Patel, CDC
Faisal Reza, CDC
Cynthia Knighton, CDC
Angela Marie Chambliss, CDC

First Author

Yi Mu

Presenting Author

Yi Mu

Integrating LLMs with Existing Text Analysis and Summarization Research Approaches

Existing text analysis and summarization techniques like key term frequency analysis and unsupervised topic modeling are helpful for analyzing large quantities text but often are insufficient for contextual interpretations. We explore the groundbreaking integration of Large Language Models (LLMs) like GPT with these conventional techniques, highlighting this synergy through two real-world projects from distinct subject areas. This session offers a deep dive into the technicalities of using the GPT API in practice, comparing traditional text analysis methods with LLMs, various technological and methodological challenges, and work done to validate findings. We also discuss feedback and limitations of this approach in two real world settings with subject matter experts from non-technical backgrounds. We suggest further research opportunities for statisticians and sociologists and emphasize how LLMs can enhance analysis of large text datasets. 

Keywords

Text analysis

Large language models 

Co-Author(s)

Laura Marcial, RTI International
Anthony Berghammer, RTI International
Wes Quattrone, RTI International
Georgiy Bobashev, Research Triangle Institute

First Author

Emily Hadley, RTI International

Presenting Author

Emily Hadley, RTI International

Leveraging Generative AI to identify narrative evolution, and target audiences in social media

In an era where information is ubiquitous but increasingly unregulated, malicious actors are leveraging the ambiguity of the information environment to provoke specific responses within target audiences via the use of narratives. In public sector applications, the intent of narrative manipulation is often to affect public debate on issues, electoral processes, and policy decisions. To formulate effective responses, policy makers must understand what narratives are being propagated, who is being targeted, and the potential impacts. However, this type of analysis is often complicated by the volume of content and noise in the information environemnt. Leveraging large volumes of data from social media, inputs from geopolitical monitoring systems, and a predictive modeling capability combining LLMs with traditional statistical simulation approaches, we seek to (1) identify key features in specific narratives in social media data, (2) identify shifts in narratives over time, (3) identify potential target audiences of specific narratives, and (4) identify impact to a target audience. 

Keywords

social media

public policy impact

narrative assessment

generative AI

large langauge models 

Co-Author(s)

Amir Bagherpour, co-author
Heather Patsolic, Johns Hopkins University
Marjorie Willner, co-author
Sieu Tran, co-author

First Author

Richard Takacs

Presenting Author

Richard Takacs

Is your paper going to be cited? A Multinomial Inverse Regression model for predicting citations.

This study draws upon the complete text of a collection comprising 751 scientific articles. These articles specifically feature the terms 'renewable energies' and 'circular economies' either in their titles or abstracts. Then a novel application of the Multinomial Inverse Regression to predict the number of citations is investigated. This prediction is based on textual data, coupled with a set of related covariates. For the proposed model, measures of goodness of fit, as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are provided. By investigating characteristic words and related covariates that are associated with a higher number of citations, this work aims to provide significant evidence for researchers and practitioners..  

Keywords

Predictive Model

Textual Data

Renewable Energies

Circular Economies

Multinomial Inverse Regression. 

Co-Author

Tarifa Almulhim, King Faisal University. Business School. Department of Quantitative Methods. Business

First Author

Igor Barahona, King Fahad University of Petroleum and Minerals

Presenting Author

Igor Barahona, King Fahad University of Petroleum and Minerals

Are the Gospels & Acts Historical? A response to Gregor and Blais’ (2023) statistical analysis

In September, 2023, the Journal for the Study of the Historical Jesus published a paper by Gregor and Blais which concluded that the name frequencies in the Gospels & Acts had no better fit to the best historical database of 2100+ first century Palestine male Jewish names than a uniform distribution (i.e. the names were made up). Furthermore, they wrote, "our statistical analysis identifies some, albeit weak, evidence against [the historical] thesis." (Gregor, Kamil and Brian Blais; 2023; Is Name Popularity a Good Test of Historicity? A Statistical Evaluation of Richard Bauckham's Onomastic Argument; JSHJ; Brill; 21:171-202. DOI: 10.1163/17455197-BJA10023) We applaud Gregor and Blais' great contribution with their careful data curation and groundbreaking statistical analysis of ancient name frequency data. Unfortunately, however, there were some methodological flaws which we will address. More importantly, we will offer an independent statistical analysis of the Gospels & Acts name frequency data and show that it actually fits the historical database pretty well. 

Keywords

onomastics

historical name frequency data

Gospels and Acts

goodness-of-fit 

First Author

Jason Wilson, Biola University

Presenting Author

Jason Wilson, Biola University

Mapping Ocean Stories Using Text Analysis

Maine and its coastal and fishing communities are facing unprecedented challenges due to climate change. Oral histories provide a powerful set of insights into how ecosystems and communities have responded to changing conditions and fish abundance over time. The goal of our project, Mapping Ocean Stories, is to study the changes in species distributions and commercial fisheries using data science techniques to offer a bird's eye view of coastal Maine over several generations. By amplifying the voices of Maine's coastal communities, our goal is to bridge the gap between policies crafted by agencies without specific insights into the inshore and offshore local knowledge, and support better adaptive decisions that contribute to the resilience of Maine marine fisheries and aquaculture industries.

In this talk I will present the text analysis and geocoding approaches we are using to identify spatial activities from oral history interviews. I will also discuss the challenges associated with locating activities in the ocean from text descriptions and the ways in which we are capturing uncertainty in the data. 

Keywords

Text Analysis

Geographical Information Systems

Spatial data

Local Knowledge

Oral Histories 

First Author

Laurie Baker

Presenting Author

Laurie Baker