Monday, Aug 5: 10:30 AM - 12:20 PM
5059
Contributed Papers
Oregon Convention Center
Room: CC-D138
Main Sponsor
Section on Text Analysis
Co Sponsors
Caucus for Women in Statistics
Presentations
The Bureau of Labor Statistics (BLS) measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making. To meet this mission, BLS not only publishes statistics and research on its own website but also seeks to understand when and where its products are mentioned in online news sources. Making sense of this huge volume of news articles is impossible without a means of summarizing and grouping them. Using article data collected by a third-party service, we experimented with several methods to model the topics contained in news articles that mention BLS products. We compared and optimized candidate models with a goal of meeting the needs of internal stakeholders who use the output to help evaluate the impact of their outreach efforts. Ultimately, we selected a model that provided the best balance of evaluation metrics and utility to these users. This presentation will include a summary of the models we explored and the process we developed to compare them.
Keywords
topic model
machine learning
natural language processing
In 2021, the Centers for Disease Control and Prevention (CDC) implemented a cloud-based chatbot called SmartFind COVID-19 Vaccine Chatbot. This Chatbot employed natural language processing (NLP) to automatically address vaccination-related inquiries. It provided high-confidence responses matched to CDC's COVID-19 frequently asked questions and answers (FAQ&As), or a "Sorry, the ChatBot couldn't find a good match" response for low-confidence matches. An analysis of system logs from August 30, 2021 to March 16, 2023 examined 64,884 visitor questions (of which about three-fifths received an NLP matched response) and 3,925 visitor feedback entries. The goal of this project is to use NLP statistical methods, including tokenization and feature extraction, to analyze question text to determine topics that the chatbot was not able to provide a matched response for, including by design for clinical and disease-related questions, for example. The results can guide improvement of vaccination content by including FAQ&As on CDC's webpages and informing development of future chatbots using more powerful large language models.
Keywords
Chatbot
Natural Language Processing
COVID-19 Vaccination
Existing text analysis and summarization techniques like key term frequency analysis and unsupervised topic modeling are helpful for analyzing large quantities text but often are insufficient for contextual interpretations. We explore the groundbreaking integration of Large Language Models (LLMs) like GPT with these conventional techniques, highlighting this synergy through two real-world projects from distinct subject areas. This session offers a deep dive into the technicalities of using the GPT API in practice, comparing traditional text analysis methods with LLMs, various technological and methodological challenges, and work done to validate findings. We also discuss feedback and limitations of this approach in two real world settings with subject matter experts from non-technical backgrounds. We suggest further research opportunities for statisticians and sociologists and emphasize how LLMs can enhance analysis of large text datasets.
Keywords
Text analysis
Large language models
In an era where information is ubiquitous but increasingly unregulated, malicious actors are leveraging the ambiguity of the information environment to provoke specific responses within target audiences via the use of narratives. In public sector applications, the intent of narrative manipulation is often to affect public debate on issues, electoral processes, and policy decisions. To formulate effective responses, policy makers must understand what narratives are being propagated, who is being targeted, and the potential impacts. However, this type of analysis is often complicated by the volume of content and noise in the information environemnt. Leveraging large volumes of data from social media, inputs from geopolitical monitoring systems, and a predictive modeling capability combining LLMs with traditional statistical simulation approaches, we seek to (1) identify key features in specific narratives in social media data, (2) identify shifts in narratives over time, (3) identify potential target audiences of specific narratives, and (4) identify impact to a target audience.
Keywords
social media
public policy impact
narrative assessment
generative AI
large langauge models
Abstracts
This study draws upon the complete text of a collection comprising 751 scientific articles. These articles specifically feature the terms 'renewable energies' and 'circular economies' either in their titles or abstracts. Then a novel application of the Multinomial Inverse Regression to predict the number of citations is investigated. This prediction is based on textual data, coupled with a set of related covariates. For the proposed model, measures of goodness of fit, as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are provided. By investigating characteristic words and related covariates that are associated with a higher number of citations, this work aims to provide significant evidence for researchers and practitioners..
Keywords
Predictive Model
Textual Data
Renewable Energies
Circular Economies
Multinomial Inverse Regression.
Co-Author
Tarifa Almulhim, King Faisal University. Business School. Department of Quantitative Methods. Business
First Author
Igor Barahona, King Fahad University of Petroleum and Minerals
Presenting Author
Igor Barahona, King Fahad University of Petroleum and Minerals
In September, 2023, the Journal for the Study of the Historical Jesus published a paper by Gregor and Blais which concluded that the name frequencies in the Gospels & Acts had no better fit to the best historical database of 2100+ first century Palestine male Jewish names than a uniform distribution (i.e. the names were made up). Furthermore, they wrote, "our statistical analysis identifies some, albeit weak, evidence against [the historical] thesis." (Gregor, Kamil and Brian Blais; 2023; Is Name Popularity a Good Test of Historicity? A Statistical Evaluation of Richard Bauckham's Onomastic Argument; JSHJ; Brill; 21:171-202. DOI: 10.1163/17455197-BJA10023) We applaud Gregor and Blais' great contribution with their careful data curation and groundbreaking statistical analysis of ancient name frequency data. Unfortunately, however, there were some methodological flaws which we will address. More importantly, we will offer an independent statistical analysis of the Gospels & Acts name frequency data and show that it actually fits the historical database pretty well.
Keywords
onomastics
historical name frequency data
Gospels and Acts
goodness-of-fit
Maine and its coastal and fishing communities are facing unprecedented challenges due to climate change. Oral histories provide a powerful set of insights into how ecosystems and communities have responded to changing conditions and fish abundance over time. The goal of our project, Mapping Ocean Stories, is to study the changes in species distributions and commercial fisheries using data science techniques to offer a bird's eye view of coastal Maine over several generations. By amplifying the voices of Maine's coastal communities, our goal is to bridge the gap between policies crafted by agencies without specific insights into the inshore and offshore local knowledge, and support better adaptive decisions that contribute to the resilience of Maine marine fisheries and aquaculture industries.
In this talk I will present the text analysis and geocoding approaches we are using to identify spatial activities from oral history interviews. I will also discuss the challenges associated with locating activities in the ocean from text descriptions and the ways in which we are capturing uncertainty in the data.
Keywords
Text Analysis
Geographical Information Systems
Spatial data
Local Knowledge
Oral Histories