52: Leveraging Large Language Models (LLMs) and Research-Based Prompts for Cleaning Unstructured Text

Joshua Lambert First Author
University of Cincinnati
 
Joshua Lambert Presenting Author
University of Cincinnati
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
0942 
Contributed Posters 
Music City Center 

Description

Cleaning unstructured text data, particularly in large data contexts, presents a considerable challenge given the size, complexity of language and the diversity of data sources. In this work, we propose a methodology that combines Large Language Models (LLMs) and Research-Based Prompts (RBP) to effectively streamline and improve text data cleaning for tabular data analysis. To evaluate the methodology's efficacy, we examine three key sources of variability-LLM choice (e.g., ChatGPT, Llama), prompt choice, and text data type (e.g., nurse vs. physician text)-and their impact on overall performance as compared to a human label.

This poster outlines each phase of our methodology, including initial data collection, the use of specialized prompts, iterative LLM refinements, and automated code generation for advanced text processing. We demonstrate the effectiveness of these approaches on two types of datasets: clinical text and Reddit posts. Lastly, we address limitations such as potential biases in LLM outputs and the need for continuous human oversight to ensure accuracy and reliability.

Keywords

Data management

Data cleaning

Large Language Model (LLM)

unstructured text data 

Main Sponsor

Section on Statistical Learning and Data Science