Print Close

52: Leveraging Large Language Models (LLMs) and Research-Based Prompts for Cleaning Unstructured Text

Presented During: Contributed Poster Presentations: Section on Statistical Learning and Data Science

Joshua Lambert First Author
University of Cincinnati

Joshua Lambert Presenting Author
University of Cincinnati

Tuesday, Aug 5: 10:30 AM - 12:20 PM
0942
Contributed Posters

Music City Center

Description

Cleaning unstructured text data, particularly in large data contexts, presents a considerable challenge given the size, complexity of language and the diversity of data sources. In this work, we propose a methodology that combines Large Language Models (LLMs) and Research-Based Prompts (RBP) to effectively streamline and improve text data cleaning for tabular data analysis. To evaluate the methodology's efficacy, we examine three key sources of variability-LLM choice (e.g., ChatGPT, Llama), prompt choice, and text data type (e.g., nurse vs. physician text)-and their impact on overall performance as compared to a human label.

This poster outlines each phase of our methodology, including initial data collection, the use of specialized prompts, iterative LLM refinements, and automated code generation for advanced text processing. We demonstrate the effectiveness of these approaches on two types of datasets: clinical text and Reddit posts. Lastly, we address limitations such as potential biases in LLM outputs and the need for continuous human oversight to ensure accuracy and reliability.

Keywords

Data management

Data cleaning

Large Language Model (LLM)

unstructured text data

Main Sponsor

Section on Statistical Learning and Data Science