52: Leveraging Large Language Models (LLMs) and Research-Based Prompts for Cleaning Unstructured Text
Tuesday, Aug 5: 10:30 AM - 12:20 PM
0942
Contributed Posters
Music City Center
Cleaning unstructured text data, particularly in large data contexts, presents a considerable challenge given the size, complexity of language and the diversity of data sources. In this work, we propose a methodology that combines Large Language Models (LLMs) and Research-Based Prompts (RBP) to effectively streamline and improve text data cleaning for tabular data analysis. To evaluate the methodology's efficacy, we examine three key sources of variability-LLM choice (e.g., ChatGPT, Llama), prompt choice, and text data type (e.g., nurse vs. physician text)-and their impact on overall performance as compared to a human label.
This poster outlines each phase of our methodology, including initial data collection, the use of specialized prompts, iterative LLM refinements, and automated code generation for advanced text processing. We demonstrate the effectiveness of these approaches on two types of datasets: clinical text and Reddit posts. Lastly, we address limitations such as potential biases in LLM outputs and the need for continuous human oversight to ensure accuracy and reliability.
Data management
Data cleaning
Large Language Model (LLM)
unstructured text data
Main Sponsor
Section on Statistical Learning and Data Science
You have unsaved changes.