Large language models empower meta-analysis in the big data era

Hojin Moon Co-Author
California State University - Long Beach
 
Michelle Cheuk Co-Author
California State University, Long Beach
 
Owen Sun First Author
California Academy of Mathematics and Science
 
Owen Sun Presenting Author
California Academy of Mathematics and Science
 
Wednesday, Aug 6: 9:50 AM - 10:05 AM
1162 
Contributed Papers 
Music City Center 
In the current big data era, large data repositories containing thousands of studies present opportunities for meta-analysis but require labor-intensive, time-consuming screening. To address this, we developed a framework using large language models (LLMs) to determine and justify whether a study dataset is suitable for any given meta-analysis based on the dataset description, the dataset itself, the study paper, or some combination. We demonstrated this framework for a meta-analysis on adjuvant chemotherapy response in non-small cell lung cancer, screening clinical data from 536 studies in the NCBI Gene Expression Omnibus (GEO) repository using the cost-effective GPT-4o mini LLM in a zero-shot setting. We found that the framework was more sensitive than traditional keyword search in identifying suitable studies while cutting screening time to hours. To streamline the framework and enable scientists to efficiently identify relevant studies for meta-analysis, we developed a publicly-available app implementing this framework for screening studies in the GEO repository and PubMed with the goal of accelerating scientific discovery.

Keywords

natural language processing

open big data

study screening

data-driven research

text mining 

Main Sponsor

Section on Statistical Learning and Data Science