Print Close

Large language models empower meta-analysis in the big data era

Presented During: Large Language Models and their Applications

Hojin Moon Co-Author
California State University - Long Beach

Michelle Cheuk Co-Author
California State University, Long Beach

Owen Sun First Author
California Academy of Mathematics and Science

Owen Sun Presenting Author
California Academy of Mathematics and Science

Wednesday, Aug 6: 9:50 AM - 10:05 AM
1162
Contributed Papers

Music City Center

In the current big data era, large data repositories containing thousands of studies present opportunities for meta-analysis but require labor-intensive, time-consuming screening. To address this, we developed a framework using large language models (LLMs) to determine and justify whether a study dataset is suitable for any given meta-analysis based on the dataset description, the dataset itself, the study paper, or some combination. We demonstrated this framework for a meta-analysis on adjuvant chemotherapy response in non-small cell lung cancer, screening clinical data from 536 studies in the NCBI Gene Expression Omnibus (GEO) repository using the cost-effective GPT-4o mini LLM in a zero-shot setting. We found that the framework was more sensitive than traditional keyword search in identifying suitable studies while cutting screening time to hours. To streamline the framework and enable scientists to efficiently identify relevant studies for meta-analysis, we developed a publicly-available app implementing this framework for screening studies in the GEO repository and PubMed with the goal of accelerating scientific discovery.

Keywords

natural language processing

open big data

study screening

data-driven research

text mining

Main Sponsor

Section on Statistical Learning and Data Science