Large Language Models in Biomedical and Statistical Knowledge Discovery

Buxin Su Chair
 
Linjun Zhang Discussant
Rutgers University
 
Bingxin Zhao Organizer
 
Wednesday, Aug 6: 8:30 AM - 10:20 AM
0652 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-103C 

Applied

Yes

Main Sponsor

Biometrics Section

Co Sponsors

ENAR
IMS

Presentations

UKB-KG: Knowledge Graph for Integrating and Enhancing Biomedical Insights from the UK Biobank

The UK Biobank (UKB) is a cornerstone of modern biomedical research, providing
unparalleled data to advance the understanding, prediction, and treatment of
diseases. Its contributions span genetics, genomics, disease prediction, and long-term
follow-up studies, driving transformative advancements in public health and precision
medicine. However, the fragmentation of research outcomes across numerous
publications limits analytic efficiency and cross-study integration. To address this,
we developed UKB Knowledge Graph (UKB-KG), a high-quality medical knowledge
graph constructed using large language models (LLMs) with a precision rate of 85.6%.
Integrating data from approximately 7,000 UKB-related publications. UKB-KG comprises
137,328 triples enriched with contextual attributes such as source information
and demographic details. It reveals intricate relationships among genes, diseases,
environment variables, and lifestyle factors, while a dynamic scoring mechanism enhances
triple retrieval accuracy. Evaluations highlight UKB-KG's transformative
potential: (i) Embedding UKB-KG into multi-disease prediction models improves
AUROC, AUPRC, and F1 scores by 8.4%, 6.6%, and 3.2%, respectively, for rare diseases;
(ii) A tailored retrieval-augmented generation (RAG) approach boosted LLM
accuracy by 21% on PubMedQA; and (iii) A user-friendly platform enhances accessibility
for researchers. By unifying fragmented research and enabling robust data
exploration, UKB-KG emerges as a powerful tool for advancing biomedical research
and driving innovative healthcare applications.  

Co-Author

Hongtu Zhu

Speaker

Hongtu Zhu

Domain-Knowledge Augmented Multi-Agent Collaborative Reasoning Protein-Disease Mapping

Understanding protein-disease relationships is crucial for uncovering disease mechanisms, identifying biomarkers, and accelerating drug discovery. However, researchers currently spend significant time and effort manually reasoning over fragmented biomedical data to extract meaningful insights. Existing methods often lack efficient integration of diverse biological perspectives, making it challenging to derive comprehensive conclusions. To address this, we propose a multi-agent framework that automates the reasoning process for protein-disease mapping. Given a user-specified disease, the system retrieves top-associated proteins and employs specialized reasoning agents to analyze key aspects such as existing data evidence, protein function, and disease biology. Additional agents explore gene-disease associations, protein-protein interactions, and protein-drug relationships, synthesizing multi-source biomedical data. An aggregation agent ensures coherence, while a natural language generation agent translates findings into human-readable reports.By automating complex reasoning and reducing manual effort, our framework enhances the interpretability of disease mechanisms, facilitates hypothesis generation, and supports precision medicine and drug discovery. 

Speaker

Bingxuan Li

Empowering Biomedical Discovery with AI Agents

We envision "AI scientists" as systems capable of skeptical learning and reasoning that empower biomedical research through collaborative agents that integrate AI models and biomedical tools with experimental platforms. Rather than taking humans out of the discovery process, biomedical AI agents combine human creativity and expertise with AI's ability to analyze large datasets, navigate hypothesis spaces, and execute repetitive tasks. AI agents are poised to be proficient in various tasks, planning discovery workflows and performing self-assessment to identify and mitigate gaps in their knowledge. These agents use large language models and generative models to feature structured memory for continual learning and use machine learning tools to incorporate scientific knowledge, biological principles, and theories. AI agents can impact areas ranging from virtual cell simulation, programmable control of phenotypes, and the design of cellular circuits to developing new therapies. 

Speaker

Marinka Zitnik, Department of Biomedical Informatics, Harvard Medical School

Automated Statistical Model Discovery with Large Language Models

Statistical models are powerful tools for helping us understand and explain the world. Building these models is challenging because it requires deep expertise in modeling and the problem domain. Motivated by the capabilities of large language models (LLMs), we introduce a framework for language model driven automated statistical model discovery. We cast our automated procedure within Box's Loop: the LM iterates between proposing statistical models represented as probabilistic programs, acting as a modeler, and critiquing those models through simulation-based checks, acting as a domain expert. By integrating LMs, we do not have to define a domain-specific language of models or design a handcrafted search procedure over models. We evaluate our method on a range of controlled settings and real-world tasks. Our method identifies models on par with human expert designed models and extends classic models in interpretable ways. Our results highlight the promise of LM-driven model discovery. 

Speaker

Michael Li, Department of Computer Science, Stanford University