Advances in Foundation Models and LLMs for Biomedical Data Science

Sreya Sarkar Chair
 
Himel Mallick Organizer
Cornell University
 
Monday, Aug 4: 2:00 PM - 3:50 PM
0209 
Invited Paper Session 
Music City Center 
Room: CC-Davidson Ballroom B 

Keywords

Large Language Models

Foundation Models

Machine Learning

Artificial Intelligence

Omics Data Science

Biostatistics 

Applied

Yes

Main Sponsor

International Indian Statistical Association

Co Sponsors

Section on Statistical Learning and Data Science
Section on Statistics in Genomics and Genetics

Presentations

Developing Large language-based models to infer biological insights from omics sequencing data

equencing data (e.g., DNA, RNA, mass spectrometry) are invaluable for measuring biomarkers and inferring biological insights. As the volume of these data grows rapidly, they become well-suited for training large language models (LLMs). We treat sequencing data as a language and implement innovative strategies for sequence encoding using LLMs, which can be extended to other sequence types, such as amino acids. Most current methods rely on alignment-based approaches to infer biological insights. Here, I present a two-stage training framework for LLMs, applied across a broad range of health and biological domains. Applications include genomic benchmark datasets, metagenomics quality control, and microbial species taxonomic profiling from short- and long-read sequencing technologies. Our approach is evaluated against existing methods, demonstrating its potential to advance LLMs in the health domain. 

Keywords

Large Language Models

Metagenomics

Sequencing Data

Mass Spectrometry

Software

AI 

Speaker

Ali Rahnavard, The George Washington University

WITHDRAWN Generative AI for Protein Design

Proteins are workhorses of living cells. Understanding the functions of proteins is critical to many applications such as biomedicine and synthetic biology. Thanks to recent biotechnology breakthroughs such as gene sequencing and Cro-EM, a large amount of protein data (such as protein sequences and structures) are generated, providing a huge opportunity for AI. As the functions of proteins are determined by their structures, in this talk, I will introduce our recent work on protein understanding based on protein 3D structures with geometric deep learning. I will introduce three different topics including protein representation learning, generative models for protein structure prediction, and generative models for protein design.  

Keywords

Generative AI; protein design 

Co-Author

Jian Tang, HEC Montréal

Single-Cell Embedding from Language Models

Various Foundation Models (FMs) have been built based on the pre-training and fine-tuning framework to analyze single-cell data with different degrees of success. In this presentation, we describe a method named scELMo (Single-cell Embedding from Language Models) to analyze single-cell data that utilizes Large Language Models (LLMs) as a generator for both the description of metadata information and the embeddings for such descriptions. We combine the embeddings from LLMs with the raw data under the zero-shot learning framework to further extend its function by using the fine-tuning framework to handle different tasks. We demonstrate that scELMo is capable of cell clustering, batch effect correction, and cell-type annotation without training a new model. Moreover, the fine-tuning framework of scELMo can help with more challenging tasks, including in-silico treatment analysis and modeling perturbation. scELMo has a lighter structure and lower resource requirements. Moreover, our method is comparable to recent large-scale FMs (such as scGPT and Geneformer) based on our evaluations, suggesting a promising path for developing domain-specific FMs. 

Keywords

Single cell

foundation models

large language models

embedding

clustering

annotation 

Speaker

Hongyu Zhao, Yale University

Applications of Commercial LLMs in Genomic Research

Recent advances in large language models (LLMs) are unlocking new possibilities in genomic research. This talk will explore how commercial LLMs, such as GPT-4, Claude, and Gemini, are being applied to tasks like annotating cell types in single-cell sequencing data, classifying biomedical images, answering genomics-related questions, and generating programming code. These innovative uses of LLMs have the potential to lower the technical barriers in genomic research, broadening access and enabling researchers from diverse backgrounds to contribute to the field. 

Keywords

Large language model

Large multimodal model

Cell type annotation

Image classification

Computer programming

Question answering 

Speaker

Zhicheng Ji, Duke University

Digital Pathology Meets Spatial Omics: Emerging Problems in Data Integration, Solutions, and New Opportunities

This talk will introduce parallel advancements of two emerging fields, computational pathology and spatial –omics, in the modern era of biomedical sciences. Accordingly, my team leverages computational image analysis tools and best engineering practices to integrate spatial –omics datasets with their associated histology images, to draw meaningful conclusions. We work to fundamentally understand cell type and cell state compositions and underlying quantitative morphometric features at various scales from transcripts to tissue microanatomy. Additionally, I will highlight our ongoing efforts within the Human Biomolecular Atlas Project (HuBMAP), a consortium spanning 42 sites, focused on creating an atlas of the human body at the cellular level using spatial technologies. Moreover, I will discuss the detection and segmentation of multiple cell types and cell states as well as tissue microanatomy exclusively from brightfield histology images. Furthermore, I'll explore several use-case studies of these tools including use in kidney disease trajectory prediction, relevant to the NIH Kidney Precision Medicine Project (KPMP) consortium, and distinguishing glomeruli with chronic and acute injury. Additionally, I will demonstrate our cloud-based open-source distributed software systems (FUSION Functional Unit State IdentificatiON in Whole Slide Images, accessible at http://fusion.hubmapconsortium.org/, and CompRePS Computational Renal Pathology Suite, accessible at https://athena.rc.ufl.edu/). These systems are designed to conduct various computational image analysis tasks related to digital pathology, starting with the analysis of brightfield histology images and extending to the integration of histology with spatial omics data. We'll conclude by discussing new opportunities and potential directions for collective contributions in the field of computational pathology. 

Keywords

Computational Pathology 

Speaker

Pinaki Sarder, SUNY Buffalo