Monday, Aug 4: 2:00 PM - 3:50 PM
0209
Invited Paper Session
Music City Center
Room: CC-Davidson Ballroom B
Large Language Models
Foundation Models
Machine Learning
Artificial Intelligence
Omics Data Science
Biostatistics
Applied
Yes
Main Sponsor
International Indian Statistical Association
Co Sponsors
Section on Statistical Learning and Data Science
Section on Statistics in Genomics and Genetics
Presentations
equencing data (e.g., DNA, RNA, mass spectrometry) are invaluable for measuring biomarkers and inferring biological insights. As the volume of these data grows rapidly, they become well-suited for training large language models (LLMs). We treat sequencing data as a language and implement innovative strategies for sequence encoding using LLMs, which can be extended to other sequence types, such as amino acids. Most current methods rely on alignment-based approaches to infer biological insights. Here, I present a two-stage training framework for LLMs, applied across a broad range of health and biological domains. Applications include genomic benchmark datasets, metagenomics quality control, and microbial species taxonomic profiling from short- and long-read sequencing technologies. Our approach is evaluated against existing methods, demonstrating its potential to advance LLMs in the health domain.
Keywords
Large Language Models
Metagenomics
Sequencing Data
Mass Spectrometry
Software
AI
Proteins are workhorses of living cells. Understanding the functions of proteins is critical to many applications such as biomedicine and synthetic biology. Thanks to recent biotechnology breakthroughs such as gene sequencing and Cro-EM, a large amount of protein data (such as protein sequences and structures) are generated, providing a huge opportunity for AI. As the functions of proteins are determined by their structures, in this talk, I will introduce our recent work on protein understanding based on protein 3D structures with geometric deep learning. I will introduce three different topics including protein representation learning, generative models for protein structure prediction, and generative models for protein design.
Keywords
Generative AI; protein design
Various Foundation Models (FMs) have been built based on the pre-training and fine-tuning framework to analyze single-cell data with different degrees of success. In this presentation, we describe a method named scELMo (Single-cell Embedding from Language Models) to analyze single-cell data that utilizes Large Language Models (LLMs) as a generator for both the description of metadata information and the embeddings for such descriptions. We combine the embeddings from LLMs with the raw data under the zero-shot learning framework to further extend its function by using the fine-tuning framework to handle different tasks. We demonstrate that scELMo is capable of cell clustering, batch effect correction, and cell-type annotation without training a new model. Moreover, the fine-tuning framework of scELMo can help with more challenging tasks, including in-silico treatment analysis and modeling perturbation. scELMo has a lighter structure and lower resource requirements. Moreover, our method is comparable to recent large-scale FMs (such as scGPT and Geneformer) based on our evaluations, suggesting a promising path for developing domain-specific FMs.
Keywords
Single cell
foundation models
large language models
embedding
clustering
annotation
Recent advances in large language models (LLMs) are unlocking new possibilities in genomic research. This talk will explore how commercial LLMs, such as GPT-4, Claude, and Gemini, are being applied to tasks like annotating cell types in single-cell sequencing data, classifying biomedical images, answering genomics-related questions, and generating programming code. These innovative uses of LLMs have the potential to lower the technical barriers in genomic research, broadening access and enabling researchers from diverse backgrounds to contribute to the field.
Keywords
Large language model
Large multimodal model
Cell type annotation
Image classification
Computer programming
Question answering
This talk will introduce parallel advancements of two emerging fields, computational pathology and spatial –omics, in the modern era of biomedical sciences. Accordingly, my team leverages computational image analysis tools and best engineering practices to integrate spatial –omics datasets with their associated histology images, to draw meaningful conclusions. We work to fundamentally understand cell type and cell state compositions and underlying quantitative morphometric features at various scales from transcripts to tissue microanatomy. Additionally, I will highlight our ongoing efforts within the Human Biomolecular Atlas Project (HuBMAP), a consortium spanning 42 sites, focused on creating an atlas of the human body at the cellular level using spatial technologies. Moreover, I will discuss the detection and segmentation of multiple cell types and cell states as well as tissue microanatomy exclusively from brightfield histology images. Furthermore, I'll explore several use-case studies of these tools including use in kidney disease trajectory prediction, relevant to the NIH Kidney Precision Medicine Project (KPMP) consortium, and distinguishing glomeruli with chronic and acute injury. Additionally, I will demonstrate our cloud-based open-source distributed software systems (FUSION Functional Unit State IdentificatiON in Whole Slide Images, accessible at http://fusion.hubmapconsortium.org/, and CompRePS Computational Renal Pathology Suite, accessible at https://athena.rc.ufl.edu/). These systems are designed to conduct various computational image analysis tasks related to digital pathology, starting with the analysis of brightfield histology images and extending to the integration of histology with spatial omics data. We'll conclude by discussing new opportunities and potential directions for collective contributions in the field of computational pathology.
Keywords
Computational Pathology