Developing Large language-based models to infer biological insights from omics sequencing data

Ali Rahnavard Speaker
The George Washington University
 
Monday, Aug 4: 2:05 PM - 2:25 PM
Invited Paper Session 
Music City Center 
equencing data (e.g., DNA, RNA, mass spectrometry) are invaluable for measuring biomarkers and inferring biological insights. As the volume of these data grows rapidly, they become well-suited for training large language models (LLMs). We treat sequencing data as a language and implement innovative strategies for sequence encoding using LLMs, which can be extended to other sequence types, such as amino acids. Most current methods rely on alignment-based approaches to infer biological insights. Here, I present a two-stage training framework for LLMs, applied across a broad range of health and biological domains. Applications include genomic benchmark datasets, metagenomics quality control, and microbial species taxonomic profiling from short- and long-read sequencing technologies. Our approach is evaluated against existing methods, demonstrating its potential to advance LLMs in the health domain.

Keywords

Large Language Models

Metagenomics

Sequencing Data

Mass Spectrometry

Software

AI