Developing Large language-based models to infer biological insights from omics sequencing data
Monday, Aug 4: 2:05 PM - 2:25 PM
Invited Paper Session
Music City Center
equencing data (e.g., DNA, RNA, mass spectrometry) are invaluable for measuring biomarkers and inferring biological insights. As the volume of these data grows rapidly, they become well-suited for training large language models (LLMs). We treat sequencing data as a language and implement innovative strategies for sequence encoding using LLMs, which can be extended to other sequence types, such as amino acids. Most current methods rely on alignment-based approaches to infer biological insights. Here, I present a two-stage training framework for LLMs, applied across a broad range of health and biological domains. Applications include genomic benchmark datasets, metagenomics quality control, and microbial species taxonomic profiling from short- and long-read sequencing technologies. Our approach is evaluated against existing methods, demonstrating its potential to advance LLMs in the health domain.
Large Language Models
Metagenomics
Sequencing Data
Mass Spectrometry
Software
AI
You have unsaved changes.