Using large language models for rare variant association testing in large-scale biobanks

Christopher Gillies Co-Author
Regeneron Genetics Center
 
Andrey Ziyatdinov Co-Author
Regeneron Genetics Center
 
The Regeneron Genetics Center Co-Author
Regeneron Genetics Center
 
Maya Ghoussaini Co-Author
Regeneron Genetics Center
 
Jonathan Marchini Co-Author
Regeneron Genetics Center
 
Joelle Mbatchou Speaker
 
Wednesday, Aug 6: 9:00 AM - 9:25 AM
Invited Paper Session 
Music City Center 
The application of whole exome sequencing in studying of rare genetic variation has been well-established as a powerful and cost-effective strategy for novel drug target discovery. The study of rare genetic variation, potentially important in the development of complex diseases, has been increasingly performed thanks to advances in sequencing technologies. Gene-based tests have been developed to address the challenges with single variant tests caused by the rarity of these variants and the need for large sample sizes. These tests aggregate information across many variants and can integrate external functional annotations to improve the power of rare variant analysis. In recent years, large language models have been used to predict the functional impact of genetic mutations, potentially enhancing the power of rare variant association tests, and complementing functional prediction approaches based on in-silico algorithms. We showcase the integration of functional scores leveraging protein language models for large-scale gene-based association testing in the UK Biobank.