AI/ML-prediction powered statistical genetics and genomics using synthetic data

Xihong Lin Speaker
Harvard T.H. Chan School of Public Health
 
Monday, Aug 4: 8:35 AM - 9:00 AM
Invited Paper Session 
Music City Center 
Within population biobanks and omic data, certain outcomes are subject to substantial missing data, which limit the power for genetic discovery. AI and ML methods are increasingly used for prediction, such as CNN, DL and transformer. Data based on predicted values can introduce spurious associations, identifying genetic variants that are not associated with the original trait. Here we introduce a new method, synthetic surrogate (SynSurr) analysis, which makes GWAS on AI/ML model-based predicted phenotypes robust to prediction errors by jointly analyzing the partially observed outcomes and predicted outcomes for all subjects. We show that SynSurr estimates the same genetic effect as standard GWAS and improves power in proportion to the quality of the imputations. SynSurr requires a commonly made missing-at-random assumption but does not require correct prediction model specification. We present extensive simulations and ablation analyses to validate SynSurr and apply it to empower the GWAS of dual-energy X-ray absorptiometry traits within the UK Biobank.

Keywords

AI and ML

Transformer

statistical inference

prediction

biobanks