Large Impact of Genetic Data Processing Steps on Reproducibility in Genome-Wide Association Studies

Ryan Sun Co-Author
University of Texas, MD Anderson Cancer Center
 
Naishu Kui First Author
 
Naishu Kui Presenting Author
 
Thursday, Aug 7: 10:50 AM - 11:05 AM
2032 
Contributed Papers 
Music City Center 
Genome-wide association studies (GWAS) play a crucial role in identifying genetic variants linked to complex traits, but reproducibility of results remains a major challenge due to inconsistencies in data processing pipelines. Set-based analyses - including hypothesis tests such as the Sequence Kernel Association Test (SKAT) or fine-mapping methods such as conditional regression - are widely used but can be highly sensitive to data cleaning steps. For example, choices made about genotype coding, filtering criteria, and annotation databases can all greatly affect findings. This work quantifies the impact of cleaning choices on statistical power and effect size estimation using a model misspecification framework. We demonstrate that reasonable differences in such choices can lead to significant changes in operating properties of SKAT and conditional regression. Simulations and application to a whole-exome sequencing pancreatic cancer dataset highlight the importance of standardized and transparent genotype processing to improve study replicability. Our results are publicly available in an R package that can be used for sensitivity analysis at the design stage of GWAS investigations.

Keywords

Genome-Wide Association Study

Reproducibility

Set-Based Inference

Model Misspecification 

Main Sponsor

Section on Statistics in Genomics and Genetics