Large Impact of Genetic Data Processing Steps on Reproducibility in Genome-Wide Association Studies
Ryan Sun
Co-Author
University of Texas, MD Anderson Cancer Center
Thursday, Aug 7: 10:50 AM - 11:05 AM
2032
Contributed Papers
Music City Center
Genome-wide association studies (GWAS) play a crucial role in identifying genetic variants linked to complex traits, but reproducibility of results remains a major challenge due to inconsistencies in data processing pipelines. Set-based analyses - including hypothesis tests such as the Sequence Kernel Association Test (SKAT) or fine-mapping methods such as conditional regression - are widely used but can be highly sensitive to data cleaning steps. For example, choices made about genotype coding, filtering criteria, and annotation databases can all greatly affect findings. This work quantifies the impact of cleaning choices on statistical power and effect size estimation using a model misspecification framework. We demonstrate that reasonable differences in such choices can lead to significant changes in operating properties of SKAT and conditional regression. Simulations and application to a whole-exome sequencing pancreatic cancer dataset highlight the importance of standardized and transparent genotype processing to improve study replicability. Our results are publicly available in an R package that can be used for sensitivity analysis at the design stage of GWAS investigations.
Genome-Wide Association Study
Reproducibility
Set-Based Inference
Model Misspecification
Main Sponsor
Section on Statistics in Genomics and Genetics
You have unsaved changes.