Print Close

Large Impact of Genetic Data Processing Steps on Reproducibility in Genome-Wide Association Studies

Presented During: Statistical Challenges and New Testing Methods in Genomics and Genetics

Ryan Sun Co-Author
University of Texas, MD Anderson Cancer Center

Naishu Kui First Author

Naishu Kui Presenting Author

Thursday, Aug 7: 10:50 AM - 11:05 AM
2032
Contributed Papers

Music City Center

Genome-wide association studies (GWAS) play a crucial role in identifying genetic variants linked to complex traits, but reproducibility of results remains a major challenge due to inconsistencies in data processing pipelines. Set-based analyses - including hypothesis tests such as the Sequence Kernel Association Test (SKAT) or fine-mapping methods such as conditional regression - are widely used but can be highly sensitive to data cleaning steps. For example, choices made about genotype coding, filtering criteria, and annotation databases can all greatly affect findings. This work quantifies the impact of cleaning choices on statistical power and effect size estimation using a model misspecification framework. We demonstrate that reasonable differences in such choices can lead to significant changes in operating properties of SKAT and conditional regression. Simulations and application to a whole-exome sequencing pancreatic cancer dataset highlight the importance of standardized and transparent genotype processing to improve study replicability. Our results are publicly available in an R package that can be used for sensitivity analysis at the design stage of GWAS investigations.

Keywords

Genome-Wide Association Study

Reproducibility

Set-Based Inference

Model Misspecification

Main Sponsor

Section on Statistics in Genomics and Genetics