Thursday, Aug 7: 10:30 AM - 12:20 PM
4229
Contributed Papers
Music City Center
Room: CC-104D
Main Sponsor
Section on Statistics in Genomics and Genetics
Presentations
Epigenome-wide association studies often involve large-scale DNA methylation data. Efficient screening of CpG sites associated with binary outcomes, especially when the events are rare (<5%), is challenging. Existing screening methods, such as ttScreening, effectively filter out unimportant variables (e.g., CpG sites) in epigenome-wide studies. However, they do not work well for binary outcomes with rare events. To address this, we developed a novel screening method that combines resampling with replacement and empirical Bayes adjustment to stabilize estimates and improve sensitivity. In parallel, we implemented a ttScreening approach embedding logistic regression with Firth's penalty term to mitigate bias in rare event contexts. We evaluate the performance of the proposed approaches and benchmark methods, using FDR and Bonferroni adjustments for multiple testing, through extensive simulations with varying sample sizes and number of parameters. The results show that the proposed methods have a higher sensitivity compared to the benchmark methods. To facilitate implementation, we have developed an R package, rareScreening, now available on GitHub.
Keywords
Screening
rare-event
sensitivity
resampling
Genome-wide association studies (GWAS) play a crucial role in identifying genetic variants linked to complex traits, but reproducibility of results remains a major challenge due to inconsistencies in data processing pipelines. Set-based analyses - including hypothesis tests such as the Sequence Kernel Association Test (SKAT) or fine-mapping methods such as conditional regression - are widely used but can be highly sensitive to data cleaning steps. For example, choices made about genotype coding, filtering criteria, and annotation databases can all greatly affect findings. This work quantifies the impact of cleaning choices on statistical power and effect size estimation using a model misspecification framework. We demonstrate that reasonable differences in such choices can lead to significant changes in operating properties of SKAT and conditional regression. Simulations and application to a whole-exome sequencing pancreatic cancer dataset highlight the importance of standardized and transparent genotype processing to improve study replicability. Our results are publicly available in an R package that can be used for sensitivity analysis at the design stage of GWAS investigations.
Keywords
Genome-Wide Association Study
Reproducibility
Set-Based Inference
Model Misspecification
Rare variant association studies are inherently challenging due to the sparse nature of variants and high computational demands of large-scale datasets analysis. We introduce POET (Poisson Exact Test), a novel statistical framework specifically designed for rare variant analysis in case-control studies. POET simplifies testing by requiring only summary-level carrier counts and allele frequencies as inputs. Extensive simulations show that POET outperforms Fisher's exact test in power while controlling the false discovery rate (FDR). It is also more computationally efficient than regression-based methods. Applied to whole-exome sequencing data from approximately 400K UK Biobank European participants, POET identified five significant genes (BRCA2, CHEK2, PALB2, BRCA1, ATM) for Breast Cancer and four (CHEK2, ATM, BRCA2, RNF212) for Prostate Cancer (FDR<0.05). These results are comparable to the genes identified using the SKAT-O method in GENEBASS. By combining minimal data requirements with computational efficiency, POET is a scalable and powerful tool for large biobank studies, particularly in cloud computing scenarios, adding a valuable option to the rare variant analysis toolkit.
Keywords
Rare Variant Association Studies
Whole Exome Sequencing
UK Biobank
With larger and more diverse studies becoming the standard in genome-wide association studies (GWAS), accurate estimation of ancestral proportions is increasingly important for summary-statistics-based methods such as those for imputing association summary statistics, adjusting allele frequencies (AFs) for ancestry, and prioritizing disease candidate variants or genes. Existing methods for estimating ancestral proportions in GWAS rely on the availability of study reference AFs, which are often inaccessible in current GWAS due to privacy concerns. In this study, we propose ZMIX (Z-score-based estimation of ethnic MIXing proportions), a novel method for estimating ethnic mixing proportions in GWAS using only association Z-scores, and we compare its performance to existing reference AF-based methods in both real-world and simulated GWAS settings. We found that ZMIX offered comparable results to the reference AF-based methods in simulation and real-world studies. When applied to summary-statistics imputation, all three methods produced high-quality imputations with almost identical results.
Keywords
GWAS
ancestry proportions
association Z-scores
summary statistics
Cohort studies involving analyses of high throughput (HTP) data will become more prevalent with reduced cost barriers of these technologies. Due to differences in their study designs, observational studies, including cohort studies may face statistical challenges not found in randomized studies. For instance, cohort studies may include multiple groups (i.e., >2), heterogeneity within the groups, unequal sample sizes, or they may lack appropriate power and sample size to detect differences in HTP measures. We thoroughly review the challenges and issues faced when associating HTP data (e.g., miRNAs, exposomics) to disease status with an application to discovery of liver disease biomarkers in a residential cohort exposed to environmental toxins. Among others, we identified as potential issues heterogeneity, inappropriate data dimensionality reduction, improper handling of multiple testing, inadequate HTP data pre-processing procedures, unmet model assumptions, and misidentified biological relevance as barriers to a properly executed biomarker discovery analysis in a population cohort. Correctly accounting for these factors should result in more robust unbiased statistical findings.
Keywords
high throughput studies
biomarkers
heterogeneity
cohort studies
liver disease
environmental exposures
Co-Author(s)
Shesh Rai, University of Louisville
Matthew Cave, University of Louisville
First Author
Christina Pinkston, University of Louisville and Biostats, Health Inform & Data Sci, University of Cincinnati College of Medicine
Presenting Author
Christina Pinkston, University of Louisville and Biostats, Health Inform & Data Sci, University of Cincinnati College of Medicine
We examine the problem of assessing how the association of metabolite levels with individual characteristics (such as sex or treatment) depends on metabolite traits (e.g., pathways). A standard approach involves two steps: testing each metabolite's association, followed by enrichment analysis. We combine both steps using a bilinear model based on the matrix linear model (MLM) framework. Our method estimates relationships among metabolites that share known characteristics, whether categorical (such as lipid type or pathway) or numerical (such as the number of double bonds in triglycerides). We demonstrate the
flexibility and interpretability of MLMs across various metabolomic studies. We illustrate how our method can distinguish the contributions of two correlated features of triglycerides: the number of carbon atoms and the number of double bonds, which would be overlooked if we analyzed lipids individually. Our method has been implemented using the open-source Julia package, MatrixLM, and can be explored using interactive notebooks.
Keywords
high-throughput data
metabolomics
Julia package
Batch effects, or technical variation across experimental batches, can obscure true biological signals and represent a significant challenge in the analysis and interpretation of 'omics data. This is particularly true of proteomics data generated using Olinkļ Target technology. To mitigate batch effects, Olink recommends using bridge samples, however, it is unclear whether bridge samples are always necessary. Furthermore, if bridge samples are needed, key questions arise regarding the appropriate number of bridge samples, as well as the strategy for the specific selection of bridge samples. To shed light on these questions, we conducted a systematic evaluation of three batch correction approaches including Olink's bridge sample method, COMBAT, and a case-control confounded approach (Remeasure) across three different study designs: (1) cases and controls processed in separate batches, (2) cases and controls mixed within each batch, and (3) cases distributed across multiple batches. Using simulations that closely reflect real-world datasets, we assessed the impact of batch correction methods on statistical power, Type I error, and false discovery rate. Our results provide guidance on which correction method performs best under different scenarios and the optimal number of bridge samples needed for effective correction. We further validated our findings by applying these methods to a real dataset. While our study focuses on batch correction within Olink proteomics data, the methodologies and insights presented here may be applicable to other high-dimensional omics datasets facing similar challenges.
Keywords
Batch Effect
Bridge Sample Selection
OlinkAnalyze
COMBAT
High-Dimensional Data
Olink Proteomics
Co-Author(s)
Rondi A Butler, Brown University School of Public Heath, RI, USA
Lucas A Salas, Geisel School of Medicine at Dartmouth, Lebanon, NH, USA
Brock C Christensen, Geisel School of Medicine at Dartmouth, Lebanon, NH, USA
Karl T Kelsey, Brown University School of Public Health, Providence, RI, USA
Devin C Koestler, University of Kansas Medical Center, Kansas City, KS, USA
First Author
Md Saiful Islam Saif
Presenting Author
Md Saiful Islam Saif