Tuesday, Aug 5: 10:30 AM - 12:20 PM
0809
Topic-Contributed Paper Session
Music City Center
Room: CC-210
Applied
Yes
Main Sponsor
Mental Health Statistics Section
Co Sponsors
International Chinese Statistical Association
Section on Statistics in Genomics and Genetics
Presentations
The reproducibility of epi-transcriptomic data analysis hinges on effectively mitigating data artifacts that arise from variable experimental handling through data harmonization. While numerous harmonization methods – encompassing normalization and batch-effect correction – have been developed to address these artifacts, statistical investigations into their impact on downstream analyses primarily focused on differential expression analysis. To promote evidence-based practices in data harmonization, my team has developed robust benchmark datasets, novel statistical methods, and accompanying software tools, with a particular focus on microRNAs. In this talk, I will present findings from a simulation study evaluating the performance of various data harmonization approaches in the contexts of sample clustering and sample classification, each assessed using multiple analytical methods. The best-performing combinations of harmonization and downstream analysis methods were then applied to reanalyze publicly available real-world data.
Speaker
Li-Xuan Qin, Memorial Sloan Kettering Cancer Center
Constructing gene regulatory networks is crucial to understand the genetic architecture of complex traits. However, constructing directed networks with genome-wide genes remains a challenge due to the high dimensionality. Taking advantage of both transcriptomic and single nucleotide polymorphism data, we proposed a two-stage penalized least squares method to build large systems of structural equations for directional network construction. A large system of structural equations can be constructed via consistent estimation of a set of conditional expectations at the first stage, and a consistent selection of regulatory effects was obtained at the second stage. The proposed method can simultaneously investigate all the genes across the entire genome, and the computation is fast due to the parallel implementation. Such unbiased network construction will enable the determination of the causal relationship between genes and facilitate our understanding of disease mechanisms. We demonstrate the superior performance and effectiveness of the method using simulation studies, and the method has been successfully applied to real data.
Speaker
Min Zhang, University of California, Irvine
Understanding the mediating role of omics data is crucial for uncovering the biological mechanisms through which an established risk factor influences an outcome of interest. Despite extensive research on high-dimensional mediation analysis, existing methods have often fallen short in accurately quantifying the global contribution of omics mediators, particularly those with weak effects. When investigating the proteins from environmental exposures to cardiovascular outcomes in the MESA dataset, we found that many likely have weak mediating effects that could collectively play a substantial mediating role. To address this issue, we propose new variance-based causal measures under the causal mediation analysis framework. Then, we develop a flexible and computationally efficient estimation procedure based on a mixed-effects working model. Through this innovative approach, we were able to accurately quantify the total mediation effect, and we discovered that a significant amount is attributed to weak mediators in simulation studies and real data analysis, which were largely mis-estimated by existing methods. This result offers valuable guidance for future study design and downstream analyses. The proposed approach is general and complements the existing methodologies by offering new perspectives on the global and weak effects in mediation analysis.
Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment.
Keywords
Sample size determination
Deep generative models
Pleiotropic variants are often identified by running separate genome-wide association studies (GWAS) on each trait and then combining results, but this marginal-summary-statistics-based approach can lead to spurious findings by inflating each trait's residual variance. We propose a new statistical approach, Debiased-regularized Factor Analysis Regression Model (DrFARM), which employs a joint regression model to analyze high-dimensional genetic variants while accounting for multilevel trait dependencies. This joint modeling strategy permits comprehensive false discovery rate (FDR) control. DrFARM leverages debiasing techniques and the Cauchy combination test, both theoretically justified, to establish a valid post-selection inference on pleiotropic variants. Through extensive simulations, we demonstrate that DrFARM appropriately controls the overall FDR. Applying DrFARM to data on 1,031 metabolites measured in 6,135 men from the Metabolic Syndrome in Men (METSIM) study, we identify 288 new metabolite associations at loci that did not reach significance in prior METSIM metabolite GWAS analyses.
Keywords
High-dimensional inference
debiasing
metabolomics
factor analysis model
post-selection inference