Advances in Statistical Methods for Omics Data: Integrative Analysis and Causal Inference

Chong Wang Chair
Iowa State University
 
Chunlin Li Organizer
Iowa State University
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
0809 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-210 

Applied

Yes

Main Sponsor

Mental Health Statistics Section

Co Sponsors

International Chinese Statistical Association
Section on Statistics in Genomics and Genetics

Presentations

Evidence-based Practice for Epi-Transcriptomic Data Harmonization

The reproducibility of epi-transcriptomic data analysis hinges on effectively mitigating data artifacts that arise from variable experimental handling through data harmonization. While numerous harmonization methods – encompassing normalization and batch-effect correction – have been developed to address these artifacts, statistical investigations into their impact on downstream analyses primarily focused on differential expression analysis. To promote evidence-based practices in data harmonization, my team has developed robust benchmark datasets, novel statistical methods, and accompanying software tools, with a particular focus on microRNAs. In this talk, I will present findings from a simulation study evaluating the performance of various data harmonization approaches in the contexts of sample clustering and sample classification, each assessed using multiple analytical methods. The best-performing combinations of harmonization and downstream analysis methods were then applied to reanalyze publicly available real-world data. 

Speaker

Li-Xuan Qin, Memorial Sloan Kettering Cancer Center

Transcriptome-wide gene regulatory network construction

Constructing gene regulatory networks is crucial to understand the genetic architecture of complex traits. However, constructing directed networks with genome-wide genes remains a challenge due to the high dimensionality. Taking advantage of both transcriptomic and single nucleotide polymorphism data, we proposed a two-stage penalized least squares method to build large systems of structural equations for directional network construction. A large system of structural equations can be constructed via consistent estimation of a set of conditional expectations at the first stage, and a consistent selection of regulatory effects was obtained at the second stage. The proposed method can simultaneously investigate all the genes across the entire genome, and the computation is fast due to the parallel implementation. Such unbiased network construction will enable the determination of the causal relationship between genes and facilitate our understanding of disease mechanisms. We demonstrate the superior performance and effectiveness of the method using simulation studies, and the method has been successfully applied to real data. 

Speaker

Min Zhang, University of California, Irvine

Measuring weak effects in high dimensional mediation analysis for omics data

Understanding the mediating role of omics data is crucial for uncovering the biological mechanisms through which an established risk factor influences an outcome of interest. Despite extensive research on high-dimensional mediation analysis, existing methods have often fallen short in accurately quantifying the global contribution of omics mediators, particularly those with weak effects. When investigating the proteins from environmental exposures to cardiovascular outcomes in the MESA dataset, we found that many likely have weak mediating effects that could collectively play a substantial mediating role. To address this issue, we propose new variance-based causal measures under the causal mediation analysis framework. Then, we develop a flexible and computationally efficient estimation procedure based on a mixed-effects working model. Through this innovative approach, we were able to accurately quantify the total mediation effect, and we discovered that a significant amount is attributed to weak mediators in simulation studies and real data analysis, which were largely mis-estimated by existing methods. This result offers valuable guidance for future study design and downstream analyses. The proposed approach is general and complements the existing methodologies by offering new perspectives on the global and weak effects in mediation analysis. 

Speaker

Tianzhong Yang

Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach

Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment. 

Keywords

Sample size determination

Deep generative models 

Speaker

Yunhui Qi

DrFARM: Identification and inference for pleiotropic gene in multi-trait metabolomics GWAS

Pleiotropic variants are often identified by running separate genome-wide association studies (GWAS) on each trait and then combining results, but this marginal-summary-statistics-based approach can lead to spurious findings by inflating each trait's residual variance. We propose a new statistical approach, Debiased-regularized Factor Analysis Regression Model (DrFARM), which employs a joint regression model to analyze high-dimensional genetic variants while accounting for multilevel trait dependencies. This joint modeling strategy permits comprehensive false discovery rate (FDR) control. DrFARM leverages debiasing techniques and the Cauchy combination test, both theoretically justified, to establish a valid post-selection inference on pleiotropic variants. Through extensive simulations, we demonstrate that DrFARM appropriately controls the overall FDR. Applying DrFARM to data on 1,031 metabolites measured in 6,135 men from the Metabolic Syndrome in Men (METSIM) study, we identify 288 new metabolite associations at loci that did not reach significance in prior METSIM metabolite GWAS analyses. 

Keywords

High-dimensional inference

debiasing

metabolomics

factor analysis model

post-selection inference 

Speaker

Lap Sum Chan