Statistical Models for Omics Data Integration

Haohao Su Chair
Michigan State University
 
Wednesday, Aug 6: 2:00 PM - 3:50 PM
4199 
Contributed Papers 
Music City Center 
Room: CC-101D 

Main Sponsor

Section on Statistics in Genomics and Genetics

Presentations

Interpretable integration of multiple spatial transcriptomics datasets with INSPIRE

Recent advances in spatial transcriptomics technologies have led to diverse datasets, offering opportunities to explore tissue organizations within spatial contexts. However, it remains a significant challenge to effectively integrate and interpret these data, often originating from different samples, technologies, and developmental stages. We present INSPIRE, a deep learning method for integrative analyses of multiple spatial transcriptomics datasets to address this challenge. With designs of graph neural networks and an adversarial learning mechanism, INSPIRE enables spatially informed and adaptable integration of data from varying sources. By incorporating non-negative matrix factorization, INSPIRE uncovers interpretable spatial factors with corresponding gene programs, revealing tissue architectures, cell type distributions and biological processes. We showcase INSPIRE's capabilities by applying it to diverse datasets. INSPIRE shows superior performance in identifying detailed biological signals, effectively borrowing information across distinct profiling technologies, and elucidating dynamical changes during embryonic development. 

Keywords

Spatial transcriptomics

Data integration

Deep learning

Data interpretation 

Co-Author(s)

Xiangyu Zhang, Yale University
Gefei Wang, Yale University
Yingxin Lin
Tianyu Liu
Rui Chang, Yale University
Hongyu Zhao, Yale University

First Author

Jia Zhao, Yale University

Presenting Author

Jia Zhao, Yale University

MultiGATE: Integrative Analysis and Regulatory Inference in Spatial Multi-Omics Data

New spatial multi-omics technologies, which jointly profiles transcriptome and epigenome/protein markers for the same tissue section, have expanded the frontiers of spatial techniques. Here we introduce MultiGATE, which utilizes a two-level graph attention auto-encoder to integrate the multi-modality and spatial information in spatial multi-omics data. The key feature of MultiGATE is that it simultaneously performs embedding of the spatial pixels and infers the cross-modality regulatory relationship, which allows deeper data integration and provides insights on transcriptional regulation. We evaluated the performance of MultiGATE on spatial multi-omics datasets obtained from different tissues and platforms. Through effectively integrating spatial multi-omics data, MultiGATE both enhances the extraction of latent embeddings of the pixels and boosts the inference of transcriptional regulation for cross-modality genomic features. 

Keywords

Data Integration

Spatial multi-omics data 

Co-Author(s)

Jishuai MIAO, The Chinese University of Hong Kong
Ying Zhu, Fudan University
Can Yang, The Hong Kong University of Science and Technology
Zhixiang Lin, The Chinese University of Hong Kong

First Author

Jinzhao Li, The Chinese University of Hong Kong

Presenting Author

Jishuai MIAO, The Chinese University of Hong Kong

Nonlinear Embedding and Integration of Omics Data: A Fast and Tuning-Free Approach

The rapid progress of single-cell technology is enabling biologists to unravel the intricacies of cell populations, disease states, and developmental lineages. The high-dimensional, noisy, and sparse nature of single-cell omics data poses significant analytical challenges. Here, we introduce DCOL (Dissimilarity based on Conditional Ordered List) correlation, a functional dependency measure for quantifying nonlinear relationships between variables. Based on this measure, we propose DCOL-PCA and DCOL-CCA, for dimension reduction and integration of single- and multi-omics data. In simulation studies, our methods outperformed eight other DR methods and four joint dimension reduction (jDR) methods, showcasing stable performance across various settings. It proved highly effective in extracting essential factors even in the most challenging scenarios. We also validated these methods on real datasets, with our method demonstrating its ability to detect intricate signals within and between omics data and generate lower-dimensional embeddings that preserve the essential information and latent structures in the data. 

Keywords

Nonlinear Dimensionality Reduction

Single-cell Analysis

Multi-Omics Integration 

Co-Author

Tianwei Yu

First Author

Shengjie Liu, The Chinese University of Hong Kong, Shenzhen

Presenting Author

Tianwei Yu

HighDimMixedModels.jl: Robust high-dimensional mixed-effects models across omics data

High-dimensional mixed-effects models are an increasingly important form of regression in which the number of covariates rivals or exceeds the number of samples, which are collected in groups or clusters. The penalized likelihood approach to fitting these models relies on a coordinate descent algorithm that lacks guarantees of convergence to a global optimum. Here, we empirically study the behavior of this algorithm on simulated and real examples of three types of data that are common in modern biology: transcriptome, genome-wide association, and microbiome data. Our simulations provide new insights into the algorithm's behavior in these settings, and, comparing the performance of two popular penalties, we demonstrate that the smoothly clipped absolute deviation (SCAD) penalty consistently outperforms the least absolute shrinkage and selection operator (LASSO) penalty in terms of both variable selection and estimation accuracy across omics data. To empower researchers in biology and other fields to fit models with the SCAD penalty, we implement the algorithm in a Julia package, HighDimMixedModels.jl. 

Keywords

Coordinate descent

Penalized likelihood

Mixed-effects

Omics

Variable selection 

Co-Author(s)

Rosa Aghdam, University of Wisconsin-Madison
Claudia Solis-Lemus, University of Wisconsin-Madison

First Author

Evan Gorstein

Presenting Author

Evan Gorstein

Constructing Multi-Omic Networks Related to Schizophrenia for the dACC Dataset

Multi-omics datasets allow researchers to uncover relationships across different omics layers (e.g. genome, proteome, metabolome). Analyzing multiple layers requires specialized methods to handle heterogeneity and other inherent challenges. A well-known approach is canonical correlation analysis (CCA) with more recent extensions in sparsity and incorporating phenotype. One of such extensions is SmCCNet, which is built on sparse CCA and repeated feature subsampling to construct multi-omic networks specific to a phenotype. We analyze a multi-omic schizophrenia (SCZ) dataset of 112 samples with protein, phosphorylation, lipid, and metabolomic features, plus binary schizophrenia or neurotypical (NT) diagnosis. We use a sparse CCA approach inspired by SmCCNet, incorporating the feature subsampling and similarity matrix construction. However, our method has three key differences: 1) we do not explicitly use the phenotype in the network construction to prevent signal blurring, 2) in order to increase our number of samples, we remove the case effect from the case samples, and 3) we use bagging to improve robustness and generalizability. Lastly, we identify networks that are significantly associated with schizophrenia using an enrichment analysis and covariance matrix permutation test. Of the resulting significant networks, many are also biologically meaningful and contain features consistent with existing literature. 

Keywords

Multi-omics

Canonical correlation analysis (CCA)

Sparse canonical correlation analysis (sCCA)

Multi-omic networks

Case-control covariance differences

Feature subsampling 

Co-Author(s)

Bernie Devlin, Univ of Pittsburgh
Lambertus Klei, University of Pittsburgh
Kathryn Roeder, Carnegie Mellon University

First Author

Maya Shen, Carnegie Mellon University

Presenting Author

Maya Shen, Carnegie Mellon University

Evaluation of Rare Variant Association Methods When Incorporating Data from Multiple Sources

Rare variant genetic associations are crucial to understanding complex traits and diseases. Yet, the large sample sizes needed to observe rare variants can be difficult to ascertain. Incorporating public summary data as external controls, meta-analyzing existing case-control studies, or combining different study types (e.g., case-only, control-only) can boost power by increasing sample sizes. However, using data from multiple sources can cause bias due to differences in sample ascertainment and processing. Here, we compare the performance of rare variant association methods designed to incorporate external controls (iECAT-O and ProxECAT) with a new method (LogProx) that can leverage data from multiple sources. We also use SKAT-O, which was not designed for external data, as a baseline comparison. We find that SKAT-O often has optimal power, even without external controls, but ProxECAT and LogProx are the most powerful given a moderate proportion of cases to internal controls (e.g., ≥4:1). By identifying the scenarios (e.g., study designs, sample sizes) where the use of additional data sources is most beneficial, we hope to aid in the discovery of new genetic associations. 

Keywords

statistical genetics

rare variant association methods

public summary data

external controls 

Co-Author

Audrey Hendricks, University of Colorado Denver

First Author

Jessica Murphy

Presenting Author

Jessica Murphy

Optimal Weighting for Integrative Association Tests: Application to Whole-Genome Sequencing Studies

The integrative association test utilizes a weighting scheme to combine prior information and increase statistical power. In whole-genome sequencing (WGS) studies, it facilitates the integration of biological characteristics of single nucleotide variants (SNVs) to improve the detection of novel disease genes. Despite recent applicational advances, determining optimal weights to fully leverage relevant information remains an open question. For a broad family of weighted integrative tests, this paper proposes optimal weights that maximize the tests' asymptotic efficiency, a dominant metric influencing statistical power. The study elucidates how weighting enhances statistical power and designs a practical approach for integrating effective information from SNV allele frequencies, annotations, and linkage disequilibrium. Extensive simulations demonstrate improved statistical power compared to existing methods. An osteoporosis case study further illustrates the method's application and potential for detecting more novel disease genes. 

Keywords

Data integration

p-value combination

signal detection

weighting

whole genome sequencing study 

Co-Author(s)

Ming Liu, Worcester Polytechnic Institute
Zheyang Wu, WPI
John Landers, UMass Chan Medical School

First Author

Hong Zhang, Pfizer Inc.

Presenting Author

Hong Zhang, Pfizer Inc.