Wednesday, Aug 6: 2:00 PM - 3:50 PM
4199
Contributed Papers
Music City Center
Room: CC-101D
Main Sponsor
Section on Statistics in Genomics and Genetics
Presentations
Recent advances in spatial transcriptomics technologies have led to diverse datasets, offering opportunities to explore tissue organizations within spatial contexts. However, it remains a significant challenge to effectively integrate and interpret these data, often originating from different samples, technologies, and developmental stages. We present INSPIRE, a deep learning method for integrative analyses of multiple spatial transcriptomics datasets to address this challenge. With designs of graph neural networks and an adversarial learning mechanism, INSPIRE enables spatially informed and adaptable integration of data from varying sources. By incorporating non-negative matrix factorization, INSPIRE uncovers interpretable spatial factors with corresponding gene programs, revealing tissue architectures, cell type distributions and biological processes. We showcase INSPIRE's capabilities by applying it to diverse datasets. INSPIRE shows superior performance in identifying detailed biological signals, effectively borrowing information across distinct profiling technologies, and elucidating dynamical changes during embryonic development.
Keywords
Spatial transcriptomics
Data integration
Deep learning
Data interpretation
New spatial multi-omics technologies, which jointly profiles transcriptome and epigenome/protein markers for the same tissue section, have expanded the frontiers of spatial techniques. Here we introduce MultiGATE, which utilizes a two-level graph attention auto-encoder to integrate the multi-modality and spatial information in spatial multi-omics data. The key feature of MultiGATE is that it simultaneously performs embedding of the spatial pixels and infers the cross-modality regulatory relationship, which allows deeper data integration and provides insights on transcriptional regulation. We evaluated the performance of MultiGATE on spatial multi-omics datasets obtained from different tissues and platforms. Through effectively integrating spatial multi-omics data, MultiGATE both enhances the extraction of latent embeddings of the pixels and boosts the inference of transcriptional regulation for cross-modality genomic features.
Keywords
Data Integration
Spatial multi-omics data
Co-Author(s)
Jishuai MIAO, The Chinese University of Hong Kong
Ying Zhu, Fudan University
Can Yang, The Hong Kong University of Science and Technology
Zhixiang Lin, The Chinese University of Hong Kong
First Author
Jinzhao Li, The Chinese University of Hong Kong
Presenting Author
Jishuai MIAO, The Chinese University of Hong Kong
The rapid progress of single-cell technology is enabling biologists to unravel the intricacies of cell populations, disease states, and developmental lineages. The high-dimensional, noisy, and sparse nature of single-cell omics data poses significant analytical challenges. Here, we introduce DCOL (Dissimilarity based on Conditional Ordered List) correlation, a functional dependency measure for quantifying nonlinear relationships between variables. Based on this measure, we propose DCOL-PCA and DCOL-CCA, for dimension reduction and integration of single- and multi-omics data. In simulation studies, our methods outperformed eight other DR methods and four joint dimension reduction (jDR) methods, showcasing stable performance across various settings. It proved highly effective in extracting essential factors even in the most challenging scenarios. We also validated these methods on real datasets, with our method demonstrating its ability to detect intricate signals within and between omics data and generate lower-dimensional embeddings that preserve the essential information and latent structures in the data.
Keywords
Nonlinear Dimensionality Reduction
Single-cell Analysis
Multi-Omics Integration
High-dimensional mixed-effects models are an increasingly important form of regression in which the number of covariates rivals or exceeds the number of samples, which are collected in groups or clusters. The penalized likelihood approach to fitting these models relies on a coordinate descent algorithm that lacks guarantees of convergence to a global optimum. Here, we empirically study the behavior of this algorithm on simulated and real examples of three types of data that are common in modern biology: transcriptome, genome-wide association, and microbiome data. Our simulations provide new insights into the algorithm's behavior in these settings, and, comparing the performance of two popular penalties, we demonstrate that the smoothly clipped absolute deviation (SCAD) penalty consistently outperforms the least absolute shrinkage and selection operator (LASSO) penalty in terms of both variable selection and estimation accuracy across omics data. To empower researchers in biology and other fields to fit models with the SCAD penalty, we implement the algorithm in a Julia package, HighDimMixedModels.jl.
Keywords
Coordinate descent
Penalized likelihood
Mixed-effects
Omics
Variable selection
Multi-omics datasets allow researchers to uncover relationships across different omics layers (e.g. genome, proteome, metabolome). Analyzing multiple layers requires specialized methods to handle heterogeneity and other inherent challenges. A well-known approach is canonical correlation analysis (CCA) with more recent extensions in sparsity and incorporating phenotype. One of such extensions is SmCCNet, which is built on sparse CCA and repeated feature subsampling to construct multi-omic networks specific to a phenotype. We analyze a multi-omic schizophrenia (SCZ) dataset of 112 samples with protein, phosphorylation, lipid, and metabolomic features, plus binary schizophrenia or neurotypical (NT) diagnosis. We use a sparse CCA approach inspired by SmCCNet, incorporating the feature subsampling and similarity matrix construction. However, our method has three key differences: 1) we do not explicitly use the phenotype in the network construction to prevent signal blurring, 2) in order to increase our number of samples, we remove the case effect from the case samples, and 3) we use bagging to improve robustness and generalizability. Lastly, we identify networks that are significantly associated with schizophrenia using an enrichment analysis and covariance matrix permutation test. Of the resulting significant networks, many are also biologically meaningful and contain features consistent with existing literature.
Keywords
Multi-omics
Canonical correlation analysis (CCA)
Sparse canonical correlation analysis (sCCA)
Multi-omic networks
Case-control covariance differences
Feature subsampling
Rare variant genetic associations are crucial to understanding complex traits and diseases. Yet, the large sample sizes needed to observe rare variants can be difficult to ascertain. Incorporating public summary data as external controls, meta-analyzing existing case-control studies, or combining different study types (e.g., case-only, control-only) can boost power by increasing sample sizes. However, using data from multiple sources can cause bias due to differences in sample ascertainment and processing. Here, we compare the performance of rare variant association methods designed to incorporate external controls (iECAT-O and ProxECAT) with a new method (LogProx) that can leverage data from multiple sources. We also use SKAT-O, which was not designed for external data, as a baseline comparison. We find that SKAT-O often has optimal power, even without external controls, but ProxECAT and LogProx are the most powerful given a moderate proportion of cases to internal controls (e.g., ≥4:1). By identifying the scenarios (e.g., study designs, sample sizes) where the use of additional data sources is most beneficial, we hope to aid in the discovery of new genetic associations.
Keywords
statistical genetics
rare variant association methods
public summary data
external controls
The integrative association test utilizes a weighting scheme to combine prior information and increase statistical power. In whole-genome sequencing (WGS) studies, it facilitates the integration of biological characteristics of single nucleotide variants (SNVs) to improve the detection of novel disease genes. Despite recent applicational advances, determining optimal weights to fully leverage relevant information remains an open question. For a broad family of weighted integrative tests, this paper proposes optimal weights that maximize the tests' asymptotic efficiency, a dominant metric influencing statistical power. The study elucidates how weighting enhances statistical power and designs a practical approach for integrating effective information from SNV allele frequencies, annotations, and linkage disequilibrium. Extensive simulations demonstrate improved statistical power compared to existing methods. An osteoporosis case study further illustrates the method's application and potential for detecting more novel disease genes.
Keywords
Data integration
p-value combination
signal detection
weighting
whole genome sequencing study