Tuesday, Aug 5: 8:30 AM - 10:20 AM
4094
Contributed Papers
Music City Center
Room: CC-103B
Main Sponsor
Section on Statistics in Genomics and Genetics
Presentations
Single-cell RNA-sequencing (scRNA-seq) experiments are becoming increasingly complicated with multiple treatment or biological conditions. However, guidelines on experimental designs and rigorous statistical methods for comparative scRNA-seq studies with cells collected from multiple conditions
are still lacking. For a confounded design, the batch effects, cell-type effects and condition effects can never be distinguished. Therefore, we mathematically derive the requirements for a valid design for a comparative scRNA-seq study. Moreover, existing methods for identifying differentially expressed genes
and differential cell-type abundance between conditions have to be multi-stage approaches. Because multi-stage approaches ignore uncertainties in previous stages and may propagate errors from earlier stages to later stages, they can suffer from high error rates. Here, we introduce DIFseq, a hierarchical
model that accounts for all uncertainties and hence rigorously quantifies the condition effects on both cellular composition and cell-type-specific gene expression levels. DIFseq substantially outperforms state-of-the-art methods for both simulated and real data.
Keywords
Single-cell RNA-sequencing experiments
Differential gene expression
Differential abundance
Experimental design
Model identifiability
Integrative analysis
Co-Author(s)
Kevin Y. Yip, Sanford Burnham Prebys Medical Discovery Institute
Yingying Wei, The Chinese University of Hong Kong
First Author
Fangda Song, The Chinese University of Hong Kong, Shenzhen
Presenting Author
Fangda Song, The Chinese University of Hong Kong, Shenzhen
In single-cell data analysis, addressing sparsity often involves aggregating the profiles of homogeneous single cells into metacells. However, existing metacell partitioning methods lack checks on the homogeneity assumption and may aggregate heterogeneous single cells, potentially biasing downstream analysis and leading to spurious discoveries. To fill this gap, we introduce mcRigor, a statistical method to detect dubious metacells, which are composed of heterogeneous single cells, and optimize the hyperparameter of a metacell partitioning method. The core of mcRigor is a feature-correlation-based statistic that measures the heterogeneity of a metacell, with its null distribution derived from a double permutation scheme. As an optimizer for existing metacell partitioning methods, mcRigor has been shown to improve the reliability of discoveries in single-cell RNA-seq and multiome (RNA+ATAC) data analyses, such as uncovering differential gene co-expression modules, enhancer-gene associations, and gene temporal expression. Moreover, mcRigor enables benchmarking and selection of the most suitable metacell partitioning method with optimized hyperparameters tailored to specific datasets.
Keywords
Metacell partitioning
Single-cell RNA-seq
Single-cell ATAC-seq
Data sparsity
Permutation
Genome-wide association studies (GWAS) have identified numerous genetic variants associated with complex traits, yet the majority of these variants reside in intergenic regions, making it challenging to link them to functional genes and regulatory mechanisms. Expression quantitative trait loci (eQTL) analysis connects genetic variants with gene expression and reveals cell-type-specific effects. Single-cell RNA sequencing (scRNA-seq) enables investigation of cell-type-specific eQTLs (ct-eQTLs) by capturing gene expression at single-cell resolution. However, existing methods rely on pre-annotated cell-type labels, which may not be accurate. Differential inference for regulatory effects across different cell types will be hampered by inaccurate cell-type annotation, leading to unexpected false positives. Thus, we propose a statistical model that simultaneously performs cell-type annotation and identifies ct-eQTLs. By leveraging allele-specific expression, our method improves the accuracy and interpretability of ct-eQTL detection.
Keywords
Single-cell RNA Sequencing (scRNA-seq)
Expression quantitative trait loci (eQTL)
Integrative Analysis
Mixture Model
Co-Author
Fangda Song, The Chinese University of Hong Kong, Shenzhen
First Author
Jiasheng Li, The Chinese University of Hong Kong, Shenzhen
Presenting Author
Jiasheng Li, The Chinese University of Hong Kong, Shenzhen
The growing availability of single-cell RNA sequencing (scRNA-seq) data high-
lights the necessity for robust integration methods to uncover both shared and unique cellular
features across samples. These datasets often exhibit technical variations and biological dif-
ferences, complicating integrative analyses. While numerous integration methods have been
proposed, many fail to account for individual-level covariates or are limited to discrete vari-
ables. To address these limitations, we propose scINSIGHT2, a generalized linear latent
variable model that accommodates both continuous covariates, such as age, and discrete fac-
tors, such as disease conditions. Through both simulation studies and real-data applications,
we demonstrate that scINSIGHT2 accurately harmonizes scRNA-seq datasets, whether from
single or multiple sources. These results highlight scINSIGHT2's utility in capturing meaningful
biological insights from scRNA-seq data while accounting for individual-level variation.
Keywords
single-cell RNA-seq
integration
generalized linear latent variable model
Single-cell RNA-sequencing (scRNA-seq) technologies provide researchers with unprecedented opportunities to identify cell types and understand cell lineages. With the emergence of scRNA-seq studies that assay a large number of subjects, there is growing interest in aligning and comparing cell lineages between different individuals, especially for those with different clinical conditions. However, comparing cell lineages learned from scRNA-seq data collected from multiple individuals is challenging due to (a) scRNA-seq data can suffer from severe batch effects and (b) certain cell types may occur in some but not all individuals. In this study, we propose a Bayesian hierarchical model built upon Dirichlet diffusion tree to learn a phylogenetic forest for scRNA-seq data collected from multiple individuals. Our proposed model can automatically align the topologies of the phylogenetic trees of different individuals. We develop an efficient Markov chain Monte Carlo algorithm for posterior inference. Simulation studies and real data analysis demonstrate that our proposed model outperforms the state-of-the-art methods.
Keywords
single-cell RNA-sequencing data
phylogenetic tree
Dirichlet diffusion tree
Bayesian hierarchical model
Background: Transcriptome-wide association studies (TWAS) integrate gene expression with GWAS data to identify disease susceptibility genes. Conventional TWAS methods rely on tissue-specific models, but accounting for cell type variation may enhance discovery.
Methods: We built cell type-specific gene expression models using scRNA-seq data from 982 individuals in the OneK1K cohort, comprising 1.27 million PBMCs across 14 cell types with matched genotypes. To enhance prediction accuracy, we developed a novel approach leveraging correlations across cell types. These models were applied to TWAS on GWAS data for six cancers (>280,000 cases total): breast, prostate, lung, melanoma, ovarian, and endometrial.
Results: TWAS identified 339 novel genes for breast cancer, 92 for prostate, 18 for lung, 51 for melanoma, and 9 for ovarian, most of which were cell type-specific. Notably, 139 significant genes were shared across cancer types, enriched in cell types like CD4-NC and CD8-ET. Gene-set analyses validated novel breast and prostate genes in UK Biobank replication datasets.
Conclusion: Cell type-specific models improve cancer gene discovery, revealing distinct genetic landscapes.
Keywords
Transcriptome-wide association studies
Single-cell RNA sequencing
Cell type
Cancer
Genome-wide association studies
Copy number variants (CNVs), involving genomic duplications/deletions, play a critical role in various human diseases. Accurate CNV detection is essential but challenging due to high dimensionality, technical biases, and low signal-to-noise ratios, leading to inconsistent calls and high false positives. Existing deep learning-based methods employ Convolutional Neural Networks (CNNs), which rely on image-based recognition and are prone to domain shifting problems. Also, accurate supervised learning required a large and validated variant set to differentiate CNV predictions from false positives.
Therefore, we developed a novel deep learning model, cn-RNN, for copy number estimation with sequencing data using Recurrent Neural Networks (RNNs). Unlike CNNs, RNNs inherently preserve the sequential structure of genomic data, enabling more accurate and biologically meaningful processing of sequencing data. Besides, we used a publicly available trio dataset to construct a large high-confidence CNV training set. Compared to CNN-based methods, cn-RNN achieved a 20% higher F1-score with significantly fewer false positives. Our work enables more reliable CNV detection with sequencing data.
Keywords
Copy Number Variants (CNV) Detection
Recurrent Neural Networks (RNN)
Supervised Learning
Statistical Genetics
Deep Learning in Genomics