Single-Cell and Next-generation Sequencing Omics Data Analysis

Asmita Roy Chair
Johns Hopkins University School of Public Health
 
Tuesday, Aug 5: 8:30 AM - 10:20 AM
4094 
Contributed Papers 
Music City Center 
Room: CC-103B 

Main Sponsor

Section on Statistics in Genomics and Genetics

Presentations

Experimental Design and Differential Inference for Comparative Single-cell RNA-sequencing Studies

Single-cell RNA-sequencing (scRNA-seq) experiments are becoming increasingly complicated with multiple treatment or biological conditions. However, guidelines on experimental designs and rigorous statistical methods for comparative scRNA-seq studies with cells collected from multiple conditions
are still lacking. For a confounded design, the batch effects, cell-type effects and condition effects can never be distinguished. Therefore, we mathematically derive the requirements for a valid design for a comparative scRNA-seq study. Moreover, existing methods for identifying differentially expressed genes
and differential cell-type abundance between conditions have to be multi-stage approaches. Because multi-stage approaches ignore uncertainties in previous stages and may propagate errors from earlier stages to later stages, they can suffer from high error rates. Here, we introduce DIFseq, a hierarchical
model that accounts for all uncertainties and hence rigorously quantifies the condition effects on both cellular composition and cell-type-specific gene expression levels. DIFseq substantially outperforms state-of-the-art methods for both simulated and real data. 

Keywords

Single-cell RNA-sequencing experiments

Differential gene expression

Differential abundance

Experimental design

Model identifiability

Integrative analysis 

Co-Author(s)

Kevin Y. Yip, Sanford Burnham Prebys Medical Discovery Institute
Yingying Wei, The Chinese University of Hong Kong

First Author

Fangda Song, The Chinese University of Hong Kong, Shenzhen

Presenting Author

Fangda Song, The Chinese University of Hong Kong, Shenzhen

McRigor: a statistical method to enhance rigor of metacell partitioning in single-cell data analysis

In single-cell data analysis, addressing sparsity often involves aggregating the profiles of homogeneous single cells into metacells. However, existing metacell partitioning methods lack checks on the homogeneity assumption and may aggregate heterogeneous single cells, potentially biasing downstream analysis and leading to spurious discoveries. To fill this gap, we introduce mcRigor, a statistical method to detect dubious metacells, which are composed of heterogeneous single cells, and optimize the hyperparameter of a metacell partitioning method. The core of mcRigor is a feature-correlation-based statistic that measures the heterogeneity of a metacell, with its null distribution derived from a double permutation scheme. As an optimizer for existing metacell partitioning methods, mcRigor has been shown to improve the reliability of discoveries in single-cell RNA-seq and multiome (RNA+ATAC) data analyses, such as uncovering differential gene co-expression modules, enhancer-gene associations, and gene temporal expression. Moreover, mcRigor enables benchmarking and selection of the most suitable metacell partitioning method with optimized hyperparameters tailored to specific datasets. 

Keywords

Metacell partitioning

Single-cell RNA-seq

Single-cell ATAC-seq

Data sparsity

Permutation 

Co-Author

Jingyi Jessica Li, UCLA

First Author

Pan Liu

Presenting Author

Pan Liu

Detection of Cell-type-specific eQTL for scRNA-seq Data with Unknown Cell Types

Genome-wide association studies (GWAS) have identified numerous genetic variants associated with complex traits, yet the majority of these variants reside in intergenic regions, making it challenging to link them to functional genes and regulatory mechanisms. Expression quantitative trait loci (eQTL) analysis connects genetic variants with gene expression and reveals cell-type-specific effects. Single-cell RNA sequencing (scRNA-seq) enables investigation of cell-type-specific eQTLs (ct-eQTLs) by capturing gene expression at single-cell resolution. However, existing methods rely on pre-annotated cell-type labels, which may not be accurate. Differential inference for regulatory effects across different cell types will be hampered by inaccurate cell-type annotation, leading to unexpected false positives. Thus, we propose a statistical model that simultaneously performs cell-type annotation and identifies ct-eQTLs. By leveraging allele-specific expression, our method improves the accuracy and interpretability of ct-eQTL detection.  

Keywords

Single-cell RNA Sequencing (scRNA-seq)

Expression quantitative trait loci (eQTL)

Integrative Analysis

Mixture Model 

Co-Author

Fangda Song, The Chinese University of Hong Kong, Shenzhen

First Author

Jiasheng Li, The Chinese University of Hong Kong, Shenzhen

Presenting Author

Jiasheng Li, The Chinese University of Hong Kong, Shenzhen

Harmonizing Heterogeneous Single-cell Gene Expression Data with Individual-level Covariates

The growing availability of single-cell RNA sequencing (scRNA-seq) data high-
lights the necessity for robust integration methods to uncover both shared and unique cellular
features across samples. These datasets often exhibit technical variations and biological dif-
ferences, complicating integrative analyses. While numerous integration methods have been
proposed, many fail to account for individual-level covariates or are limited to discrete vari-
ables. To address these limitations, we propose scINSIGHT2, a generalized linear latent
variable model that accommodates both continuous covariates, such as age, and discrete fac-
tors, such as disease conditions. Through both simulation studies and real-data applications,
we demonstrate that scINSIGHT2 accurately harmonizes scRNA-seq datasets, whether from
single or multiple sources. These results highlight scINSIGHT2's utility in capturing meaningful
biological insights from scRNA-seq data while accounting for individual-level variation. 

Keywords

single-cell RNA-seq

integration

generalized linear latent variable model 

Co-Author

Vivian Li, University of California, Riverside

First Author

Yudi Mu

Presenting Author

Yudi Mu

Estimating a phylogenetic forest for single-cell RNA-sequencing data

Single-cell RNA-sequencing (scRNA-seq) technologies provide researchers with unprecedented opportunities to identify cell types and understand cell lineages. With the emergence of scRNA-seq studies that assay a large number of subjects, there is growing interest in aligning and comparing cell lineages between different individuals, especially for those with different clinical conditions. However, comparing cell lineages learned from scRNA-seq data collected from multiple individuals is challenging due to (a) scRNA-seq data can suffer from severe batch effects and (b) certain cell types may occur in some but not all individuals. In this study, we propose a Bayesian hierarchical model built upon Dirichlet diffusion tree to learn a phylogenetic forest for scRNA-seq data collected from multiple individuals. Our proposed model can automatically align the topologies of the phylogenetic trees of different individuals. We develop an efficient Markov chain Monte Carlo algorithm for posterior inference. Simulation studies and real data analysis demonstrate that our proposed model outperforms the state-of-the-art methods. 

Keywords

single-cell RNA-sequencing data

phylogenetic tree

Dirichlet diffusion tree

Bayesian hierarchical model 

Co-Author

Yingying Wei, The Chinese University of Hong Kong

First Author

Shuyi WANG

Presenting Author

Shuyi WANG

Identifying cell-specific cancer susceptibility genes using transcriptome-wide association studies

Background: Transcriptome-wide association studies (TWAS) integrate gene expression with GWAS data to identify disease susceptibility genes. Conventional TWAS methods rely on tissue-specific models, but accounting for cell type variation may enhance discovery.
Methods: We built cell type-specific gene expression models using scRNA-seq data from 982 individuals in the OneK1K cohort, comprising 1.27 million PBMCs across 14 cell types with matched genotypes. To enhance prediction accuracy, we developed a novel approach leveraging correlations across cell types. These models were applied to TWAS on GWAS data for six cancers (>280,000 cases total): breast, prostate, lung, melanoma, ovarian, and endometrial.
Results: TWAS identified 339 novel genes for breast cancer, 92 for prostate, 18 for lung, 51 for melanoma, and 9 for ovarian, most of which were cell type-specific. Notably, 139 significant genes were shared across cancer types, enriched in cell types like CD4-NC and CD8-ET. Gene-set analyses validated novel breast and prostate genes in UK Biobank replication datasets.
Conclusion: Cell type-specific models improve cancer gene discovery, revealing distinct genetic landscapes. 

Keywords

Transcriptome-wide association studies

Single-cell RNA sequencing

Cell type

Cancer

Genome-wide association studies 

Co-Author(s)

Kai Yu
Jianxin Shi

First Author

Fei Qin

Presenting Author

Fei Qin

Cn-RNN: a Supervised Learning Framework for CNV Detection with Sequencing Data

Copy number variants (CNVs), involving genomic duplications/deletions, play a critical role in various human diseases. Accurate CNV detection is essential but challenging due to high dimensionality, technical biases, and low signal-to-noise ratios, leading to inconsistent calls and high false positives. Existing deep learning-based methods employ Convolutional Neural Networks (CNNs), which rely on image-based recognition and are prone to domain shifting problems. Also, accurate supervised learning required a large and validated variant set to differentiate CNV predictions from false positives.
Therefore, we developed a novel deep learning model, cn-RNN, for copy number estimation with sequencing data using Recurrent Neural Networks (RNNs). Unlike CNNs, RNNs inherently preserve the sequential structure of genomic data, enabling more accurate and biologically meaningful processing of sequencing data. Besides, we used a publicly available trio dataset to construct a large high-confidence CNV training set. Compared to CNN-based methods, cn-RNN achieved a 20% higher F1-score with significantly fewer false positives. Our work enables more reliable CNV detection with sequencing data. 

Keywords

Copy Number Variants (CNV) Detection

Recurrent Neural Networks (RNN)

Supervised Learning

Statistical Genetics

Deep Learning in Genomics 

Co-Author(s)

Wenhan Bao
Fei Qin
Feifei Xiao, University of Florida

First Author

Dayuan Wang

Presenting Author

Dayuan Wang