Monday, Aug 4: 10:30 AM - 12:20 PM
4055
Contributed Papers
Music City Center
Room: CC-103A
Main Sponsor
Section on Statistical Learning and Data Science
Presentations
Single-cell data provide a unique opportunity to dissect complex cellular interactions. However, several existing computational methods lack the building blocks to effectively integrate spatial and temporal dimensions to identify key regulatory genes in sparse single-cell data. To address this gap, we developed pSTNet, a novel framework that integrates network and spatio-temporal models to identify differentially expressed genes induced by perturbation effects. The method incorporates cell spatial organizational structure to identify joint and perturbation-specific driver genes while taking account of single-cell data sparsity, tissue misalignment across multiple samples, and time points. We applied pSTNet to a time-course scRNA-seq dataset obtained from wild-type (WT) and Glis3 knockout (KO) mice to investigate the role of Glis3 in beta-cell development in the study of diabetes mechanism. The pSTNet framework jointly models the gene activation process through a spatiotemporal logistic regression and further models non-zero gene expression using a spatiotemporal model with standard parameterized probability distribution conditioned on a network, allowing for the differentiation between true biological signals and technical noise. The framework identifies key regulatory genes and cell states driving normal and deregulated beta-cell development, providing insights into the molecular mechanisms underlying beta-cell differentiation. Notably, the analysis reveals significant spatial and temporal differential expression patterns in genes central to beta-cell development, such as Ins1, Ins2, and Iapp, and identifies diverse regulatory modules associated with KO beta-cell development. These findings highlight the utility of pSTNet in elucidating the complex dynamics of cellular behavior in response to genetic perturbations.
Keywords
Spatial model
Network Model
Spatial Transcriptomics
Cell-Cell Interaction
Differential Expression
High-throughput chromatin conformation capture (Hi-C) data provide insights into the 3D structure of chromosomes, with normalization being a crucial pre-processing step. A common technique for normalization is matrix balancing, which rescales rows and columns of a Hi-C matrix to equalize their sums. Despite its popularity and convenience, matrix balancing lacks statistical justification. In this talk, we introduce a statistical model to analyze matrix balancing methods and propose a kernel-based estimator that leverages spatial structure. Under mild assumptions, we demonstrate that the kernel-based method is consistent, converges faster, and is more robust to data sparsity when compared to existing approaches.
Keywords
Density Estimation
Matrix Balancing
Accurate forecasting of the effective reproductive number (R_t) is crucial for informed public health decision-making. In this study, we develop a forecasting framework that integrates machine learning and probabilistic methods to improve predictive performance. We estimate R_t using the EpiNow2 R software package and further refine these estimates with a spatial (covariate-adjusted) smoothing technique. Forecasts are generated using an ensemble approach incorporating XGBoost, Random Forest (RF), and regression models. A stochastic Poisson framework is employed for daily COVID-19 case counts prediction. The ensemble method consistently outperformed EpiNow2, with a median percentage agreement of 94.7% (IQR:93.9–95.1%) for 7-day ahead forecasts during Wave-2, compared to 87.0% (IQR:84.4–89.4%) for EpiNow2. Similar improvements were observed in Wave-6, where the ensemble approach achieved a median percentage agreement of 92.5% (IQR:90.5–93.6, while EpiNow2 had a lower percentage agreement of 86.8% (IQR:82.5–89.2%). For daily case forecasting, the ensemble model maintained a higher percentage agreement across all horizons during both Wave-2 and Wave-6.
Keywords
Infectious Disease Modeling
COVID-19
Effective Reproductive Number
Forecasting
Machine Learning
Glaucoma is a major cause of irreversible blindness, affecting millions of people worldwide. In the United States, approximately 2.6% of adults over 40 have glaucoma, with a higher prevalence among Black and Hispanic populations. Traditional clinical measurements often lack sensitivity and specificity, underscoring the need for improved diagnostic tools.
Digital twins based on Multiphysics models provide insight into glaucoma diagnosis by simulating the relationship between changes in intraocular pressure (IOP), blood pressure, tissue perfusion and biomechanical stresses and strains. However, the computational cost of these models limits their clinical applicability in clinical practice. To address this challenge, we develop a spatial statistical emulator for spatial model output that approximates key hemodynamic and biomechanical tissue responses under varying physiological conditions.
We use a first-order singular value decomposition emulator framework that captures the model's primary spatial features using standard clinical input related to ocular physiology. Several nonlinear machine learning methods, including random forests, boosting, multilayer perceptron, and reservoir computing neural models, are explored to construct effective surrogate models. These approaches provide a scalable alternative to direct simulation while preserving predictive accuracy.
The results suggest that nonlinear surrogate models offer an efficient and reliable framework for supporting clinical decision-making in glaucoma diagnosis.
Keywords
glaucoma diagnosis
intraocular pressure (IOP)
statistical emulation
spatial model
machine learning
Unsupervised learning is increasingly being used to mine large datasets to make discoveries in critical domains such as biomedicine and national security. However, there is a lack of standardized methodologies to ensure these results are reliable and interpretable. Here, we present a structured workflow for applying unsupervised learning, illustrated through an in-depth case study. We examine the classification of Milky Way stars in the APOGEE survey, applying unsupervised techniques to distinguish stellar populations and find common origins of chemical formations. Through this example, we provide guidance on data preprocessing, feature engineering, exploratory analysis, dimension reduction, validation, and iterative communication with domain experts to ensure meaningful insights. By integrating best practices in statistical analysis with real-world applications, we demonstrate how a generalizable workflow for unsupervised learning can facilitate robust data-driven discovery.
Keywords
unsupervised learning
workflow
validation
clustering
dimension reduction
statistical learning
Modeling the temporal dynamics of heterogeneous cellular systems poses significant statistical challenges, particularly in the presence of stochastic transitions and structural heterogeneity inherent in single-cell time-course data. We propose MarkovCellNet, a probabilistic modeling framework that employs distance-aware Markov transition matrices coupled with time-informed dimensionality reduction to infer dynamic cellular trajectories. Cell state distributions are formalized as normalized probability vectors over discrete states and evolved through time via transition matrices which are (1) constructed from biologically informed distance metrics including diffusion, Euclidean, and Manhattan distances and (2) perturbed with Gaussian noise to capture inherent biological variability. The resulting predictive distributions are evaluated under a multinomial sampling model using log-likelihood as a scoring criterion, enabling principled comparison of modeling configurations. We assess the statistical performance of MarkovCellNet on synthetic datasets designed to mimic divergent, periodic, and convergent evolutionary regimes. Our results demonstrate that embeddings derived from PHATE or UMAP, when coupled with diffusion-based transition kernels, yield superior recovery of underlying stochastic dynamics. This framework provides a statistically interpretable and computationally tractable approach for analyzing high-dimensional, time-resolved single-cell data.
Keywords
Placenta development
Single-cell RNA sequencing
Markov processes
Cell-cell interactions
Dimensionality reduction
Computational models
Co-Author
Benedict Anchang, NIEHS
First Author
Komlan Atitey, National Institute of Environmental Health Science (NIEHS)
Presenting Author
Komlan Atitey, National Institute of Environmental Health Science (NIEHS)