Machine Learning for Spatiotemporal Data

Wenhui Sophia Lu Chair
Stanford University
 
Monday, Aug 4: 10:30 AM - 12:20 PM
4055 
Contributed Papers 
Music City Center 
Room: CC-103A 

Main Sponsor

Section on Statistical Learning and Data Science

Presentations

Joint spatiotemporal modeling of perturbation effects in single-cell data

Single-cell data provide a unique opportunity to dissect complex cellular interactions. However, several existing computational methods lack the building blocks to effectively integrate spatial and temporal dimensions to identify key regulatory genes in sparse single-cell data. To address this gap, we developed pSTNet, a novel framework that integrates network and spatio-temporal models to identify differentially expressed genes induced by perturbation effects. The method incorporates cell spatial organizational structure to identify joint and perturbation-specific driver genes while taking account of single-cell data sparsity, tissue misalignment across multiple samples, and time points. We applied pSTNet to a time-course scRNA-seq dataset obtained from wild-type (WT) and Glis3 knockout (KO) mice to investigate the role of Glis3 in beta-cell development in the study of diabetes mechanism. The pSTNet framework jointly models the gene activation process through a spatiotemporal logistic regression and further models non-zero gene expression using a spatiotemporal model with standard parameterized probability distribution conditioned on a network, allowing for the differentiation between true biological signals and technical noise. The framework identifies key regulatory genes and cell states driving normal and deregulated beta-cell development, providing insights into the molecular mechanisms underlying beta-cell differentiation. Notably, the analysis reveals significant spatial and temporal differential expression patterns in genes central to beta-cell development, such as Ins1, Ins2, and Iapp, and identifies diverse regulatory modules associated with KO beta-cell development. These findings highlight the utility of pSTNet in elucidating the complex dynamics of cellular behavior in response to genetic perturbations. 

Keywords

Spatial model

Network Model

Spatial Transcriptomics

Cell-Cell Interaction

Differential Expression 

Co-Author

Benedict Anchang, NIEHS

First Author

Osafu Augustine Egbon, National Institute of Environmental Health Sciences

Presenting Author

Osafu Augustine Egbon, National Institute of Environmental Health Sciences

Kernel Density Balancing for Hi-C data

High-throughput chromatin conformation capture (Hi-C) data provide insights into the 3D structure of chromosomes, with normalization being a crucial pre-processing step. A common technique for normalization is matrix balancing, which rescales rows and columns of a Hi-C matrix to equalize their sums. Despite its popularity and convenience, matrix balancing lacks statistical justification. In this talk, we introduce a statistical model to analyze matrix balancing methods and propose a kernel-based estimator that leverages spatial structure. Under mild assumptions, we demonstrate that the kernel-based method is consistent, converges faster, and is more robust to data sparsity when compared to existing approaches. 

Keywords

Density Estimation

Matrix Balancing 

Co-Author

Ning Hao, University of Arizona

First Author

John Park

Presenting Author

John Park

Machine Learning and Probabilistic Approaches for Forecasting COVID-19 Transmission and Cases

Accurate forecasting of the effective reproductive number (R_t) is crucial for informed public health decision-making. In this study, we develop a forecasting framework that integrates machine learning and probabilistic methods to improve predictive performance. We estimate R_t using the EpiNow2 R software package and further refine these estimates with a spatial (covariate-adjusted) smoothing technique. Forecasts are generated using an ensemble approach incorporating XGBoost, Random Forest (RF), and regression models. A stochastic Poisson framework is employed for daily COVID-19 case counts prediction. The ensemble method consistently outperformed EpiNow2, with a median percentage agreement of 94.7% (IQR:93.9–95.1%) for 7-day ahead forecasts during Wave-2, compared to 87.0% (IQR:84.4–89.4%) for EpiNow2. Similar improvements were observed in Wave-6, where the ensemble approach achieved a median percentage agreement of 92.5% (IQR:90.5–93.6, while EpiNow2 had a lower percentage agreement of 86.8% (IQR:82.5–89.2%). For daily case forecasting, the ensemble model maintained a higher percentage agreement across all horizons during both Wave-2 and Wave-6. 

Keywords

Infectious Disease Modeling

COVID-19

Effective Reproductive Number

Forecasting

Machine Learning 

First Author

MD SAKHAWAT HOSSAIN

Presenting Author

MD SAKHAWAT HOSSAIN

Nonlinear Surrogate Models for Emulating Spatial Fields from Multiphysics Models in Support of Glaucoma Diagnosis

Glaucoma is a major cause of irreversible blindness, affecting millions of people worldwide. In the United States, approximately 2.6% of adults over 40 have glaucoma, with a higher prevalence among Black and Hispanic populations. Traditional clinical measurements often lack sensitivity and specificity, underscoring the need for improved diagnostic tools.
Digital twins based on Multiphysics models provide insight into glaucoma diagnosis by simulating the relationship between changes in intraocular pressure (IOP), blood pressure, tissue perfusion and biomechanical stresses and strains. However, the computational cost of these models limits their clinical applicability in clinical practice. To address this challenge, we develop a spatial statistical emulator for spatial model output that approximates key hemodynamic and biomechanical tissue responses under varying physiological conditions.
We use a first-order singular value decomposition emulator framework that captures the model's primary spatial features using standard clinical input related to ocular physiology. Several nonlinear machine learning methods, including random forests, boosting, multilayer perceptron, and reservoir computing neural models, are explored to construct effective surrogate models. These approaches provide a scalable alternative to direct simulation while preserving predictive accuracy.
The results suggest that nonlinear surrogate models offer an efficient and reliable framework for supporting clinical decision-making in glaucoma diagnosis.
 

Keywords

glaucoma diagnosis

intraocular pressure (IOP)

statistical emulation

spatial model

machine learning 

Co-Author(s)

Giovanna Guidoboni, University of Maine
Alon Harris, Icahn School of Medicine at Mount Sinai
Christopher Wikle, University of Missouri

First Author

Mira Isnainy, University of Missouri-Columbia

Presenting Author

Mira Isnainy, University of Missouri-Columbia

Unsupervised machine learning for discovery: workflow and best practices

Unsupervised learning is increasingly being used to mine large datasets to make discoveries in critical domains such as biomedicine and national security. However, there is a lack of standardized methodologies to ensure these results are reliable and interpretable. Here, we present a structured workflow for applying unsupervised learning, illustrated through an in-depth case study. We examine the classification of Milky Way stars in the APOGEE survey, applying unsupervised techniques to distinguish stellar populations and find common origins of chemical formations. Through this example, we provide guidance on data preprocessing, feature engineering, exploratory analysis, dimension reduction, validation, and iterative communication with domain experts to ensure meaningful insights. By integrating best practices in statistical analysis with real-world applications, we demonstrate how a generalizable workflow for unsupervised learning can facilitate robust data-driven discovery. 

Keywords

unsupervised learning

workflow

validation

clustering

dimension reduction

statistical learning 

Co-Author(s)

Tarek Zikry
Tiffany Tang, University of Notre Dame
Genevera Allen

First Author

Andersen Chang

Presenting Author

Tarek Zikry

MarkovCellNet: Statistical Inference of Time-Evolving Cell Populations via Compositional Markov Models

Modeling the temporal dynamics of heterogeneous cellular systems poses significant statistical challenges, particularly in the presence of stochastic transitions and structural heterogeneity inherent in single-cell time-course data. We propose MarkovCellNet, a probabilistic modeling framework that employs distance-aware Markov transition matrices coupled with time-informed dimensionality reduction to infer dynamic cellular trajectories. Cell state distributions are formalized as normalized probability vectors over discrete states and evolved through time via transition matrices which are (1) constructed from biologically informed distance metrics including diffusion, Euclidean, and Manhattan distances and (2) perturbed with Gaussian noise to capture inherent biological variability. The resulting predictive distributions are evaluated under a multinomial sampling model using log-likelihood as a scoring criterion, enabling principled comparison of modeling configurations. We assess the statistical performance of MarkovCellNet on synthetic datasets designed to mimic divergent, periodic, and convergent evolutionary regimes. Our results demonstrate that embeddings derived from PHATE or UMAP, when coupled with diffusion-based transition kernels, yield superior recovery of underlying stochastic dynamics. This framework provides a statistically interpretable and computationally tractable approach for analyzing high-dimensional, time-resolved single-cell data. 

Keywords

Placenta development

Single-cell RNA sequencing

Markov processes

Cell-cell interactions

Dimensionality reduction

Computational models 

Co-Author

Benedict Anchang, NIEHS

First Author

Komlan Atitey, National Institute of Environmental Health Science (NIEHS)

Presenting Author

Komlan Atitey, National Institute of Environmental Health Science (NIEHS)