Innovative Applications of Statistics

Jing Zhang Chair
Miami University
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
4109 
Contributed Papers 
Music City Center 
Room: CC-208A 

Main Sponsor

IMS

Presentations

Calibrated multi-level quantile forecasting

In order for probabilistic forecasts to be useful to decision makers, the forecasts should be calibrated – given a sequence of 90% quantile forecasts, we want the true value to be less than the forecast 90% of the time. Existing online calibration procedures, such as the quantile tracking algorithm from online conformal prediction (Angelopoulos et al., 2023), are able to effectively calibrate a single quantile but, when applied to multiple quantiles, can produce invalid probability distributions due to crossings – e.g., the calibrated 50% quantile forecast is above the calibrated 75% quantile forecast. In this work, we consider the problem of online calibration with order constraints. We propose intuitive ways of combining the quantile tracking algorithm with an order-enforcing method (such as sorting or isotonic regression) that produce a sequence of forecasts with no crossings but is also guaranteed to achieve the correct long-run coverage under mild assumptions. We demonstrate our methods on COVID-19 forecasting data. 

Keywords

forecasting

calibration

conformal prediction

online learning 

Co-Author(s)

Isaac Gibbs
Ryan Tibshirani, UC Berkeley

First Author

Tiffany Ding, University of California, Berkeley

Presenting Author

Tiffany Ding, University of California, Berkeley

Physics-Informed multiple quantile regression for complex environmental data

I will present a Physics-Informed multiple quantile regression model. The method features a regularizing term involving a Partial Differential Equation, that encodes the available problem-specific information about the phenomenon under study. The method permits to jointly estimate multiple quantiles, preserving monotonicity. Moreover, it can handle spatial data observed over non-Euclidean domains, such as linear networks, two-dimensional manifolds and non-convex volumes. The method will be illustrated through application to the study of nitrogen dioxide over Lombardy region, in Italy. 

Keywords

spatial data analysis

smoothing with roughness penalties

quantile regression 

Co-Author(s)

Ilenia Di Battista, Politecnico di Milano
Marco De Sanctis, Politecnico di Milano
Eleonora Arnone, Università degli Studi di Torino
Cristian Castiglione, Bocconi University
Mauro Bernardi, Università degli Studi di Padova
Francesca Ieva, Politecnico di Milano

First Author

Laura Maria Sangalli, MOX - Dipartimento Di Matematica, Politecnico Di Milano

Presenting Author

Laura Maria Sangalli, MOX - Dipartimento Di Matematica, Politecnico Di Milano

Etiological connections between initial COVID-19 and two rare infectious diseases

The origin of COVID-19 remains unclear despite extensive research. Theoretical models can simplify complex epigenetic landscapes by reducing vast methylation sites into manageable sets, revealing fundamental pathogen interactions that leap medical advances for the first time in tracing virus origin in the literature and practices. In our study, a max-logistic intelligence classifier analyzed 865,859 Infinium MethylationEPIC sites (CpGs), identifying eight CpGs that achieved 100% accuracy in distinguishing COVID-19 patients from other respiratory disease patients and healthy controls. One CpG, cg07126281, linked to the SAMM50 gene, shares genetic ties with rare infectious diseases like Sennetsu fever and glanders, suggesting a potential connection between COVID-19 and these diseases, possibly transmitted through contaminated seafood or glanders-infected individuals. Identifying such links among 865,859 CpG sites is challenging, with a random correlation probability of less than one in ten million. However, the likelihood of finding meaningful associations with rare diseases lowers this probability to one in one hundred million, reinforcing the credibility of our findings. 

Keywords

Biomarkers

virus tracing

DNA methylations

site-site interaction effects

rare diseases

Sennetsu fever and glanders 

First Author

Zhengjun Zhang, University of Chinese Academy of Sciences

Presenting Author

Zhengjun Zhang, University of Chinese Academy of Sciences

Minimax Rates for Discrete Signal Recovery with Applications to Photonic Imaging

We analyze the statistical problem of recovering a discrete signal, modeled as a k-atomic uniform distribution μ, from a binned Poisson convolution model. This question is motivated from super-resolution microscopy where precise estimation of μ provides insights into spatial configurations, such as protein colocalization in cellular imaging. Our main result quantifies the minimax risk of estimating μ under the Wasserstein distance for Gaussian and compactly supported, smooth convolution kernels. Specifically, we show that the global minimax risk scales with t^{-1/2k} for t→∞, where t denotes the illumination time of the probe, and that this rate is achieved by the method of moments and the maximum likelihood estimator. To address practical settings where atoms of μ may be partially separated, we also analyze a regime with structured clusters and show faster adaptive rates for both estimators and locally minimax optimality. As an application we use our methods on experimental STED microscopy data to locate single DNA origami. In addition, we complement our findings with numerical experiments that showcase the practical performance of both estimators and their trade-offs. 

Keywords

Gaussian Mixture Models

Method of Moments

Maximum Likelihood Estimation

Microscopy

Polynomial Root Stability

Chebyshev Systems 

Co-Author(s)

Tudor Manole
Danila Litskevich, University of Göttingen
Axel Munk, Goettingen University

First Author

Shayan Hundrieser

Presenting Author

Shayan Hundrieser

Geodesic Causal Inference

Adjusting for confounding and imbalance when establishing statistical relationships is an increasingly important task, and causal inference methods have emerged as the most popular tool to achieve this. Causal inference has been developed mainly for regression relationships with scalar responses and also for distributional responses. We introduce here a general framework for causal inference when responses reside in general geodesic metric spaces, where we draw on a novel geodesic calculus that facilitates scalar multiplication for geodesics and the quantification of treatment effects through the concept of geodesic average treatment effect. Using ideas from Fréchet regression, we obtain a doubly robust estimation of the geodesic average treatment effect and results on consistency and rates of convergence for the proposed estimators. We also study uncertainty quantification and inference for the treatment effect. Examples and practical implementations include simulations and data illustrations for responses corresponding to compositional responses as encountered for U.S. statewise energy source data, where we study the effect of coal mining, network data corresponding to New York taxi trips, where the effect of the COVID-19 pandemic is of interest, and the studying the effect of Alzheimer's disease on connectivity networks. 

Keywords

Doubly robust estimation

Fréchet regression

geodesic average treatment effect

metric statistic

network

random object 

Co-Author(s)

Daisuke Kurisu, The University of Tokyo
Taisuke Otsu, London School of Economics
Hans-Georg Mueller, UC Davis

First Author

Yidong Zhou

Presenting Author

Yidong Zhou

Variance component mixture modelling for longitudinal T-cell receptor clonal dynamics

Studies of T cells and their clonally unique receptors have shown promise in elucidating the association between immune response and human disease. Methods to identify T-cell receptor clones which expand or contract in response to certain therapeutic strategies have so far been limited to longitudinal pairwise comparisons of clone frequency with multiplicity adjustment. Here we develop a more general mixture model approach for arbitrary follow-up and missingness which partitions dynamic longitudinal clone frequency behavior from static. While it is common to mix on the location or scale parameter of a family of distributions, the model takes a different approach, mixing on the parameterization itself, the dynamic component allowing for a variable, Gamma-distributed Poisson mean parameter over longitudinal followup, while the static component mean is time invariant. We leverage Gamma-Poisson conjugacy to evaluate the model with respective component posterior predictive distributions and develop an EM-algorithm to estimate the empirical Bayes hyperparameters and component membership. We demonstrate the model in simulation and in a prostate cancer patient cohort. 

Keywords

mixture model

hierarchical model

Bayesian conjugacy

EM algorithm

T-cell receptor 

First Author

David Swanson, University of Texas MD Anderson Cancer Center

Presenting Author

David Swanson, University of Texas MD Anderson Cancer Center

WITHDRAWN: Non-parametric Counterfactual Regression with Applications in Causal Inference with Dependent Data

Series regression estimates the conditional mean of a response variable by regressing it on features derived from basis functions evaluated at covariate values. Ordinary least squares (OLS)-based series estimators achieve minimax rate optimality but impose stringent assumptions on basis functions. To address this, prior work introduced the Forster-Warmuth (FW) learner, which relaxes these conditions using a unified pseudo-outcome framework to minimize bias from nuisance function estimation, achieving minimax rates under mild assumptions. While these results relied on an i.i.d. sample condition, we extend the FW framework to dependent data settings, including time series and spatial structures. Our analysis shows that under specific dependence conditions, the ℓ2 error rate aligns with the i.i.d. case, preserving minimax optimality. This extension broadens the applicability of FW-inspired methods to high-dimensional and structured data. We demonstrate its utility by estimating dose-response curves for continuous treatments under both unconfounded and confounded scenarios. We model air pollution's immediate effects on heart attack rates to identify actionable public health insights. 

Keywords

Series regression

Forster-Warmuth (FW) learner

Minimax rate optimality

Dependent data

Dose-response curves

Air pollution and heart attack rates 

Co-Author(s)

Arun Kuchibhotla, University of Pennsylvania
Eric Tchetgen Tchetgen, University of Pennsylvania

First Author

Prabrisha Rakshit, University of Pennsylvania