Advances in Statistical Inference

Lasanthi Pelawa Watagoda Chair
Appalachian State University
 
Monday, Aug 4: 2:00 PM - 3:50 PM
4064 
Contributed Papers 
Music City Center 
Room: CC-201B 

Main Sponsor

IMS

Presentations

Assumption-Lean Post-Integrated Inference with Negative Control Outcomes

Data integration methods aim to extract low-dimensional embeddings from high-dimensional outcomes to remove unwanted variations, such as batch effects and unmeasured covariates, across heterogeneous datasets. However, multiple hypothesis testing after integration can be biased due to data-dependent processes. We introduce a robust post-integrated inference method that adjusts for latent heterogeneity using negative control outcomes. Leveraging causal interpretations, we derive nonparametric identifiability of the direct effects, which motivates our semiparametric inference method. These estimands remain statistically meaningful under model misspecifications and with error-prone embeddings. We provide bias quantifications and finite-sample linear expansions with uniform concentration bounds. The proposed doubly robust estimators are consistent and efficient under minimal assumptions and potential misspecification, facilitating data-adaptive estimation with machine learning algorithms. Our proposal is evaluated with random forests through simulations and analysis of single-cell CRISPR perturbed datasets with potential unmeasured confounders. 

Keywords

Batch correction

Confounder adjustment

Data integration

Hypothesis testing

Latent embedding

Model-free inference 

Co-Author(s)

Kathryn Roeder, Carnegie Mellon University
Larry Wasserman, Carnegie Mellon University

First Author

Jin-Hong Du, Carnegie Mellon University

Presenting Author

Jin-Hong Du, Carnegie Mellon University

Direct Probabilistic Inference for Continuity of Piecewise Models at Weighted Mean Threshold

In weighted piecewise regression (PM), an unknown threshold is usually estimated by a weighted mean and confidence interval. A key issue brought about by this is how to use probability to infer the continuity of two adjacent models at the threshold. This article will take a 2-segment linear model in a 2D space to demonstrate a method to infer the continuity of the PMs. Assuming that the fullwise model (FM) is y=a+bx, and the convex self-weighted mean (Cmean) of its absolute residuals is AR_bar_c. Then, we first take X as the segmented vattribute and the FM as the benchmark model. By keeping each pair of PMs (PM1 and PM2) homogeneously with the FM during the iteration for the threshold, we can calculate the Cmean ar_bar_(x,c,i) of the combined absolute residuals of the PMs obtained at each iteration, and then we will have a regressive weight w_x,i=(AR_bar_c - ar_bar_(x,c,i))/AR_bar_c, thus the threshold X_bar_∆=(∑x_i×w_x,i)/(∑w_x,i). The two predictions will be Y_1_hat and Y_2_hat at the threshold X_bar_∆. Thus we have Y_cv=|Y_1_hat - Y_2_hat|. Similarly for X by taking Y as the segmented one, we have X_cv=|X_1_hat - X_2_hat|. Thus, P_c=(X_cv×Y_cv)/(2×R_X×R_Y) (R is range). 

Keywords

Fullwise-Piecewise model

Convex Self-weighted Mean (Cmean) of Absolute residuals

Regressive weight

weighted mean threshold

Connection variation at mean threshold

Continuity probability 

First Author

Ligong Chen

Presenting Author

Ligong Chen

Efficient Semiparametric Inference for Distributed Data with Blockwise Missingness

We consider statistical inference for a finite-dimensional parameter in a regular semiparametric model under a distributed setting with blockwise missingness, where entire blocks of variables are unavailable at certain sites. We propose a class of augmented one-step estimators that incorporate information from external sites through "transfer functions." The proposed approach has several main advantages. First, it is communication-efficient, requiring only one-round communication of summary-level statistics. Second, it satisfies a "do-no-harm" property in the sense that the augmented estimator is at least as efficient as the original one based solely on the internal data. Third, it is statistically optimal, achieving the semiparametric efficiency bound when the transfer function is appropriately estimated from data. Finally, it is scalable, remaining asymptotically normal even when the number of external sites grows with the internal sample size. Simulation studies confirm both the statistical efficiency and computational feasibility of our method in distributed settings. 

Keywords

Blockwise missing

Distributed inference

Semi-parametric inference 

Co-Author(s)

Huiyuan Wang, University of Pennsylvania
Yong Chen, University of Pennsylvania, Perelman School of Medicine

First Author

Jingyue Huang

Presenting Author

Jingyue Huang

Generative Calibration for Valid Inference: Bridging Inferential Models and Simulation

Modern simulation-based inference methods face challenges in achieving finite-sample validity, particularly in high-dimensional settings. Inferential models (IMs) offer a prior-free framework for statistically reliable inference, merging Bayesian-like reasoning with frequentist calibration guarantees. However, practical deployment of IMs is hindered by their possibilistic uncertainty quantification, which resists approximation by conventional Monte Carlo tools. We introduce a generative calibration framework that trains a generative model to sample parameters, which is used to generate synthetic datasets, and assess a discrepancy function over observed and simulated data. By ensuring the discrepancy follows a uniform distribution, we achieve exact frequentist confidence regions. Minimizing deviations from uniformity via a loss function iteratively refines the generative model without requiring priors or asymptotic assumptions. Experiments confirm its effectiveness in high-dimensional regression and real-world applications, delivering nominal coverage and outperforming calibrated bootstrap and Bayesian methods in finite samples. 

Keywords

Inferential models

simulation-based inference

uncertainty quantification

generative modeling

frequentist calibration 

Co-Author(s)

Hyeong Jin Hyun
Halin Shin
Xiao Wang, Purdue University

First Author

Haoyun Yin

Presenting Author

Haoyun Yin

WITHDRAWN One-step mean-squared consistency of the EM algorithm for high-dimensional Gaussian mixture models

The EM algorithm has been used extensively in classification problems involving mixture models. There has been a recent surgence in the theoretical understanding of the EM algorithm in specialized versions of Gaussian mixture models, primarily in univariate models or models with fixed dimensionality. However, in practice, the use of EM extends to ultra high-dimensional datasets with surprisingly good performance. This talk will present recent results on the theoretical properties of the EM algorithm for high-dimensional Gaussian mixture models with minimal assumptions that showcases empirically optimal control of the mean squared error in just one iteration of the algorithm. The theory also provides a novel analysis method for iterative algorithms that could be of independent interest for the analysis of other algorithms in high-dimensional regimes. 

Keywords

Expectation-Maximization

Gaussian Mixture Models

High-dimensional

iterative algorithms

consistent estimation 

Co-Author(s)

Matias Cattaneo, Princeton University
Jason Klusowski, Princeton University

First Author

Rajita Chandak

Optimally adaptive test for high dimensional hypotheses via minimax deficiency

The detection boundary is a tool for power evaluation of a high dimensional test, which provides a binary phase transition of power in terms of signal density and strength. However, it cannot separate the $L_{2}$ and higher criticism (HC) tests under dense signals, and the $L_{\infty}$ and HC tests under highly sparse signals as they share the same detection boundary. This paper proposes minimax relative deficiency and minimax absolute deficiency as sharper measures for power evaluation than the detection boundary, and develop an adaptive testing procedure by combining three basic tests via a power enhancement. The proposed test is robust to the unknown signal density and strength with sharp optimal relative deficiency and nearly optimal absolute deficiency over the whole signal density regime. A full comparison of the proposed test with the existing methods is provided using the minimax deficiency measures. Simulation studies and a real data application to climate change analysis are conducted to evaluate the proposed test and demonstrate its superiority. 

Keywords

deficiency

detection boundary

high dimensionality

minimax optimality

power enhancement 

Co-Author(s)

Song Xi Chen, Tsinghua University
Yumou Qiu, Peking University

First Author

Jingkun Qiu

Presenting Author

Jingkun Qiu

Towards Efficient Statistical Inference and Optimal Design in Adaptive Experiments

Adaptive experiments play a crucial role in clinical trials and online A/B testing. Unlike static designs, adaptive experiments dynamically adjust treatment randomization probabilities and key elements based on sequentially collected data. This flexibility helps achieve objectives like reducing uncertainty in causal estimates or enhancing participant benefits. However, the adaptive and time-dependent nature of the data collected from such experiments poses challenges for unbiased statistical inference due to non-i.i.d. data. Building upon the Targeted Maximum Likelihood Estimator (TMLE) literature that has provided valid statistical inference tailored to adaptive experimental settings using inverse weighting strategies tailored for adaptive experiment settings, we propose a new TMLE that further improves the efficiency for estimating causal estimands under adaptive designs. Additionally, we present a general framework for implementing optimal adaptive designs tailored to various objectives. We demonstrate the effectiveness of our proposed estimators and adaptive designs through theoretical analysis and extensive simulations. 

Keywords

Adaptive Experimental Design

Targeted Maximum Likelihood Estimation

Causal Inference 

Co-Author

Mark Van Der Laan, UC Berkeley

First Author

Wenxin Zhang, UC Berkeley

Presenting Author

Wenxin Zhang, UC Berkeley