Monday, Aug 4: 2:00 PM - 3:50 PM
4064
Contributed Papers
Music City Center
Room: CC-201B
Main Sponsor
IMS
Presentations
Data integration methods aim to extract low-dimensional embeddings from high-dimensional outcomes to remove unwanted variations, such as batch effects and unmeasured covariates, across heterogeneous datasets. However, multiple hypothesis testing after integration can be biased due to data-dependent processes. We introduce a robust post-integrated inference method that adjusts for latent heterogeneity using negative control outcomes. Leveraging causal interpretations, we derive nonparametric identifiability of the direct effects, which motivates our semiparametric inference method. These estimands remain statistically meaningful under model misspecifications and with error-prone embeddings. We provide bias quantifications and finite-sample linear expansions with uniform concentration bounds. The proposed doubly robust estimators are consistent and efficient under minimal assumptions and potential misspecification, facilitating data-adaptive estimation with machine learning algorithms. Our proposal is evaluated with random forests through simulations and analysis of single-cell CRISPR perturbed datasets with potential unmeasured confounders.
Keywords
Batch correction
Confounder adjustment
Data integration
Hypothesis testing
Latent embedding
Model-free inference
In weighted piecewise regression (PM), an unknown threshold is usually estimated by a weighted mean and confidence interval. A key issue brought about by this is how to use probability to infer the continuity of two adjacent models at the threshold. This article will take a 2-segment linear model in a 2D space to demonstrate a method to infer the continuity of the PMs. Assuming that the fullwise model (FM) is y=a+bx, and the convex self-weighted mean (Cmean) of its absolute residuals is AR_bar_c. Then, we first take X as the segmented vattribute and the FM as the benchmark model. By keeping each pair of PMs (PM1 and PM2) homogeneously with the FM during the iteration for the threshold, we can calculate the Cmean ar_bar_(x,c,i) of the combined absolute residuals of the PMs obtained at each iteration, and then we will have a regressive weight w_x,i=(AR_bar_c - ar_bar_(x,c,i))/AR_bar_c, thus the threshold X_bar_∆=(∑x_i×w_x,i)/(∑w_x,i). The two predictions will be Y_1_hat and Y_2_hat at the threshold X_bar_∆. Thus we have Y_cv=|Y_1_hat - Y_2_hat|. Similarly for X by taking Y as the segmented one, we have X_cv=|X_1_hat - X_2_hat|. Thus, P_c=(X_cv×Y_cv)/(2×R_X×R_Y) (R is range).
Keywords
Fullwise-Piecewise model
Convex Self-weighted Mean (Cmean) of Absolute residuals
Regressive weight
weighted mean threshold
Connection variation at mean threshold
Continuity probability
We consider statistical inference for a finite-dimensional parameter in a regular semiparametric model under a distributed setting with blockwise missingness, where entire blocks of variables are unavailable at certain sites. We propose a class of augmented one-step estimators that incorporate information from external sites through "transfer functions." The proposed approach has several main advantages. First, it is communication-efficient, requiring only one-round communication of summary-level statistics. Second, it satisfies a "do-no-harm" property in the sense that the augmented estimator is at least as efficient as the original one based solely on the internal data. Third, it is statistically optimal, achieving the semiparametric efficiency bound when the transfer function is appropriately estimated from data. Finally, it is scalable, remaining asymptotically normal even when the number of external sites grows with the internal sample size. Simulation studies confirm both the statistical efficiency and computational feasibility of our method in distributed settings.
Keywords
Blockwise missing
Distributed inference
Semi-parametric inference
Modern simulation-based inference methods face challenges in achieving finite-sample validity, particularly in high-dimensional settings. Inferential models (IMs) offer a prior-free framework for statistically reliable inference, merging Bayesian-like reasoning with frequentist calibration guarantees. However, practical deployment of IMs is hindered by their possibilistic uncertainty quantification, which resists approximation by conventional Monte Carlo tools. We introduce a generative calibration framework that trains a generative model to sample parameters, which is used to generate synthetic datasets, and assess a discrepancy function over observed and simulated data. By ensuring the discrepancy follows a uniform distribution, we achieve exact frequentist confidence regions. Minimizing deviations from uniformity via a loss function iteratively refines the generative model without requiring priors or asymptotic assumptions. Experiments confirm its effectiveness in high-dimensional regression and real-world applications, delivering nominal coverage and outperforming calibrated bootstrap and Bayesian methods in finite samples.
Keywords
Inferential models
simulation-based inference
uncertainty quantification
generative modeling
frequentist calibration
The EM algorithm has been used extensively in classification problems involving mixture models. There has been a recent surgence in the theoretical understanding of the EM algorithm in specialized versions of Gaussian mixture models, primarily in univariate models or models with fixed dimensionality. However, in practice, the use of EM extends to ultra high-dimensional datasets with surprisingly good performance. This talk will present recent results on the theoretical properties of the EM algorithm for high-dimensional Gaussian mixture models with minimal assumptions that showcases empirically optimal control of the mean squared error in just one iteration of the algorithm. The theory also provides a novel analysis method for iterative algorithms that could be of independent interest for the analysis of other algorithms in high-dimensional regimes.
Keywords
Expectation-Maximization
Gaussian Mixture Models
High-dimensional
iterative algorithms
consistent estimation
The detection boundary is a tool for power evaluation of a high dimensional test, which provides a binary phase transition of power in terms of signal density and strength. However, it cannot separate the $L_{2}$ and higher criticism (HC) tests under dense signals, and the $L_{\infty}$ and HC tests under highly sparse signals as they share the same detection boundary. This paper proposes minimax relative deficiency and minimax absolute deficiency as sharper measures for power evaluation than the detection boundary, and develop an adaptive testing procedure by combining three basic tests via a power enhancement. The proposed test is robust to the unknown signal density and strength with sharp optimal relative deficiency and nearly optimal absolute deficiency over the whole signal density regime. A full comparison of the proposed test with the existing methods is provided using the minimax deficiency measures. Simulation studies and a real data application to climate change analysis are conducted to evaluate the proposed test and demonstrate its superiority.
Keywords
deficiency
detection boundary
high dimensionality
minimax optimality
power enhancement
Adaptive experiments play a crucial role in clinical trials and online A/B testing. Unlike static designs, adaptive experiments dynamically adjust treatment randomization probabilities and key elements based on sequentially collected data. This flexibility helps achieve objectives like reducing uncertainty in causal estimates or enhancing participant benefits. However, the adaptive and time-dependent nature of the data collected from such experiments poses challenges for unbiased statistical inference due to non-i.i.d. data. Building upon the Targeted Maximum Likelihood Estimator (TMLE) literature that has provided valid statistical inference tailored to adaptive experimental settings using inverse weighting strategies tailored for adaptive experiment settings, we propose a new TMLE that further improves the efficiency for estimating causal estimands under adaptive designs. Additionally, we present a general framework for implementing optimal adaptive designs tailored to various objectives. We demonstrate the effectiveness of our proposed estimators and adaptive designs through theoretical analysis and extensive simulations.
Keywords
Adaptive Experimental Design
Targeted Maximum Likelihood Estimation
Causal Inference