Nonparametric Statistics Section Student Paper Award Presentations

Anru Zhang Chair
Duke University
 
Tatiyana Apanasovich Organizer
George Washington University
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
0732 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-207D 

Applied

No

Main Sponsor

Section on Nonparametric Statistics

Presentations

A Unified Framework for Semiparametrically Efficient Semi-supervised Learning

We consider statistical inference under a semi-supervised setting where we have access to both a labeled dataset and an unlabeled dataset . We ask the question: under what circumstances, and by how much, can incorporating the unlabeled dataset improve upon inference using the labeled data? To answer this question, we investigate semi-supervised learning through the lens of semiparametric efficiency theory. We characterize the efficiency lower bound under the semi-supervised setting for an arbitrary inferential problem, and show that incorporating unlabeled data can potentially improve efficiency if the parameter is not well-specified. We then propose two types of semi-supervised estimators: a safe estimator that imposes minimal assumptions, is simple to compute, and is guaranteed to be at least as efficient as the initial supervised estimator; and an efficient estimator, which --- under stronger assumptions --- achieves the semiparametric efficiency bound. Our findings unify existing semiparametric efficiency results for particular special cases, and extend these results to a much more general class of problems. Moreover, we show that our estimators can flexibly incorporate predicted outcomes arising from "black-box" machine learning models, and thereby achieve the same goal as prediction-powered inference (PPI), but with superior theoretical guarantees. We also provide a complete understanding of the theoretical basis for the existing set of PPI methods. Finally, we apply the theoretical framework developed to derive and analyze efficient semi-supervised estimators in a number of settings, including M-estimation, U-statistics, and average treatment effect estimation, and demonstrate the performance of the proposed estimators in simulation. 

Keywords

Semi-supervised learning

Influence function

Nonparametric regression

Prediction-powered inference

Black-box machine learning model 

Speaker

Zichun Xu, University of Washington, Department of Biostatistics

Deep Fréchet Regression

Advancements in modern science have led to the increasing availability of non-Euclidean data in metric spaces. This paper addresses the challenge of modeling relationships between non-Euclidean responses and multivariate Euclidean predictors. We propose a flexible regression model capable of handling high-dimensional predictors without imposing parametric assumptions. Two primary challenges are addressed: the curse of dimensionality in nonparametric regression and the absence of linear structure in general metric spaces. The former is tackled using deep neural networks, while for the latter we demonstrate the feasibility of mapping the metric space where responses reside to a low-dimensional Euclidean space using manifold learning. We introduce a reverse mapping approach, employing local Fréchet regression, to map the low-dimensional manifold representations back to objects in the original metric space. We develop a theoretical framework, investigating the convergence rate of deep neural networks under dependent sub-Gaussian noise with bias. The convergence rate of the proposed regression model is then obtained by expanding the scope of local Fréchet regression to accommodate multivariate predictors in the presence of errors in predictors. Simulations and case studies show that the proposed model outperforms existing methods for non-Euclidean responses, focusing on the special cases of probability measures and networks. 

Keywords

curse of dimensionality

deep learning

Fréchet regression

non-Euclidean data

Wasserstein space 

Speaker

SU I IAO

Doubly Robust Conditional Independence Testing with Generative Neural Networks

This article addresses the problem of testing the conditional independence of two generic random vectors X and Y given a third random vector Z, which plays an important role in statistical and machine learning applications. We propose a new non-parametric testing procedure that avoids explicitly estimating any conditional distributions but instead requires sampling from the two marginal conditional distributions of X given Z and Y given Z. We further propose using a generative neural network (GNN) framework to sample from these approximated marginal conditional distributions, which tends to mitigate the curse of dimensionality due to its adaptivity to any low-dimensional structures and smoothness underlying the data. Theoretically, our test statistic is shown to enjoy a double robustness property against GNN approximation errors, meaning that the test statistic retains all desirable properties of the oracle test statistic utilizing the true marginal conditional distributions, as long as the product of the two approximation errors decays to zero faster than the parametric rate. Asymptotic properties of our statistic and the consistency of a bootstrap procedure are derived under both null and local alternatives. Extensive numerical experiments and real data analysis illustrate the effectiveness and broad applicability of our proposed test.

 

Keywords

Conditional Distribution

Conditional Independence Test

Double Robustness

Generative Models

Kernel Method

Maximum Mean Discrepancy 

Co-Author(s)

Yi Zhang
Linjun Huang, UIUC
Yun Yang, University of Illinois Urbana-Champaign
Xiaofeng Shao, Washington University in St Louis, Dept of Statistics and Data Science

Speaker

Yi Zhang

Interpretable Scalar-on-Image Linear Regression Models via the Generalized Dantzig Selector

The scalar-on-image regression model explores the relationship between a scalar response and a two-dimensional predictor by estimating a bivariate coefficient function. Traditional methods usually assume smoothness of the coefficient function across the image domain, which helps reduce noise but limits interpretability, especially in cases where sparsity-only certain image regions affecting the response-is important. Despite the wide range of applications requiring sparse and smooth coefficient estimation, methods that simultaneously address both constraints remain limited. In this paper, we propose the Generalized Dantzig Selector (GDS) method, which estimates the coefficient function while balancing smoothness and sparsity. Our approach identifies regions of the image that do not influence the response (zero regions in the coefficient function), improving interpretability without sacrificing stability. The proposed GDS method demonstrates superior performance compared to existing techniques in both simulations and real data analyses. Furthermore, we provide theoretical support, including non-asymptotic bounds on the estimation error, for the proposed method. 

Keywords

sparse estimation

smoothness regularization

nonparametric regression

non-asymptotic error bound 

Speaker

Sijia Liao, University of Arizona

Stationarity of manifold time series

In modern interdisciplinary research, manifold time series data have been garnering more attention. A critical question in analyzing such data is ``stationarity'', which reflects the underlying dynamic behavior and is crucial across various fields like cell biology, neuroscience and empirical finance. Yet, there has been an absence of a formal definition of stationarity that is tailored to manifold time series. This work bridges this gap by proposing the first definitions of first-order and second-order stationarity for manifold time series. Additionally, we develop novel statistical procedures to test the stationarity of manifold time series and study their asymptotic properties. Our methods account for the curved nature of manifolds, leading to a more intricate analysis than that in Euclidean space. The effectiveness of our methods is evaluated through numerical simulations and their practical merits are demonstrated through analyzing a cell-type proportion time series dataset from a paper recently published in Cell. The first-order stationarity test result aligns with the biological findings of this paper, while the second-order stationarity test provides numerical support for a critical assumption made therein.  

Keywords

CUSUM

bootstrap

curvature 

Speaker

Junhao Zhu, University of Toronto

Structure-Preserving Nonlinear Sufficient Dimension Reduction for Tensor Regression

We present a novel approach to nonlinear sufficient dimension reduction for scalar-on-tensor regression and classification problems. Our method defines a Tensor Product Space within several RKHS and introduces two kinds of the dimension-folding subspace alongside the conventional SDR subspace within the Tensor Product Space. We demonstrate that, under mild conditions, the range of the regression operator in the Tensor Product Space resides within the conventional SDR subspace. Furthermore, we propose the Tucker and CP Tensor Envelope frameworks, designed to preserve the intrinsic multidimensional structure of tensor-valued predictors while achieving effective dimension reduction. This framework bridges the subspaces, enabling us to establish that the tensor envelope of regression operator is also contained within the Dimension Folding Subspace. By leveraging Population-level and Sample-level Estimation, and drawing inspiration from this two common tensor decomposition methods, we develop two optimization algorithms to enhance the operator's objective function. We evaluate the performance of our proposed estimators through comprehensive simulations and real-world applications. 

Keywords

Nonlinear Sufficient Dimension Reduction

Dimension Folding

Tensor Decomposition

Reproducing Kernel Hilbert Space

Tensor Envelope and Tensor Product Space

Coordinate Mapping 

Speaker

Dianjun Lin, Pennsylvania State University