Modern Nonparametric Techniques: Beta-Trees, Ranked-Set Sampling, and Function-Valued parameters

Will Chen Chair
University of Texas at Arlington
 
Sunday, Aug 3: 4:00 PM - 5:50 PM
4022 
Contributed Papers 
Music City Center 
Room: CC-101A 

Main Sponsor

Section on Nonparametric Statistics

Presentations

Breaking Welch’s F Test

The ANOVA F test famously fails to control the Type I error rate when the variances of the populations differ. Welch's F test is commonly recommended as a robust alternative. However, we find that there is at least one situation where Welch's F test should not be recommended. Specifically, we find through simulation that if the common sample size is fixed at some small number and the number of samples becomes large, then the Type I error rate for Welch's F test tends towards 100% even if all assumptions for the ANOVA F test are met. We explore whether this finding can be confirmed theoretically. We also explore how this finding fits with existing recommendations, whether alternative tests share this deficiency of Welch's F test, and whether this behavior of Welch's F test would have been surprising to Welch and others working on alternatives to the ANOVA F test. 

Keywords

One-way ANOVA;

Simulation

Type I error rate 

Co-Author

William Schlegel, Villanova University

First Author

Jesse Frey, Villanova University

Presenting Author

Jesse Frey, Villanova University

A general nonparametric framework for testing hypotheses about a class of function-valued parameters

Performing inference on function-valued parameters, such as the regression function or the conditional average treatment effect (CATE), poses fundamental challenges in nonparametric models. For a class of smooth function-valued parameters that can be expressed as functions of a conditional distribution, we develop a nonparametric test to assess whether the function-valued parameter is constant. The test statistic is based on the norm of the difference between two parameters. We propose a near-estimator of the norm that attains a tractable limiting distribution under the null, when the norm is zero. Our method improves upon many existing approaches for estimating norms which exhibit poor asymptotic behavior under the null. As an illustration of our framework, we present three concrete applications: (1) testing null variable significance in regression; (2) testing constant conditional covariance; and (3) testing constant CATE. Simulation studies demonstrate strong performance, and we further apply the method to identify predictive biomarkers for adjuvant chemotherapy response in HER2-positive breast cancer patients. 

Keywords

Pathwise differentiability

Function-valued parameters

Equality of functionals

Hypothesis testing 

Co-Author(s)

Aaron Hudson, Fred Hutchinson Cancer Center
Ali Shojaie, University of Washington

First Author

Albert Osom, University of Washington

Presenting Author

Albert Osom, University of Washington

Likelihood ratio tests for monotonic functions with regions of flatness

A typical assumption in the asymptotic analysis of estimators of monotonic functions is that they are strictly monotonic, eliminating the possibility for regions of flatness. We characterize the asymptotic behavior of the monotonic regression estimator for a function f at a fixed point x when f is constant in some neighborhood of x. Further, we extend the results on likelihood ratio testing from Banerjee and Wellner (2001) and Groeneboom and Jongbloed (2015) to these regions of flatness. 

Keywords

Monotonic Regression

Nonparametric Regression

Shape constraints

Asymptotics

Likelihood ratio testing 

Co-Author

Charles Doss, University of Minnesota

First Author

Robert VandenBerg, University Of Minnesota

Presenting Author

Robert VandenBerg, University Of Minnesota

Assumption-Free Nonparametric MLE of the Distribution Function Using Ranked-Set Sampling

We study nonparametric maximum likelihood estimation of the population distribution function based on ranked-set sampling data. In other words, the probability of seeing the observed data is maximized both over all possible distributions and over all possible ranking schemes. We find that it can be achieved by adopting a ranking strategy driven by existing observed ranks. Obtaining maximum likelihood estimators becomes complicated when there are ties across ranking classes, but we develop a storage-intensive EM algorithm to overcome it. We find that the maximum likelihood estimator turns out not to be unique in general. However, imposing reasonable constraints leads to unique estimators with attractive properties. We compare our proposed maximum likelihood estimator to other estimators from the literature in terms of bias, variance, and consistency. 

Keywords

EM algorithm

Nonparametric maximum likelihood

Distribution functions

Order statistics

Ranked-set sampling

Ranking 

Co-Author

Yimin Zhang, Villanova University

First Author

Jesse Frey, Villanova University

Presenting Author

Yimin Zhang, Villanova University

Beta-trees: Multivariate histograms with confidence statements

Multivariate histograms are difficult to construct due to the curse of dimensionality. Motivated by k-d trees in computer science, we show how to construct an efficient data-adaptive partition of Euclidean space that possesses the following two properties: With high confidence the distribution from which the data are generated is close to uniform on each rectangle of the partition; and despite the data-dependent construction we can give guaranteed finite sample simultaneous confidence intervals for the probabilities (and hence for the average densities) of each rectangle in the partition. This partition will automatically adapt to the sizes of the regions where the distribution is close to uniform. The methodology produces confidence intervals whose widths depend only on the probability content of the rectangles and not on the dimensionality of the space, thus avoiding the curse of dimensionality. Moreover, the widths essentially match the optimal widths in the univariate setting. The simultaneous validity of the confidence intervals allows to use this construction, which we call Beta-trees, for various data-analytic purposes, as will be illustrated. 

Keywords

curse of dimensionality

simultaneous inference 

Co-Author

QIAN ZHAO, University of Massachusetts

First Author

Guenther Walther, Stanford University

Presenting Author

Guenther Walther, Stanford University

General Frameworks for Conditional Two-Sample Testing

We address the problem of conditional two-sample testing, which assesses whether two populations share the same distribution after accounting for confounding variables. This problem is critical in applications such as domain adaptation and algorithmic fairness, where valid group comparisons must account for such factors. We establish a theoretical hardness result, showing that significant power against any single alternative is unattainable without appropriate assumptions. To address this, we propose two general frameworks: the first transforms any conditional independence test into a conditional two-sample test while preserving its asymptotic properties, and the second leverages estimated density ratios to compare marginal distributions using existing methods for marginal two-sample testing. We demonstrate these frameworks concretely using classification and kernel-based methods, supported by simulation studies to illustrate their efficacy in finite-sample scenarios. 

Keywords

Conditional independence testing

Covariate shift

Density ratio estimation

Algorithmic fairness

Domain adaptation 

Co-Author

Suman Cha, Yonsei University

First Author

Lee Seongchan, Yonsei University

Presenting Author

Suman Cha, Yonsei University

Consistency of the Shortest Hamiltonian Path Test

The shortest Hamiltonian path test, introduced by Biswas et al. (2014), is a widely used nonparametric, multivariate two-sample test. It is one of only three statistical tests with a null test statistic that has a tractable distribution in finite sample sizes. A Hamiltonian path visits each vertex exactly once, and the shortest Hamiltonian path minimizes the total path length. In this test, the shortest Hamiltonian path is constructed from the pooled vertices of two samples, and the test statistic is the number of edges connecting vertices from different samples. The null distribution of this test statistic matches that of the Wald–Wolfowitz Runs test, as it counts the number of runs along the shortest Hamiltonian path. The problem of constructing the shortest Hamiltonian path closely resembles the well-known Traveling Salesman Problem. In this proof, we develop an approximation algorithm for the shortest Hamiltonian path using concepts from the Traveling Salesman Problem and establish that the test is asymptotically consistent. 

Keywords

Nonparametric Tests

Graph Based Tests

Multivariate Statistics

Consistency 

First Author

Maxmillian Tjauw

Presenting Author

Maxmillian Tjauw