Statistical Innovations for Heterogeneous and Clustered Data

Kai Cooper Chair
The Wharton School of the University of Pennsylvania
 
Monday, Aug 4: 2:00 PM - 3:50 PM
4077 
Contributed Papers 
Music City Center 
Room: CC-202A 

Main Sponsor

Section on Nonparametric Statistics

Co Sponsors

Section on Nonparametric Statistics

Presentations

Adaptive Block-Based Change-Point Detection for Sparse Spatially Clustered Data with Applications in Remote Sensing Imaging

We present a non-parametric change-point detection approach for detecting potentially sparse changes in a time series of high-dimensional observations or non-Euclidean data objects. We target a change in distribution that occurs in a smaller (unknown) subset of dimensions, where the dimensions may be correlated. Our work is motivated by a remote sensing application where changes occur in small, spatially clustered regions over time. An adaptive block-based change-point detection framework is proposed that accounts for spatial dependencies across dimensions and leverages these dependencies to boost detection power and estimation accuracy. Through simulation studies, we demonstrate that our approach has superior performance in detecting sparse changes for datasets with spatial or local group structures. An application of of the proposed method to detect activity, such as new construction, in remote sensing imagery of the Natanz Nuclear facility in Iran is presented to demonstrate the method's efficacy. 

Keywords

Change-point

Non-parametric

Spatial Dependence

Graph-based Tests

High-dimensional data

Satellite Imagery 

Co-Author(s)

Lynna Chu, Iowa State University
Zhengyuan Zhu, Iowa State University

First Author

Alan Moore, Iowa State University

Presenting Author

Alan Moore, Iowa State University

Hierarchical Conformal Prediction for Clustered Data with Missing Responses

Existing prediction methods for clustered data often depend on strong model assumptions, making them vulnerable to model misspecification. We propose a hierarchical conformal prediction framework for predicting outcomes of new subjects at specific time points or trajectories in clustered data with missing responses, without requiring the specification of the prediction model or within-subject correlations. The idea is to establish marginal prediction for clustered data while utilizing subsampling techniques to accommodate dependency and appropriate weighting to address distribution shifts caused by missing data.
To address complex error distributions, including skewed and multimodal cases, we construct the prediction region using the highest conditional density set of the target distribution. Additionally, we propose an enhanced approach, termed localized prediction, to more effectively adapt to heterogeneous or atypical subjects. This method achieves not only marginal coverage but also local and asymptotic conditional coverage for a given subject within a subset or specific profile, while converging to optimal interval lengths under consistent estimation conditions. 

Keywords

Conformal prediction

Conditional coverage

Distribution shift

Marginal prediction

Missing at random

Repeated subsampling 

Co-Author(s)

Huixia Wang, George Washington University
Yanlin Tang, East China Normal University
Yingying Zhang, East China Normal University

First Author

Menghan Yi

Presenting Author

Menghan Yi

Knockoffs for Variable Selection with Nonparametric and Heterogeneous Data

Knockoff variable selection is a powerful method to create synthetic variables to mirror the correlation structure of observed features, enabling principled false discovery rate control. Existing methods often assume homogeneous data (all numeric or all categorical) or rely on known distributions, limitations that arise with heterogeneous data and unknown distributions. Moreover, standard measures of variable importance often rely on well-specified outcome models (e.g., linear), making them unsuitable for nonlinear relationships.

We introduce a generalizable knockoff generation procedure based on conditional residuals, handling heterogeneous data with unknown distributions. We further propose an interpretable importance measure, the Mean Absolute Local Derivatives (MALD), to quantify variable influence for arbitrary outcome functions, and can be implemented with random forests or neural networks. Simulation studies show that our method outperforms existing ones, controlling the false discovery rate with superior power. We apply these methods to DNA methylation data of mouse tissue samples to select CpG sites related to age. We provide software implementations in R and Python. 

Keywords

Variable Selection

Nonparametric

Machine Learning

Wide Data

Knockoffs 

Co-Author

Zhe Fei, University of California, Riverside

First Author

Evan Mason, UC Riverside

Presenting Author

Evan Mason, UC Riverside

Jackknife empirical likelihood for the correlation coefficient with additive distortion measurement

The correlation coefficient is fundamental in advanced statistical analysis. However, traditional methods of calculating correlation coefficients can be biased due to the existence of confounding variables. Such confounding variables could act in an additive or multiplicative fashion. To study the additive model, previous research has shown residual-based estimation of correlation coefficients. The empirical likelihood (EL) has been used to construct the confidence interval for the correlation coefficient. With small sample size situations, the coverage probability of EL, for instance, can be below 90% at confidence level 95%. We propose new methods of interval estimation for the correlation coefficient using jackknife empirical likelihood, mean jackknife empirical likelihood and adjusted jackknife empirical likelihood. For better performance with small sample sizes, we also propose mean adjusted empirical likelihood. The simulation results show the best performance with mean adjusted jackknife empirical likelihood when the sample sizes are as small as 25. Real data analyses are used to illustrate the proposed approach. 

Keywords

Correlation coefficient

Distortion errors

Adjusted jackknife empirical likelihood

Mean jackknife empirical likelihood

Mean adjusted jackknife empirical likelihood

Jackknife empirical likelihood 

Co-Author(s)

Linlin Dai
Yichuan Zhao, Georgia State University

First Author

Da Chen, Georgia State University

Presenting Author

Yichuan Zhao, Georgia State University

On the choice of parameters for the local block bootstrap in the local stationary setting

Dealing with time-varying linear processes, their stationary companion processes come in handy for proving various results. However, espacially considering limit distributions, their lack of observability hamper statistical procedures like hypothesis testing. In this case, the so-called local block bootstrap established by Dowla et al. (2013) provides a sound way out. Said bootstrap procedure is based on the choice of different bootstrap parameters which each have a distinct impact on the simulation results. We illustrate the influence of different parameter choices with an extended simulation study using alpha-stable distributions in combination with empirical characteristic functions. The former is a wide class of distributions ensuring the transferability of our results, whereas the latter opens the way to various procedures including independence testing. Additionally, we present a bootstrap central limit theorem allowing for the formulation of bootstrap confidence intervals by the pivotal method without relying on the normal distribution. 

Keywords

Local stationarity

Local block bootstrap

Central limit theorem

Nonparametric statistics 

First Author

Carina Beering, Helmut Schmidt University Hamburg

Presenting Author

Carina Beering, Helmut Schmidt University Hamburg

Nonparametric local estimation of the partial area under the receiver operating characteristic curve

We consider estimation of the receiver operating characteristic curve and the ordinal dominance curve. The nonparametric estimation is based on delta-sequences. We also consider estimation of the partial area under the receiver operating characteristic curve and the ordinal dominance curve. This is obtained by local estimation of the delta-sequences. We characterize feasible statistics induced by central limit theory for the estimation procedure. A numerical simulation corroborates the asymptotic theory. 

Keywords

nonparametric estimation

ROC curve

partial area

ODC curve

delta-sequence

local estimation 

Co-Author

Chang Yuan Li, UCSB

First Author

Yoann Potiron

Presenting Author

Yoann Potiron

Modeling Aging Based on Semiparametric Starshaped Mean Equilibrium Life Model: A Bayesian Approach

This study introduces a novel semiparametric regression model based on the starshaped mean equilibrium life (SMEL) function to describe the mean remaining life of aging systems. The SMEL function, exhibiting a decreasing-then-increasing pattern, provides a flexible framework for modeling non-monotonic aging behaviors. Addressing the challenge of non-identifiability of the survival function, we propose a nonparametric testing procedure to validate the starshaped assumption. An adaptive semiparametric MCMC algorithm is developed to estimate regression parameters and select optimal priors, ensuring robust Bayesian regularization. Validated through simulations and real-world applications, the methodology effectively captures complex aging patterns, offering actionable insights for reliability analysis, survival modeling, and decision-making in healthcare, engineering, and actuarial science. This work bridges semiparametric regression, Bayesian inference, and nonparametric testing, advancing the theoretical and computational foundations of aging modeling. 

Keywords

Semiparametric Regression

Mean Equilibrium Life Function

Bayesian Inference

Nonparametric Testing

Aging Modeling

Starshaped Function 

First Author

mohammad sepehrifar

Presenting Author

mohammad sepehrifar