Navigating High-Dimensional Landscapes: Innovations in Model Estimation and Predictive Inference

Georgia Smits Chair
Cornell University
 
Valbona Bejleri Discussant
United States Department of Agriculture – National Agricultural Statistics Service
 
Luca Sartore Organizer
National Institute of Statistical Sciences
 
David Matteson Organizer
Cornell University & National Institute of Statistical Sciences
 
Wednesday, Aug 5: 10:30 AM - 12:20 PM
1669 
Topic-Contributed Paper Session 
Traditional regression approaches are not suitable for analyzing high-dimensional data sets. Recent advances in big-data analytics have enabled the sparse selection of informative variables to enhance the interpretability and predictive accuracy of models for high-dimensional data. However, several challenges in high-dimensional spaces remain unaddressed in the statistical literature. For example, from a frequentist perspective, model selection and its properties are not fully studied in capture-recapture contexts or when dealing with data from heterogeneous domains. From a Bayesian perspective, however, approaches to modeling high-dimensional data sets focus on stochastic variable selection, adaptive shrinkage, or model averaging. Nevertheless, current state-of-the-art Bayesian methods are not fully equipped to simultaneously handle hierarchical population structures, heteroscedastic designs, various missing data mechanisms, and different levels of missingness. Addressing these challenges requires the development of new methods that improve computational efficiency relative to existing techniques. These innovations are crucial for advancements in various fields such as econometrics, healthcare, and social sciences. Overall, this section presents diverse perspectives to advance high-dimensional analytics, providing reliable and effective alternatives for statistical practitioners.

Habtamu Benecha from the United States Department of Agriculture's National Agricultural Statistics Service will begin the session with an advanced variable selection method designed for the US Census of Agriculture. He will highlight iterative approaches for the initialization and successive optimization of model parameters in high-dimensional settings. Ivy Yuexin Zhang from Stanford University will present a delta-invariant method for feature selection, addressing the challenges of retrieving a stable signal in high-dimensional heterogeneous domains. Johannes Bleher from Hohenheim University will discuss a probabilistic procedure for variable selection when missing covariate data are handled through multiple imputations. He will evaluate his procedure through a Monte Carlo study under several missing data mechanisms and demonstrate its application using survey data. Aliaksandr Hubin from Oslo University will introduce the concept of active paths for accurately identifying true covariates in high-dimensional non-linear systems. He will offer a novel perspective on a sparse representation of latent binary Bayesian neural networks to identify over-parameterized models. Finally, Valbona Bejleri from the United States Department of Agriculture's National Agricultural Statistics Service will conclude the session as a discussant. She will summarize the innovations in high-dimensional methods, highlighting future research directions and opportunities for collaboration among statisticians from various backgrounds.

Applied

Yes

Main Sponsor

Section on Statistical Computing

Co Sponsors

Biometrics Section
Government Statistics Section

Presentations

Variable Selection in Capture-Recapture Models for Adjustment Weight Estimation

USDA's National Agricultural Statistics Service (NASS) conducts the Census of Agriculture every five years. Because the Census Mailing List (CML) is incomplete, NASS uses the June Area Survey (JAS) to assess undercoverage. A capture–recapture framework allows for the estimation of weights to adjust for undercoverage, nonresponse, and misclassification. First, the CML and JAS records are linked, then sigmoidal models are fitted to all records. Standard penalized logistic regression may fail to identify the most important covariates, resulting in higher bias and uncertainty of model-based estimates. We introduce a novel penalty structure that enables joint variable selection across multiple models and yields improved adjustment weights and unbiased Census totals. Our approach combines advanced penalties with fractional gradient descent to handle high-dimensional settings, where predictors and interactions exceed ten million elements. Applied to 2022 Census data, it isolates critical predictors, reduces bias, and preserves parsimony, offering a scalable solution for accurate and efficient agricultural statistics. 

Keywords

Bias reduction

Census of Agriculture

Fractional gradient descent

High-dimensional inference

Model parsimony

Regularization methods 

Speaker

Habtamu Benecha

Co-Author(s)

Habtamu Benecha
Justin van Wart, USDA NASS
Valbona Bejleri, United States Department of Agriculture – National Agricultural Statistics Service
Luca Sartore, National Institute of Statistical Sciences

δ-Invariant Feature Selection: Stable Signal Recovery Across Heterogeneous Domains

We aim to identify features that predict an outcome of interest across multiple datasets. In particular, we study how to recover a subset of features whose relationship with the outcome generalizes to an unseen future population. We study this feature selection problem under sparse linear models, allowing for shifts in the conditional distribution Y|X across domains. We propose δ-invariant feature selection, which selects features whose estimated coefficients are sign-consistent across datasets and whose strength exceeds a fixed δ-relevance threshold. Through empirical examples and theoretical analysis, we study conditions under which the proposed procedure consistently recovers the population δ-invariant feature set and produces a feature set with small out-of-distribution prediction error. 

Keywords

Distribution shift

Sparse regression

Invariance

Variable selection

Out-of-distribution prediction 

Speaker

Ivy Zhang

A Bayes-Factor-Guided Approach to Post-Double Selection with Bootstrapped Multiple Imputation

Valid inference on treatment effects after data-driven model selection has been extensively studied in high-dimensional linear models, most notably through the post-double selection approach of Belloni et al. (2014). However, when missing covariate data are handled through multiple imputation and sampling uncertainty is addressed by bootstrapping, researchers face an additional challenge: different sets of controls are typically selected in each bootstrapped and imputed dataset, and aggregating these sets through the usual union rule can lead to overly dense models. This paper proposes a Bayes-factor-guided procedure for variable selection on bootstrapped, multiply imputed data within the post-double selection framework.

We employ a sequential BOOT-MI strategy in which each iteration consists of a non-parametric bootstrap of the incomplete dataset, random forest multiple imputation, and LASSO-based variable selection. Instead of relying on ad hoc aggregation rules, we approximate the Bayes factor using type I and type II error probabilities - or in the latter case, estimates thereof - and use it as a principled stopping and decision criterion for variable inclusion. This connects the Bayes factor to familiar frequentist quantities such as significance levels and power, and it provides a probabilistic measure of variable relevance across the iterative selection process.

The proposed method is evaluated in a Monte Carlo study calibrated to the data-generating processes of Belloni et al. (2014) and Rubin (1987), extended to include hierarchical population structures, homoscedastic and heteroscedastic designs, and several missing data mechanisms (MCAR, MAR, MNAR) and missingness levels. Across 81 simulation scenarios, we assess variable selection performance, treatment effect estimation, and computational efficiency relative to existing BOOT-MI and MI-BOOT approaches. An empirical illustration using survey data demonstrates how the procedure can be applied in practice for treatment effect estimation with incomplete and high-dimensional covariates. 

Keywords

Post-double selection

Multiple imputation

Bootstrap aggregation

Bayes factors

High-dimensional treatment effect estimation

Variable selection under missing data 

Speaker

Johannes Bleher, University of Hohenheim

Co-Author

Claudia Tarantola, University of Milano

Bayesian Model Selection and Averaging with Latent Binary Bayesian Neural Networks

Artificial neural networks (ANNs) yield accurate predictions but are often over-parameterized and hard to interpret. Bayesian neural networks (BNNs) mitigate uncertainty by treating weights probabilistically, while latent binary BNNs (LBBNNs) handle structural uncertainty via weight sparsification. We extend LBBNNs by allowing covariates to skip to any subsequent layer or be excluded entirely, yielding an input-skip LBBNN (ISLaB) that learns parsimonious structures-from linear to intercept-only models-when suitable. ISLaB achieves extreme compression (over 99–99.9% weight reduction) with minimal loss in accuracy; on MNIST, it attains 97% accuracy and excellent calibration using only 935 weights. Moreover, it identifies relevant covariates, captures nonlinearity, and introduces theoretically grounded, intrinsic local and global model explanations without post hoc tools. The methods are available in the open-source R package LBBNN on CRAN. 

Keywords

Bayesian model averaging

Interpretable deep learning

Stochastic variational Bayes

Uncertainty in deep learning

Structure learning 

Speaker

Aliaksandr Hubin