Print Close

A Bayes-Factor-Guided Approach to Post-Double Selection with Bootstrapped Multiple Imputation

Presented During: Navigating High-Dimensional Landscapes: Innovations in Model Estimation and Predictive Inference

Johannes Bleher Speaker
University of Hohenheim

Claudia Tarantola Co-Author
University of Milano

Wednesday, Aug 5: 10:30 AM - 12:20 PM
Topic-Contributed Paper Session

Valid inference on treatment effects after data-driven model selection has been extensively studied in high-dimensional linear models, most notably through the post-double selection approach of Belloni et al. (2014). However, when missing covariate data are handled through multiple imputation and sampling uncertainty is addressed by bootstrapping, researchers face an additional challenge: different sets of controls are typically selected in each bootstrapped and imputed dataset, and aggregating these sets through the usual union rule can lead to overly dense models. This paper proposes a Bayes-factor-guided procedure for variable selection on bootstrapped, multiply imputed data within the post-double selection framework.

We employ a sequential BOOT-MI strategy in which each iteration consists of a non-parametric bootstrap of the incomplete dataset, random forest multiple imputation, and LASSO-based variable selection. Instead of relying on ad hoc aggregation rules, we approximate the Bayes factor using type I and type II error probabilities - or in the latter case, estimates thereof - and use it as a principled stopping and decision criterion for variable inclusion. This connects the Bayes factor to familiar frequentist quantities such as significance levels and power, and it provides a probabilistic measure of variable relevance across the iterative selection process.

The proposed method is evaluated in a Monte Carlo study calibrated to the data-generating processes of Belloni et al. (2014) and Rubin (1987), extended to include hierarchical population structures, homoscedastic and heteroscedastic designs, and several missing data mechanisms (MCAR, MAR, MNAR) and missingness levels. Across 81 simulation scenarios, we assess variable selection performance, treatment effect estimation, and computational efficiency relative to existing BOOT-MI and MI-BOOT approaches. An empirical illustration using survey data demonstrates how the procedure can be applied in practice for treatment effect estimation with incomplete and high-dimensional covariates.

Keywords

Post-double selection

Multiple imputation

Bootstrap aggregation

Bayes factors

High-dimensional treatment effect estimation

Variable selection under missing data