Valid and Efficient Inference for Nonparametric Variable Importance in Two-Phase Studies
Sunday, Aug 3: 5:25 PM - 5:45 PM
Topic-Contributed Paper Session
Music City Center
We consider a common nonparametric regression setting where the data consist of a response variable Y, some easily obtainable covariates X, and a set of costly covariates Z. Prior to large-scale data collection for developing a model to predict Y with (X, Z), we wish to conduct preliminary investigations to infer the importance of Z for predicting Y given X. To achieve this goal, we propose a nonparametric variable importance measure for Z, defined as a population parameter that quantifies the contribution through general loss functions. Considering two-phase data that consist of a large number of observations for (Y, X) with Z being measured only in a relatively small subsample, we propose a novel semi-parametric method for estimating the proposed importance measure. Our method accommodates the missing Z for each individual in the two-phase data by imputing their contribution to the loss function. Our imputation method, inspired by similarities with semi-supervised learning methods, involves challenging two-stage nonparametric estimation. We establish theoretical results and demonstrate the performance of our method via extensive numerical results.
You have unsaved changes.