Knockoffs for Variable Selection with Nonparametric and Heterogeneous Data

Zhe Fei Co-Author
University of California, Riverside
 
Evan Mason First Author
UC Riverside
 
Evan Mason Presenting Author
UC Riverside
 
Monday, Aug 4: 2:35 PM - 2:50 PM
1383 
Contributed Papers 
Music City Center 
Knockoff variable selection is a powerful method to create synthetic variables to mirror the correlation structure of observed features, enabling principled false discovery rate control. Existing methods often assume homogeneous data (all numeric or all categorical) or rely on known distributions, limitations that arise with heterogeneous data and unknown distributions. Moreover, standard measures of variable importance often rely on well-specified outcome models (e.g., linear), making them unsuitable for nonlinear relationships.

We introduce a generalizable knockoff generation procedure based on conditional residuals, handling heterogeneous data with unknown distributions. We further propose an interpretable importance measure, the Mean Absolute Local Derivatives (MALD), to quantify variable influence for arbitrary outcome functions, and can be implemented with random forests or neural networks. Simulation studies show that our method outperforms existing ones, controlling the false discovery rate with superior power. We apply these methods to DNA methylation data of mouse tissue samples to select CpG sites related to age. We provide software implementations in R and Python.

Keywords

Variable Selection

Nonparametric

Machine Learning

Wide Data

Knockoffs 

Main Sponsor

Section on Nonparametric Statistics