Knockoffs for Variable Selection with Nonparametric and Heterogeneous Data
Zhe Fei
Co-Author
University of California, Riverside
Monday, Aug 4: 2:35 PM - 2:50 PM
1383
Contributed Papers
Music City Center
Knockoff variable selection is a powerful method to create synthetic variables to mirror the correlation structure of observed features, enabling principled false discovery rate control. Existing methods often assume homogeneous data (all numeric or all categorical) or rely on known distributions, limitations that arise with heterogeneous data and unknown distributions. Moreover, standard measures of variable importance often rely on well-specified outcome models (e.g., linear), making them unsuitable for nonlinear relationships.
We introduce a generalizable knockoff generation procedure based on conditional residuals, handling heterogeneous data with unknown distributions. We further propose an interpretable importance measure, the Mean Absolute Local Derivatives (MALD), to quantify variable influence for arbitrary outcome functions, and can be implemented with random forests or neural networks. Simulation studies show that our method outperforms existing ones, controlling the false discovery rate with superior power. We apply these methods to DNA methylation data of mouse tissue samples to select CpG sites related to age. We provide software implementations in R and Python.
Variable Selection
Nonparametric
Machine Learning
Wide Data
Knockoffs
Main Sponsor
Section on Nonparametric Statistics
You have unsaved changes.