Print Close

Active Subsampling for Measurement-Constrained M-Estimation of Individualized Thresholds with High-Dimensional Data

Presented During: SLDS Student Paper Awards

Jingyi Duan Speaker
Cornell University

Tuesday, Aug 5: 2:45 PM - 3:05 PM
Topic-Contributed Paper Session

Music City Center

In the measurement-constrained problems, despite the availability of large datasets, we may be only affordable to observe the labels on a small portion of the large dataset. This poses a critical question that which data points are most beneficial to label given a budget constraint. In this paper, we focus on the estimation of the optimal individualized threshold in a measurement-constrained M-estimation framework. Our goal is to estimate a high-dimensional parameter θ in a linear threshold θTZ for a continuous variable X such that the discrepancy between whether Xexceeds the threshold θTZ and a binary outcome Y is minimized. We propose a novel K-step active subsampling algorithm to estimate θ, which iteratively samples the most informative observations and solves a regularized M-estimator. The theoretical properties of our estimator demonstrate a phase transition phenomenon with respect to β≥1, the smoothness of the conditional density of X given Y and Z. For β>(1+3‾√)/2, we show that the two-step algorithm yields an estimator with the parametric convergence rate Op((slogd/N)1/2) in l2 norm. The rate of our estimator is strictly faster than the minimax optimal rate with Ni.i.d. samples drawn from the population. For the other two scenarios 1<β≤(1+3‾√)/2 and β=1, the estimator from the two-step algorithm is sub-optimal. The former requires to run K>2 steps to attain the same parametric rate, whereas in the latter case only a near parametric rate can be obtained. Furthermore, we formulate a minimax framework for the measurement-constrained M-estimation problem and prove that our estimator is minimax rate optimal up to a logarithmic factor. Finally, we demonstrate the performance of our method in simulation studies and apply the method to analyze a large diabetes dataset.