Advances in Exact Subsampling Methods with Linear Regression Models

Nicholas Rios Co-Author
George Mason University
 
Jiayi Zheng First Author
 
Jiayi Zheng Presenting Author
 
Sunday, Aug 3: 2:05 PM - 2:20 PM
2045 
Contributed Papers 
Music City Center 
With the dramatic rise of automatic data collection, a huge volume of data is recorded on a daily basis. Despite the potential of big data, it is computationally expensive to fit traditional regression models to datasets with billions of rows. This motivates the use of Optimal Design Based (ODB) subsampling, which identifies a subset that maximizes an optimality criterion typically used in experimental design. Existing methods, such as Information-Based Optimal Subdata Selection (IBOSS), focus on the D-optimality criterion, which minimizes the generalized variance of the parameter estimates. While this is helpful for parameter estimation, little attention has been given to criteria that favor model prediction, such as the I-optimality criterion. In this paper, we propose new algorithms that identify I-optimal subsamples from massive datasets. These algorithms lead to computationally efficient and reliable prediction for linear regression models. The algorithms are extended to the case where there is heteroscedasticity in the errors. Case studies illustrate that the proposed methods have smaller prediction error than existing methods. 

Keywords

Experiment Design

Big Data

Subsampling

I-optimality 

Main Sponsor

Section on Statistical Computing