Print Close

Advances in Exact Subsampling Methods with Linear Regression Models

Presented During: Computing Advances in Sampling Methodology & Categorical Data

Nicholas Rios Co-Author
George Mason University

Jiayi Zheng First Author

Jiayi Zheng Presenting Author

Sunday, Aug 3: 2:05 PM - 2:20 PM
2045
Contributed Papers

Music City Center

With the dramatic rise of automatic data collection, a huge volume of data is recorded on a daily basis. Despite the potential of big data, it is computationally expensive to fit traditional regression models to datasets with billions of rows. This motivates the use of Optimal Design Based (ODB) subsampling, which identifies a subset that maximizes an optimality criterion typically used in experimental design. Existing methods, such as Information-Based Optimal Subdata Selection (IBOSS), focus on the D-optimality criterion, which minimizes the generalized variance of the parameter estimates. While this is helpful for parameter estimation, little attention has been given to criteria that favor model prediction, such as the I-optimality criterion. In this paper, we propose new algorithms that identify I-optimal subsamples from massive datasets. These algorithms lead to computationally efficient and reliable prediction for linear regression models. The algorithms are extended to the case where there is heteroscedasticity in the errors. Case studies illustrate that the proposed methods have smaller prediction error than existing methods.

Keywords

Experiment Design

Big Data

Subsampling

I-optimality

Main Sponsor

Section on Statistical Computing