Selection bias in big data in official statistics from a practitioner’s point of view

Dan Hedlin Co-Author
Stockholm University
 
Edgar Bueno Co-Author
Stockholm University
 
Martin Hyllienmark First Author
Stockholm University
 
Martin Hyllienmark Presenting Author
Stockholm University
 
Tuesday, Aug 6: 9:35 AM - 9:50 AM
3399 
Contributed Papers 
Oregon Convention Center 
Which is better: simple random sample (SRS) with nonresponse or a large dataset with selection bias?

Selection bias is increasingly problematic in surveys. Inspired by Meng's (2018) paper Statistical paradises and paradoxes in big data, where he highlights statistical issues that bigness of data sets incur, we simulated sequences of growing populations with two different data collection methods: simple random sample (SRS) with nonresponse and organic data with selection bias ("big data"). The results showed a trade-off between bias and coverage probability caused by the amount of data available. Users of statistics often focus on good point estimates. Then a large nonprobability data source may be better than an SRS with nonresponse. On the other hand, if the user wants a reliable confidence interval, a probability sample with missingness may be preferred. Tools for comparing different data sources were investigated and discussed from a practical point of view.

Keywords

Selection bias

Simulation study

Bias-variance tradeoff

Non-response bias 

Main Sponsor

Survey Research Methods Section