Selection bias in big data in official statistics from a practitioner’s point of view

Abstract Number:

3399 

Submission Type:

Contributed Abstract 

Contributed Abstract Type:

Paper 

Participants:

Martin Hyllienmark (1), Dan Hedlin (1), Edgar Bueno (1)

Institutions:

(1) Stockholm University, Stockholm, Sweden

Co-Author(s):

Dan Hedlin  
Stockholm University
Edgar Bueno  
Stockholm University

First Author:

Martin Hyllienmark  
Stockholm University

Presenting Author:

Martin Hyllienmark  
N/A

Abstract Text:

Which is better: simple random sample (SRS) with nonresponse or a large dataset with selection bias?

Selection bias is increasingly problematic in surveys. Inspired by Meng's (2018) paper Statistical paradises and paradoxes in big data, where he highlights statistical issues that bigness of data sets incur, we simulated sequences of growing populations with two different data collection methods: simple random sample (SRS) with nonresponse and organic data with selection bias ("big data"). The results showed a trade-off between bias and coverage probability caused by the amount of data available. Users of statistics often focus on good point estimates. Then a large nonprobability data source may be better than an SRS with nonresponse. On the other hand, if the user wants a reliable confidence interval, a probability sample with missingness may be preferred. Tools for comparing different data sources were investigated and discussed from a practical point of view.

Keywords:

Selection bias|Simulation study|Bias-variance tradeoff|Non-response bias| |

Sponsors:

Survey Research Methods Section

Tracks:

Non-probability Samples

Can this be considered for alternate subtype?

Yes

Are you interested in volunteering to serve as a session chair?

No

I have read and understand that JSM participants must abide by the Participant Guidelines.

Yes

I understand that JSM participants must register and pay the appropriate registration fee by June 1, 2024. The registration fee is non-refundable.

I understand