CS031 Improving Population Surveys

Conference: Symposium on Data Science and Statistics (SDSS) 2023
05/26/2023: 11:55 AM - 1:25 PM CDT
Refereed 
Room: Grand Ballroom C 

Chair

Joshua Finnell Finnell, Colgate University

Tracks

Practice and Applications
Symposium on Data Science and Statistics (SDSS) 2023

Presentations

Using web scraping and machine learning to harness alternative data sources for frame enhancement in the U.S. Census

Alternative data sources can provide a resource for augmenting the Census Frame, including for Transitory Locations. We created a web scraper to extract data on United States campgrounds from Good Sam (www.goodsam.com) using the scrapy Python package. We built custom parsers to extract information from each campground webpage including the total number of sites, total number of available sites, facility type (e.g. state park or RV resort), geospatial coordinates, address, telephone number, the number of tent-only sites, any externally linked website, and the number of recreational facilities both within and nearby the campground. Additionally, we trained a machine learning model to predict the total number of sites in a campground, when the total capacity was unlisted, using other data elements scraped from the page. Data sources from reputable, publicly available websites, allow for resource-efficient enhancement of the Census Transitory Location frame via direct scraping and prediction of additional data elements. 

Presenting Author

Haley Hunter-Zinck, U.S. Census Bureau

First Author

Haley Hunter-Zinck, U.S. Census Bureau

CoAuthor

Louis Avenilla, US Census Bureau

Predicting the Need to Recontact in Household Survey Data: A Machine Learning Approach

The Spanish Survey of Household Finances (EFF) is a large-scale survey and a complex statistical operation. Data editing is a major task in the production process of survey data where the revision team manually checks the consistency among questions and considers the help of interviewer comments and audio records to edit the data if necessary. Household interviews are sometimes fled with data ommisions and inconsistencies. When this occurs, households are recontacted and are re-asked certain parts of the questionnaire. In essence, the manual revision process enteails several costs, namely, time and measurement error. In this paper, using structured and unstructured surgey-generated data, we examine the use of machine learning techniques that allow to classify interviews that require the need to carefully analyze its questionnaire and potentially recontact the interviewed household. We find an algorithm or score function that predicts with relative high accuracy such kind of household interviews. Our contribution to the survey data production literature is twofold. First, we provide a way to shorten revision and data production time. Second, we propose a methodology to reduce the time between first and second contact for recontacted households, potentially also reducing measurement error. 

Presenting Author

Nicolás Forteza, Bank of Spain

First Author

Nicolás Forteza, Bank of Spain

CoAuthor

Sandra García Uribe, Bank of Spain