05/26/2023: 11:55 AM - 1:25 PM CDT
Refereed
Room: Grand Ballroom C
Chair
Joshua Finnell Finnell, Colgate University
Tracks
Practice and Applications
Symposium on Data Science and Statistics (SDSS) 2023
Presentations
Alternative data sources can provide a resource for augmenting the Census Frame, including for Transitory Locations. We created a web scraper to extract data on United States campgrounds from Good Sam (www.goodsam.com) using the scrapy Python package. We built custom parsers to extract information from each campground webpage including the total number of sites, total number of available sites, facility type (e.g. state park or RV resort), geospatial coordinates, address, telephone number, the number of tent-only sites, any externally linked website, and the number of recreational facilities both within and nearby the campground. Additionally, we trained a machine learning model to predict the total number of sites in a campground, when the total capacity was unlisted, using other data elements scraped from the page. Data sources from reputable, publicly available websites, allow for resource-efficient enhancement of the Census Transitory Location frame via direct scraping and prediction of additional data elements.
Presenting Author
Haley Hunter-Zinck, U.S. Census Bureau
First Author
Haley Hunter-Zinck, U.S. Census Bureau
CoAuthor
Louis Avenilla, US Census Bureau
The Spanish Survey of Household Finances (EFF) is a large-scale survey and a complex statistical operation. Data editing is a major task in the production process of survey data where the revision team manually checks the consistency among questions and considers the help of interviewer comments and audio records to edit the data if necessary. Household interviews are sometimes fled with data ommisions and inconsistencies. When this occurs, households are recontacted and are re-asked certain parts of the questionnaire. In essence, the manual revision process enteails several costs, namely, time and measurement error. In this paper, using structured and unstructured surgey-generated data, we examine the use of machine learning techniques that allow to classify interviews that require the need to carefully analyze its questionnaire and potentially recontact the interviewed household. We find an algorithm or score function that predicts with relative high accuracy such kind of household interviews. Our contribution to the survey data production literature is twofold. First, we provide a way to shorten revision and data production time. Second, we propose a methodology to reduce the time between first and second contact for recontacted households, potentially also reducing measurement error.
Presenting Author
Nicolás Forteza, Bank of Spain
First Author
Nicolás Forteza, Bank of Spain
CoAuthor
Sandra García Uribe, Bank of Spain