Using web scraping and machine learning to harness alternative data sources for frame enhancement in the U.S. Census

Conference: Symposium on Data Science and Statistics (SDSS) 2023
05/26/2023: 12:00 PM - 12:25 PM CDT
Refereed 

Description

Alternative data sources can provide a resource for augmenting the Census Frame, including for Transitory Locations. We created a web scraper to extract data on United States campgrounds from Good Sam (www.goodsam.com) using the scrapy Python package. We built custom parsers to extract information from each campground webpage including the total number of sites, total number of available sites, facility type (e.g. state park or RV resort), geospatial coordinates, address, telephone number, the number of tent-only sites, any externally linked website, and the number of recreational facilities both within and nearby the campground. Additionally, we trained a machine learning model to predict the total number of sites in a campground, when the total capacity was unlisted, using other data elements scraped from the page. Data sources from reputable, publicly available websites, allow for resource-efficient enhancement of the Census Transitory Location frame via direct scraping and prediction of additional data elements.

Keywords

web scraping

machine learning

alternative data sources

frame

transitory locations 

Presenting Author

Haley Hunter-Zinck, U.S. Census Bureau

First Author

Haley Hunter-Zinck, U.S. Census Bureau

CoAuthor

Louis Avenilla, US Census Bureau

Target Audience

Mid-Level

Tracks

Practice and Applications
Symposium on Data Science and Statistics (SDSS) 2023