Using web scraping and machine learning to harness alternative data sources for frame enhancement in the U.S. Census
Conference: Symposium on Data Science and Statistics (SDSS) 2023
05/26/2023: 12:00 PM - 12:25 PM CDT
Refereed
Alternative data sources can provide a resource for augmenting the Census Frame, including for Transitory Locations. We created a web scraper to extract data on United States campgrounds from Good Sam (www.goodsam.com) using the scrapy Python package. We built custom parsers to extract information from each campground webpage including the total number of sites, total number of available sites, facility type (e.g. state park or RV resort), geospatial coordinates, address, telephone number, the number of tent-only sites, any externally linked website, and the number of recreational facilities both within and nearby the campground. Additionally, we trained a machine learning model to predict the total number of sites in a campground, when the total capacity was unlisted, using other data elements scraped from the page. Data sources from reputable, publicly available websites, allow for resource-efficient enhancement of the Census Transitory Location frame via direct scraping and prediction of additional data elements.
web scraping
machine learning
alternative data sources
frame
transitory locations
Presenting Author
Haley Hunter-Zinck, U.S. Census Bureau
First Author
Haley Hunter-Zinck, U.S. Census Bureau
CoAuthor
Louis Avenilla, US Census Bureau
Target Audience
Mid-Level
Tracks
Practice and Applications
Symposium on Data Science and Statistics (SDSS) 2023
You have unsaved changes.