43: PRITS framework for investigating and assessing web-scraped datasets for research applications

Tina Lam Co-Author
Monash University
 
Mitchell O'Hara-Wild Co-Author
Monash University
 
Cynthia Huang First Author
 
Cynthia Huang Presenting Author
 
Monday, Aug 4: 2:00 PM - 3:50 PM
2080 
Contributed Posters 
Music City Center 
The PRITS framework addresses the lack of integrated technical and statistical guidance on the programmatic collection of data from online data sources and assessing existing web-scraped datasets for specific research uses. The framework covers five stages: Planning, Retrieval, Investigation, Transformation and Summary (PRITS). The 'Planning' stage focuses on problem and context definition, and sampling design. 'Retrieval' involves the technical execution and automated documentation of web-scraping processes and outputs (i.e. paradata and substantive data). 'Investigation' assesses the content and completeness of the retrieved web response objects. 'Transformation' involves parsing and cleaning the retrieved web data, potential integration with other data, and documentation of key decisions such as imputation or harmonisation strategies. Finally, the 'Summary' stage documents any decisions that might materially impact downstream analysis, and describes key properties (i.e. metadata) and limitations of the final web-scraped dataset.

Keywords

internet data

sampling design

web scraping

data quality 

Abstracts


Main Sponsor

Survey Research Methods Section