Evaluating the Disclosure Risk and Analytic Utility of Synthetic Data in a Municipal Health Survey

Wen Qin Deng Co-Author
NYC Department of Health and Mental Hygiene
 
Jingchen Hu Co-Author
Vassar College
 
Tashema Bholanath Co-Author
NYC Department of Health and Mental Hygiene
 
Fangtao He Co-Author
NYC Department of Health and Mental Hygiene
 
Nneka Lundy De La Cruz Co-Author
NYC Department of Health and Mental Hygiene
 
Stephen Immerwahr First Author
NYC Department of Health and Mental Hygiene
 
Stephen Immerwahr Presenting Author
NYC Department of Health and Mental Hygiene
 
Thursday, Aug 8: 11:50 AM - 12:05 PM
3545 
Contributed Papers 
Oregon Convention Center 
Releasing public-use micro-level data files from health surveys holds immense value for science and health policy. However, even after removing personally identifying information, the privacy of survey respondents may still be compromised. Using a large NYC population-representative health survey (n=10,271), we identified high-risk observations based on population estimates through a combination of key variables. We compared three different solutions to mitigate the risk of re-identification – suppression, synthesis using Classification and Regression Trees, and synthesis via Bayesian models – and assess their impact on both risk and loss of utility of the resulting protected data. While both synthesis methods resulted in slightly higher disclosure risks compared to the suppression method, the synthetic datasets preserved a higher level of utility. We will discuss our proposed solutions to avoid over-protecting and potentially obscuring estimates for underserved and vulnerable groups and share our experiences with data curators in advancing disclosure risk controls and data sharing in public health.

Keywords

Health Surveys

Data Privacy Risk

Synthetic Data

Survey Research Methods

Government Statistics 

Main Sponsor

Survey Research Methods Section