Evaluating the Disclosure Risk and Analytic Utility of Synthetic Data in a Municipal Health Survey
Wen Qin Deng
Co-Author
NYC Department of Health and Mental Hygiene
Fangtao He
Co-Author
NYC Department of Health and Mental Hygiene
Thursday, Aug 8: 11:50 AM - 12:05 PM
3545
Contributed Papers
Oregon Convention Center
Releasing public-use micro-level data files from health surveys holds immense value for science and health policy. However, even after removing personally identifying information, the privacy of survey respondents may still be compromised. Using a large NYC population-representative health survey (n=10,271), we identified high-risk observations based on population estimates through a combination of key variables. We compared three different solutions to mitigate the risk of re-identification – suppression, synthesis using Classification and Regression Trees, and synthesis via Bayesian models – and assess their impact on both risk and loss of utility of the resulting protected data. While both synthesis methods resulted in slightly higher disclosure risks compared to the suppression method, the synthetic datasets preserved a higher level of utility. We will discuss our proposed solutions to avoid over-protecting and potentially obscuring estimates for underserved and vulnerable groups and share our experiences with data curators in advancing disclosure risk controls and data sharing in public health.
Health Surveys
Data Privacy Risk
Synthetic Data
Survey Research Methods
Government Statistics
Main Sponsor
Survey Research Methods Section
You have unsaved changes.