K–Anonymity–Aware Sequential Sampling Method for Synthetic Data
Changwon Yoon
Co-Author
Department of Industrial & Systems Engineering, KAIST
Sunday, Aug 2: 2:00 PM - 3:50 PM
2529
Contributed Speed
In sensitive domains like healthcare, synthetic data can replace releasing microdata, aiming to match key statistics while reducing re-identification and attribute-inference risk. Yet generators may still emit rare patterns or near-duplicates. We propose a k–anonymity–aware sequential sampling approach that generates each synthetic record by sampling variables sequentially from histogram-based empirical conditional distributions. At each step, the method restricts candidate values so that the resulting partial pattern (the projection onto the variables already sampled) has at least k matching records in the original dataset. When the conditioning context is too sparse, we relax the conditioning set in a dependence-guided manner, dropping variables weakly related to the variable currently being sampled (e.g., ranked by normalized mutual information) while retaining the minimum-frequency requirement. Overall, k serves as a transparent, user-specified control over the minimum frequency of generated patterns, while dependence-guided relaxation can help preserve useful multivariate structure, supporting a practical balance between fidelity and privacy in synthetic data release.
Synthetic tabular data
Data Privacy
k-anonymity
Sequential sampling
Normalized mutual information
Main Sponsor
Korean International Statistical Society
You have unsaved changes.