K–Anonymity–Aware Sequential Sampling Method for Synthetic Data

TaeWook Kim Speaker
 
Jeongyoun Ahn Co-Author
KAIST
 
Changwon Yoon Co-Author
Department of Industrial & Systems Engineering, KAIST
 
Cheolwoo Park Co-Author
KAIST
 
Bonwoo Lee Co-Author
 
Sunday, Aug 2: 2:00 PM - 3:50 PM
2529 
Contributed Speed 
In sensitive domains like healthcare, synthetic data can replace releasing microdata, aiming to match key statistics while reducing re-identification and attribute-inference risk. Yet generators may still emit rare patterns or near-duplicates. We propose a k–anonymity–aware sequential sampling approach that generates each synthetic record by sampling variables sequentially from histogram-based empirical conditional distributions. At each step, the method restricts candidate values so that the resulting partial pattern (the projection onto the variables already sampled) has at least k matching records in the original dataset. When the conditioning context is too sparse, we relax the conditioning set in a dependence-guided manner, dropping variables weakly related to the variable currently being sampled (e.g., ranked by normalized mutual information) while retaining the minimum-frequency requirement. Overall, k serves as a transparent, user-specified control over the minimum frequency of generated patterns, while dependence-guided relaxation can help preserve useful multivariate structure, supporting a practical balance between fidelity and privacy in synthetic data release.

Keywords

Synthetic tabular data

Data Privacy

k-anonymity

Sequential sampling

Normalized mutual information 

Main Sponsor

Korean International Statistical Society