Print Close

K–Anonymity–Aware Sequential Sampling Method for Synthetic Data

Presented During: SPEED 1: Data Challenge, Bayesian Analysis, and Statistical Applications, Part 1

TaeWook Kim Speaker

Jeongyoun Ahn Co-Author
KAIST

Changwon Yoon Co-Author
Department of Industrial & Systems Engineering, KAIST

Cheolwoo Park Co-Author
KAIST

Bonwoo Lee Co-Author

Sunday, Aug 2: 3:05 PM - 3:10 PM
2529
Contributed Speed

Thomas M. Menino Convention & Exhibition Center

In sensitive domains like healthcare, synthetic data can replace releasing microdata, aiming to match key statistics while reducing re-identification and attribute-inference risk. Yet generators may still emit rare patterns or near-duplicates. We propose a k–anonymity–aware sequential sampling approach that generates each synthetic record by sampling variables sequentially from histogram-based empirical conditional distributions. At each step, the method restricts candidate values so that the resulting partial pattern (the projection onto the variables already sampled) has at least k matching records in the original dataset. When the conditioning context is too sparse, we relax the conditioning set in a dependence-guided manner, dropping variables weakly related to the variable currently being sampled (e.g., ranked by normalized mutual information) while retaining the minimum-frequency requirement. Overall, k serves as a transparent, user-specified control over the minimum frequency of generated patterns, while dependence-guided relaxation can help preserve useful multivariate structure, supporting a practical balance between fidelity and privacy in synthetic data release.

Keywords

Synthetic tabular data

Data Privacy

k-anonymity

Sequential sampling

Normalized mutual information

Main Sponsor

Korean International Statistical Society