Population Obfuscation for Data Privacy and a Masking Problem Solved by Optimal Transport
Conference: Symposium on Data Science and Statistics (SDSS) 2023
05/26/2023: 10:25 AM - 10:30 AM CDT
Lightning
Data managers are often charged to share data samples that have proprietary or sensitive elements--data entries, variables, or individuals' whole records--whose privacy must be maintained. Meeting these conflicting goals of data access and privacy is a challenging sample obfuscation problem that has been broadly studied from a variety of perspectives. Population obfuscation, by contrast, protects information and features of a whole statistical population of data, the population being represented by an algorithm, formula, model, or sampling plan from which unlimited numbers of data records can be produced. We propose a problem in population obfuscation in which two samples are given: a large sample from a population with a subset of variables that must be masked and a small sample of masked data. This situation can arise in the case of, for example, archived data being repurposed for new analyses. This is a data augmentation problem in which a small data set--the masked data--is supported by a large data set--the marked data--from a different, but related source. A solution to this problem based on Monge-Kantorovich-formulated optimal transport (OT) is explored. OT finds the unique optimal map, or push-forward operator, to transform one probability distribution to another. Experiments using earth mover distance to quantify learning error are conducted to determine the effectiveness of the OT solution approach relative to the masked sample size. These experiments involve five factors: covariance and shape of the population marked for masking, number of population variables, choice of variable(s) to be masked, and different types (linear/non-linear) of masking map. These experiments show 1) that marked data can effectively augment a limited set of masked data, and 2) that the OT solution's masking error decreases log-log linearly with training data sample size, with a constant log-log slope, not significantly different from −1/2 in the two-variable case.
obfuscation
data privacy
masking
optimal transport
earth mover distance
data augmentation
Presenting Author
Angela Folz, University of Colorado Boulder
First Author
Angela Folz, University of Colorado Boulder
CoAuthor(s)
Michael Frey, National Institute of Standards & Technology
Adam Wunderlich, Communications Technology Laboratory, National Institute of Standards and Technology
Target Audience
Mid-Level
Tracks
Practice and Applications
Symposium on Data Science and Statistics (SDSS) 2023
You have unsaved changes.