Population Obfuscation for Data Privacy and a Masking Problem Solved by Optimal Transport

Conference: Symposium on Data Science and Statistics (SDSS) 2023
05/26/2023: 10:25 AM - 10:30 AM CDT
Lightning 

Description

Data managers are often charged to share data samples that have proprietary or sensitive elements--data entries, variables, or individuals' whole records--whose privacy must be maintained. Meeting these conflicting goals of data access and privacy is a challenging sample obfuscation problem that has been broadly studied from a variety of perspectives. Population obfuscation, by contrast, protects information and features of a whole statistical population of data, the population being represented by an algorithm, formula, model, or sampling plan from which unlimited numbers of data records can be produced. We propose a problem in population obfuscation in which two samples are given: a large sample from a population with a subset of variables that must be masked and a small sample of masked data. This situation can arise in the case of, for example, archived data being repurposed for new analyses. This is a data augmentation problem in which a small data set--the masked data--is supported by a large data set--the marked data--from a different, but related source. A solution to this problem based on Monge-Kantorovich-formulated optimal transport (OT) is explored. OT finds the unique optimal map, or push-forward operator, to transform one probability distribution to another. Experiments using earth mover distance to quantify learning error are conducted to determine the effectiveness of the OT solution approach relative to the masked sample size. These experiments involve five factors: covariance and shape of the population marked for masking, number of population variables, choice of variable(s) to be masked, and different types (linear/non-linear) of masking map. These experiments show 1) that marked data can effectively augment a limited set of masked data, and 2) that the OT solution's masking error decreases log-log linearly with training data sample size, with a constant log-log slope, not significantly different from −1/2 in the two-variable case.

Keywords

obfuscation

data privacy

masking

optimal transport

earth mover distance

data augmentation 

Presenting Author

Angela Folz, University of Colorado Boulder

First Author

Angela Folz, University of Colorado Boulder

CoAuthor(s)

Michael Frey, National Institute of Standards & Technology
Adam Wunderlich, Communications Technology Laboratory, National Institute of Standards and Technology

Target Audience

Mid-Level

Tracks

Practice and Applications
Symposium on Data Science and Statistics (SDSS) 2023