Innovations in Privacy and Confidentiality: Synthetic Data, Differential Privacy, and Statistical Disclosure Control

Ozge Surer Chair
Miami University
 
Tuesday, Aug 5: 8:30 AM - 10:20 AM
4091 
Contributed Papers 
Music City Center 
Room: CC-Davidson Ballroom A2 

Main Sponsor

Privacy and Confidentiality Interest Group

Presentations

Ethical Considerations in Big Data Research Involving Human Subjects

Technological advancements have enabled the collection of vast amounts of data on human subjects and their interactions, particularly through social media platforms such as Facebook, Twitter, WhatsApp, and TikTok. These platforms facilitate global communication and the sharing of personal information, but their use in research introduces complex ethical challenges related to privacy, consent, and data security. While Big Data and modern technology significantly enhance the efficiency and accuracy of statistical modeling and inference, they also demand careful consideration of these ethical issues. This conference paper examines the ethical complexities inherent in Big Data research involving human subjects, identifies key concerns, and discusses potential solutions, including the adoption of dynamic informed consent and the establishment of regulatory frameworks to ensure responsible and transparent use of social media data. 

Keywords

Big Data

Informed Consent

Human Subjects Research

Ethical Challenges

Social Media Data 

First Author

Nirajan Bam, Miami University, Ohio, USA

Presenting Author

Nirajan Bam, Miami University, Ohio, USA

WITHDRAWN: Sub-Sampling as Data Protection: A Case Study of Pew Research Center's Asian American Survey

An anonymization challenge faced by many surveys is that that prior to, or even after, the process of anonymization organizations may release publications containing tables from raw data. These publications can undo data protections offered by Statistical Disclosure Limitation techniques, such as local suppression, since these tables can be used in subtraction attacks. We present a case study using Pew Research Center's Asian American Survey. Prior to releasing a Public Use File (PUF), Pew created many publications using raw Asian American Survey data. To create a PUF, we used further sub-sampling as our primary form of disclosure protection, since it would protect the PUF from subtraction attacks. This is effective because a data attacker would not expect a table coming from a sub-sampled PUF to have the exact same counts as the original data. We devised an experiment wherein we pulled 70 subsamples from the original responding sample and experimented with different sample sizes and different sampling strategies. We then tested the samples for both disclosure risk and data utility to find the sample with the best risk-utility profile. 

Keywords

Statistical Disclosure Limitation

Data Privacy

Sampling 

First Author

Jennifer Taub, NORC at The University of Chicago

Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees

In social sciences, where small- to medium-scale datasets are common, canonical tasks such as linear regression are ubiquitous. In privacy-aware settings, substantial work has been done on differentially private (DP) linear regression. However, most existing methods focus primarily on point estimation, with limited consideration of uncertainty quantification. At the same time, synthetic data generation (SDG) is gaining importance as a tool to allow replication studies in privacy-aware settings. Yet, current DP linear regression approaches do not readily support SDG. Furthermore, mainstream SDG methods, usually based on machine learning and deep learning models, often require large datasets to train effectively. This limits their applicability to smaller data regimes typical of social science research.
To address these challenges, we propose a novel Gaussian DP linear regression method that enables statistically valid inference by accounting for the noise introduced by the privacy mechanism. We derive a DP bias-corrected regression estimator and its asymptotic confidence interval. We also introduce a synthetic data generation procedure, where running linear regression on the synthetic data is equivalent to the proposed DP linear regression. Our approach is built upon a binning-aggregation strategy, leveraging existing DP binning techniques. It is designed to operate effectively in smaller $d$-dimensional regimes. Experimental results demonstrate that our method achieves statistical accuracy comparable to or better than existing DP linear regression techniques, with particularly notable improvements over those capable of statistical inference. 

Keywords

Differential Privacy

Linear Regression

Synthetic Data

Gaussian Mechanism

Perturbed Histogram 

Co-Author

Aleksandra Slavkovic, Pennsylvania State University

First Author

Shurong Lin, Pennsylvania State University

Presenting Author

Shurong Lin, Pennsylvania State University

A Deep Learning Framework for Statistical Disclosure Control

Statistical disclosure control (SDC) seeks to prevent data intended for legitimate analyses from being used to obtain sensitive information about individuals. We introduce a new approach, neural network SDC (NN-SDC), that uses deep learning to preserve privacy. We will focus on microdata (records corresponding to individuals), but the techniques presented may also apply to aggregate data and information retrieval. Existing SDC methods, which are primarily intended for numeric or categorical data, include adding noise, data swapping, and micro aggregation. But machine learning and AI often require attributes like text and images, to which existing methods may not apply. Also, the release of data typically involves multiple goals, including a desire to provide useful data and a need to protect privacy.

NN-SDC first trains a model, then uses that model to produce anonymized data. The training process can account for goals, including privacy protection and utility of the data. NN-SDC can incorporate existing methods while having the potential to preserve confidentiality in new and novel ways. We argue that NN-SDC generalizes existing approaches and is at least as effective. 

Keywords

Statistical disclosure control

Deep learning

Microdata

Machine learning

AI

Differential privacy 

First Author

Patrick Tendick, Federal Reserve

Presenting Author

Patrick Tendick, Federal Reserve

Synthetic Data with Heterogeneous Differential Privacy

Differential privacy (DP) offers rigorous privacy guarantees but often applies a uniform privacy level across entire datasets, neglecting user preferences and varying attribute sensitivity. We propose a framework incorporating these granularities to enhance the privacy-utility trade-off in DP synthetic data. We introduce multi-dimensional heterogeneous DP (HDP), combining user-dependent and attribute-dependent HDP guarantees, along with a privacy budget allocation policy. We propose and compare a synthetic data generation framework for combining user groups with diverse privacy needs and across attributes with different levels of sensitivity. Additionally, we develop the technique of SoftMax weighting that downweights the contribution of highly perturbed privacy groups at small sample sizes by borrowing information from less perturbed groups to improve the utility of the final synthetic data. We run extensive simulation studies and apply our proposed framework to a real-world dataset. The results demonstrate improved utility with heterogeneous DP over uniform DP for synthetic data generation
 

Keywords

Differential privacy

synthetic data

Bayesian

personalized DP

attribute DP

heterogeneous DP, privacy-utility trade-off 

Co-Author

Fang Liu, University of Notre Dame

First Author

Gina Mannino, University of Notre Dame

Presenting Author

Gina Mannino, University of Notre Dame