Tuesday, Aug 5: 8:30 AM - 10:20 AM
4091
Contributed Papers
Music City Center
Room: CC-Davidson Ballroom A2
Main Sponsor
Privacy and Confidentiality Interest Group
Presentations
Technological advancements have enabled the collection of vast amounts of data on human subjects and their interactions, particularly through social media platforms such as Facebook, Twitter, WhatsApp, and TikTok. These platforms facilitate global communication and the sharing of personal information, but their use in research introduces complex ethical challenges related to privacy, consent, and data security. While Big Data and modern technology significantly enhance the efficiency and accuracy of statistical modeling and inference, they also demand careful consideration of these ethical issues. This conference paper examines the ethical complexities inherent in Big Data research involving human subjects, identifies key concerns, and discusses potential solutions, including the adoption of dynamic informed consent and the establishment of regulatory frameworks to ensure responsible and transparent use of social media data.
Keywords
Big Data
Informed Consent
Human Subjects Research
Ethical Challenges
Social Media Data
First Author
Nirajan Bam, Miami University, Ohio, USA
Presenting Author
Nirajan Bam, Miami University, Ohio, USA
An anonymization challenge faced by many surveys is that that prior to, or even after, the process of anonymization organizations may release publications containing tables from raw data. These publications can undo data protections offered by Statistical Disclosure Limitation techniques, such as local suppression, since these tables can be used in subtraction attacks. We present a case study using Pew Research Center's Asian American Survey. Prior to releasing a Public Use File (PUF), Pew created many publications using raw Asian American Survey data. To create a PUF, we used further sub-sampling as our primary form of disclosure protection, since it would protect the PUF from subtraction attacks. This is effective because a data attacker would not expect a table coming from a sub-sampled PUF to have the exact same counts as the original data. We devised an experiment wherein we pulled 70 subsamples from the original responding sample and experimented with different sample sizes and different sampling strategies. We then tested the samples for both disclosure risk and data utility to find the sample with the best risk-utility profile.
Keywords
Statistical Disclosure Limitation
Data Privacy
Sampling
In social sciences, where small- to medium-scale datasets are common, canonical tasks such as linear regression are ubiquitous. In privacy-aware settings, substantial work has been done on differentially private (DP) linear regression. However, most existing methods focus primarily on point estimation, with limited consideration of uncertainty quantification. At the same time, synthetic data generation (SDG) is gaining importance as a tool to allow replication studies in privacy-aware settings. Yet, current DP linear regression approaches do not readily support SDG. Furthermore, mainstream SDG methods, usually based on machine learning and deep learning models, often require large datasets to train effectively. This limits their applicability to smaller data regimes typical of social science research.
To address these challenges, we propose a novel Gaussian DP linear regression method that enables statistically valid inference by accounting for the noise introduced by the privacy mechanism. We derive a DP bias-corrected regression estimator and its asymptotic confidence interval. We also introduce a synthetic data generation procedure, where running linear regression on the synthetic data is equivalent to the proposed DP linear regression. Our approach is built upon a binning-aggregation strategy, leveraging existing DP binning techniques. It is designed to operate effectively in smaller $d$-dimensional regimes. Experimental results demonstrate that our method achieves statistical accuracy comparable to or better than existing DP linear regression techniques, with particularly notable improvements over those capable of statistical inference.
Keywords
Differential Privacy
Linear Regression
Synthetic Data
Gaussian Mechanism
Perturbed Histogram
Statistical disclosure control (SDC) seeks to prevent data intended for legitimate analyses from being used to obtain sensitive information about individuals. We introduce a new approach, neural network SDC (NN-SDC), that uses deep learning to preserve privacy. We will focus on microdata (records corresponding to individuals), but the techniques presented may also apply to aggregate data and information retrieval. Existing SDC methods, which are primarily intended for numeric or categorical data, include adding noise, data swapping, and micro aggregation. But machine learning and AI often require attributes like text and images, to which existing methods may not apply. Also, the release of data typically involves multiple goals, including a desire to provide useful data and a need to protect privacy.
NN-SDC first trains a model, then uses that model to produce anonymized data. The training process can account for goals, including privacy protection and utility of the data. NN-SDC can incorporate existing methods while having the potential to preserve confidentiality in new and novel ways. We argue that NN-SDC generalizes existing approaches and is at least as effective.
Keywords
Statistical disclosure control
Deep learning
Microdata
Machine learning
AI
Differential privacy
Differential privacy (DP) offers rigorous privacy guarantees but often applies a uniform privacy level across entire datasets, neglecting user preferences and varying attribute sensitivity. We propose a framework incorporating these granularities to enhance the privacy-utility trade-off in DP synthetic data. We introduce multi-dimensional heterogeneous DP (HDP), combining user-dependent and attribute-dependent HDP guarantees, along with a privacy budget allocation policy. We propose and compare a synthetic data generation framework for combining user groups with diverse privacy needs and across attributes with different levels of sensitivity. Additionally, we develop the technique of SoftMax weighting that downweights the contribution of highly perturbed privacy groups at small sample sizes by borrowing information from less perturbed groups to improve the utility of the final synthetic data. We run extensive simulation studies and apply our proposed framework to a real-world dataset. The results demonstrate improved utility with heterogeneous DP over uniform DP for synthetic data generation
Keywords
Differential privacy
synthetic data
Bayesian
personalized DP
attribute DP
heterogeneous DP, privacy-utility trade-off