Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees

Aleksandra Slavkovic Co-Author
Pennsylvania State University
 
Shurong Lin First Author
Pennsylvania State University
 
Shurong Lin Presenting Author
Pennsylvania State University
 
Tuesday, Aug 5: 9:05 AM - 9:20 AM
1423 
Contributed Papers 
Music City Center 
In social sciences, where small- to medium-scale datasets are common, canonical tasks such as linear regression are ubiquitous. In privacy-aware settings, substantial work has been done on differentially private (DP) linear regression. However, most existing methods focus primarily on point estimation, with limited consideration of uncertainty quantification. At the same time, synthetic data generation (SDG) is gaining importance as a tool to allow replication studies in privacy-aware settings. Yet, current DP linear regression approaches do not readily support SDG. Furthermore, mainstream SDG methods, usually based on machine learning and deep learning models, often require large datasets to train effectively. This limits their applicability to smaller data regimes typical of social science research.
To address these challenges, we propose a novel Gaussian DP linear regression method that enables statistically valid inference by accounting for the noise introduced by the privacy mechanism. We derive a DP bias-corrected regression estimator and its asymptotic confidence interval. We also introduce a synthetic data generation procedure, where running linear regression on the synthetic data is equivalent to the proposed DP linear regression. Our approach is built upon a binning-aggregation strategy, leveraging existing DP binning techniques. It is designed to operate effectively in smaller $d$-dimensional regimes. Experimental results demonstrate that our method achieves statistical accuracy comparable to or better than existing DP linear regression techniques, with particularly notable improvements over those capable of statistical inference.

Keywords

Differential Privacy

Linear Regression

Synthetic Data

Gaussian Mechanism

Perturbed Histogram 

Main Sponsor

Privacy and Confidentiality Interest Group