Methods and Considerations for Using Differential Privacy in Government Statistics

Zachary Terner Chair
The MITRE Corporation
 
Mikaela Meyer Organizer
 
Zachary Terner Organizer
The MITRE Corporation
 
Monday, Aug 4: 2:00 PM - 3:50 PM
0633 
Invited Paper Session 
Music City Center 
Room: CC-101D 

Applied

Yes

Main Sponsor

Government Statistics Section

Co Sponsors

Survey Research Methods Section

Presentations

Slowly Scaling Per-Record Differential Privacy

We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy guarantee degrades slowly in the protected records' influence on the statistics being released. Formal privacy mechanisms generally add randomness, or "noise," to published statistics. If a noisy statistic's distribution changes little with the addition or deletion of a single record in the underlying dataset, an attacker looking at this statistic will find it plausible that any particular record was present or absent, preserving the records' privacy. More influential records -- those whose addition or deletion would change the statistics' distribution more -- typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records' influence varies widely, as is common in economic data.

We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These mechanisms allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility. 

Co-Author(s)

Brian Finley, US Census Bureau
Anthony Caruso, U.S. Census Bureau
Justin Doty, U.S. Census Bureau
Ashwin Machanavajjhala
David Pujol, Tumult Labs
William Sexton, Tumult Labs
Zachary Terner, The MITRE Corporation

Speaker

Mikaela Meyer

Best Linear Unbiased Estimate from Privatized Histograms

In differential privacy (DP) mechanisms, it can be beneficial to release "redundant" outputs, in the sense that a quantity can be estimated by combining different combinations of privatized values. Indeed, this structure is present in the DP 2020 Decennial Census products published by the U.S. Census Bureau. With this structure, the DP output can be improved by enforcing self-consistency (i.e., estimators obtained by combining different values result in the same estimate) and we show that the minimum variance processing is a linear projection. However, standard projection algorithms are too computationally expensive in terms of both memory and execution time for applications such as the Decennial Census. We propose the Scalable Efficient Algorithm for Best Linear Unbiased Estimate (SEA BLUE), based on a two step process of aggregation and differencing that 1) enforces self-consistency through a linear and unbiased procedure, 2) is computationally and memory efficient, 3) achieves the minimum variance solution under certain structural assumptions, and 4) is empirically shown to be robust to violations of these structural assumptions. We propose three methods of calculating confidence intervals from our estimates, under various assumptions. We apply SEA BLUE to two 2010 Census demonstration products, illustrating its scalability and validity.  

Co-Author(s)

Adam Edwards, The Mitre Corporation
Paul Bartholomew, The Mitre Corporation
Andrew Sillers, The MITRE Corporation

Speaker

Jordan Awan, Purdue University

WITHDRAWN Exploring Statistical Disclosure Limitation Techniques for Agricultural Data Using Robust Metrics for Risk and Utility

USDA's National Agricultural Statistics Service (NASS) uses the Census of Agriculture (CoA), surveys, and information from other sources to produce official statistics. To minimize disclosure risks and to maintain the analytical validity of the published data, NASS uses a cell suppression approach for its disclosure program. The research to improve statistical disclosure limitation (SDL) program is ongoing at NASS. It is crucial to develop robust evaluation metrics for disclosure risks and data-information loss to identify and select an optimal SDL method that protects agricultural data while preserving their statistical validity. Although measures of utility are easier to define, the literature on assessing disclosure risk is sparse. The contribution of this work is the introduction of robust metrics developed for both risk and utility. Results obtained from different privacy protection approaches applied to the 2017 CoA are compared based on both their utility and risk. In addition, a Pareto front is developed based on these measures to identify an SDL technique for the NASS disclosure program. Results from these analyses and some final remarks are discussed.

Key Words: Privacy evaluation, Disclosure limitation, Pareto front, Utility-risk tradeoff, Robustness
 

Co-Author(s)

Luca Sartore, National Institute of Statistical Sciences
Valbona Bejleri, United States Department of Agriculture – National Agricultural Statistics Service

Differential Privacy and the Survey Data Pipeline

The concept of differential privacy (DP) gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for its 2020 Decennial Census. However, despite its attractive theoretical properties, implementing the approach in practice is challenging, especially when it comes to survey data. In this talk, I will focus on the fact that the production of survey data is a complex multistage process and discuss its implications for DP. Specifically, I will illustrate that data custodians willing to adopt DP for their surveys need to address two important questions: Firstly, at what point in the pipeline should the DP mechanism start? And secondly, which of the earlier stages of the data pipeline should be considered invariant – i.e., should be treated as fixed – by DP? I will highlight the implications of these decisions and offer guidelines which settings statistical agencies should potentially adopt when implementing DP for their surveys. 

Co-Author

James Bailie, Harvard University

Speaker

Joerg Drechsler, Institute for Employment Research

Implementing PETs for Government Data: Context and Considerations

As privacy tools move from research to deployment in government contexts, they will unavoidably interact with the complicated use cases and implementation challenges associated with the government's data policy, governance, and regulatory environment.

In this talk we will review public resources for discovering, untangling and navigating the constraints of the government data ecosystem: understanding the law relevant to data access, leveraging standards to help streamline deployment, unlocking the language of public interest, and connecting projects back to decision maker use cases for data-informed policy.

We relate this broader context to key considerations for differential privacy solutions in particular, providing illustrative examples from the NIST CRC synthetic data benchmarking project. 

Co-Author(s)

Meghan Stuessy, US Congressional Research Service
Jake Pasner, Georgetown University
Christine Task, Knexus Research Corporation

Speaker

Meghan Stuessy, US Congressional Research Service