Monday, Aug 4: 2:00 PM - 3:50 PM
0633
Invited Paper Session
Music City Center
Room: CC-101D
Applied
Yes
Main Sponsor
Government Statistics Section
Co Sponsors
Survey Research Methods Section
Presentations
We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy guarantee degrades slowly in the protected records' influence on the statistics being released. Formal privacy mechanisms generally add randomness, or "noise," to published statistics. If a noisy statistic's distribution changes little with the addition or deletion of a single record in the underlying dataset, an attacker looking at this statistic will find it plausible that any particular record was present or absent, preserving the records' privacy. More influential records -- those whose addition or deletion would change the statistics' distribution more -- typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records' influence varies widely, as is common in economic data.
We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These mechanisms allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility.
In differential privacy (DP) mechanisms, it can be beneficial to release "redundant" outputs, in the sense that a quantity can be estimated by combining different combinations of privatized values. Indeed, this structure is present in the DP 2020 Decennial Census products published by the U.S. Census Bureau. With this structure, the DP output can be improved by enforcing self-consistency (i.e., estimators obtained by combining different values result in the same estimate) and we show that the minimum variance processing is a linear projection. However, standard projection algorithms are too computationally expensive in terms of both memory and execution time for applications such as the Decennial Census. We propose the Scalable Efficient Algorithm for Best Linear Unbiased Estimate (SEA BLUE), based on a two step process of aggregation and differencing that 1) enforces self-consistency through a linear and unbiased procedure, 2) is computationally and memory efficient, 3) achieves the minimum variance solution under certain structural assumptions, and 4) is empirically shown to be robust to violations of these structural assumptions. We propose three methods of calculating confidence intervals from our estimates, under various assumptions. We apply SEA BLUE to two 2010 Census demonstration products, illustrating its scalability and validity.
USDA's National Agricultural Statistics Service (NASS) uses the Census of Agriculture (CoA), surveys, and information from other sources to produce official statistics. To minimize disclosure risks and to maintain the analytical validity of the published data, NASS uses a cell suppression approach for its disclosure program. The research to improve statistical disclosure limitation (SDL) program is ongoing at NASS. It is crucial to develop robust evaluation metrics for disclosure risks and data-information loss to identify and select an optimal SDL method that protects agricultural data while preserving their statistical validity. Although measures of utility are easier to define, the literature on assessing disclosure risk is sparse. The contribution of this work is the introduction of robust metrics developed for both risk and utility. Results obtained from different privacy protection approaches applied to the 2017 CoA are compared based on both their utility and risk. In addition, a Pareto front is developed based on these measures to identify an SDL technique for the NASS disclosure program. Results from these analyses and some final remarks are discussed.
Key Words: Privacy evaluation, Disclosure limitation, Pareto front, Utility-risk tradeoff, Robustness
Co-Author(s)
Luca Sartore, National Institute of Statistical Sciences
Valbona Bejleri, United States Department of Agriculture – National Agricultural Statistics Service
The concept of differential privacy (DP) gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for its 2020 Decennial Census. However, despite its attractive theoretical properties, implementing the approach in practice is challenging, especially when it comes to survey data. In this talk, I will focus on the fact that the production of survey data is a complex multistage process and discuss its implications for DP. Specifically, I will illustrate that data custodians willing to adopt DP for their surveys need to address two important questions: Firstly, at what point in the pipeline should the DP mechanism start? And secondly, which of the earlier stages of the data pipeline should be considered invariant – i.e., should be treated as fixed – by DP? I will highlight the implications of these decisions and offer guidelines which settings statistical agencies should potentially adopt when implementing DP for their surveys.
As privacy tools move from research to deployment in government contexts, they will unavoidably interact with the complicated use cases and implementation challenges associated with the government's data policy, governance, and regulatory environment.
In this talk we will review public resources for discovering, untangling and navigating the constraints of the government data ecosystem: understanding the law relevant to data access, leveraging standards to help streamline deployment, unlocking the language of public interest, and connecting projects back to decision maker use cases for data-informed policy.
We relate this broader context to key considerations for differential privacy solutions in particular, providing illustrative examples from the NIST CRC synthetic data benchmarking project.