Synthetic Data in Preserving Privacy: Connections Across Federal Statistics and Health Data

Rebecca Hubbard Chair
Brown University
 
Bradley Malin Panelist
Vanderbilt University
 
Harrison Quick Panelist
 
Aleksandra Slavkovic Panelist
Pennsylvania State University
 
Roee Gutman Panelist
Brown University
 
Brittany Segundo Organizer
 
Elizabeth Stuart Organizer
Johns Hopkins University, Bloomberg School of Public Health
 
Lance Waller Organizer
Emory University
 
Monday, Aug 5: 2:00 PM - 3:50 PM
1115 
Invited Panel Session 
Oregon Convention Center 
Room: CC-F150 
Open data sharing supports scientific advancement and research reproducibility. Data sharing is seen as a priority by researchers and funders, but sharing of sensitive health data is rightly hampered by privacy and confidentiality concerns, as well as regulatory restrictions such as those imposed by IRBs and HIPAA. This panel will convene experts in health and administrative data sharing – two areas with recent advances in synthetic data methods - to discuss a range of approaches for facilitating data sharing and protecting privacy. By bringing these experts together, the session seeks to explore the current challenges and opportunities in creating synthetic data for sharing health and administrative data while maintaining privacy.

Besides traditional methods such as differential privacy, an example of an emerging approach to open sharing of data that preserves privacy is the use of generative machine learning models and artificial intelligence techniques such as generative adversarial networks (GANs) and variational autoencoders to create synthetic datasets. In theory, such synthetic data retain key features of the source data while greatly reducing the risk to individual privacy. The objective of creating synthetic records is to support research using these resources while protecting privacy and confidentiality. Performance of proposed approaches is typically evaluated in terms of similarity between synthetic data and the source data from which it was derived, performance of prediction models developed using the synthetic data, and vulnerability of synthetic data to privacy attacks. Similarity between synthetic and source data is generally assessed in terms of distributional characteristics of the synthetic data, as well as the ability of domain experts and predictive models to distinguish real from synthetic data. While many GANs can produce synthetic data that cannot be distinguished from real data by experts, the data often fail to reflect higher order distributional characteristics of the source data. Success in supporting development of prediction models that generalize to real EHR data has been mixed.

In addition to navigating the opportunities and challenges of these approaches, panelists will discuss the connections between privacy for health and privacy for administrative data. They will also probe the domain-specific nuances that merit consideration when handling different data types. Panelists include individuals with deep expertise regarding the methods themselves and the nuances of their application in a range of datasets, including health data and federal data such as from the US Census Bureau. By bringing together experts across these areas to share lessons learned in each can highlight areas for additional methods development.

Given the current research landscape, the time is right for a panel to explore state-of-the-art approaches for generating synthetic data, frameworks for evaluating the utility of such data, and methods to assess vulnerability to privacy attacks while ensuring privacy guarantees.

Applied

Yes

Main Sponsor

Committee on Applied and Theoretical Statistics; NAS

Co Sponsors

Health Policy Statistics Section
Scientific and Public Affairs Advisory Committee
Section on Statistics in Epidemiology