Print Close

Synthetic data in preserving privacy: Connections across federal statistics and health data

Abstract Number:

1115

Submission Type:

Invited Panel Session

Participants:

Brittany Segundo (1), Elizabeth Stuart (3), Lance Waller (4), Bradley Malin (5), Harrison Quick (6), Aleksandra Slavkovic (7), Roee Gutman (8), Rebecca Hubbard (2)

Institutions:

(1) The National Academies of Sciences, Engineering, and Medicine, Washington, D.C., (2) University of Pennsylvania, N/A, (3) Johns Hopkins University, Bloomberg School of Public Health, N/A, (4) Emory University, N/A, (5) Vanderbilt University, N/A, (6) N/A, N/A, (7) Pennsylvania State University, N/A, (8) Brown University, N/A

Chair:

Rebecca Hubbard
University of Pennsylvania

Co-Organizer(s):

Elizabeth Stuart
Johns Hopkins University, Bloomberg School of Public Health

Lance Waller
Emory University

Panelist(s):

Bradley Malin
Vanderbilt University

Harrison Quick
N/A

Aleksandra Slavkovic
Pennsylvania State University

Roee Gutman
Brown University

Session Organizer:

Brittany Segundo
The National Academies of Sciences, Engineering, and Medicine

Session Description:

Open data sharing supports scientific advancement and research reproducibility. Data sharing is seen as a priority by researchers and funders, but sharing of sensitive health data is rightly hampered by privacy and confidentiality concerns, as well as regulatory restrictions such as those imposed by IRBs and HIPAA. This panel will convene experts in health and administrative data sharing – two areas with recent advances in synthetic data methods - to discuss a range of approaches for facilitating data sharing and protecting privacy. By bringing these experts together, the session seeks to explore the current challenges and opportunities in creating synthetic data for sharing health and administrative data while maintaining privacy.

Besides traditional methods such as differential privacy, an example of an emerging approach to open sharing of data that preserves privacy is the use of generative machine learning models and artificial intelligence techniques such as generative adversarial networks (GANs) and variational autoencoders to create synthetic datasets. In theory, such synthetic data retain key features of the source data while greatly reducing the risk to individual privacy. The objective of creating synthetic records is to support research using these resources while protecting privacy and confidentiality. Performance of proposed approaches is typically evaluated in terms of similarity between synthetic data and the source data from which it was derived, performance of prediction models developed using the synthetic data, and vulnerability of synthetic data to privacy attacks. Similarity between synthetic and source data is generally assessed in terms of distributional characteristics of the synthetic data, as well as the ability of domain experts and predictive models to distinguish real from synthetic data. While many GANs can produce synthetic data that cannot be distinguished from real data by experts, the data often fail to reflect higher order distributional characteristics of the source data. Success in supporting development of prediction models that generalize to real EHR data has been mixed.

In addition to navigating the opportunities and challenges of these approaches, panelists will discuss the connections between privacy for health and privacy for administrative data. They will also probe the domain-specific nuances that merit consideration when handling different data types. Panelists include individuals with deep expertise regarding the methods themselves and the nuances of their application in a range of datasets, including health data and federal data such as from the US Census Bureau. By bringing together experts across these areas to share lessons learned in each can highlight areas for additional methods development.

Given the current research landscape, the time is right for a panel to explore state-of-the-art approaches for generating synthetic data, frameworks for evaluating the utility of such data, and methods to assess vulnerability to privacy attacks while ensuring privacy guarantees.

Sponsors:

Committee on Applied and Theoretical Statistics; NAS ¹

Health Policy Statistics Section ²

Section on Statistics in Epidemiology ³

Theme: Statistics and Data Science: Informing Policy and Countering Misinformation

Yes

Applied

Yes

Estimated Audience Size

Medium (80-150)

I have read and understand that JSM participants must abide by the Participant Guidelines.

Yes

I understand and have communicated to my proposed speakers that JSM participants must register and pay the appropriate registration fee by June 1, 2024. The registration fee is nonrefundable.

I understand