Creating, evaluating, and sharing synthetic data for multinational HIV cohorts
Sunday, Aug 3: 2:35 PM - 3:05 PM
Invited Paper Session
Music City Center
Open science often clashes with data privacy laws and regulations, especially with respect to sharing health data. Synthetic data offers a viable middle-ground solution to enable sharing data that resemble the original data while mitigating privacy concerns. We present our experience generating synthetic datasets for the Caribbean, Central and South America network for HIV epidemiology (CCASAnet), a large (n~70,000) observational cohort of people living with HIV throughout Latin America. We describe various methods for fitting and generating data, including generative adversarial network (GAN) and diffusion probabilistic model techniques. We discuss challenges encountered, including handling missing data and rare events. We evaluate the utility of our synthetic data by assessing its extrinsic performance – i.e., its ability to yield similar results to the original data when applying analyses that are independent of the data generation process.
You have unsaved changes.