R-Squared From Synthetic Data: When Can It Be Trusted?
Monday, Aug 4: 11:50 AM - 12:15 PM
Invited Paper Session
Music City Center
Synthetic data are gaining in popularity for a variety of reasons -- ranging from protecting privacy to reducing computation costs -- making it more urgent to address the question about the reliability of using synthetic data for assessing the real-world efficacy of a prediction algorithm. This article outline a comparative framework that takes into account (1) the relationship between the synthetic data D and the (potentially counterfactual) benchmark data D*, which is perceived as a reasonable representation of reality; and (2) the relationship between how the algorithm interacts with D and how it interacts with D*. We propose measures of target syntheticity (or more broadly proximity) and residual syntheticity/proximity, and provide a simple decomposition of the benchmark R-squared into the synthetic R-squared and a syntheticity-impact score, which quantifies the difference between the residual and target syntheticities relative to the residual syntheticity alone. We show that the synthetic R-squared is typically asymptotically conservative whenever the synthetic data are created by injecting additive noise to the target variable, such as in differential privacy, and we provide a computable adjustment for safely correcting the conservativeness in the synthetic R-squared in such cases. Additionally, we establish a necessary and sufficient condition for the residual syntheticity to exceed one, which implies a conservative synthetic R-squared when the target variable is not synthesized. We apply these theoretical insights to a proxy study, investigating the prediction of ground-level features from Earth observations in cases where the locations of these features have been synthetically perturbed to protect the data-providers' privacy. (This is joint work with James Bailie, Mohammad Kakooei, and Adel Daoud of AI and Global Development Lab at Chalmers University of Technology in Sweden.)
You have unsaved changes.