False Discovery Rate in Large-Scale Data Error Localization

Paul Smith Co-Author
University of Maryland (retired)
 
Chin-Fang Weng Co-Author
Census Bureau (retired)
 
Eric Slud Co-Author
U. S. Census Bureau
 
Chin-Fang Weng First Author
US Census Bureau
 
Chin-Fang Weng Presenting Author
US Census Bureau
 
Tuesday, Aug 6: 9:20 AM - 9:35 AM
2395 
Contributed Papers 
Oregon Convention Center 

Description

Statistical data editing means identifying potential response errors in the data. Data editing is subject to two types of errors: labeling a correct observation as erroneous and not identifying an incorrect value. There is no statistical criterion to decide how many observations should be edited. Over-editing can increase data errors, degrade data quality, change the data structure and increase costs. Error localization consists of separate tests on each observation, where the null hypothesis states that the observation is error free and the alternative states that the observation is erroneous. The False Discovery Rate (FDR) is the fraction of false-positive findings among those deemed to be erroneous. Because FDR control is related to the number of edited observations, imposing an FDR requirement specifies the number of outliers to be edited, thereby controlling overediting. In this presentation we apply FDR theory to error localization and verify the theory on simulated data.

Keywords

data editing

response errors

over-editing

multiple hypothesis tests

periodic surveys 

Main Sponsor

Survey Research Methods Section