WITHDRAWN An AI/ML pipeline for automated coding of race and ethnicity write-in responses for federal surveys
Wednesday, Aug 6: 9:05 AM - 9:20 AM
0963
Contributed Papers
Music City Center
Accurate coding of federal survey write-in responses to standardized concept lists is essential for incorporation of these responses into downstream statistical products. However, coding by a trained specialist is resource intensive. We examine automated coding of write-in responses to race and ethnicity questions on the United States decennial census to over 1,600 standardized concept codes using artificial intelligence and machine learning (AI/ML) techniques. Since any subset of codes may be assigned to a response, we format the task as a multilabel classification problem. We benchmark fuzzy lookups, classical machine learning and transformer-based classifiers for the coder model and evaluate on the response and code level. To facilitate automation, we train a second ML model to generate a probability that the predicted codes are an exact match to codes that would have been assigned by a residual coder. Performance is evaluated with both intrinsic (e.g., F1 score) and extrinsic (e.g., simulation) metrics. Overall, AI/ML methods show potential for automated coding of race and ethnicity write-in responses in federal surveys.
Artificial intelligence
Machine learning
Automated coding
Federal surveys
Multilabel learning
Main Sponsor
Government Statistics Section
You have unsaved changes.