Print Close

WITHDRAWN An AI/ML pipeline for automated coding of race and ethnicity write-in responses for federal surveys

Presented During: A New Era in Official Statistics: AI/ML and Automation

Haley Hunter-Zinck First Author
U.S. Census Bureau

Wednesday, Aug 6: 9:05 AM - 9:20 AM
0963
Contributed Papers

Music City Center

Accurate coding of federal survey write-in responses to standardized concept lists is essential for incorporation of these responses into downstream statistical products. However, coding by a trained specialist is resource intensive. We examine automated coding of write-in responses to race and ethnicity questions on the United States decennial census to over 1,600 standardized concept codes using artificial intelligence and machine learning (AI/ML) techniques. Since any subset of codes may be assigned to a response, we format the task as a multilabel classification problem. We benchmark fuzzy lookups, classical machine learning and transformer-based classifiers for the coder model and evaluate on the response and code level. To facilitate automation, we train a second ML model to generate a probability that the predicted codes are an exact match to codes that would have been assigned by a residual coder. Performance is evaluated with both intrinsic (e.g., F1 score) and extrinsic (e.g., simulation) metrics. Overall, AI/ML methods show potential for automated coding of race and ethnicity write-in responses in federal surveys.

Keywords

Artificial intelligence

Machine learning

Automated coding

Federal surveys

Multilabel learning

Main Sponsor

Government Statistics Section