Cleaning Products for Your Data: Four Studies in Editing and Imputation

Luca Sartore Chair
National Institute of Statistical Sciences
 
Megan Lipke Discussant
USDA/NASS
 
Darcy Miller Organizer
USDA/NASS
 
Luca Sartore Organizer
National Institute of Statistical Sciences
 
Wednesday, Aug 7: 8:30 AM - 10:20 AM
1780 
Topic-Contributed Paper Session 
Oregon Convention Center 
Room: CC-253 

Applied

Yes

Main Sponsor

Survey Research Methods Section

Co Sponsors

Government Statistics Section
Section on Statistical Learning and Data Science

Presentations

Applying Non-Survey Data and Machine Learning Techniques to Address Nonresponse in an Agricultural Area Frame Survey

The June Area Survey (JAS) is an annual survey conducted by the United States (U.S.) Department of Agriculture's National Agricultural Statistics Service (NASS) to estimate crop acreages and to measure the coverage of the NASS list frame. The JAS is based on an area frame that offers complete coverage of the contiguous U.S. The design of the survey requires complete reports for all sampled tracts. Thus, the inevitable nonresponse in the survey must be addressed through observation of sampled areas or imputation. Time spent on these efforts is costly and the resulting data are less reliable than data obtained from full responses. Researchers at NASS have developed a new approach for integrating administrative data, geospatial data, and machine learning forecasting techniques to begin addressing nonresponse in the JAS with an automated imputation process. In this paper, the new process for automated imputation will be described, and the predicted impact on survey data quality is explored. Study results indicate that the automated imputation process produces estimates that are comparable to those produced using traditional methods.  

Co-Author(s)

Tara Murphy, USDA National Agricultural Statistics Service
Luca Sartore, National Institute of Statistical Sciences
Jonathon Abernethy, USDA/NASS
Robert Emmet
Linda Young, USDA NASS
Arthur Rosales

Speaker

Darcy Miller, USDA/NASS

Developing a Hot Deck Imputation Procedure for the Annual Economic Integrated Survey

Beginning in 2024, the economic directorate of the U.S. Census Bureau will the Annual Integrated Economic Survey (AIES), an economy wide survey that replaces a suite of seven independently designed ongoing surveys. The AIES requirements are informed by the user community's longstanding data needs (e.g. national and sub national tabulations), as well as by extensive respondent research on collection. This presentation provides a detailed overview of the nearest neighbor imputation methodology used for the establishment level collection of the survey. Throughout, I will highlight specific challenges of developing a viable imputation procedure for a new multi-purpose business survey whose collection covers a wide range of economic sectors. 

Speaker

Katherine Thompson, US Census Bureau

Ensuring Data Quality in Student Lists Submitted for Sampling for the 2022 National Assessment of Educational Progress

The National Assessment of Educational Progress (NAEP) is a congressionally-mandated series of surveys measuring the proficiency of American students in a variety of academic subjects. Cooperating schools electronically submit lists of students in the target grade, from which sampled students are drawn. Schools store their data in different ways, so the incoming student lists must be standardized for use in NAEP. Each list submitter maps the columns in their file to specific NAEP fields and the values in each column to specific NAEP values. To ensure the quality of the student lists and the students' demographic data, data checks are run on each student list after the mapping work is done. The fields subject to these checks are student name, gender, student disability status, English learner status, race/ethnicity, school lunch eligibility status, grade, and month and year of birth. Some checks are straightforward, but others are more complex or involve statistical tests. This paper describes the types of data checks performed on the 11,500 student lists submitted for NAEP 2022 and presents results including the number of data check failures and false-positive rates by check type. 

Speaker

Leslie Wallace, Westat

Rule-Based Data Validation and Reconciliation of Survey Responses

Each year the US Department of Agriculture's National Agricultural Statistics Service (NASS) conducts more than a hundred surveys to understand and enumerate agriculture in the United States. The quality of survey responses varies with survey and respondent. Ensuring that survey responses are valid, reliable, and internally consistent is vital to publishing accurate official statistics. NASS is undertaking modernization efforts to detect and edit survey responses through rule validation. These innovations include (1) a review and reconciliation of documented (e.g., written in business rules) and undocumented (e.g., only appearing in programming code) validation specifications, (2) distinguishing validation rules whose errors might be correctable with programming code or numeric methods, (3) using numeric methods, such as the Fellegi-Holt algorithm, and R software packages to automate response-level validation checks and error corrections, and (4) flagging instances of automated correction or validation errors for NASS analysts. This paper will describe the processes and procedures used for each step and highlight challenges and solutions to issues commonly encountered.  

Co-Author

Albert Lee, Summit Consulting, LLC

Speaker

Gunnar Ingle, Summit Consulting LLC