Progress in the Use of Traditional and Generative Artificial Intelligence in the Federal Government

Jennifer Parker Chair
University of Maryland, College Park
 
Linda Young Organizer
Young Statistical Consulting LLC
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
0702 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-201A 
Federal statistical agencies continually evolve their processes in their efforts to provide the highest quality official statistics. The list frame coverages and response rates are tending to decrease for surveys in the U.S. and other countries. Administrative, business, remotely sensed, weather, and other non-survey data are increasingly available. A major effort by federal statistical agencies focuses on the integration of survey and non-survey data to produce improved official statistics.
Traditional Artificial Intelligence (AI) methods, including machine learning, deep learning, computer-aided vision, and natural language processing, have become important tools in this effort, which focuses on automation. When ChatGPT was released in November 2022, excitement for the potential of generative AI in numerous applications exploded.
Transformers (GPTs), such as ChatGPT, Copilot, and Gemini, train on massive amounts of data. With a focus on amalgamation, the GPTs can be used to generate content, create new data, and identify new capabilities. As federal statistical agencies identify possibilities for using generative AI to further advance the production of official statistics, the concern for ensuring the safety of the data underpinning those statistics has come to the forefront. The question of how to ensure the safety of data while taking advantage of the new technology has slowed adoption of GPTs within government.
This session will focus on the integration of traditional AI methods to improve processes in the Federal Statistical System (FSS) and the progress that is being made in adopting generative AI. Representatives from four government agencies and a consulting company working closely with federal statistical agencies will provide an overview of the present and a look into the future of AI within the FSS.

Applied

Yes

Main Sponsor

Government Statistics Section

Co Sponsors

History of Statistics Interest Group
Survey Research Methods Section

Presentations

Overview of Traditional and Generative AI Applications at the USDA's National Agricultural Statistics Service

The USDA's National Agricultural Statistics Service (NASS) has integrated traditional AI into a number of its production processes. Privacy Preserving Record Linkage (PPRL) using Natural Language Processing (NLP) provides the foundation for integrating survey and non-survey data. Response propensity models have been used to inform sampling and data collection. Traditional AI methods are utilized in the development of geospatial products. For example, the Cropland Data Layer (CDL) displays where each of 114 crops are grown across the contiguous U.S. each year, and it forms the foundation for identifying the impacts of natural disasters on agriculture. Other models based on high-order Markov Chains and neural networks are used to predict what crops will be planted for an upcoming growing season, and maps of uncertainty are based on normalized Shannon entropy. Some machine learning models provide insights for imputation. Other models provide the foundation for producing official statistics. NASS has hundreds of programs, many of which were written in code that is no longer supported, not recommended for use, or too expensive. Although the agency does not have the resources to pay someone to convert the code to a more modern language, generative AI is a feasible solution. In this presentation, progress that NASS has made in adopting traditional and generative AI methods and the future of generative AI within the agency will be discussed. 

Speaker

Linda Young, Young Statistical Consulting LLC

Using Natural Language Processing (NLP) in Data Linkage

The National Center for Health Statistics (NCHS) has a data linkage program that combines national survey data with key sources of health outcomes and health care utilization. The overall accuracy and quality of a data linkage depends on the quality of the data fields. This applies in a variety of data linkage methods, including clear text and PPRL. Data pre-processing and cleaning are essential to address data quality issues in most linkage tasks. Automating pre-processing tasks can reduce time-consuming manual reviews particularly when linkages involve a large number of records. For some data fields, cleaning and pre-processing are relatively straightforward. For example, dates typically have a limited number of plausible values that make checking and cleaning relatively easy. Unique identifiers (e.g., social security number) often conform to some set format or have restrictions on the values that would be expected. Other data fields, such as first name and last name, present greater challenges with respect to automating the cleaning process. The use of NLP to identify valid names and automate identification and removal of non-name text in name fields will be discussed. In addition, the results from an evaluation of artificial-intelligence-based large language model (LLM) and a simple rule-based algorithm to identify non-name text in name fields will be presented. 

Speaker

Frances McCarty

Using Data for Evidence Building: incorporating AI tools and techniques

Linking data from disparate sources supports using data for evidence building. In March of 2023, the White House Office of Science and Technology Policy released the National Strategy to Advance Privacy Preserving Data Sharing and Analytics noting several key strategic priorities, including "cultivating an ecosystem that promotes a timely translation of theoretical results into real-world implementation and deployment." As part of the National Secure Data Service (NSDS) Demonstration Project, NCSES and its partners are deploying and evaluating PPRL tools to inform efforts for developing a shared services ecosystem. The results of this project will inform ways to streamline and innovate data sharing and linking across sources since PPRL technology supports linking individual data records without exposing personal information. The NSDS is currently evaluating two PPRL tools, including a commercial tool, HealthVerity, and an open source, Anonlink, python application. These evaluation projects will result in linked data files that provide the opportunity to implement machine learning models that could help minimize bias in linkage results. Background information on the NSDS Demonstration project and an overview of the PPRL process will be provided. Techniques to use machine learning to optimize the utility of the linked data will be described followed by a summary of the initial findings and a discussion of the next steps. 

Speaker

Lisa Mirel, National Science Foundation, National Center for Science and Engineering Statistics

Leveraging Generative AI to Improve the Survey Process: Use Cases and Challenges

Artificial intelligence (AI), both traditional and generative, has already been adopted in the federal system. Agencies at the federal and local levels have started using AI-driven methods for evidence-based policy, to process and extract insights from text documents (such as application forms), and to develop predictive machine learning models. AI-driven technologies help automate and streamline processes, reduce administrative burdens, and reach decisions more accurately, consistently, and quickly. The broad span of AI use cases starts from questionnaire design and translation to open-ended coding and analysis. AI has also taken place in the federal statistical system, and agencies have started using AI to enhance and improve surveys. AI methods, specifically machine learning and NLP, can be used broadly throughout the survey process, including data collection, processing, imputation, dissemination, and analysis, as well as addressing data confidentiality and disclosure avoidance. In this presentation, how AI can enhance the survey process by improving data quality and generating efficiencies will be summarized. The limitations and challenges associated with these methods and strategies to mitigate these concerns will be addressed. 

Co-Author

Gizem Korkmaz, Westat

Speaker

Elizabeth Mannshardt, Westat

AI adoption and Usage at the CDC: A case study in Machine Learning, NLP, Computer Vision, and Generative AI usage for identifying novel insights in Public Health

The AI implementation team within CDC, a cross-cutting team of staff from CDC Centers, Institutes, and Offices, works to ensure responsible adoption of artificial intelligence technologies to enhance public health decision-making capabilities and organizational efficiency. The advancements in AI capabilities over the past decade have enabled the integration of non-traditional data sources, including satellite imagery, X-ray images, and unstructured documents, creating opportunities to fill critical information gaps in public health surveillance and response. In this presentation, recent AI projects and implementation efforts at the CDC will be outlined. This includes the adoption of appropriate guardrails to ensure the responsible adoption of generative AI by staff, overviewing the results of pilot projects testing enhanced capabilities of these systems, examples of implementation of reusable AI pipelines aimed at providing high impact solutions to current public health problems, and existing challenges to successful AI adoption within public health. 

Speaker

Benjamin Rogers, NCHS