CS005 NISS-FCSM: AI in Federal Government (Pt 1)

Conference: Symposium on Data Science and Statistics (SDSS) 2024
06/05/2024: 1:15 PM - 2:45 PM EDT
Special Event 
Room: James River Salon CD 

Chair

Luca Sartore, National Institute of Statistical Sciences

Presentations

Artificial Intelligence and Official Statistics: Responsibly Leveraging Large Language Models in Support of Open Data

One of the fundamental responsibilities of a statistical agency is to produce and publicly disseminate relevant, accurate, and credible statistical information. The scale and complexity of some of these data products (file size, number of variables, technical documentation), however, can hinder their direct use by non-technical audiences. Consequently, third parties will often repackage and share that information in myriad ways to make it more accessible and interpretable to the average person. The repackaging of statistical information by non-authoritative sources, however, may impact the integrity of the underlying statistics, calling their accuracy or credibility into question. Emerging technologies like mass-market Large Language Models (LLMs) and other generative artificial intelligence (AI) applications may provide an opportunity for statistical agencies to enhance their ability to disseminate statistics more directly to the average web user, but only if AI can properly and efficiently ingest and interpret the official statistics. The U.S. Department of Commerce, one of the world's largest producers of public data, has assembled a working group to help realize the benefits and mitigate the risks of AI models for finding, linking, and interpreting the Department's data. The goal is to advance dissemination standards for data and statistics from being machine-readable to being machine-understandable, capturing and conveying the information's context, structure, and meaning. This working group is currently drafting technical guidelines for publishing AI-ready open data. The Department of Commerce is interested in engagement from industry, academia, and other partners across the public data ecosystem. We will share the progress of the working group and elicit your feedback. 

Presenting Author

Sallie Keller, University of Virginia

First Author

Sallie Keller, University of Virginia

CoAuthor(s)

Michael Hawes, U.S. Census Bureau
Kenneth Haase, U.S. Census Bureau

Predictive Cropland Data Layer and Uncertainty Measures

The National Agricultural Statistics Service (NASS)
of the United States Department of Agriculture
(USDA) uses High-Order Markov
Chains (HOMC) to analyze crop rotation patterns
over time and project future crop-specific
planting. However, HOMCs often face issues with
sparsity and identifiability due to the representation
of categorical data as indicator variables. As the
number of HOMCs needed for analysis increases, the
parametric space's dimension grows exponentially.
Parsimonious representations reduce the number
of parameters but often produce less accurate
predictions. To better represent the complexity of
the data, a deep neural network model is suggested.
To measure the degree of uncertainty surrounding
categorical predictions, two uncertainty measures
are also offered. 

Presenting Author

Claire Boryan, USDA/NASS

First Author

Claire Boryan, USDA/NASS

CoAuthor

Luca Sartore, National Institute of Statistical Sciences

AI Guidelines, Best Practice, and Use-Cases at the National Center for Health Statistics/CDC

The National Center for Health Statistics/Centers for Disease Control and Prevention (NCHS/CDC) is developing guidelines and best-practices for the use of AI. There are many potential benefits of AI, including generative AI, for NCHS/CDC, including efficiency and resource savings through increased automation, and supported code-generation, synthesis and summarization of written material, and communication. However, risks of AI, most recently risks of generative AI, are regularly documented. Risks can cause agency harm through fabrication and hallucination, poor model performance, bias and discrimination, privacy and data security failure, and other legal and ethical risks that risk the credibility and integrity of the agency. Use-cases illustrate the opportunities and challenges of AI for two data processing tasks – including the identification of nonresponse for survey text responses and the differentiation of absence or presence of conditions and risk factors using clinical notes. This presentation will describe processes for the development of guidelines and best-practices for AI use with examples drawn from use-cases. 

Presenting Author

Jennifer Parker, National Center for Health Statistics

First Author

Jennifer Parker, National Center for Health Statistics