Generative AI and Foundation Models in Defense Applications

Michael McKibben Chair
John Hopkins University Applied Physics Lab
 
Amanda French Organizer
Johns Hopkins University Applied Physics Laboratory
 
Monday, Aug 4: 10:30 AM - 12:20 PM
0520 
Invited Paper Session 
Music City Center 
Room: CC-201B 

Applied

Yes

Main Sponsor

Section on Statistics in Defense and National Security

Presentations

Automating Data Insights: A Large Language Model-Driven Data Analysis Tool

Large language models (LLMs) are recognized for their confident assertions, particularly in mathematical contexts, which can occasionally lead to incorrect conclusions. To address this challenge and enhance the reliability of quantitative answers, we present an LLM agent to effectively answer analytical questions and interact with diverse datasets. This tool integrates an LLM with a code interpreter in a secure, sandboxed environment. The LLM generates code to effectively answer analytical questions, then the code is executed to provide accurate and reliable results.
To ensure confidence in the outputs, the tool provides the generated code, allowing users to verify the correctness of the calculations independently. Additionally, users can generate accompanying visualizations, to support findings and verify data insights. By combining LLMs with code execution capabilities, our LLM agent empowers users to quickly and reliably derive meaningful insights from their datasets.
 

Keywords

Large language model (LLM)

AI for Data Analysis 

Speaker

Catherine Appleby

Building a "Model in a Month" for Science and Defense Applications

While artificial intelligence (AI) has been a prominent modeling technique for decades, a paradigm shift has emerged more recently with a focus on training foundation models. Unlike predecessor AI models which are defined as narrow AI, i.e. algorithms designed for a single specific task or application, foundation models are capable of a variety of tasks and, although sometimes suboptimal on a specific desired task, can often be retrained or fine-tuned quickly to increase performance. In this talk, we will review the development of multiple unimodal and multimodal large language models (LLMs) for scientific and defense applications, discuss strategies for training with limited compute, the challenges of alignment (both across data sources and with human intent), how to incorporate statistics into your LLM pipeline, and how to make the results accessible and trustworthy for human interaction, all with a focus on how to accelerate the process of deploying new models. 

Keywords

Artificial Intelligence

Large Language Model

Foundation Model 

Speaker

Karl Pazdernik, Pacific Northwest National Laboratory

SysChat: A Human-Expert Guided Retrieval Augmented Generation (RAG) Chatbot for Complex System Question Answering

SysChat is a Retrieval Augmented Generation (RAG) tool that combines information retrieval, black-box large language models (LLMs), and expert feedback to answer user questions on mechanical systems. In retrieval-augmented generation, an embedding model first stores all documents in a vector database—in our case, tens of thousands of pages of complex systems documents. These embeddings are then used to identify relevant information for each query, guiding black-box LLM responses with improved factual accuracy and traceable information sources. Experts were then given access to this tool, and their feedback was used to train auxiliary methods that guide LLM outputs towards expert-preferred responses. This talk will discuss SysChat's architecture, highlighting classical and modern RAG techniques, LLM enhancements to improve reasoning capabilities, and the integration of expert feedback to guide black-box LLM generation. 

Keywords

LLM

NLP

Information Retrieval

Generative AI 

Speaker

Robert Molloy, Johns Hopkins University Applied Physics Laboratory

Testing and Evaluating foundation models in high consequence scenarios

Foundation models have had a profound impact on society, through models such as OpenAI's chatGPTs and Anthropic's Claude series, as well as science, through models such as AlphaFold, Climax, and Aurora. While these models can produce impressive output, less focus has been spent on model evaluation than model building. In this talk I will discuss some of the challenges that make testing and evaluating large models difficult and efforts to more systematically evaluate them, such as uncertainty quantification on metrics and predictions, holistic metrics that go beyond leaderboardism, where models are ranked and compared with a single value, and deterministic evaluation of an LLM's output probability distribution.  

Keywords

Testing and evaluation

uncertainty quantification

AI models 

Speaker

Emily Casleton