Print Close

Statistics for Large Language Models and Large Language Models for Statistics

Linjun Zhang Chair
Rutgers University

David Banks Discussant
Duke University

Linjun Zhang Organizer
Rutgers University

Sunday, Aug 3: 2:00 PM - 3:50 PM
0479
Invited Paper Session

Music City Center

Room: CC-101A

Keywords

Large Language Models

Applied

Yes

Main Sponsor

Section on Text Analysis

Co Sponsors

Section on Nonparametric Statistics

Section on Statistical Learning and Data Science

Presentations

Auditing the Performance and Calibration of Multi-Modal Large Language Models

Multi-modal Large Language Models (MLLMs) promise a paradigm shift in the way domain experts interact with rich data such as images, graphs, multi-media, and structured data. However, the impressive performance of MLLMs, presented in the form of high accuracy scores on visual multiple-choice question answering (MCQA) tasks, only begins to measure their readiness for sensitive domains such as medicine, scientific research, and multi-modal analytics. In this work, we conduct two analyses that probe the robustness and calibration underlying the strong performance of MLLMs on image QA benchmarks. In doing so, we apply uncertainty quantification (UQ) methods for text generation to MLLMs for the first time. First, we find that MLLM accuracy is reliant on the multiple choice format of the question, with performance even affected by the number of multiple choice options. Additionally, estimates of model calibration shift drastically when comparing UQ metrics in the classification versus the open-ended, generative setting increasingly employed by MLLMs. Second, we benchmark calibration for several MLLMs on an array of multi-modal tasks. We evaluate open-ended generation for image and video question answering, image captioning, as well as compare calibration on general datasets against domain-specific, scientific datasets. Based on our analysis, we suggest that the path to robust, deployable MLLMs requires not only achieving high accuracy on benchmarks, but also improving performance and calibration on challenging, open-ended tasks across the multi-modal spectrum.

Keywords

Multimodal Large Language Models

Benchmarks

Uncertainty Quantification

Speaker

Brendan Kennedy, Pacific Northwest National Lab

Towards the Effect of Examples on In-Context Learning: A Theoretical Case Study

In-context learning (ICL) has emerged as a powerful capability for large language models (LLMs) to adapt to downstream tasks by leveraging a few (demonstration) examples. Despite its effectiveness, the mechanism behind ICL remains underexplored.
To better understand how ICL integrates the examples with the knowledge learned by the LLM during pre-training (i.e., pre-training knowledge) and how the examples impact ICL, this paper conducts a theoretical study in binary classification tasks.
In particular, we introduce a probabilistic model extending from the Gaussian mixture model to exactly quantify the impact of pre-training knowledge, label frequency, and label noise on the prediction accuracy. Based on our analysis, when the pre-training knowledge contradicts the knowledge in the examples, whether ICL prediction relies more on the pre-training knowledge or the examples depends on the number of examples. In addition, the label frequency and label noise of the examples both affect the accuracy of the ICL prediction, where the minor class has a lower accuracy, and how the label noise impacts the accuracy is determined by the specific noise level of the two classes. Extensive simulations are conducted to verify the correctness of the theoretical results, and real-data experiments also align with the theoretical insights. Our work reveals the role of pre-training knowledge and examples in ICL, offering a deeper understanding of LLMs' behaviors in classification tasks.

Keywords

in-context learning

bayesian inference

Speaker

Pengfei He, Michigan State University

A Custom GPT for Executive MBA Students: A Case Study in Enhancing Learning

This paper presents a case study on the development and implementation of a custom GPT-based tool designed to enhance the learning experience for Executive MBA students at Wharton, in reviewing and studying a core Business Statistics course. The project aimed to provide students with an interactive, AI-driven resource tailored to their specific learning needs, offering an interactive way to review complex statistical concepts and improve comprehension.
We share insights from the multidisciplinary team, highlighting the integration of diverse expertise in designing and deploying the system. Key components of the project architecture, including the Retrieval-Augmented Generation (RAG) framework, feedback mechanisms, system prompt design, and integration of course materials, are discussed to provide a comprehensive guide for those interested in replicating or scaling similar AI tools across different academic settings.
The paper explores how the tool aligns with student expectations, including strategies for fostering trust and engagement with AI-generated outputs, the importance of linking relevant course materials and lecture recordings to specific concepts for deeper learning, emphasizing the importance of the feedback loop for continuous improvement and trust enhancement.
In addition, the paper discusses the limitations of the current platform, outlines usage statistics to demonstrate its impact on student engagement and highlights future enhancements to further refine the tool. This case study also addresses critical considerations such as scaling across other courses, ethical and data privacy protocols, and strategies for balancing AI-driven insights with human-led instruction. This paper offers valuable lessons for educators and technologists seeking to leverage large language models in higher education.

Keywords

Custom GPT-based tool

Core Statistics Class

Retrieval-Augmented Generation (RAG) framework

Feedback mechanisms

Trust development

Student engagement metrics

Statistics for Large Language Models and Large Language Models for Statistics

Keywords

Applied

Main Sponsor

Co Sponsors

Presentations

Auditing the Performance and Calibration of Multi-Modal Large Language Models

Keywords

Speaker

Towards the Effect of Examples on In-Context Learning: A Theoretical Case Study

Keywords

Speaker

A Custom GPT for Executive MBA Students: A Case Study in Enhancing Learning

Keywords

Speaker

PresentationB

Speaker