Sunday, Aug 3: 2:00 PM - 3:50 PM
0479
Invited Paper Session
Music City Center
Room: CC-101A
Large Language Models
Applied
Yes
Main Sponsor
Section on Text Analysis
Co Sponsors
Section on Nonparametric Statistics
Section on Statistical Learning and Data Science
Presentations
Multi-modal Large Language Models (MLLMs) promise a paradigm shift in the way domain experts interact with rich data such as images, graphs, multi-media, and structured data. However, the impressive performance of MLLMs, presented in the form of high accuracy scores on visual multiple-choice question answering (MCQA) tasks, only begins to measure their readiness for sensitive domains such as medicine, scientific research, and multi-modal analytics. In this work, we conduct two analyses that probe the robustness and calibration underlying the strong performance of MLLMs on image QA benchmarks. In doing so, we apply uncertainty quantification (UQ) methods for text generation to MLLMs for the first time. First, we find that MLLM accuracy is reliant on the multiple choice format of the question, with performance even affected by the number of multiple choice options. Additionally, estimates of model calibration shift drastically when comparing UQ metrics in the classification versus the open-ended, generative setting increasingly employed by MLLMs. Second, we benchmark calibration for several MLLMs on an array of multi-modal tasks. We evaluate open-ended generation for image and video question answering, image captioning, as well as compare calibration on general datasets against domain-specific, scientific datasets. Based on our analysis, we suggest that the path to robust, deployable MLLMs requires not only achieving high accuracy on benchmarks, but also improving performance and calibration on challenging, open-ended tasks across the multi-modal spectrum.
Keywords
Multimodal Large Language Models
Benchmarks
Uncertainty Quantification
In-context learning (ICL) has emerged as a powerful capability for large language models (LLMs) to adapt to downstream tasks by leveraging a few (demonstration) examples. Despite its effectiveness, the mechanism behind ICL remains underexplored.
To better understand how ICL integrates the examples with the knowledge learned by the LLM during pre-training (i.e., pre-training knowledge) and how the examples impact ICL, this paper conducts a theoretical study in binary classification tasks.
In particular, we introduce a probabilistic model extending from the Gaussian mixture model to exactly quantify the impact of pre-training knowledge, label frequency, and label noise on the prediction accuracy. Based on our analysis, when the pre-training knowledge contradicts the knowledge in the examples, whether ICL prediction relies more on the pre-training knowledge or the examples depends on the number of examples. In addition, the label frequency and label noise of the examples both affect the accuracy of the ICL prediction, where the minor class has a lower accuracy, and how the label noise impacts the accuracy is determined by the specific noise level of the two classes. Extensive simulations are conducted to verify the correctness of the theoretical results, and real-data experiments also align with the theoretical insights. Our work reveals the role of pre-training knowledge and examples in ICL, offering a deeper understanding of LLMs' behaviors in classification tasks.
Keywords
in-context learning
bayesian inference
This paper presents a case study on the development and implementation of a custom GPT-based tool designed to enhance the learning experience for Executive MBA students at Wharton, in reviewing and studying a core Business Statistics course. The project aimed to provide students with an interactive, AI-driven resource tailored to their specific learning needs, offering an interactive way to review complex statistical concepts and improve comprehension.
We share insights from the multidisciplinary team, highlighting the integration of diverse expertise in designing and deploying the system. Key components of the project architecture, including the Retrieval-Augmented Generation (RAG) framework, feedback mechanisms, system prompt design, and integration of course materials, are discussed to provide a comprehensive guide for those interested in replicating or scaling similar AI tools across different academic settings.
The paper explores how the tool aligns with student expectations, including strategies for fostering trust and engagement with AI-generated outputs, the importance of linking relevant course materials and lecture recordings to specific concepts for deeper learning, emphasizing the importance of the feedback loop for continuous improvement and trust enhancement.
In addition, the paper discusses the limitations of the current platform, outlines usage statistics to demonstrate its impact on student engagement and highlights future enhancements to further refine the tool. This case study also addresses critical considerations such as scaling across other courses, ethical and data privacy protocols, and strategies for balancing AI-driven insights with human-led instruction. This paper offers valuable lessons for educators and technologists seeking to leverage large language models in higher education.
Keywords
Custom GPT-based tool
Core Statistics Class
Retrieval-Augmented Generation (RAG) framework
Feedback mechanisms
Trust development
Student engagement metrics