Auditing the Performance and Calibration of Multi-Modal Large Language Models

Brendan Kennedy Speaker
Pacific Northwest National Lab
 
Sunday, Aug 3: 2:05 PM - 2:25 PM
Invited Paper Session 
Music City Center 
Multi-modal Large Language Models (MLLMs) promise a paradigm shift in the way domain experts interact with rich data such as images, graphs, multi-media, and structured data. However, the impressive performance of MLLMs, presented in the form of high accuracy scores on visual multiple-choice question answering (MCQA) tasks, only begins to measure their readiness for sensitive domains such as medicine, scientific research, and multi-modal analytics. In this work, we conduct two analyses that probe the robustness and calibration underlying the strong performance of MLLMs on image QA benchmarks. In doing so, we apply uncertainty quantification (UQ) methods for text generation to MLLMs for the first time. First, we find that MLLM accuracy is reliant on the multiple choice format of the question, with performance even affected by the number of multiple choice options. Additionally, estimates of model calibration shift drastically when comparing UQ metrics in the classification versus the open-ended, generative setting increasingly employed by MLLMs. Second, we benchmark calibration for several MLLMs on an array of multi-modal tasks. We evaluate open-ended generation for image and video question answering, image captioning, as well as compare calibration on general datasets against domain-specific, scientific datasets. Based on our analysis, we suggest that the path to robust, deployable MLLMs requires not only achieving high accuracy on benchmarks, but also improving performance and calibration on challenging, open-ended tasks across the multi-modal spectrum.

Keywords

Multimodal Large Language Models

Benchmarks

Uncertainty Quantification