Tuesday, Aug 6: 10:30 AM - 12:20 PM
1057
Invited Paper Session
Oregon Convention Center
Room: CC-255
Applied
No
Main Sponsor
Section on Statistics in Defense and National Security
Co Sponsors
Journal on Uncertainty Quantification
Section on Statistical Learning and Data Science
Presentations
In an ideal world, state-of-the-art machine learning techniques, such as deep neural networks, would provide accurate measures of uncertainty in addition to assurance that the many modeling choices leading to a final trained model have not led to suboptimal or misleading results. Bayesian neural networks offer built-in uncertainty quantification, essential for high-stakes decision-making and building trust in the algorithm, but model choice is still an unsolved problem. While in practice we may pay a lot of attention to the choice of neural network architecture, the integrity of probabilistic predictions from Bayesian neural networks also rests on another key element: the selected prior distribution over the parameters. Recent works suggest that widely-used default prior choices can lead to poor quantification of uncertainty, but guidance for selecting priors (and the potential impact of making different prior choices) is severely understudied. We develop and implement Bayesian model selection methods for quantitatively assessing prior-model choices in Bayesian neural networks with the ultimate goal of providing more reliable inference in Bayesian neural networks.
Multi-model ensemble analysis combines output from multiple climate models into a single projection. Recent work has shown that Gaussian processes (GPs) are effective tools for multi-model analysis, but have rapidly increasing error rates as the test distribution diverges from the train distribution. To combat this, we propose a Deep kernel learning model that combines a deep convolutional neural network (CNN) with a neural network Gaussian process (NNGP) to produce accurate, high resolution projections even when the train and test distributions are highly dissimilar. To quantify the projection uncertainty we develop a conformal prediction method, based on data depth, to generate prediction ensembles with exact coverage. We evaluate our method on monthly surface temperature data and show that it outperforms GP approaches, in terms of spatial prediction accuracy and uncertainty quantification, without a commensurate increase in computational cost. Moreover, we show that the growth rate of the prediction errors and UQ errors are much slower than GP approaches leading to reduced uncertainty in far future projections.
The gold standard in evaluating prediction interval quality given by data-driven classification models is to compute frequentist validity and efficiency statistics relative to a ground truth oracle distribution. The oracle distribution is required to determine if a predicted class probability interval contains the true class probability for accurate validity assessments. In real-world data, there is no 'true' oracle distribution which leads us to ask if a surrogate of the oracle model (SOM) be used in place of the oracle and the same metrics be computed? We investigate the feasibility of using SOMs in cases where an underlying data distribution is unavailable. Specifically, we use generative methods to learn a distribution over the data and ask the question if a SOM exists, and if so, what is the quality of the learned distribution? Can we use a SOM to rank UQ enabled models in lieu of the oracle model? Our experiments show that such a SOM indeed exists, and that it can be used as a tool to provide coverage and validity estimates that have small error to the true values compared to an oracle model. Effectively enabling real-world model selection based on UQ quality.
In an era dominated by large-scale machine learning models, poor calibration severely limits the trustworthiness of the results. As we increasingly rely on complex systems, recalibration becomes essential, where the objective is to find a mapping that adjusts the model's original probabilistic prediction to a new, more reliable one. We explore a broad class of recalibration functions based on learning the optimal step function over a proper scoring rule. Using the continuous ranked probability score (CRPS) and applying predicted-mean binning, our approach outperforms the widely-used quantile recalibration method in terms of both calibration and sharpness, while maintaining its simplicity. We apply our method to a case study on the Pinatubo eruption climate dataset using a convolutional neural network model with dropout.
Speaker
Feng Liang, University of Illinois at Urbana-Champaign
With the popularity of ChatGPT and the ever-improving performance of generative pretrained transformers (GPTs), Large Language Models (LLMs) are being implemented in almost every information retrieval tool. Whether extracting specific entities from text (named entity recognition) or describing certain characteristics found within the text, LLMs are being tested for a wide variety of tasks. However, these models "hallucinate" facts with great confidence in their false results. While retrieval augmented generation (RAG) has shown the ability to reduce errors in the generated response, without explicitly quantifying the uncertainty in the result, human users of these systems are left to blindly trust the result. To this end, we review existing forms of uncertainty quantification for language models and highlight forms of calibrating a language model using methods such as Bayesian belief matching and conformal prediction. We end with discussions on the challenges when moving towards multimodal foundation models.