Promoting Interpretable, Trustworthy, and Ethical Machine Learning with Statistics

Yuchen Zhou Chair
University of Illinois Urbana-Champaign
 
Lili Zheng Organizer
 
Thursday, Aug 7: 8:30 AM - 10:20 AM
0381 
Invited Paper Session 
Music City Center 
Room: CC-Davidson Ballroom B 

Keywords

Interpretable machine learning, fair machine learning, uncertainty quantification 

Applied

Yes

Main Sponsor

Section on Statistical Learning and Data Science

Co Sponsors

Section on Nonparametric Statistics

Presentations

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. In this article, we introduce OPAL (\textbf{O}versam\textbf{P}ling with \textbf{A}rtificial \textbf{L}LM-generated data), a systematic oversampling approach that leverages the capabilities of large language models (LLMs) to generate high-quality synthetic data for minority groups. Recent studies on synthetic data generation using deep generative models mostly target prediction tasks. Our proposal differs in that we focus on handling imbalanced data and spurious correlations. More importantly, we develop a novel theory that rigorously characterizes the benefits of using the synthetic data, and shows the capacity of transformers in generating high-quality synthetic data for both labels and covariates. We further conduct intensive numerical experiments to demonstrate the efficacy of our proposed approach compared to some representative alternative solutions.
 

Keywords

Large Language Models

Algorithmic Fairness 

Speaker

Linjun Zhang, Rutgers University

Targeted Learning Inference for Variable Importance and Fairness in Machine Learning

This talk develops inference methods for consequences of Machine Learning models. While originally developed as purely predictive tools, there has been increasing inference in inspecting ML models as a means of gaining understanding about the relationships that they uncover and about the consequences of deploying them in the real world. These have been addressed by the development of feature attribution methods and fairness assessments for specific models. However, neither provide uncertainty quantification about the corresponding aspects of the data generation process.

In this talk we show that tools from targetted Machine Learning Estimation (tMLE) are naturally adaptable to these problems, and that doing so is revealing of the regularity of the proposed target. The development of these tools also illuminates the sources of uncertainty for these targets, allowing a discussion of which sources need to be accounted for in any given application.  

Keywords

Interpretable Machine Learning

targetted learning

feature attribution

fairness

xAI 

Speaker

Giles Hooker, University of Pennsylvania

A model-agnostic ensemble framework with built-in LOCO feature importance inference

Interpretability and reliability are crucial desiderata when machine learning is applied in critical applications. However, generating interpretations and uncertainty quantifications for black-box ML models often costs significant extra computation and held-out data. In this talk, I will introduce a novel ensemble framework where one can simultaneously train a predictive model and gives uncertainty quantification for its interpretation, in the form of leave-one-covariate-out (LOCO) feature importance. This framework is almost model-agnostic, can be applied with any base model, for regression or classification tasks. Most notably, it avoids model-refitting and data-splitting, and hence there is no extra cost, computationally and statistically, for uncertainty quantification. To ensure the inference validity without data splitting, we address a number of challenges by leveraging the stability of the ensemble training process. I will discuss some broad connection of this work to selective inference, and other model-agnostic feature importance inference methods. I will also demonstrate the framework via some real benchmark datasets.
 

Keywords

Feature interaction, random ensembles, uncertainty quantification 

Speaker

Lili Zheng

Powerful and valid interactive subgroup selection via machine learning

In regression and causal inference, controlled subgroup discovery aims to identify, with inferential guarantees, a subgroup (defined as a subset of the covariate space) on which the average response or treatment effect is above a given threshold. E.g., in a clinical trial, it may be of interest to find a subgroup with a positive average treatment effect. However, existing methods either lack inferential guarantees, heavily restrict the search for the subgroup, or sacrifice efficiency by naive data splitting. We propose a novel framework that allows the analyst to interactively refine and test a candidate subgroup by iteratively shrinking it. The sole restriction is that the shrinkage direction only depends on the points outside the current subgroup, but otherwise the analyst may leverage any prior information or machine learning algorithm. Despite this flexibility, our method controls the probability that the discovered subgroup is null (e.g., has a non-positive average treatment effect) under minimal assumptions: for example, in randomized experiments, our method controls the error rate under only bounded moment conditions. Empirically, our method identifies substantially better subgroups than existing methods with inferential guarantees. This is joint work with Nathan Cheng and Asher Spector. 

Keywords

Subgroup discovery

Causal inference

Interactive procedures 

Speaker

Lucas Janson, Harvard University