Print Close

IMS Medallion Award & Lecture III

Elizaveta Levina Chair
University of Michigan

Stefan Wager Organizer
Stanford University

Tuesday, Aug 5: 8:30 AM - 10:20 AM
0253
Invited Paper Session

Music City Center

Room: CC-Davidson Ballroom A1

Applied

Main Sponsor

IMS

Presentations

Softmax mixture ensembles for interpretable latent discovery

Extracting latent information from complex data sets plays a central role in statistics, machine learning and AI. This lecture will explore solutions to this problem for data that can be well modeled via an ensemble of discrete mixtures with shared mixture components. The first model of this type, the topic model, has been originally proposed, four decades ago, for extracting semantically meaningful, latent, topics from a text corpus. The model is now used as an
exploratory tool in virtually any scientific area where a model for a collection of multinomial samples with latent structure is of interest. The basic topic model formulation starts by associating to each observed multinomial sample its generating p-dimensional probability vector. The topic model assumption is that each of these vectors is a mixture of K latent, p-dimensional probability vectors that are common to the ensemble, with sample specific mixture weights. In the original jargon, adopted in this lecture, a sample is a document, viewed as a relative frequency of its constituent words relative to a given vocabulary; the mixture components are topics covered by the entire corpus; and the mixture weights are the proportions in which a document (sample) covers a specific topic. Estimation of this collection of non-parametrized, discrete, mixture models enables the evaluation of a corpus (collection of samples) in terms of its latent topical content.

Part 1 of this lecture will briefly review methods with statistical guarantees for the basic topic model. Computationally efficient estimation of the model parameters (mixture weights and mixture components) reduces to performing non-negative matrix factorization under constraints. While many such constraints can be considered, I will present those that simultaneously allow for model identifiability and computational tractability, as well as minimax-rate optimal, interpretable, topic estimation. Moreover, the cross-entropy mixture weight estimates, in any identifiable topic model, have a notable property: they automatically adapt to the unknown sparsity of the true mixture weights, without extra regularization. Furthermore, a one-step correction of these potentially sparse estimates allows for rigorous inference on the mixture weights. The totality of these theoretically justified properties enables nuanced analyses not only at the corpus, but also at the document level. However, both theory and practice suggest that the quality of estimation may deteriorate when p is very large. Moreover, by definition, the basic topic model cannot incorporate information on the support points of the p-dimensional probability vectors that are modeled as mixtures.

Part 2 of this lecture will offer solutions to these issues, and cover contemporary extensions of this model, inspired by LLM technology, where each mixture component (topic) has a softmax parameterization relative to a large collection of p feature vectors from an L-dimensional space, with L < p. Since a topic is in a one-to-one correspondence with its L-dimensional softmax parameter, it is directly interpretable in the embedding space. I will discuss very recent theoretical and practical results on an EM algorithm developed for ensembles of softmax mixtures. In identifiable models, and under specific initialization schemes, the EM algorithm provably yields rate optimal parameter estimates; the mixture weight estimates continue to enjoy the adaptation to sparsity established for the non-parameterized topic model. These theoretical results have immediate practical relevance, as estimation of softmax mixture ensembles can be used as a building block in the estimation of appropriate mixture of experts models, developed for contextually embedded corpora.

Just as the softmax parametrization has come to occupy a prominent role in any LLM generated output, ensembles of softmax mixtures can become important vehicles for learning and interpreting the topical richness explored by LLM algorithms in generating such output.

Speaker

Florentina Bunea, Cornell University