Learning Identifiable and Interpretable Latent Representations with Applications to Social and Biomedical Sciences

Zhiyu Xu Chair
Columbia University
 
Yuqi Gu Organizer
Columbia University
 
Thursday, Aug 7: 10:30 AM - 12:20 PM
0858 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-207D 

Applied

Yes

Main Sponsor

Section on Statistical Learning and Data Science

Co Sponsors

Biometrics Section
Social Statistics Section

Presentations

Deep Discrete Encoders: Identifiable Deep Generative Models for Rich Data with Discrete Latent Layers

In the era of generative AI, deep generative models (DGMs) with latent representations have gained tremendous popularity. Despite their impressive empirical performance, the statistical properties of these models remain underexplored. DGMs are often overparametrized, non-identifiable, and uninterpretable black boxes, raising serious concerns when deploying them in high-stakes applications. Motivated by this, we propose interpretable deep generative models for rich data types with discrete latent layers, called Deep Discrete Encoders (DDEs). A DDE is a directed graphical model with multiple binary latent layers. Theoretically, we propose transparent identifiability conditions for DDEs, which imply progressively smaller sizes of the latent layers as they go deeper. Identifiability ensures consistent parameter estimation and inspires an interpretable design of the deep architecture. Computationally, we propose a scalable estimation pipeline of a layerwise nonlinear spectral initialization followed by a penalized stochastic approximation EM algorithm. This procedure can efficiently estimate models with exponentially many latent components. Extensive simulation studies for high-dimensional data and deep architectures validate our theoretical results and demonstrate the excellent performance of our algorithms. We apply DDEs to three diverse real datasets with different data types to perform hierarchical topic modeling, image representation learning, and response time modeling in educational testing. 

Co-Author

Seunghyun Lee

Speaker

Yuqi Gu, Columbia University

Entrywise Error Analysis and Uncertainty Quantification in Heterogeneous Preference Learning from Revealed Choice Behavior

This paper studies human preference learning based on partially revealed choice behavior. We formulate the problem as a generalized Bradley–Terry–Luce (BTL) ranking model that accounts for heterogeneous preferences. Specifically, we assume that each user is associated with a nonparametric preference function, and each item is characterized by a low-dimensional latent feature vector — their interaction defines the underlying score matrix.

In this formulation, we propose an indirect regularization method for collaboratively learning the score matrix, which ensures entrywise error control — a novel contribution to the heterogeneous preference learning literature. This technique is based on sieve approximation and can be extended to a broader class of binary response models where a smooth link function is adopted. In addition, by applying a single step of the Newton–Raphson method, we debias the regularized estimator and establish uncertainty quantification for item scores and rankings, both for the aggregated and individual preferences. Extensive simulation results from synthetic and real datasets corroborate our theoretical findings. 

Speaker

Hyukjun Kwon, Princeton University

Identifiability and Inference for Generalized Latent Factor Models

Generalized latent factor analysis not only provides a useful latent embedding approach in statistics and machine learning, but also serves as a widely used tool across various scientific fields, such as psychometrics, econometrics, and social sciences. Ensuring the identifiability of latent factors and the loading matrix is essential for the model's estimability and interpretability, and various identifiability conditions have been employed by practitioners. However, fundamental statistical inference issues for latent factors and factor loadings under commonly used identifiability conditions remain largely unaddressed, especially for correlated factors and/or non-orthogonal loading matrix. In this work, we focus on the maximum likelihood estimation for generalized factor models and establish statistical inference properties under popularly used identifiability conditions. The developed theory is further illustrated through numerical simulations and an application to a personality assessment dataset.
 

Co-Author

Chengyu Cui

Speaker

Gongjun Xu, University of Michigan

Identifiable Nonlinear Group Factor Analysis

Given data from multiple groups or environments, one goal is to understand which underlying factors of variation are common to all groups, and which factors are group-specific. As a particular example, we consider platelet gene expression data from patients in different disease groups. In this data, factors correspond to clusters of genes which are expressed together; we may expect some clusters (or biological pathways) to be active in all diseases, while some are only active in a specific disease. To learn these factors, we consider a nonlinear multi-group factor model, which allows for both shared and specific factors. To fit this model, we propose a multi-group sparse variational autoencoder. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. In the genomics example, this means each gene is active in only a few biological processes. Further, the model implicitly induces a penalty on the number of latent factors, which helps separate the shared factors from the group-specific factors. We prove that the shared factors are identified, and demonstrate our method recovers meaningful factors in the platelet gene expression data. 

Speaker

Gemma Moran, Rutgers University

Response-on-Process Regression: A Statistical Model Describing Temporal Effects of Actions in Response Processes

Response process data from computer-based problem-solving items capture respondents' problem-solving processes as timestamped sequences of actions. These data provide a valuable source for understanding the dynamics of problem-solving behaviors and their relationships with other variables. Due to the nonstandard format of response processes, analyzing such relationships typically involves a two-step approach: first, behavioral features are extracted from response processes using experts' knowledge or data-driven methods; then statistical or machine learning tools, such as linear or logistic regression, are applied to describe the relationship between these features and the variable of interest. In this work, we propose the response-on-process regression model, which directly links the timing of taking an action in the response process to the response variable. Unlike traditional two-step approaches, this model offers a coherent framework to characterize the temporal effects of individual actions on the response variable, providing a more nuanced understanding of problem-solving behaviors. This model also facilitates rigorous statistical inference. We demonstrate the performance of the proposed model through simulation studies and empirical analysis of PISA process data. 

Speaker

Xueying Tang, University of Arizona