Thursday, Aug 7: 10:30 AM - 12:20 PM
0858
Topic-Contributed Paper Session
Music City Center
Room: CC-207D
Applied
Yes
Main Sponsor
Section on Statistical Learning and Data Science
Co Sponsors
Biometrics Section
Social Statistics Section
Presentations
In the era of generative AI, deep generative models (DGMs) with latent representations have gained tremendous popularity. Despite their impressive empirical performance, the statistical properties of these models remain underexplored. DGMs are often overparametrized, non-identifiable, and uninterpretable black boxes, raising serious concerns when deploying them in high-stakes applications. Motivated by this, we propose interpretable deep generative models for rich data types with discrete latent layers, called Deep Discrete Encoders (DDEs). A DDE is a directed graphical model with multiple binary latent layers. Theoretically, we propose transparent identifiability conditions for DDEs, which imply progressively smaller sizes of the latent layers as they go deeper. Identifiability ensures consistent parameter estimation and inspires an interpretable design of the deep architecture. Computationally, we propose a scalable estimation pipeline of a layerwise nonlinear spectral initialization followed by a penalized stochastic approximation EM algorithm. This procedure can efficiently estimate models with exponentially many latent components. Extensive simulation studies for high-dimensional data and deep architectures validate our theoretical results and demonstrate the excellent performance of our algorithms. We apply DDEs to three diverse real datasets with different data types to perform hierarchical topic modeling, image representation learning, and response time modeling in educational testing.
This paper studies human preference learning based on partially revealed choice behavior. We formulate the problem as a generalized Bradley–Terry–Luce (BTL) ranking model that accounts for heterogeneous preferences. Specifically, we assume that each user is associated with a nonparametric preference function, and each item is characterized by a low-dimensional latent feature vector — their interaction defines the underlying score matrix.
In this formulation, we propose an indirect regularization method for collaboratively learning the score matrix, which ensures entrywise error control — a novel contribution to the heterogeneous preference learning literature. This technique is based on sieve approximation and can be extended to a broader class of binary response models where a smooth link function is adopted. In addition, by applying a single step of the Newton–Raphson method, we debias the regularized estimator and establish uncertainty quantification for item scores and rankings, both for the aggregated and individual preferences. Extensive simulation results from synthetic and real datasets corroborate our theoretical findings.
Generalized latent factor analysis not only provides a useful latent embedding approach in statistics and machine learning, but also serves as a widely used tool across various scientific fields, such as psychometrics, econometrics, and social sciences. Ensuring the identifiability of latent factors and the loading matrix is essential for the model's estimability and interpretability, and various identifiability conditions have been employed by practitioners. However, fundamental statistical inference issues for latent factors and factor loadings under commonly used identifiability conditions remain largely unaddressed, especially for correlated factors and/or non-orthogonal loading matrix. In this work, we focus on the maximum likelihood estimation for generalized factor models and establish statistical inference properties under popularly used identifiability conditions. The developed theory is further illustrated through numerical simulations and an application to a personality assessment dataset.
Given data from multiple groups or environments, one goal is to understand which underlying factors of variation are common to all groups, and which factors are group-specific. As a particular example, we consider platelet gene expression data from patients in different disease groups. In this data, factors correspond to clusters of genes which are expressed together; we may expect some clusters (or biological pathways) to be active in all diseases, while some are only active in a specific disease. To learn these factors, we consider a nonlinear multi-group factor model, which allows for both shared and specific factors. To fit this model, we propose a multi-group sparse variational autoencoder. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. In the genomics example, this means each gene is active in only a few biological processes. Further, the model implicitly induces a penalty on the number of latent factors, which helps separate the shared factors from the group-specific factors. We prove that the shared factors are identified, and demonstrate our method recovers meaningful factors in the platelet gene expression data.
Response process data from computer-based problem-solving items capture respondents' problem-solving processes as timestamped sequences of actions. These data provide a valuable source for understanding the dynamics of problem-solving behaviors and their relationships with other variables. Due to the nonstandard format of response processes, analyzing such relationships typically involves a two-step approach: first, behavioral features are extracted from response processes using experts' knowledge or data-driven methods; then statistical or machine learning tools, such as linear or logistic regression, are applied to describe the relationship between these features and the variable of interest. In this work, we propose the response-on-process regression model, which directly links the timing of taking an action in the response process to the response variable. Unlike traditional two-step approaches, this model offers a coherent framework to characterize the temporal effects of individual actions on the response variable, providing a more nuanced understanding of problem-solving behaviors. This model also facilitates rigorous statistical inference. We demonstrate the performance of the proposed model through simulation studies and empirical analysis of PISA process data.