Sunday, Aug 3: 4:00 PM - 5:50 PM
4029
Contributed Papers
Music City Center
Room: CC-102B
Main Sponsor
Section on Statistical Learning and Data Science
Presentations
Modeling the complex relationships between multiple categorical response variables as a function of predictors is a fundamental task in the analysis of categorical data. However, existing methods can be difficult to interpret and may lack flexibility. To address these challenges, we introduce a penalized likelihood method for multivariate categorical response regression that relies on a novel subspace decomposition to parameterize interpretable association structures. Our approach models the relationships between categorical responses by identifying mutual, joint, and conditionally independent associations, which yields a linear problem within a tensor product space. We establish theoretical guarantees for our estimator, including error bounds in high-dimensional settings, and demonstrate the method's interpretability and prediction accuracy through comprehensive simulation studies.
Keywords
multinomial logistic regression
categorical data analysis
log-linear models
We introduce a unified, flexible, and easy-to-implement framework of sufficient dimension reduction that can accommodate both linear and nonlinear dimension reduction, and both the conditional distribution and the conditional mean as the targets of estimation. This unified framework is achieved by a specially structured neural network - the Belted and Ensembled Neural Network (BENN) - that consists of a narrow latent layer, which we call the belt, and a family of transformations of the response, which we call the ensemble. By strategically placing the belt at different layers of the neural network, we can achieve linear or nonlinear sufficient dimension reduction, and by choosing the appropriate transformation families, we can achieve dimension reduction for the conditional distribution or the conditional mean. Moreover, thanks to the advantage of the neural network, the method is very fast to compute, overcoming a computation bottleneck of the traditional sufficient dimension reduction estimators, which involves the inversion of a matrix of dimension either p or n. We develop the algorithm and convergence rate of our method and compare it with existing SDR methods.
Keywords
Autoencoder
Convergence rate
Covering numbers
Deep learning
Probability characterizing family
The Grade of Membership (GoM) model is a popular individual-level mixture model for multivariate categorical data such as survey responses. In modern data collection, numerous covariates are often gathered alongside the target response data, many of which share a similar latent structure. To leverage this covariate information for improved estimation of the latent structure of the target data, we introduce Covariate-assisted Grade of Membership (CoGoM) models and develop an efficient estimation algorithm based on spectral methods. For model identifiability, we establish a weaker sufficient condition compared to the covariate-free case. For theoretical guarantee, we show consistency in high-dimensional settings, demonstrating how incorporating covariates can aid the estimation of the latent structure. Through simulation studies, our proposed method outperforms traditional approaches in terms of both computation efficiency and estimation accuracy. Finally, we demonstrate our method by applying it to a Trends in International Mathematics and Science Study (TIMSS) dataset.
Keywords
Grade of Membership Model
Identifiability
Sequential Projection Algorithm
Covariate Assistance
Spectral Method
Co-Author
Yuqi Gu, Columbia University
First Author
Zhiyu Xu, Columbia University
Presenting Author
Zhiyu Xu, Columbia University
Federated learning enables the analysis of multi-site real-world data (RWD) while preserving data privacy, yet challenges persist due to heterogeneous modality availability and distribution shifts across sites. In this work, we develop a novel federated multimodal learning framework to improve causal inference in distributed research networks (DRNs), integrating electronic health records (EHRs) and genetic biomarkers. Traditional methods often fail to account for structural missingness and site-specific heterogeneity, leading to biased estimates of treatment effects.
To address this, we propose a new statistical framework that accounts for distribution shifts of populations across sites, while pursuing efficiency and bias correction by leveraging information from all available modalities across sites. In addition, we employ multiple negative control outcomes to calibrate estimates and mitigate residual systematic biases, including unmeasured confounding.
Keywords
Causal inference
Negative control outcomes
Average treatment effect
Bias Correction
Multi-Modality
Multi-modal data fusion has become increasingly critical for enhancing the predictive power of machine learning methods across diverse fields, from autonomous driving to medical diagnosis. Traditional fusion methods—early fusion, intermediate fusion, and late fusion—approach data integration differently, each with distinct advantages and limitations. In this paper, we introduce Meta-Fusion, a flexible and principled framework that unifies these existing approaches as special cases. Drawing inspiration from mutual deep learning and ensemble learning, Meta-Fusion constructs a cohort of models based on various combinations of latent representations across modalities, and further enhances predictive performance through soft information sharing within the cohort. Our approach is model-agnostic in learning the latent representations, allowing it to flexibly adapt to the unique characteristics of each modality. Theoretically, our soft information sharing mechanism effectively reduces the generalization error. Empirically, Meta-Fusion consistently outperforms conventional fusion strategies in extensive synthetic experiments. We further validate our approach on real-world applications, including Alzheimer's disease detection and brain activity analysis.
Keywords
multi-modality fusion
deep mutual learning
ensemble learning
soft information sharing
Compositional data, also referred to as simplicial data, naturally arise in many scientific domains such as geochemistry, microbiology, and economics. In such domains, obtaining sensible lower-dimensional representations and modes of variation plays an important role. A typical approach to the problem is applying a log-ratio transformation followed by principal component analysis (PCA). However, this approach has several notable weaknesses: it amplifies variation in minor variables and obscures those in major elements, is not directly applicable to data sets containing zeros, and has limited ability to capture linear patterns. We propose novel methods that produce nested sequences of simplices of decreasing dimensions using the backwards principal component analysis framework. These nested sequences offer both interpretable lower dimensional representations and linear modes of variation. In addition, our methods are applicable to data sets contain zeros without any modification. Our methods are demonstrated on simulated data and on relative abundances of diatom species during the late Pliocene.
Keywords
Modes of variation
Backwards approach
Nested relations
Compositional data
Paleoceanography
We introduce a statistical and computational framework for tensor Canonical Polyadic (CP) decomposition, with a focus on statistical theory, convergence, and algorithmic improvements. First, we show that the Alternating Least Squares (ALS) algorithm achieves the desired error rate within three iterations when $R = 1$. Second, for the more general case where $R > 1$, we derive statistical bounds for ALS, showing that the estimation error exhibits an initial phase of quadratic convergence followed by linear convergence until reaching the desired accuracy. Third, we propose a novel warm-start procedure for ALS in the $R > 1$ setting, which integrates tensor Tucker decomposition with simultaneous diagonalization (Jennrich's algorithm) to significantly enhance performance over existing benchmark methods. Numerical experiments support our theoretical findings, demonstrating the practical advantages of our approach.
Keywords
tensor
CP decomposition
alternative least square
statistical bound