Thursday, Aug 7: 8:30 AM - 10:20 AM
0371
Invited Paper Session
Music City Center
Room: CC-104A
Applied
Yes
Main Sponsor
Classification Society of North America
Co Sponsors
Biometrics Section
Section on Statistical Learning and Data Science
Presentations
Multivariate count data are commonly encountered in bioinformatics. Although the Poisson distribution seems a natural fit to these count data, its multivariate extension is computationally expensive. Recently, mixtures of multivariate Poisson lognormal (MPLN) models have been used to analyze these multivariate count measurements. In the MPLN model, the counts, conditional on the latent variable, are modelled using a Poisson distribution and the latent variable comes from a multivariate Gaussian distribution. Due to this hierarchical structure, the MPLN model can account for over-dispersion as opposed to the traditional Poisson distribution and allows for correlation between the variables. Here, we extend the mixture of multivariate Poisson-log normal distributions for clustering high dimensional RNA-seq data by incorporating a factor analyzer structure in the latent space. A family of parsimonious mixtures of multivariate Poisson log-normal distributions are proposed by decomposing the covariance matrix and imposing constraints on these decompositions. Application on simulated data sets as well as a real data set is presented.
Keywords
Model-based clustering
RNA-seq data
Mixture models
multivariate Poisson lognormal distribution
Microbiome taxa count data, derived from next-generation sequencing, are inherently high-dimensional, over-dispersed, and reveal only relative abundance, making them compositional and constrained to a simplex. To model such data, the logistic normal multinomial (LNM) approach transforms relative abundances from a simplex to real Euclidean space using the additive log-ratio transformation. We have developed mixtures of LNM models for clustering microbiome data, adopting an efficient framework for parameter estimation using variational approximations to reduce the computational overhead. In this talk, we will illustrate that the LNM mixture models provide a flexible framework, which can be easily adopted by assuming different data structures and distributions at the hidden layer latent space. Specifically, we present a matrix-LNM model that introduces a matrix variate normal distribution at the latent layer, designed for time-coursed microbiome data. This approach captures both temporal dependencies and inter-sample correlations, offering a structured approach to longitudinal microbiome analysis. In addition, a family of models is also proposed by incorporating the modified Cholesky decomposition and imposing constraints on the components of the covariance matrix. Through simulation studies and real data analysis, we demonstrate the model's effectiveness in identifying dynamic patterns and clustering temporal microbiome profiles.
Keywords
Microbiome data
Model-based clustering
Matrix-variate normal
The analysis of repeated measures of dietary intake data is often met with statistical challenges due to its high dimensionality and heterogeneity. These issues are further amplified when data is obtained from large population cohort studies. Bayesian nonparametric model-based clustering offers the computational flexibility to handle multiple exposures jointly, as well as the mechanics to identify subgroup differences and similarities through the borrowing of information across subgroups. This talk will discuss approaches that can accommodate a wide set of dietary exposure variables that interrelate and change over time, as well as the scalability of this data for large, heterogeneous populations. Using dietary intake data from over 58.000 women collected from the Black Women's Health Study, we will apply these approaches to better understand dietary consumption patterns in US amongst Black Women from 1995-2021.
Keywords
diet patterns
Bayesian nonparametric
model-based clustering
Our new framework for model-based clustering on data with continuous and discrete variables extends the cluster variance structure
framework for Gaussian mixture models set forth by Fraley and Raftery (1999). In modeling how each variable contributes to cluster determination, we allow for relationships within and between the continuous and discrete variables. This avoids both the creation
of latent continuous variables for unordered categories and the simplifying assumption that categorical variables are completely independent of all other clustering variables. Simulation study results showed desirable properties of our method when applied
to data with variables of mixed-distributional forms. Applying our clustering methods to prostate cancer data shows subgroups with different responses to treatment. Applying our data to nutritional intake data shows similar clusters for mothers and children
based on independent data collection.
Keywords
mixture models
categorical data