Print Close

Better clustering leads to better understanding: genes, microbes, or foods show risk subgroups

Jiachen Chen Chair
Boston University School of Public Health

Tanzy Love Organizer
University of Rochester

Thursday, Aug 7: 8:30 AM - 10:20 AM
0371
Invited Paper Session

Music City Center

Room: CC-104A

Applied

Yes

Main Sponsor

Classification Society of North America

Co Sponsors

Biometrics Section

Section on Statistical Learning and Data Science

Presentations

Clustering high dimensional RNA-seq data

Multivariate count data are commonly encountered in bioinformatics. Although the Poisson distribution seems a natural fit to these count data, its multivariate extension is computationally expensive. Recently, mixtures of multivariate Poisson lognormal (MPLN) models have been used to analyze these multivariate count measurements. In the MPLN model, the counts, conditional on the latent variable, are modelled using a Poisson distribution and the latent variable comes from a multivariate Gaussian distribution. Due to this hierarchical structure, the MPLN model can account for over-dispersion as opposed to the traditional Poisson distribution and allows for correlation between the variables. Here, we extend the mixture of multivariate Poisson-log normal distributions for clustering high dimensional RNA-seq data by incorporating a factor analyzer structure in the latent space. A family of parsimonious mixtures of multivariate Poisson log-normal distributions are proposed by decomposing the covariance matrix and imposing constraints on these decompositions. Application on simulated data sets as well as a real data set is presented.

Keywords

Model-based clustering

RNA-seq data

Mixture models

multivariate Poisson lognormal distribution

Co-Author(s)

Andrea Payne, Carleton University
Anjali Silva, University of Toronto
Steven Rothstein, University of Guelph
Paul McNicholas, McMaster University

Speaker

Sanjeena Dang, Carleton University

Efficient clustering of microbiome compositional data using mixtures of logistic normal multinomial models and their extensions

Microbiome taxa count data, derived from next-generation sequencing, are inherently high-dimensional, over-dispersed, and reveal only relative abundance, making them compositional and constrained to a simplex. To model such data, the logistic normal multinomial (LNM) approach transforms relative abundances from a simplex to real Euclidean space using the additive log-ratio transformation. We have developed mixtures of LNM models for clustering microbiome data, adopting an efficient framework for parameter estimation using variational approximations to reduce the computational overhead. In this talk, we will illustrate that the LNM mixture models provide a flexible framework, which can be easily adopted by assuming different data structures and distributions at the hidden layer latent space. Specifically, we present a matrix-LNM model that introduces a matrix variate normal distribution at the latent layer, designed for time-coursed microbiome data. This approach captures both temporal dependencies and inter-sample correlations, offering a structured approach to longitudinal microbiome analysis. In addition, a family of models is also proposed by incorporating the modified Cholesky decomposition and imposing constraints on the components of the covariance matrix. Through simulation studies and real data analysis, we demonstrate the model's effectiveness in identifying dynamic patterns and clustering temporal microbiome profiles.

Keywords

Microbiome data

Model-based clustering

Matrix-variate normal

Speaker

Yuan Fang, Binghamton University

Overcoming Statistical challenges in the analysis of dietary intake changes over time

The analysis of repeated measures of dietary intake data is often met with statistical challenges due to its high dimensionality and heterogeneity. These issues are further amplified when data is obtained from large population cohort studies. Bayesian nonparametric model-based clustering offers the computational flexibility to handle multiple exposures jointly, as well as the mechanics to identify subgroup differences and similarities through the borrowing of information across subgroups. This talk will discuss approaches that can accommodate a wide set of dietary exposure variables that interrelate and change over time, as well as the scalability of this data for large, heterogeneous populations. Using dietary intake data from over 58.000 women collected from the Black Women's Health Study, we will apply these approaches to better understand dietary consumption patterns in US amongst Black Women from 1995-2021.

Keywords

diet patterns

Bayesian nonparametric

model-based clustering

Co-Author(s)

Briana Stephenson, Harvard T.H. Chan School of Public Health
Daniel Schwartz, Harvard T.H. Chan School of Public Health

Speaker

Xuzhi Wang, Harvard University

Clinical Variables of Different Types: An Approach to Model-Based Clustering of Mixed-Type Data Illustrated in Diet Diaries

Our new framework for model-based clustering on data with continuous and discrete variables extends the cluster variance structure
framework for Gaussian mixture models set forth by Fraley and Raftery (1999). In modeling how each variable contributes to cluster determination, we allow for relationships within and between the continuous and discrete variables. This avoids both the creation
of latent continuous variables for unordered categories and the simplifying assumption that categorical variables are completely independent of all other clustering variables. Simulation study results showed desirable properties of our method when applied
to data with variables of mixed-distributional forms. Applying our clustering methods to prostate cancer data shows subgroups with different responses to treatment. Applying our data to nutritional intake data shows similar clusters for mothers and children
based on independent data collection.

Keywords

mixture models

categorical data

Speaker

Tanzy Love, University of Rochester