Clustering high dimensional RNA-seq data

Andrea Payne Co-Author
Carleton University
 
Anjali Silva Co-Author
University of Toronto
 
Steven Rothstein Co-Author
University of Guelph
 
Paul McNicholas Co-Author
McMaster University
 
Sanjeena Dang Speaker
Carleton University
 
Thursday, Aug 7: 8:35 AM - 9:00 AM
Invited Paper Session 
Music City Center 
Multivariate count data are commonly encountered in bioinformatics. Although the Poisson distribution seems a natural fit to these count data, its multivariate extension is computationally expensive. Recently, mixtures of multivariate Poisson lognormal (MPLN) models have been used to analyze these multivariate count measurements. In the MPLN model, the counts, conditional on the latent variable, are modelled using a Poisson distribution and the latent variable comes from a multivariate Gaussian distribution. Due to this hierarchical structure, the MPLN model can account for over-dispersion as opposed to the traditional Poisson distribution and allows for correlation between the variables. Here, we extend the mixture of multivariate Poisson-log normal distributions for clustering high dimensional RNA-seq data by incorporating a factor analyzer structure in the latent space. A family of parsimonious mixtures of multivariate Poisson log-normal distributions are proposed by decomposing the covariance matrix and imposing constraints on these decompositions. Application on simulated data sets as well as a real data set is presented.

Keywords

Model-based clustering

RNA-seq data

Mixture models

multivariate Poisson lognormal distribution