Advances in Bayesian Factorization Methods in Genomics and Medicine

Alessandro Zito Chair
Harvard University
 
Peter Carbonetto Discussant
University of Chicago
 
Alessandro Zito Organizer
Harvard University
 
Tuesday, Aug 5: 8:30 AM - 10:20 AM
0766 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-209B 
The amount of genomic data, such as whole-genome sequencing, single-cell RNA, and repeated gene expression measurements, has seen unparalleled growth in the last decade thanks to technological advancements and reduced storage and obtainment costs. This wealth of information has enhanced the effectiveness of therapeutic decisions for many diseases at a large scale. However, such data come with increased computational and modeling hurdles due to their large dimension, often arising from time- and spatially-dependent longitudinal measurements. Hence, effective dimensionality-reduction tools aimed at finding simple and interpretable patterns among such complexities are of paramount importance in unveiling the common pathways through which diseases progress over time and/or impact subgroups of patients. In turn, the inferred low-dimensional structures arising from such models effectively improve precision medicine. The purpose of this session is to explore some emerging novel approaches in Bayesian factor analysis and related factorization methods applied to genomic and health data. These include generalizations of non-negative matrix factorization methods applied to count and categorical data, and Gaussian process modeling for retrieving lower-dimensional spatial and time trajectories in continuous and binary data. The speakers are all applied scientists from highly diverse and heterogeneous backgrounds who have extensive experience in the field. We anticipate the session will attract a wide range of audiences interested in parametric and nonparametric methods in the Bayesian field and beyond, including spatial statistics, temporal and dynamic modeling, statistical testing in high dimensions, and related interdisciplinary research areas beyond genomics which apply factor models, such as ecology, epidemiology, and bioinformatics.

Keywords

Bayesian inference

Factor models

Medicine

Genomics

Spatial statistics 

Applied

Yes

Main Sponsor

Section on Bayesian Statistical Science

Co Sponsors

Section on Nonparametric Statistics
Section on Statistics in Genomics and Genetics

Presentations

A Bayesian Boolean Matrix Factorization for Analyzing Copy Number Abnormalities for Multiple Myeloma Disease

Chromosomal alterations in multiple myeloma are pivotal in understanding the disease's patho-
genesis, progression, and therapeutic response. Multiple myeloma, a cancer of plasma cells, is char-
acterized by various genomic abnormalities, including chromosomal translocations, deletions, dupli-
cations, and aneuploidy. Studying the latent factors behind these events of deletion and insertion is
very helpful in understanding the disease's prognosis and evolution. One possible approach would be
the use of Boolean Matrix Factorization algorithms to unravel the complexities of these events. This
study aimed to develop a novel algorithm Bayesian Boolean Matrix Factorization (BBooMF) for
decomposing binary (0, 1) datasets into a two binary factor matrices. We propose a simple novel de-
composition method for categorical data based on logical conditions, yielding to easily interpretable
factors. We utilize the Bayesian approach in addition to boolean algebra to carry out probabilistic
inference to address uncertainty and noise in the data,improving the accuracy and interpretability of
matrix factorization. We iteratively optimize the factorization process using the Gibbs sampler, pro-
viding valuable insights into the underlying patterns and structures of complex discrete datasets. The
proposed algorithm is compared with the existing classical methods like Asso, GRENCOND, GRENCOND+
and topFiberM. Our algorithm perform better or as well as these classical methods. The developed
algorithms will have potential applications in other fields which have data sets that are naturally
represented using binary structures .
 

Keywords

Boolean factorization 

Co-Author

Giovanni Parmigiani, Dana-Farber Cancer Institute

Speaker

Adolphus Wagala, Dana Farber Cancer Institute

Structurally aware robust rank selection for probabilistic matrix factorization

Matrix factorization models (e.g., factor analysis, PCA, and nonnegative matrix factorization) are widely used to find latent structure in data by assuming the data-generating model parameters can be expressed as the product of two low-rank matrices. Typically, the rank K of the matrices is interpreted as the number of "processes" or "activities" that generated the observed data. Thus, in practice, determining K is a critical inferential step for scientific understanding. However, because the assumed observation model is only an approximation to the true data-generating process, as the number of observations increases, rather than obtaining better inferences, the opposite occurs: the data is explained by adding spurious new "activities" that compensate for the shortcomings of the observation model. However, there are two important sources of prior knowledge that we can exploit to obtain well-defined results no matter the dataset size: known causal structure (e.g., knowing that the latent activities cause the observed signal but not vice-versa) and a rough sense of how wrong the observation model is (e.g., based on small amounts of expert-labeled data or some understanding of the data-generating process). We propose a new model selection criteria that, while model-based, uses this available knowledge to obtain inferences about the rank K that are robust to misspecification of the observation model. We provide theoretical support for our approach by proving a consistency result under intuitive assumptions. Numerical experiments demonstrate our model selection criteria consistently finds an appropriate number of latent activities in two applications: mutational signature discovery and hyper spectral unmixing.  

Keywords

Model selection

Mutational signature discovery

Nonnegative matrix factorization

Probabilistic matrix factorization

Misspecified model 

Co-Author

Jonathan Huggins, Boston University

Speaker

Jonathan Huggins, Boston University

Aladynoulli: Hierarchical Bayesian Modeling of time-varying trajectories across 358 distinct diseases

This proposal introduces a novel Bayesian framework for modeling disease progression that accounts for the complex temporal relationships between cardiovascular disease and its comorbidities. The model leverages latent disease signatures that evolve over time, allowing diseases to manifest through different pathways and with varying progression rates. By incorporating time-varying signatures through Gaussian processes, the framework captures how disease patterns emerge and evolve across the lifespan, while a discrete-time survival likelihood enables prediction of disease trajectories. The model integrates genetic effects through individual-specific signature loadings, allowing for personalized progression rates. The framework addresses key limitations of existing approaches: it moves beyond assumptions of disease independence, handles the streaming nature of electronic health records through Bayesian updating, and enables joint modeling of heterogeneous disease types across lifespans. Validation on UK Biobank data (N=407,878) across 358 diseases demonstrates identification of 20 biologically meaningful disease signatures. This approach advances statistical methodology for analyzing longitudinal health data while enabling precision medicine applications through improved disease trajectory prediction. 

Keywords

Dynamic factor analysis 

Co-Author

Sarah Urbut, Broad Institute

Speaker

Sarah Urbut, Broad Institute

Spectral decomposition-assisted multi-study factor analysis

This work focuses on covariance estimation for multi-study data. Popular approaches employ factor-analytic terms with shared and study-specific loadings that decompose the variance into (i) a shared low-rank component, (ii) study-specific low-rank components, and (iii) a diagonal term capturing idiosyncratic variability. Our proposed methodology estimates the latent factors via spectral decompositions and infers the factor loadings via surrogate regression tasks, avoiding identifiability and computational issues of existing alternatives. Reliably inferring shared vs study-specific components requires novel developments that are of independent interest. The approximation error decreases as the sample size and the data dimension diverge, formalizing a blessing of dimensionality. Conditionally on the factors, loadings and residual error variances are inferred via conjugate normal-inverse gamma priors. The conditional posterior distribution of factor loadings has a simple product form across outcomes, facilitating parallelization. We show favorable asymptotic properties, including central limit theorems for point estimators and posterior contraction, and excellent empirical performance in simulations. The methods are applied to integrate three studies on gene associations among immune cells. 

Keywords

latent variable models

RNA sequencing

contrastive models

case-control data 

Speaker

Niccolo Anceschi, Duke University