Tuesday, Aug 5: 8:30 AM - 10:20 AM
0766
Topic-Contributed Paper Session
Music City Center
Room: CC-209B
The amount of genomic data, such as whole-genome sequencing, single-cell RNA, and repeated gene expression measurements, has seen unparalleled growth in the last decade thanks to technological advancements and reduced storage and obtainment costs. This wealth of information has enhanced the effectiveness of therapeutic decisions for many diseases at a large scale. However, such data come with increased computational and modeling hurdles due to their large dimension, often arising from time- and spatially-dependent longitudinal measurements. Hence, effective dimensionality-reduction tools aimed at finding simple and interpretable patterns among such complexities are of paramount importance in unveiling the common pathways through which diseases progress over time and/or impact subgroups of patients. In turn, the inferred low-dimensional structures arising from such models effectively improve precision medicine. The purpose of this session is to explore some emerging novel approaches in Bayesian factor analysis and related factorization methods applied to genomic and health data. These include generalizations of non-negative matrix factorization methods applied to count and categorical data, and Gaussian process modeling for retrieving lower-dimensional spatial and time trajectories in continuous and binary data. The speakers are all applied scientists from highly diverse and heterogeneous backgrounds who have extensive experience in the field. We anticipate the session will attract a wide range of audiences interested in parametric and nonparametric methods in the Bayesian field and beyond, including spatial statistics, temporal and dynamic modeling, statistical testing in high dimensions, and related interdisciplinary research areas beyond genomics which apply factor models, such as ecology, epidemiology, and bioinformatics.
Bayesian inference
Factor models
Medicine
Genomics
Spatial statistics
Applied
Yes
Main Sponsor
Section on Bayesian Statistical Science
Co Sponsors
Section on Nonparametric Statistics
Section on Statistics in Genomics and Genetics
Presentations
Chromosomal alterations in multiple myeloma are pivotal in understanding the disease's patho-
genesis, progression, and therapeutic response. Multiple myeloma, a cancer of plasma cells, is char-
acterized by various genomic abnormalities, including chromosomal translocations, deletions, dupli-
cations, and aneuploidy. Studying the latent factors behind these events of deletion and insertion is
very helpful in understanding the disease's prognosis and evolution. One possible approach would be
the use of Boolean Matrix Factorization algorithms to unravel the complexities of these events. This
study aimed to develop a novel algorithm Bayesian Boolean Matrix Factorization (BBooMF) for
decomposing binary (0, 1) datasets into a two binary factor matrices. We propose a simple novel de-
composition method for categorical data based on logical conditions, yielding to easily interpretable
factors. We utilize the Bayesian approach in addition to boolean algebra to carry out probabilistic
inference to address uncertainty and noise in the data,improving the accuracy and interpretability of
matrix factorization. We iteratively optimize the factorization process using the Gibbs sampler, pro-
viding valuable insights into the underlying patterns and structures of complex discrete datasets. The
proposed algorithm is compared with the existing classical methods like Asso, GRENCOND, GRENCOND+
and topFiberM. Our algorithm perform better or as well as these classical methods. The developed
algorithms will have potential applications in other fields which have data sets that are naturally
represented using binary structures .
Keywords
Boolean factorization
Matrix factorization models (e.g., factor analysis, PCA, and nonnegative matrix factorization) are widely used to find latent structure in data by assuming the data-generating model parameters can be expressed as the product of two low-rank matrices. Typically, the rank K of the matrices is interpreted as the number of "processes" or "activities" that generated the observed data. Thus, in practice, determining K is a critical inferential step for scientific understanding. However, because the assumed observation model is only an approximation to the true data-generating process, as the number of observations increases, rather than obtaining better inferences, the opposite occurs: the data is explained by adding spurious new "activities" that compensate for the shortcomings of the observation model. However, there are two important sources of prior knowledge that we can exploit to obtain well-defined results no matter the dataset size: known causal structure (e.g., knowing that the latent activities cause the observed signal but not vice-versa) and a rough sense of how wrong the observation model is (e.g., based on small amounts of expert-labeled data or some understanding of the data-generating process). We propose a new model selection criteria that, while model-based, uses this available knowledge to obtain inferences about the rank K that are robust to misspecification of the observation model. We provide theoretical support for our approach by proving a consistency result under intuitive assumptions. Numerical experiments demonstrate our model selection criteria consistently finds an appropriate number of latent activities in two applications: mutational signature discovery and hyper spectral unmixing.
Keywords
Model selection
Mutational signature discovery
Nonnegative matrix factorization
Probabilistic matrix factorization
Misspecified model
This proposal introduces a novel Bayesian framework for modeling disease progression that accounts for the complex temporal relationships between cardiovascular disease and its comorbidities. The model leverages latent disease signatures that evolve over time, allowing diseases to manifest through different pathways and with varying progression rates. By incorporating time-varying signatures through Gaussian processes, the framework captures how disease patterns emerge and evolve across the lifespan, while a discrete-time survival likelihood enables prediction of disease trajectories. The model integrates genetic effects through individual-specific signature loadings, allowing for personalized progression rates. The framework addresses key limitations of existing approaches: it moves beyond assumptions of disease independence, handles the streaming nature of electronic health records through Bayesian updating, and enables joint modeling of heterogeneous disease types across lifespans. Validation on UK Biobank data (N=407,878) across 358 diseases demonstrates identification of 20 biologically meaningful disease signatures. This approach advances statistical methodology for analyzing longitudinal health data while enabling precision medicine applications through improved disease trajectory prediction.
Keywords
Dynamic factor analysis
This work focuses on covariance estimation for multi-study data. Popular approaches employ factor-analytic terms with shared and study-specific loadings that decompose the variance into (i) a shared low-rank component, (ii) study-specific low-rank components, and (iii) a diagonal term capturing idiosyncratic variability. Our proposed methodology estimates the latent factors via spectral decompositions and infers the factor loadings via surrogate regression tasks, avoiding identifiability and computational issues of existing alternatives. Reliably inferring shared vs study-specific components requires novel developments that are of independent interest. The approximation error decreases as the sample size and the data dimension diverge, formalizing a blessing of dimensionality. Conditionally on the factors, loadings and residual error variances are inferred via conjugate normal-inverse gamma priors. The conditional posterior distribution of factor loadings has a simple product form across outcomes, facilitating parallelization. We show favorable asymptotic properties, including central limit theorems for point estimators and posterior contraction, and excellent empirical performance in simulations. The methods are applied to integrate three studies on gene associations among immune cells.
Keywords
latent variable models
RNA sequencing
contrastive models
case-control data