Missing data imputation via truncated Gaussian factor analysis with application to metabolomics data

Lorraine Brennan Co-Author
University College Dublin
 
Roberta De Vito Co-Author
Brown University
 
Massimiliano Russo Co-Author
The Ohio State University
 
Isobel Claire Gormley Co-Author
University College Dublin
 
Kate Finucane First Author
University College Dublin
 
Kate Finucane Presenting Author
University College Dublin
 
Monday, Aug 4: 11:45 AM - 11:50 AM
2248 
Contributed Speed 
Music City Center 
In metabolomics, which involves the study of small molecules in biological samples, data are often acquired via mass spectrometry, resulting in high-dimensional, highly correlated datasets with frequent missing values. Both missing at random (MAR), due to acquisition or processing errors, and missing not at random (MNAR), often caused by values falling below detection thresholds, are common. Imputation is thus a critical component of downstream analysis. We propose a novel Truncated Gaussian Infinite Factor Analysis (TGIFA) model to address these challenges. By incorporating truncated Gaussian assumptions, TGIFA respects the physical constraints of the data, while the use of an infinite latent factor framework eliminates the need to pre-specify the number of factors. Our Bayesian inference approach jointly models MAR and MNAR mechanisms and, via a computationally efficient exchange algorithm, provides posterior uncertainty quantification for both imputed values and missingness types. We evaluate TGIFA through extensive simulation studies and apply it to a urinary metabolomics dataset, where it yields sensible and interpretable imputations with associated uncertainty estimates.

Keywords

Missing data

Metabolomics

Imputation

Infinite factor model

Mass spectrometry data 

Main Sponsor

Section on Physical and Engineering Sciences