Statistical challenges in the analysis of proteomics data with small sample size: a simulation study

Claudia Pedroza Co-Author
University of Texas Medical School
 
Shervin Assassi Co-Author
University of Texas Health Science Center at Houston
 
Chandra Mohan Co-Author
University of Houston
 
Kyung Hyun Lee First Author
UTHealth
 
Kyung Hyun Lee Presenting Author
UTHealth
 
Monday, Aug 4: 9:50 AM - 10:05 AM
1074 
Contributed Papers 
Music City Center 
Integrating proteomics and clinical data shows great promise for early, precise disease prediction and diagnosis. However, the high dimensionality and small sample sizes of proteomics data pose challenges for machine learning in identifying relevant features. While various statistical methods and pipelines are available, their efficiency, reproducibility, and clinical relevance remain unclear.
This study evaluated nine analysis pipelines using machine learning and dimensionality reduction methods on simulated data of 1317 proteins from 26 subjects (13 controls, 13 cases). With extremely small sample sizes (n < 30), all pipelines showed high performance metrics, indicating potential overfitting. Although performance metrics were similar, the proteins identified as discriminatory varied across methods. Despite this heterogeneity, their biological pathways and genetic disorders overlapped. Sensitivity analysis showed that larger sample sizes improved biomarker stability.
While most pipelines perform similarly in distinguishing cohort groups and identifying shared pathways, meticulous model selection is needed to ensure reliable protein identification for downstream studies.

Keywords

Machine learning

Proteomics data

Performance metrics

Small sample sizes 

Main Sponsor

Korean International Statistical Society