Statistical challenges in the analysis of proteomics data with small sample size: a simulation study
Shervin Assassi
Co-Author
University of Texas Health Science Center at Houston
Monday, Aug 4: 9:50 AM - 10:05 AM
1074
Contributed Papers
Music City Center
Integrating proteomics and clinical data shows great promise for early, precise disease prediction and diagnosis. However, the high dimensionality and small sample sizes of proteomics data pose challenges for machine learning in identifying relevant features. While various statistical methods and pipelines are available, their efficiency, reproducibility, and clinical relevance remain unclear.
This study evaluated nine analysis pipelines using machine learning and dimensionality reduction methods on simulated data of 1317 proteins from 26 subjects (13 controls, 13 cases). With extremely small sample sizes (n < 30), all pipelines showed high performance metrics, indicating potential overfitting. Although performance metrics were similar, the proteins identified as discriminatory varied across methods. Despite this heterogeneity, their biological pathways and genetic disorders overlapped. Sensitivity analysis showed that larger sample sizes improved biomarker stability.
While most pipelines perform similarly in distinguishing cohort groups and identifying shared pathways, meticulous model selection is needed to ensure reliable protein identification for downstream studies.
Machine learning
Proteomics data
Performance metrics
Small sample sizes
Main Sponsor
Korean International Statistical Society
You have unsaved changes.