A robust imputation method for missing data in high throughput observations

Sarmistha Das Co-Author
 
Anand Seth Co-Author
Research Mentor
 
Shesh N. Rai Co-Author
Biostats, Health Inform & Data Sci | College of Medicine
 
Bipulkumar Das First Author
University of Cincinnati
 
Bipulkumar Das Presenting Author
University of Cincinnati
 
Thursday, Aug 7: 8:35 AM - 8:50 AM
2568 
Contributed Papers 
Music City Center 
Missing data issues are highly prevalent in High Throughput Studies (HTS). Missing patterns in such studies are rarely missing at random. We describe varying percentages of missingness and quantify the amount of missingness in a clinical study. Acute Myeloid Leukemia (AML) is a type of cancer of the myeloid line of blood cells in the bone marrow and blood. This is one of most lethal cancer types. We have gene expression data on AML for thousands of genes. There are three different subtypes of AML (Normal, CK type, and CBF type) that we plan to compare. Usually, gene expression data have many genes with zero counts. Missing value imputation methods are versatile techniques to deal with missingness. The imputation methods facilitate analysis by keeping almost majority of the dataset for further analysis. Here, our goal is to compare a robust imputation technique (an MLE based approach) to the conventional imputation techniques namely mean imputation, KNN imputation, and EM algorithm and choose the best one for our AML study.

Keywords

Acute Myeloid Leukemia

High Throughput Data

Imputation

Maximum Likelihood approach

Mean, KNN, EM algorithm 

Main Sponsor

Survey Research Methods Section