Print Close

Outliers in Survival Analysis: A Clustering Framework and Error Bounds for Conditional Kaplan-Meier Estimators

Presented During: Anomalous data detection through AI/ML

George Chen Speaker
Carnegie Mellon University

Sunday, Aug 3: 5:25 PM - 5:45 PM
Topic-Contributed Paper Session

Music City Center

In this paper, we propose a simple clustering-based model for outliers in survival analysis. Specifically, we model feature vectors to be sampled from a mixture model, where each mixture component is associated with its own survival and censoring time distributions. We define an outlier to be a point sampled from one cluster but whose feature vector is closer to another cluster's center than its own. Under this setup, we derive error upper bounds for $k$-nearest neighbor and kernel Kaplan-Meier estimators. We first show that in a special case where outliers do not arise (when feature vector noise is bounded and the clusters are very well-separated), $k$-nearest neighbor and kernel Kaplan-Meier estimators converge at a rate much faster than previously established in literature (that did not assume clustering structure). However, in the general case when outliers may appear, our error bounds no longer go to 0 as the number of training data increases. We complement this bound with an error lower bound for how well an oracle estimator can estimate a test point's survival function. We show that a commonly assumed condition used to establish the statistical consistency of many survival estimators do not allow the possibility of the outliers we consider in our paper (namely, in our setting, survival and censoring times are not conditionally independent given a feature vector). We supplement our theoretical analysis with numerical experiments on recently developed deep kernel Kaplan-Meier estimators, showing that these estimators naturally learn embedding representations of clustered data that try to keep the clusters well-separated and to limit the presence of outliers.