Mastering Rare Events Analysis: Optimal Subsampling and Subsample Size Determination in Cox and Logistic Regression

Tal Agassi Co-Author
Tel Aviv University
 
Nir Keret Co-Author
University of Washington
 
Malka Gorfine Speaker
Tel Aviv University
 
Thursday, Aug 7: 9:25 AM - 9:50 AM
Invited Paper Session 
Music City Center 
Massive sized survival datasets become increasingly prevalent with the development of the healthcare
industry, and pose computational challenges unprecedented in traditional survival analysis use cases. A popular way for coping with massive datasets is downsampling them, such that the computational resources can be afforded by the researcher. This talk addresses the settings of right censored and possibly left-truncated data with rare events, such that the observed failure times constitute only a small portion of the overall sample. We propose Cox regression subsampling-based estimators that approximate their full-data partial-likelihood-based counterparts, by assigning optimal sampling probabilities to censored observations, and including all observed failures in the analysis. Asymptotic properties of the proposed estimators are established under suitable regularity conditions. Additionally, we present a novel optimal subsampling procedure tailored to logistic regression with imbalanced data. While a multitude of
existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency
loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, we introduce
tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data
with rare events and logistic regression for both balanced and imbalanced datasets. The efficacy of these tools and
procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets:
survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked
birth and infant death data with about 28 million observations. Joint work with Nir Keret and Tal Agassi.