Thursday, Aug 7: 8:30 AM - 10:20 AM
0526
Invited Paper Session
Music City Center
Room: CC-207C
Applied
Yes
Main Sponsor
ENAR
Co Sponsors
New England Statistical Society
WNAR
Presentations
It is common to split a dataset into a training set and a testing set for building statistical and machine learning models. In this talk, we will discuss about deterministic methods for optimally splitting the dataset. SPlit and Twinning are two such methods where the aim was to split the dataset with similar distributional characteristics. We will propose a new method for creating a testing set that not only maintains the distribution but also difficult to predict.
Keywords
training set
testing set
validation
experimental design
The formidable challenge presented by the analysis of big data stems not just from its sheer volume, but also from the diversity, complexity, and the rapid pace at which it needs to be processed or delivered. A compelling approach is to analyze a sample of the data, while still preserving the comprehensive information contained in the full dataset. Although there is a considerable amount of research on this subject, the majority of it relies on classical statistical models, such as linear models and generalized linear models, etc. These models serve as powerful tools when the relationships between input and output variables are uniform. However, they may not be suitable when applied to complex datasets, as they tend to yield suboptimal results in the face of inherent complexity or heterogeneity. In this presentation, we will introduce a broadly applicable and scalable methodology designed to overcome these challenges. This is achieved through an in-depth exploration and integration of cutting-edge statistical methods, drawing particularly from neural network models and, more specifically, Mixture-of-Experts (ME) models, along with optimal designs.
Keywords
Mixture-of-experts
Subdata
Information matrix
EM algorithm
Co-Author
Min Yang, University of Illinois at Chicago
Speaker
Min Yang, University of Illinois at Chicago
Massive sized survival datasets become increasingly prevalent with the development of the healthcare
industry, and pose computational challenges unprecedented in traditional survival analysis use cases. A popular way for coping with massive datasets is downsampling them, such that the computational resources can be afforded by the researcher. This talk addresses the settings of right censored and possibly left-truncated data with rare events, such that the observed failure times constitute only a small portion of the overall sample. We propose Cox regression subsampling-based estimators that approximate their full-data partial-likelihood-based counterparts, by assigning optimal sampling probabilities to censored observations, and including all observed failures in the analysis. Asymptotic properties of the proposed estimators are established under suitable regularity conditions. Additionally, we present a novel optimal subsampling procedure tailored to logistic regression with imbalanced data. While a multitude of
existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency
loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, we introduce
tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data
with rare events and logistic regression for both balanced and imbalanced datasets. The efficacy of these tools and
procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets:
survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked
birth and infant death data with about 28 million observations. Joint work with Nir Keret and Tal Agassi.
Subsampling is an effective approach for addressing the challenges associated with applying statistical methods to large datasets. The training of Gaussian process models, which is notoriously difficult with large-scale data, particularly benefits from subsampling techniques in big data contexts. In this study, we introduce a subsampling methodology designed to enhance the predictive accuracy of Gaussian process models in unexplored input regions. The proposed method, named Generalization Error Minimization in SubSampling (GEMSS), not only identifies informative subsets of data but also removes redundant data points that lead to numerical instability. We establish an equivalence between linear models and Gaussian process models, which facilitates the development of GEMSS. Additionally, we highlight a relevant study by Chang [J. Comput. Graph. Statist. 32 (2023) 613-630] as a specific case within our broader framework. The proposed method is justified by theoretical results and validated through numerical examples across various scenarios.
Keywords
Gaussian process
Generalization error