Enhancing AI with Advanced Sampling and Data Splitting Techniques for Big Data

Weng Kee Wong Chair
University of California-Los Angeles
 
Lin Wang Organizer
Purdue University
 
HaiYing Wang Organizer
University of Connecticut
 
Thursday, Aug 7: 8:30 AM - 10:20 AM
0526 
Invited Paper Session 
Music City Center 
Room: CC-207C 

Applied

Yes

Main Sponsor

ENAR

Co Sponsors

New England Statistical Society
WNAR

Presentations

Optimal Data Splitting

It is common to split a dataset into a training set and a testing set for building statistical and machine learning models. In this talk, we will discuss about deterministic methods for optimally splitting the dataset. SPlit and Twinning are two such methods where the aim was to split the dataset with similar distributional characteristics. We will propose a new method for creating a testing set that not only maintains the distribution but also difficult to predict. 

Keywords

training set

testing set

validation

experimental design 

Co-Author

Roshan Joseph, School of ISYE, Georgia Tech

Speaker

Youngseo Cho

Scalable Methodologies for Big Data Analysis: Integrating flexible Statistical Models and Optimal Designs

The formidable challenge presented by the analysis of big data stems not just from its sheer volume, but also from the diversity, complexity, and the rapid pace at which it needs to be processed or delivered. A compelling approach is to analyze a sample of the data, while still preserving the comprehensive information contained in the full dataset. Although there is a considerable amount of research on this subject, the majority of it relies on classical statistical models, such as linear models and generalized linear models, etc. These models serve as powerful tools when the relationships between input and output variables are uniform. However, they may not be suitable when applied to complex datasets, as they tend to yield suboptimal results in the face of inherent complexity or heterogeneity. In this presentation, we will introduce a broadly applicable and scalable methodology designed to overcome these challenges. This is achieved through an in-depth exploration and integration of cutting-edge statistical methods, drawing particularly from neural network models and, more specifically, Mixture-of-Experts (ME) models, along with optimal designs. 

Keywords

Mixture-of-experts

Subdata

Information matrix

EM algorithm 

Co-Author

Min Yang, University of Illinois at Chicago

Speaker

Min Yang, University of Illinois at Chicago

Mastering Rare Events Analysis: Optimal Subsampling and Subsample Size Determination in Cox and Logistic Regression

Massive sized survival datasets become increasingly prevalent with the development of the healthcare
industry, and pose computational challenges unprecedented in traditional survival analysis use cases. A popular way for coping with massive datasets is downsampling them, such that the computational resources can be afforded by the researcher. This talk addresses the settings of right censored and possibly left-truncated data with rare events, such that the observed failure times constitute only a small portion of the overall sample. We propose Cox regression subsampling-based estimators that approximate their full-data partial-likelihood-based counterparts, by assigning optimal sampling probabilities to censored observations, and including all observed failures in the analysis. Asymptotic properties of the proposed estimators are established under suitable regularity conditions. Additionally, we present a novel optimal subsampling procedure tailored to logistic regression with imbalanced data. While a multitude of
existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency
loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, we introduce
tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data
with rare events and logistic regression for both balanced and imbalanced datasets. The efficacy of these tools and
procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets:
survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked
birth and infant death data with about 28 million observations. Joint work with Nir Keret and Tal Agassi.

 

Co-Author(s)

Tal Agassi, Tel Aviv University
Nir Keret, University of Washington

Speaker

Malka Gorfine, Tel Aviv University

GEMSS-Driven Subsampling for Information Extraction and Redundancy Elimination

Subsampling is an effective approach for addressing the challenges associated with applying statistical methods to large datasets. The training of Gaussian process models, which is notoriously difficult with large-scale data, particularly benefits from subsampling techniques in big data contexts. In this study, we introduce a subsampling methodology designed to enhance the predictive accuracy of Gaussian process models in unexplored input regions. The proposed method, named Generalization Error Minimization in SubSampling (GEMSS), not only identifies informative subsets of data but also removes redundant data points that lead to numerical instability. We establish an equivalence between linear models and Gaussian process models, which facilitates the development of GEMSS. Additionally, we highlight a relevant study by Chang [J. Comput. Graph. Statist. 32 (2023) 613-630] as a specific case within our broader framework. The proposed method is justified by theoretical results and validated through numerical examples across various scenarios. 

Keywords

Gaussian process

Generalization error 

Co-Author

Ming-Chung Chang

Speaker

Ming-Chung Chang