Print Close

Enhancing AI with Advanced Sampling and Data Splitting Techniques for Big Data

Abstract Number:

526

Submission Type:

Invited Paper Session

Participants:

Lin Wang (1), HaiYing Wang (3), Weng Kee Wong (2), Roshan Joseph (4), Min Yang (5), Malka Gorfine (6), Ming-Chung Chang (7)

Institutions:

(1) Purdue University, N/A, (2) University of California-Los Angeles, N/A, (3) University of Connecticut, N/A, (4) School of ISYE, Georgia Tech, N/A, (5) University of Illinois at Chicago, N/A, (6) Tel Aviv University, N/A, (7) Academia Sinica, N/A

Chair:

Weng Kee Wong
University of California-Los Angeles

Co-Organizer:

HaiYing Wang
University of Connecticut

Session Organizer:

Lin Wang
Purdue University

Speaker(s):

Roshan Joseph
School of ISYE, Georgia Tech

Min Yang
University of Illinois at Chicago

Malka Gorfine
Tel Aviv University

Ming-Chung Chang
Academia Sinica

Session Description:

The rapid growth of AI and machine learning has been fueled by access to vast amounts of data. However, managing and extracting meaningful insights from big and complex datasets present significant challenges, including dealing with high-dimensionality, noisy and imbalanced data, computational constraints, and the risk of overfitting. These issues can lead to misleading models, poor generalization, and suboptimal AI performance, especially when models are trained on biased or unrepresentative samples.

This session will delve into advanced sampling, subsampling, and data splitting techniques tailored for AI and data science applications, particularly focusing on their role in promoting robust and accurate learning from large-scale data. Attendees will explore innovative methods for handling these challenges through:
1. Sampling and Subsampling: Techniques for reducing dataset size while preserving statistical properties, improving computational efficiency without compromising model performance.
2. Data Splitting: Best practices for training, validation, and testing splits that minimize data leakage, prevent overfitting, and enhance model generalizability.
3. Case Studies: Real-world applications demonstrating how these techniques can improve predictive accuracy and robustness in AI models, particularly in complex domains like healthcare, finance, and autonomous systems.
4. Optimizing Data Use: Learn how to efficiently utilize data through strategic sampling and subsampling methods that maintain data integrity and enhance computational efficiency.

The session will feature four expert speakers who are leaders in the fields of sampling, subsampling, and data splitting. They will provide insights into the latest research developments, share practical implementations, and discuss challenges and solutions in applying these techniques to real-world problems. The tentative titles for the presentations are:
1. Roshan Joseph: Optimal methods for data splitting
2. Min Yang: Scalable Methodologies for Big Data Analysis: Integrating Flexible Statistical Models and Optimal Designs
3. Malka Gorfine: Mastering Rare Events Analysis: Optimal Subsampling and Subsample Size Determination
4. Ming-Chung Chang: Supervised Stratified Subsampling for Predictive Analytics

By focusing on practical applications and cutting-edge research, this session aims to equip data scientists, AI practitioners, and researchers with the tools and knowledge to handle big data effectively, leading to more reliable and interpretable AI outcomes. Participants will leave with a deeper understanding of how thoughtful data handling and sampling techniques can drive superior learning performance and facilitate the development of cutting-edge AI solutions.

Sponsors:

ENAR ¹

New England Statistical Society ³

WNAR ²

Theme: Statistics, Data Science, and AI Enriching Society

Yes

Applied

Yes

Estimated Audience Size

Small (<80)

I have read and understand that JSM participants must abide by the Participant Guidelines.

Yes

I understand and have communicated to my proposed speakers that JSM participants must register and pay the appropriate registration fee by June 3, 2025. The registration fee is nonrefundable.

I understand