Tuesday, Aug 5: 10:30 AM - 12:20 PM
0513
Invited Paper Session
Music City Center
Room: CC-101D
Gen AI, Healthcare, Transfer Learning, LLM
Applied
Yes
Main Sponsor
International Indian Statistical Association
Co Sponsors
Biopharmaceutical Section
Presentations
Generative AI (GenAI) is a powerful tool in image and video generation and is popularized using filters in social media. However, the potential of GenAI in healthcare has yet to be fully explored. In this study, we use GenAI (generative models such GANs or diffusion models) to create synthetic facial vitiligo images that can be used for training traditional computer vision models (such as the UNet). We evaluate the fidelity of the synthetic vitiligo images by using them to train a UNet model and then validating the trained model using real vitiligo images. Next, we compare the accuracy of model trained using synthetic images to a model trained using real vitiligo images on the same validation set of real vitiligo images. Finally, we use the trained UNet to generate clinically meaningful measurements of vitiligo lesions. This framework can be generalized to any disease that can be diagnosed through images. A small set of real images with disease can be used as the foundation to generate a much larger set of synthetic images with disease that researches can use to train and improve the accuracy of their computer vision AI in disease quantification.
Keywords
Generative AI
Transfer Learning
Multimodal integration has made significant strides in recent years, evolving from early to late fusion approaches and achieving notable performance gains over single-view methods. Substantial questions remain, however, particularly at the intersection of dependence-aware multimodal integration and uncertainty-aware multiview feature selection - both challenging for current integration paradigms. To bridge these longstanding gaps, we propose a scalable Bayesian cooperative learning method, BayesCOOP, which combines jittered group spike-and-slab L1 regularization with intermediate fusion. For uncertainty quantification, BayesCOOP employs the Bayesian bootstrap to generate approximate posterior samples via maximum a posteriori (MAP) estimation on jittered, resampled datasets. This approach inherits strong theoretical guarantees, including posterior contraction at near-optimal rates in sparse, high-dimensional regimes, while enabling scalable pseudo-posterior inference. As one of the first uncertainty-aware multimodal approaches in the field, BayesCOOP significantly outperforms state-of-the-art approaches, including early, late, and intermediate fusion. Analyzing two published multimodal datasets using BayesCOOP, we show that it can be up to 20 times more powerful than existing methods and disclose multimodal discoveries that otherwise cannot be revealed by existing approaches. Our open-source software is publicly available.
Keywords
Transfer Learning
Foundation Model
Microbiome
Machine Learning
Multi-omics
Metabolomics
This project aims to develop an AI-driven automated database to facilitate decision-making in oncology clinical trials. To support the downstream decision-making framework, extensive historical data is needed such as tumor indication, biomarker information, Objective Response Rate (ORR), Overall Survival (OS), and Progression-free Survival (PFS). Currently, such available historical data in-house come from manual data collection, which is inefficient and laborious.
To automate this process, a variable extraction tool was developed that retrieves essential information from various sources, including external websites and internal documents. The tool leverages the recent development in large language models to transform unstructured data into structured data, incorporating key steps such as data pre-processing, context compression, multiple extraction phases, and extraction validation.
This approach ensures high-quality data extraction comparable to human efforts. The presentation will focus on the pipeline for automated data collection.
Keywords
LLM, genAI, clinical trials, structured database, information retrieval, variable extraction
The integration of big data and artificial intelligence (AI) is transforming clinical drug development, driving improvements in efficiency, speed, and cost-effectiveness. AI enhances clinical trials by optimizing patient recruitment, streamlining timelines, and enabling better resource allocation. Natural language processing (NLP), in particular, facilitates the extraction of critical insights from unstructured data sources, such as electronic health records, medical literature, and patient narratives. Additionally, AI supports real-time safety monitoring, allowing for proactive adverse event detection to protect participant well-being.
This presentation explores the application of NLP to automate outcome adjudication traditionally performed by physician-led clinical events committees (CEC). The manual review process requires substantial time, resources, and expertise, but NLP-driven adjudication offers a scalable, cost-effective alternative. Our goal is to develop a model that mimics the decision-making behaviors of human experts, fully automating the adjudication process while supporting CECs to save time and effort. We demonstrate using clinical trial data how this approach can enhance the efficiency of clinical trials, observational studies, and quality improvement initiatives, while addressing current limitations in automated adjudication.
Keywords
LLM, CEC, Adjudication, Cardiovascular, Longformer
Min-norm interpolators naturally emerge as implicit regularized limits of modern machine learning algorithms. Recently, their out-of-distribution risk was studied when test samples are unavailable during training. However, in many applications, a limited amount of test data is typically available during training. Properties of min-norm interpolation in this setting are not well understood. In this talk, I will present a characterization of the bias and variance of pooled min-L2-norm interpolation under covariate and model shifts. I will show that the pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications. For example, under model shift, adding data always hurts prediction when the signal-to-noise ratio is low. However, for higher signal-to-noise ratios, transfer learning helps as long as the shift-to-signal ratio lies below a threshold that I will define. I will further present data-driven methods to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes generalization error. Our results also show that under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between domains improves the risk. Time permitting, I will introduce a novel anisotropic local law that helps achieve some of these characterizations and may be of independent interest in random matrix theory.
Keywords
Transfer Learning