Tuesday, Aug 5: 10:30 AM - 12:20 PM
4096
Contributed Papers
Music City Center
Room: CC-103A
Main Sponsor
Section on Statistical Learning and Data Science
Presentations
Health communicators can use generative AI tools to create images for use in stakeholder-facing materials. This study examines the differences in two image-generation tools (DALL-E and Stable Diffusion) to understand how each tool portrays individuals with cancer.
Images (n = 303) generated by each tool using the prompts: "cancer patient", "breast cancer patient", "lung cancer patient", "prostate cancer patient", "cancer survivor", and "person with cancer" were coded for photorealism and the presence of rendering errors present in the image, like extra hands or misspelled words. Most of these images were coded as photorealistic (79.5%, n = 241) and without significant rendering errors (84.2%, n = 255). Stable Diffusion was more likely to produce a photorealistic result (66.4%, n = 160) while DALL-E more often produced images without errors (53.3%, n = 136). Images produced with Stable Diffusion more often produced images with the person lying in bed, wearing a hospital gown, and with a sick appearance compared to images generated by DALL-E.
Understanding how generative AI tools portray individuals with cancer is an important step in using these tools in communications.
Keywords
AI-generated images
cancer patients
representation
ChatGPT
Stable Diffusion
visual content analysis
Synthetic data generation plays a critical role across scientific disciplines, from systematic model evaluation to augmenting limited datasets. While Wasserstein Generative Adversarial Networks have shown promise in this area, they are susceptible to mode collapse. This limitation results in generated samples that neglect critical aspects of the true data distribution––particularly its tails and minor modes––thus undermining downstream analyses and jeopardizing reliable decision-making. To address these challenges, we introduce the Penalized Optimal Transport Network (POTNet), a novel deep generative model that provably mitigates mode collapse. POTNet leverages a robust and interpretable Marginally-Penalized Wasserstein loss to steer the alignment of joint distributions. Moreover, our primal-based framework eliminates the need for a critic network, thereby circumventing the instabilities of adversarial training and obviating extensive hyperparameter tuning. Through both theoretical analysis and comprehensive empirical evaluation, we demonstrate that POTNet effectively attenuates mode collapse and substantially outperforms existing methods in accurately recovering complex underlying data structures.
Keywords
mode collapse
synthetic data generation
marginal penalization
marginal regularization
generative density estimation
Wasserstein distance
Modeling multidimensional longitudinal RCT data is inherently complex due to temporal dependencies, missing values, and dynamic variations in behavioral responses and outcomes over time. Traditional analysis methods often fall short in capturing the intricate temporal and group-specific patterns present in such datasets. To overcome these limitations, we introduce MITransformer, a generative pretrained transformer framework enhanced with multiple imputations for robust contextual representation learning from longitudinal biomarker data, incorporating diet quality measurements. MITransformer reconstructs input features across time points, effectively capturing temporal patterns and inter-variable relationships, while addressing the issue of missing data through multiple imputation. By applying dynamically scaled positional embeddings within the attention mechanism, the model preserves temporal relationships without distorting continuous data distributions. A gated integration mechanism selectively emphasizes input subsets, allowing the model to differentiate the importance of various input types. The contextual embeddings generated by MITransformer improve representation quality across time, facilitating better clustering and regression/classification outcomes. Our results demonstrate that these embeddings preserve biological and behavioral variation, enabling the model to distinguish between demographic subgroups such as gender without the need for explicit labels. This approach enhances interpretability and analytical performance, laying the foundation for advanced applications such as digital twins, individualized health monitoring, and diet-related outcome prediction, thereby expanding the capabilities of conventional disease diagnosis and prognosis using biomarker data.
Keywords
Biomarker
Diet Quality Index
Contextual Representation
Longitudinal Data Modeling
Generative Pretrained Transformer
Multiple Imputation
A generative adversarial network (GAN) has become a cornerstone of generative AI for its ability to model complex data-generating processes. However, GAN training is notoriously unstable, often suffering from mode collapse. This work analyzes training instability through the variance of gradients, linking it to multimodality in the target distribution. To address these issues, we propose a novel GAN training framework that uses tempered distributions via convex interpolation. With a new GAN objective, the generator learns all tempered distributions simultaneously, akin to parallel tempering in statistics. Simulations demonstrate the superiority of our method over existing strategies in synthesizing image and tabular data. We theoretically show that this improvement stems from reduced gradient variance using tempered distributions. Additionally, we develop a variant of our framework to generate fair synthetic data, addressing a growing concern in trustworthy AI.
Keywords
Generative Adversarial Network
Parallel Tempering
Fair Data Generation
Variance Reduction of Gradients
Transporting samples from a source to a target distribution, given only finite samples from both, is a fundamental problem in machine learning, with applications in generative modeling and variational inference. We address this problem by approximating a discretized gradient flow of the MMD-regularized $\chi^2$-divergence between the evolving source and the fixed target distribution. We provide non-asymptotic error bounds for (i) optimization error (measuring convergence to the target distribution), (ii) sampling error (from finite to infinite sample size), and (iii) approximation error (due to regularization), with particular attention to their dependence on dimensionality. Our minimization scheme admits closed-form updates and employs a data-adaptive annealed regularization strategy to maximize descent. Experiments on tabular and vision datasets demonstrate the effectiveness of our approach.
Keywords
gradient flows
convex analysis
$\chi^2$-divergence
generative modeling
Wasserstein space
Generative models, like large language models or text-to-image diffusion models, can generate a random output or response after being given a query from a user. Representing them with vectors in a finite-dimensional Euclidean space based on their responses to a set of queries, facilitates statistical decision-making tasks on black-box generative models using conventional tools. We establish sufficient conditions for consistent estimation of population-level vector representations of a set of generative models based on their sample responses to a set of queries.
Keywords
generative models
multidimensional scaling
raw stress embedding
We consider post-selection inference for regression trees when the response is multivariate. In particular, we study how to appropriately test hypotheses suggested by the fitted tree. We find, as is known when the response is univariate, that to control the Type I error rate one must condition on the recursive data splits leading to the hypothesis in question. One may wish, e.g., to test whether the populations represented by two sibling nodes have the same mean. With a univariate response, proper conditioning on the splits results in a truncation of the null distribution of the test statistic such that p-values must be computed with respect to truncated normal distributions. With a multivariate response, we find that the p-values must be computed with respect to truncated multivariate normal distributions, where the truncation set is defined by a list of quadratic constraints. We show that accept-reject Monte Carlo simulation can give reliable post-selection p-values with a bivariate response and a fairly small number of predictors. To accommodate more predictors, we must consider more efficient ways to obtain probabilities from truncated multivariate Normal distributions.
Keywords
post-selection inference
regression tree
MCMC