Monday, Aug 4: 8:30 AM - 10:20 AM
0202
Invited Paper Session
Music City Center
Room: CC-105B
Applied
No
Main Sponsor
Section on Text Analysis
Co Sponsors
Section on Statistical Learning and Data Science
Section on Statistics in Defense and National Security
Presentations
Within population biobanks and omic data, certain outcomes are subject to substantial missing data, which limit the power for genetic discovery. AI and ML methods are increasingly used for prediction, such as CNN, DL and transformer. Data based on predicted values can introduce spurious associations, identifying genetic variants that are not associated with the original trait. Here we introduce a new method, synthetic surrogate (SynSurr) analysis, which makes GWAS on AI/ML model-based predicted phenotypes robust to prediction errors by jointly analyzing the partially observed outcomes and predicted outcomes for all subjects. We show that SynSurr estimates the same genetic effect as standard GWAS and improves power in proportion to the quality of the imputations. SynSurr requires a commonly made missing-at-random assumption but does not require correct prediction model specification. We present extensive simulations and ablation analyses to validate SynSurr and apply it to empower the GWAS of dual-energy X-ray absorptiometry traits within the UK Biobank.
Keywords
AI and ML
Transformer
statistical inference
prediction
biobanks
Speaker
Xihong Lin, Harvard T.H. Chan School of Public Health
Large language models (LLMs) have demonstrated success in generating human-like text. However, sentences generated by LLMs (e.g., ChatGPT) tend to be generic, and lacking personalized characteristics. Recent development on diffusion models has shown the potential in diversified generation and iterative refinement, however, its limitations still exist, especially when the generated text is complicated. In this work, we propose a syntax-guided diffusion model to achieve both well-written and personalized text generation. A hierarchical pipeline is designed to first generate a syntactic structure and then generate relevant texts accordingly, and an encoder is introduced to extract personalized characteristics. By incorporating syntactic information in the generating process, we can capture both general and personalized patterns of sentence construction. A novel loss function is constructed to guide the cross-attention maps to align well with the desired syntactic and personalized features. We further extend our framework to encourage a more sophisticated and diversified paragraph generation in a hierarchical way. We validate the effectiveness of the proposed method through comprehensive experiments and analysis, showing its capability of generating high-quality text in conjunction with personalized characteristics.
Keywords
Large language model
personalization
diffusion model
Speaker
Annie Qu, University of California At Irvine
Large but finite internet data, which has powered AI scaling in the past decade, is rapidly becoming depleted, motivating people to search for new frontiers such as test-time scaling. This talk will introduce two recent projects that probe this new frontier. First, we will present synthetic continued pretraining, a technique that converts excess compute into a statistical signal, which scales the model towards better data-efficiency. Next, we will introduce the s1-32B model, an open-source test-time scaling mechanism which uses minimal resources and evidences a transparent understanding of scaling. Overall, we will see that there are exciting opportunities around (continued) pretraining data scaling, and a cautiously optimistic path toward test-time scaling.
Keywords
Synthetic data
Language model