The Mathematics of Large Language Models

Carey Priebe Chair
Johns Hopkins University
 
Patrick Shafto Discussant
 
Carey Priebe Organizer
Johns Hopkins University
 
Monday, Aug 4: 8:30 AM - 10:20 AM
0202 
Invited Paper Session 
Music City Center 
Room: CC-105B 

Applied

No

Main Sponsor

Section on Text Analysis

Co Sponsors

Section on Statistical Learning and Data Science
Section on Statistics in Defense and National Security

Presentations

AI/ML-prediction powered statistical genetics and genomics using synthetic data

Within population biobanks and omic data, certain outcomes are subject to substantial missing data, which limit the power for genetic discovery. AI and ML methods are increasingly used for prediction, such as CNN, DL and transformer. Data based on predicted values can introduce spurious associations, identifying genetic variants that are not associated with the original trait. Here we introduce a new method, synthetic surrogate (SynSurr) analysis, which makes GWAS on AI/ML model-based predicted phenotypes robust to prediction errors by jointly analyzing the partially observed outcomes and predicted outcomes for all subjects. We show that SynSurr estimates the same genetic effect as standard GWAS and improves power in proportion to the quality of the imputations. SynSurr requires a commonly made missing-at-random assumption but does not require correct prediction model specification. We present extensive simulations and ablation analyses to validate SynSurr and apply it to empower the GWAS of dual-energy X-ray absorptiometry traits within the UK Biobank. 

Keywords

AI and ML

Transformer

statistical inference

prediction

biobanks 

Speaker

Xihong Lin, Harvard T.H. Chan School of Public Health

Syntax-Guided Diffusion Large Language Model for Personalized Text Generation

Large language models (LLMs) have demonstrated success in generating human-like text. However, sentences generated by LLMs (e.g., ChatGPT) tend to be generic, and lacking personalized characteristics. Recent development on diffusion models has shown the potential in diversified generation and iterative refinement, however, its limitations still exist, especially when the generated text is complicated. In this work, we propose a syntax-guided diffusion model to achieve both well-written and personalized text generation. A hierarchical pipeline is designed to first generate a syntactic structure and then generate relevant texts accordingly, and an encoder is introduced to extract personalized characteristics. By incorporating syntactic information in the generating process, we can capture both general and personalized patterns of sentence construction. A novel loss function is constructed to guide the cross-attention maps to align well with the desired syntactic and personalized features. We further extend our framework to encourage a more sophisticated and diversified paragraph generation in a hierarchical way. We validate the effectiveness of the proposed method through comprehensive experiments and analysis, showing its capability of generating high-quality text in conjunction with personalized characteristics.
 

Keywords

Large language model

personalization

diffusion model 

Speaker

Annie Qu, University of California At Irvine

Two Frontiers in AI Scaling

Large but finite internet data, which has powered AI scaling in the past decade, is rapidly becoming depleted, motivating people to search for new frontiers such as test-time scaling. This talk will introduce two recent projects that probe this new frontier. First, we will present synthetic continued pretraining, a technique that converts excess compute into a statistical signal, which scales the model towards better data-efficiency. Next, we will introduce the s1-32B model, an open-source test-time scaling mechanism which uses minimal resources and evidences a transparent understanding of scaling. Overall, we will see that there are exciting opportunities around (continued) pretraining data scaling, and a cautiously optimistic path toward test-time scaling. 

Keywords

Synthetic data

Language model 

Co-Author

Zitong Yang

Speaker

Shuangping Li, Stanford Univeristy