Large Language Models and their Applications

Oluwole Oyebamiji Chair
University of Birmingham
 
Wednesday, Aug 6: 8:30 AM - 10:20 AM
4146 
Contributed Papers 
Music City Center 
Room: CC-106A 

Main Sponsor

Section on Statistical Learning and Data Science

Presentations

WITHDRAWN: AI Revolution in Government Statistics: Boosting Efficiency and Data Quality

The Manpower Research and Statistics Department of the Ministry of Manpower (Singapore) has effectively harnessed Artificial Intelligence (AI) and Machine Learning (ML) to enhance our statistical production processes. In our ongoing efforts to modernize governmental statistics functions, we have implemented cutting-edge artificial intelligence solutions that are transforming our data management and operational efficiency. At the heart of the transformation is our AI-powered occupation auto-coder. Harnessing the BERT model, this system efficiently processes vast amounts of open text data, assigning standard occupation codes with remarkable accuracy. The results have achieved a 30% reduction in manual labelling requirements coupled with a notable improvement in data quality. We have also developed a suite of AI bots to streamline our operations. These bots, built on Retrieval-Augmented Generation (RAG) and Large Language Model (LLM) technologies, include an intelligent chatbot that has revolutionized how our survey officers access information. This innovation has slashed response times for complex inquiries by half, significantly enhancing our customer service capabilities. 

Keywords

intelligent AI Bot

occupation auto-coder and classification 

Co-Author

Boon Heng Ang

First Author

Daniel Tze Wei Sim, Ministry of Manpower

AI-Powered Critical Reflection: Measuring Transformative Learning Through Simulated Decision-Making

Critical reflection is essential for leadership development, career adaptability, and effective decision-making, yet measuring its depth and impact remains a challenge. This study introduces a generative AI-driven simulation framework to model and analyze reflective learning in real-world decision-making scenarios.
AI agents dynamically generate context-specific dilemmas faced by professionals in healthcare, education, and business leadership, guiding learners through AI-facilitated reflective dialogues. Responses are analyzed using Natural Language Processing (NLP) and topic modeling to track reasoning patterns and cognitive shifts.
To quantify transformative learning, we apply Latent Semantic Analysis (LSA) and BERT embeddings to measure changes in cognitive complexity, while longitudinal mixed-effects models assess behavioral adaptation over time.
Preliminary results show that AI-driven reflective coaching enhances critical thinking, adaptability, and problem-solving efficiency, particularly for individuals in career transitions. This study advances quantitative methods for assessing reflection, offering a scalable, AI-powered framework for professional education and coaching 

Keywords

Generative AI

Transformative Learning

NLP

Reflective Coaching

Professional Development

Cognitive Complexity 

Co-Author

Junyi Yu

First Author

Hanxia Li

Presenting Author

Hanxia Li

An Evaluation Framework for Ambient Digital Scribing Tools in Clinical Applications

Ambient digital scribing (ADS) tools are transforming healthcare by reducing clinicians' documentation burden, potentially mitigating burnout and turnover. As AI-driven tools integrate into clinical workflows, robust governance frameworks are essential to ensure ethical, secure, and effective deployment. We propose and test a comprehensive ADS evaluation framework combining human qualitative assessments, automated metrics, and large language models (LLMs) as evaluators. The framework evaluates transcription, diarization, and medical note generation for accuracy, fluency, coherence, completeness, and factuality, alongside simulation-based bias, fairness, and adversarial resilience testing. Using 40 clinical audio recordings from a smoking cessation study among pregnant patients, our internally developed GPT-4o-based ADS tool demonstrated satisfactory performance.LLM-based evaluations showed strong agreement with human assessments (>57%), reducing manual review efforts. Benchmarking against LLaMA-based versions confirmed the framework's utility for cross-tool comparisons. This work establishes a baseline for ADS evaluation and emphasizes the need for strong governance in ADS tools. 

Keywords

Evaluation Framework

AI governance

Ambient Digital Scribing

AI in Healthcare

Large Language Models

Health Informatics 

First Author

Haoyuan Wang, Duke

Presenting Author

Haoyuan Wang, Duke

CoT Information: A Theory of Statistical Learning under Chain-of-Thought Supervision

Learning complex functions that involve multi-step reasoning poses a significant challenge for standard supervised learning from input-output examples. Chain-of-thought (CoT) supervision, which provides intermediate reasoning steps together with the final output, has emerged as a powerful empirical technique, underpinning much of the recent progress in the reasoning capabilities of large language models. This paper develops a statistical theory of learning under CoT supervision. A key characteristic of the CoT setting, in contrast to standard supervision, is the mismatch between the training objective (CoT risk) and the test objective (end-to-end risk). A central part of our analysis, distinguished from prior work, is explicitly linking those two types of risk to achieve sharper sample complexity bounds. This is achieved via the CoT information measure CoTInfo(ε), which quantifies the additional discriminative power gained from observing the reasoning process. The main theoretical results demonstrate how CoT supervision can yield significantly faster learning rates compared to standard E2E supervision. Specifically, it is shown that the sample complexity required to achieve a target E2E error ε scales as d/CoTInfo(ε), where d is a measure of hypothesis class complexity, which can be much faster than standard d/ε rates. Information-theoretic lower bounds in terms of the CoT information are also obtained. Together, these results suggest that CoT information is a fundamental measure of statistical complexity for learning under chain-of-thought supervision. 

Keywords

Statistical Learning Theory

Autoregressive Learning

Out-of-Distribution (OOD) Generalization

Length Generalization

Language Models

Machine Learning 

Co-Author(s)

Omar Montasser, Yale University
John Lafferty, Yale University

First Author

Awni Altabaa

Presenting Author

John Lafferty, Yale University

WITHDRAWN Dive Inside a Digital Assistant Brain

In this talk, the speaker will introduce the topic of language model in conversational digital assistant. I will dive in the language model and present the data science lifecycle pertaining to the virtual assistant chat-bot starting from designing the skills, populating the utterances through the entire process of building, training and testing the model on endpoints. I will introduce the audience to virtual assistant terminology that includes an utterance, skill, intent, entity, synonyms, features (or phrase lists) and patterns and how they enhance the accuracy of the chatbot. The language model provides a single service for intent recognition, routing and entity extraction. I will go over concepts such as staging, development and production for live usage. The pigeonhole classification model follows a train-test-publish process. Topics such as active learning, real-user and synthetic utterance sets will be discussed to improve the performance of the language model. In addition, continuous monitoring in production and maintenance phases complete the loop of the life cycle. Finally, I will briefly discuss applications of virtual assistants in patient safety and healthcare. 

Keywords

natural language processing

digital (virtual) assistant

language model

utterance

intent recognition and entity extraction

classification 

First Author

Walid Sharabati, Purdue University

Large language models empower meta-analysis in the big data era

In the current big data era, large data repositories containing thousands of studies present opportunities for meta-analysis but require labor-intensive, time-consuming screening. To address this, we developed a framework using large language models (LLMs) to determine and justify whether a study dataset is suitable for any given meta-analysis based on the dataset description, the dataset itself, the study paper, or some combination. We demonstrated this framework for a meta-analysis on adjuvant chemotherapy response in non-small cell lung cancer, screening clinical data from 536 studies in the NCBI Gene Expression Omnibus (GEO) repository using the cost-effective GPT-4o mini LLM in a zero-shot setting. We found that the framework was more sensitive than traditional keyword search in identifying suitable studies while cutting screening time to hours. To streamline the framework and enable scientists to efficiently identify relevant studies for meta-analysis, we developed a publicly-available app implementing this framework for screening studies in the GEO repository and PubMed with the goal of accelerating scientific discovery. 

Keywords

natural language processing

open big data

study screening

data-driven research

text mining 

Co-Author(s)

Hojin Moon, California State University - Long Beach
Michelle Cheuk, California State University, Long Beach

First Author

Owen Sun, California Academy of Mathematics and Science

Presenting Author

Owen Sun, California Academy of Mathematics and Science

Representing a Collection of Large Language Models as a Gaussian Mixture

Motivated by the prevalence of black-box large language models (LLMs), we aim to understand the statistical properties of LLMs via response embeddings. We consider prompt augmentation, with each augmentation corresponding to one augmented LLM. Statistical consistency of the response embeddings is established. We define a measure of dissimilarity between response embeddings for a collection of augmented LLMs, which leads to a matrix of comparative dissimilarities. We consider, via multidimensional scaling (MDS), representing this dissimilarity matrix in low-dimensional Euclidean space. Under regularity conditions, we prove a row-wise central limit theorem for the MDS representation associated with the collection of augmented LLMs. That is, for a given query set, the MDS embedding of each augmented LLM asymptotically follows a Gaussian mixture model distribution when the augmentations are drawn from a mixture distribution. 

Keywords

Gaussian Mixture Model

Multidimensional Scaling

Central Limit Theorem

Prompt Augmentation

Large Language Model 

Co-Author(s)

Carey E. Priebe, Johns Hopkins University
Runbing Zheng, Johns Hopkins University

First Author

Zekun Wang, Johns Hopkins University

Presenting Author

Zekun Wang, Johns Hopkins University