Print Close

An Evaluation Framework for Ambient Digital Scribing Tools in Clinical Applications

Presented During: Large Language Models and their Applications

Haoyuan Wang First Author
Duke

Haoyuan Wang Presenting Author
Duke

Wednesday, Aug 6: 9:05 AM - 9:20 AM
1458
Contributed Papers

Music City Center

Ambient digital scribing (ADS) tools are transforming healthcare by reducing clinicians' documentation burden, potentially mitigating burnout and turnover. As AI-driven tools integrate into clinical workflows, robust governance frameworks are essential to ensure ethical, secure, and effective deployment. We propose and test a comprehensive ADS evaluation framework combining human qualitative assessments, automated metrics, and large language models (LLMs) as evaluators. The framework evaluates transcription, diarization, and medical note generation for accuracy, fluency, coherence, completeness, and factuality, alongside simulation-based bias, fairness, and adversarial resilience testing. Using 40 clinical audio recordings from a smoking cessation study among pregnant patients, our internally developed GPT-4o-based ADS tool demonstrated satisfactory performance.LLM-based evaluations showed strong agreement with human assessments (>57%), reducing manual review efforts. Benchmarking against LLaMA-based versions confirmed the framework's utility for cross-tool comparisons. This work establishes a baseline for ADS evaluation and emphasizes the need for strong governance in ADS tools.

Keywords

Evaluation Framework

AI governance

Ambient Digital Scribing

AI in Healthcare

Large Language Models

Health Informatics

Main Sponsor

Section on Statistical Learning and Data Science