An Evaluation Framework for Ambient Digital Scribing Tools in Clinical Applications

Haoyuan Wang First Author
Duke
 
Haoyuan Wang Presenting Author
Duke
 
Wednesday, Aug 6: 9:05 AM - 9:20 AM
1458 
Contributed Papers 
Music City Center 
Ambient digital scribing (ADS) tools are transforming healthcare by reducing clinicians' documentation burden, potentially mitigating burnout and turnover. As AI-driven tools integrate into clinical workflows, robust governance frameworks are essential to ensure ethical, secure, and effective deployment. We propose and test a comprehensive ADS evaluation framework combining human qualitative assessments, automated metrics, and large language models (LLMs) as evaluators. The framework evaluates transcription, diarization, and medical note generation for accuracy, fluency, coherence, completeness, and factuality, alongside simulation-based bias, fairness, and adversarial resilience testing. Using 40 clinical audio recordings from a smoking cessation study among pregnant patients, our internally developed GPT-4o-based ADS tool demonstrated satisfactory performance.LLM-based evaluations showed strong agreement with human assessments (>57%), reducing manual review efforts. Benchmarking against LLaMA-based versions confirmed the framework's utility for cross-tool comparisons. This work establishes a baseline for ADS evaluation and emphasizes the need for strong governance in ADS tools.

Keywords

Evaluation Framework

AI governance

Ambient Digital Scribing

AI in Healthcare

Large Language Models

Health Informatics 

Main Sponsor

Section on Statistical Learning and Data Science