An Evaluation Framework for Ambient Digital Scribing Tools in Clinical Applications
Wednesday, Aug 6: 9:05 AM - 9:20 AM
1458
Contributed Papers
Music City Center
Ambient digital scribing (ADS) tools are transforming healthcare by reducing clinicians' documentation burden, potentially mitigating burnout and turnover. As AI-driven tools integrate into clinical workflows, robust governance frameworks are essential to ensure ethical, secure, and effective deployment. We propose and test a comprehensive ADS evaluation framework combining human qualitative assessments, automated metrics, and large language models (LLMs) as evaluators. The framework evaluates transcription, diarization, and medical note generation for accuracy, fluency, coherence, completeness, and factuality, alongside simulation-based bias, fairness, and adversarial resilience testing. Using 40 clinical audio recordings from a smoking cessation study among pregnant patients, our internally developed GPT-4o-based ADS tool demonstrated satisfactory performance.LLM-based evaluations showed strong agreement with human assessments (>57%), reducing manual review efforts. Benchmarking against LLaMA-based versions confirmed the framework's utility for cross-tool comparisons. This work establishes a baseline for ADS evaluation and emphasizes the need for strong governance in ADS tools.
Evaluation Framework
AI governance
Ambient Digital Scribing
AI in Healthcare
Large Language Models
Health Informatics
Main Sponsor
Section on Statistical Learning and Data Science
You have unsaved changes.