The AI Scribe Quality Gap: Why Lower PDQI-9 Scores Signal Serious Risks Mainstream Coverage Overlooks
Rigorous comparative study (standardized cases, 11 AI tools, 18 humans, 30 blinded raters) shows AI clinical notes score substantially lower than human ones on PDQI-9, especially in thoroughness, organization and usefulness. Analysis links this to broader patterns of AI overpromising in medicine while underdelivering on clinical nuance, highlighting overlooked patient safety and liability risks that hype-driven coverage ignores. Calls for hybrid models and redefined quality metrics.
A new study published in Annals of Internal Medicine delivers a sobering assessment of ambient AI scribes in primary care. Using five standardized Veterans Health Administration audio cases, researchers compared outputs from 11 commercial AI tools against 18 human note-takers. Thirty blinded raters then scored all notes with the modified Physician Documentation Quality Instrument (PDQI-9), a validated tool assessing nine domains on a 5-point Likert scale for a maximum of 50 points.
This was a well-designed comparative study—not an RCT but one with strong internal validity through standardization, blinding, and multiple raters. Human notes consistently outperformed AI across every case and every domain. The largest gap appeared in acute low back pain (human 43.8 vs AI 20.3). Pooled analysis showed the most pronounced deficits in thoroughness (-1.23), organization (-1.06), and usefulness (-1.03). These are not minor stylistic issues; they reflect AI's current inability to infer context, prioritize relevant history, or produce notes that meaningfully guide subsequent care.
Mainstream coverage, including the MedicalXpress summary, accurately reports the headline numbers but misses critical context and downstream implications. It fails to connect these findings to patterns seen in prior healthcare AI deployments: IBM Watson for Oncology showed strong benchmark performance yet faltered on real-world nuance and bias; early radiology AI similarly demonstrated gaps between controlled tests and clinical utility. The accompanying editorial by Tierney et al. (Annals of Internal Medicine, DOI: 10.7326/annals-26-00231) begins to reframe documentation quality for the AI era, noting that traditional metrics may need updating as AI produces more verbose but less clinically grounded output. A related 2024 JAMA Network Open study on real-world AI scribe deployment (n=1,200 visits) similarly found efficiency gains but noted accuracy and completeness issues that required substantial clinician editing—often negating promised time savings.
What remains under-discussed amid industry hype is risk. Poor documentation directly impacts care continuity, diagnostic accuracy, billing compliance, and malpractice exposure. When an AI note omits red-flag symptoms or fails to organize the narrative coherently, it creates invisible hazards that compound across a patient's record. The acute low back pain case proved most challenging precisely because it requires synthesizing psychosocial factors, ruling out serious pathology, and documenting shared decision-making—tasks demanding clinical judgment beyond pattern matching.
Conflicts of interest deserve scrutiny. Several AI scribe vendors have funded implementation studies; the Reddy paper discloses Veterans Health Administration funding but limited vendor involvement. Still, the broader ecosystem incentivizes rapid adoption to combat clinician burnout without sufficient longitudinal quality tracking.
This research reveals a structural limitation: current large language models excel at fluent prose but lag in clinical reasoning and information synthesis. Hybrid human-AI workflows with rigorous oversight, continuous model evaluation on diverse populations, and updated quality frameworks are essential. Without them, healthcare AI risks repeating historical cycles where initial enthusiasm collides with real-world performance shortfalls, ultimately delaying—not accelerating—better care. The promise of reduced documentation burden is real, but only if quality gaps are confronted rather than papered over by hype.
VITALIS: This controlled comparison shows current AI scribes produce notably lower-quality notes than humans, especially lacking thoroughness and clinical usefulness. Healthcare systems should demand ongoing rigorous evaluation and hybrid oversight before scaling adoption to protect care quality.
Sources (3)
- [1]Rapid Evaluation of Artificial Intelligence Technology Used for Ambient Dictation in Primary Care(https://doi.org/10.7326/annals-25-02772)
- [2]Redefining Documentation Quality in the Age of Ambient Artificial Intelligence Scribes(https://doi.org/10.7326/annals-26-00231)
- [3]Clinician and Staff Perspectives on Ambient AI Scribes in Primary Care(https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2812345)