Abstract
Background: The clinical adoption of automated speech recognition (ASR) systems in speech pathology for people with aphasia (PwA) is primarily limited by performance and stability issues. The possibility for detailed evaluation of such systems is critical, as clinical assessment of patients and their treatments are highly dependent on their accuracy and precision. Aims: This study aims to address the limitations of ASR evaluation metrics, such as Word Error Rate (WER), by introducing a granular evaluation approach. The goal is to develop and apply a framework for analyzing ASR system performances, specifically on speech phenomena relevant to clinical aphasia assessment, including disfluencies, grammatical errors, and filler words. English and French data are used to test the ASR transcription performances for both languages. Methods & Procedures: We present a novel evaluation framework and accompanying software tool to assess ASR models on aphasic speech. Using annotated transcripts from the AphasiaBank database, the framework measures performance across multiple linguistic phenomena by mapping ASR-generated transcriptions against reference CHAT-encoded transcripts. Key metrics such as Character Error Rate (CER) are computed per-phenomenon using Levenshtein distance, enabling a fine-grained analysis of transcription accuracy. To demonstrate the variability in performance, we applied the framework to three different ASR models, highlighting fluctuations across speech phenomena. Outcomes & Results: Evaluation of the Whisper Large-V3 model on the English AphasiaBank dataset revealed significant variability in transcription accuracy across speech phenomena. CER ranged from 25.28% on unannotated words to 87.82% on filled pauses, a delta of over 62% points. Prompting reduced the CER for filled pauses from 87.82% to 44.04%, demonstrating that task-specific tuning can yield substantial gains. Compared to control participants, PwA transcripts obtained CERs that were up to 20% points higher, depending on the speech phenomenon. The largest gaps appeared in phonological and grammatical errors. In a multi-language evaluation, transcription performance on French aphasic speech was notably worse, with CERs averaging 32% higher than their English counterparts. Morphological errors in French reached up to 43.33%, compared to 29.60% in English, and semantic errors showed deltas exceeding 20% points. Conclusions: These indicators can help better understand the underlying behaviours of ASR models, thus offering better insights into their reliability. With this framework, ASR systems can be evaluated to detect any weaknesses and highlight areas for improvement, ensuring their suitability for clinical use. By offering a granular analysis of these factors, the tool empowers clinicians and researchers to make informed decisions about integrating ASR systems into speech pathology workflows.
| Original language | English |
|---|---|
| Journal | Aphasiology |
| DOIs | |
| Publication status | In press - 2026 |
!!!Keywords
- Aphasia
- automatic speech recognition
- evaluation framework
Fingerprint
Dive into the research topics of 'Evaluating ASR for aphasia: a framework for clinically relevant transcription performance'. These topics are generated from the title and abstract of the publication. Together, they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver