Evaluating ASR for aphasia: a framework for clinically relevant transcription performance

Julien Dupuis Desroches; Pierre André Ménard; Sylvie Ratté

doi:10.1080/02687038.2026.2621235

Evaluating ASR for aphasia: a framework for clinically relevant transcription performance

Julien Dupuis Desroches
, Pierre André Ménard
, Sylvie Ratté

École de technologie supérieure

Research output: Contribution to journal › Journal Article › peer-review

Abstract

Background: The clinical adoption of automated speech recognition (ASR) systems in speech pathology for people with aphasia (PwA) is primarily limited by performance and stability issues. The possibility for detailed evaluation of such systems is critical, as clinical assessment of patients and their treatments are highly dependent on their accuracy and precision. Aims: This study aims to address the limitations of ASR evaluation metrics, such as Word Error Rate (WER), by introducing a granular evaluation approach. The goal is to develop and apply a framework for analyzing ASR system performances, specifically on speech phenomena relevant to clinical aphasia assessment, including disfluencies, grammatical errors, and filler words. English and French data are used to test the ASR transcription performances for both languages. Methods & Procedures: We present a novel evaluation framework and accompanying software tool to assess ASR models on aphasic speech. Using annotated transcripts from the AphasiaBank database, the framework measures performance across multiple linguistic phenomena by mapping ASR-generated transcriptions against reference CHAT-encoded transcripts. Key metrics such as Character Error Rate (CER) are computed per-phenomenon using Levenshtein distance, enabling a fine-grained analysis of transcription accuracy. To demonstrate the variability in performance, we applied the framework to three different ASR models, highlighting fluctuations across speech phenomena. Outcomes & Results: Evaluation of the Whisper Large-V3 model on the English AphasiaBank dataset revealed significant variability in transcription accuracy across speech phenomena. CER ranged from 25.28% on unannotated words to 87.82% on filled pauses, a delta of over 62% points. Prompting reduced the CER for filled pauses from 87.82% to 44.04%, demonstrating that task-specific tuning can yield substantial gains. Compared to control participants, PwA transcripts obtained CERs that were up to 20% points higher, depending on the speech phenomenon. The largest gaps appeared in phonological and grammatical errors. In a multi-language evaluation, transcription performance on French aphasic speech was notably worse, with CERs averaging 32% higher than their English counterparts. Morphological errors in French reached up to 43.33%, compared to 29.60% in English, and semantic errors showed deltas exceeding 20% points. Conclusions: These indicators can help better understand the underlying behaviours of ASR models, thus offering better insights into their reliability. With this framework, ASR systems can be evaluated to detect any weaknesses and highlight areas for improvement, ensuring their suitability for clinical use. By offering a granular analysis of these factors, the tool empowers clinicians and researchers to make informed decisions about integrating ASR systems into speech pathology workflows.

Original language	English
Journal	Aphasiology
DOIs	https://doi.org/10.1080/02687038.2026.2621235
Publication status	In press - 2026

!!!Keywords

Aphasia
automatic speech recognition
evaluation framework

Access to Document

10.1080/02687038.2026.2621235

Cite this

@article{115dae240a284a0397c728a1e324e225,

title = "Evaluating ASR for aphasia: a framework for clinically relevant transcription performance",

abstract = "Background: The clinical adoption of automated speech recognition (ASR) systems in speech pathology for people with aphasia (PwA) is primarily limited by performance and stability issues. The possibility for detailed evaluation of such systems is critical, as clinical assessment of patients and their treatments are highly dependent on their accuracy and precision. Aims: This study aims to address the limitations of ASR evaluation metrics, such as Word Error Rate (WER), by introducing a granular evaluation approach. The goal is to develop and apply a framework for analyzing ASR system performances, specifically on speech phenomena relevant to clinical aphasia assessment, including disfluencies, grammatical errors, and filler words. English and French data are used to test the ASR transcription performances for both languages. Methods \& Procedures: We present a novel evaluation framework and accompanying software tool to assess ASR models on aphasic speech. Using annotated transcripts from the AphasiaBank database, the framework measures performance across multiple linguistic phenomena by mapping ASR-generated transcriptions against reference CHAT-encoded transcripts. Key metrics such as Character Error Rate (CER) are computed per-phenomenon using Levenshtein distance, enabling a fine-grained analysis of transcription accuracy. To demonstrate the variability in performance, we applied the framework to three different ASR models, highlighting fluctuations across speech phenomena. Outcomes \& Results: Evaluation of the Whisper Large-V3 model on the English AphasiaBank dataset revealed significant variability in transcription accuracy across speech phenomena. CER ranged from 25.28\% on unannotated words to 87.82\% on filled pauses, a delta of over 62\% points. Prompting reduced the CER for filled pauses from 87.82\% to 44.04\%, demonstrating that task-specific tuning can yield substantial gains. Compared to control participants, PwA transcripts obtained CERs that were up to 20\% points higher, depending on the speech phenomenon. The largest gaps appeared in phonological and grammatical errors. In a multi-language evaluation, transcription performance on French aphasic speech was notably worse, with CERs averaging 32\% higher than their English counterparts. Morphological errors in French reached up to 43.33\%, compared to 29.60\% in English, and semantic errors showed deltas exceeding 20\% points. Conclusions: These indicators can help better understand the underlying behaviours of ASR models, thus offering better insights into their reliability. With this framework, ASR systems can be evaluated to detect any weaknesses and highlight areas for improvement, ensuring their suitability for clinical use. By offering a granular analysis of these factors, the tool empowers clinicians and researchers to make informed decisions about integrating ASR systems into speech pathology workflows.",

keywords = "Aphasia, automatic speech recognition, evaluation framework",

author = "\{Dupuis Desroches\}, Julien and M{\'e}nard, \{Pierre Andr{\'e}\} and Sylvie Ratt{\'e}",

note = "Publisher Copyright: {\textcopyright} 2026 Informa UK Limited, trading as Taylor \& Francis Group.",

year = "2026",

doi = "10.1080/02687038.2026.2621235",

language = "English",

journal = "Aphasiology",

issn = "0268-7038",

publisher = "Routledge",

}

TY - JOUR

T1 - Evaluating ASR for aphasia

T2 - a framework for clinically relevant transcription performance

AU - Dupuis Desroches, Julien

AU - Ménard, Pierre André

AU - Ratté, Sylvie

PY - 2026

Y1 - 2026

N2 - Background: The clinical adoption of automated speech recognition (ASR) systems in speech pathology for people with aphasia (PwA) is primarily limited by performance and stability issues. The possibility for detailed evaluation of such systems is critical, as clinical assessment of patients and their treatments are highly dependent on their accuracy and precision. Aims: This study aims to address the limitations of ASR evaluation metrics, such as Word Error Rate (WER), by introducing a granular evaluation approach. The goal is to develop and apply a framework for analyzing ASR system performances, specifically on speech phenomena relevant to clinical aphasia assessment, including disfluencies, grammatical errors, and filler words. English and French data are used to test the ASR transcription performances for both languages. Methods & Procedures: We present a novel evaluation framework and accompanying software tool to assess ASR models on aphasic speech. Using annotated transcripts from the AphasiaBank database, the framework measures performance across multiple linguistic phenomena by mapping ASR-generated transcriptions against reference CHAT-encoded transcripts. Key metrics such as Character Error Rate (CER) are computed per-phenomenon using Levenshtein distance, enabling a fine-grained analysis of transcription accuracy. To demonstrate the variability in performance, we applied the framework to three different ASR models, highlighting fluctuations across speech phenomena. Outcomes & Results: Evaluation of the Whisper Large-V3 model on the English AphasiaBank dataset revealed significant variability in transcription accuracy across speech phenomena. CER ranged from 25.28% on unannotated words to 87.82% on filled pauses, a delta of over 62% points. Prompting reduced the CER for filled pauses from 87.82% to 44.04%, demonstrating that task-specific tuning can yield substantial gains. Compared to control participants, PwA transcripts obtained CERs that were up to 20% points higher, depending on the speech phenomenon. The largest gaps appeared in phonological and grammatical errors. In a multi-language evaluation, transcription performance on French aphasic speech was notably worse, with CERs averaging 32% higher than their English counterparts. Morphological errors in French reached up to 43.33%, compared to 29.60% in English, and semantic errors showed deltas exceeding 20% points. Conclusions: These indicators can help better understand the underlying behaviours of ASR models, thus offering better insights into their reliability. With this framework, ASR systems can be evaluated to detect any weaknesses and highlight areas for improvement, ensuring their suitability for clinical use. By offering a granular analysis of these factors, the tool empowers clinicians and researchers to make informed decisions about integrating ASR systems into speech pathology workflows.

AB - Background: The clinical adoption of automated speech recognition (ASR) systems in speech pathology for people with aphasia (PwA) is primarily limited by performance and stability issues. The possibility for detailed evaluation of such systems is critical, as clinical assessment of patients and their treatments are highly dependent on their accuracy and precision. Aims: This study aims to address the limitations of ASR evaluation metrics, such as Word Error Rate (WER), by introducing a granular evaluation approach. The goal is to develop and apply a framework for analyzing ASR system performances, specifically on speech phenomena relevant to clinical aphasia assessment, including disfluencies, grammatical errors, and filler words. English and French data are used to test the ASR transcription performances for both languages. Methods & Procedures: We present a novel evaluation framework and accompanying software tool to assess ASR models on aphasic speech. Using annotated transcripts from the AphasiaBank database, the framework measures performance across multiple linguistic phenomena by mapping ASR-generated transcriptions against reference CHAT-encoded transcripts. Key metrics such as Character Error Rate (CER) are computed per-phenomenon using Levenshtein distance, enabling a fine-grained analysis of transcription accuracy. To demonstrate the variability in performance, we applied the framework to three different ASR models, highlighting fluctuations across speech phenomena. Outcomes & Results: Evaluation of the Whisper Large-V3 model on the English AphasiaBank dataset revealed significant variability in transcription accuracy across speech phenomena. CER ranged from 25.28% on unannotated words to 87.82% on filled pauses, a delta of over 62% points. Prompting reduced the CER for filled pauses from 87.82% to 44.04%, demonstrating that task-specific tuning can yield substantial gains. Compared to control participants, PwA transcripts obtained CERs that were up to 20% points higher, depending on the speech phenomenon. The largest gaps appeared in phonological and grammatical errors. In a multi-language evaluation, transcription performance on French aphasic speech was notably worse, with CERs averaging 32% higher than their English counterparts. Morphological errors in French reached up to 43.33%, compared to 29.60% in English, and semantic errors showed deltas exceeding 20% points. Conclusions: These indicators can help better understand the underlying behaviours of ASR models, thus offering better insights into their reliability. With this framework, ASR systems can be evaluated to detect any weaknesses and highlight areas for improvement, ensuring their suitability for clinical use. By offering a granular analysis of these factors, the tool empowers clinicians and researchers to make informed decisions about integrating ASR systems into speech pathology workflows.

KW - Aphasia

KW - automatic speech recognition

KW - evaluation framework

UR - https://www.scopus.com/pages/publications/105030242618

U2 - 10.1080/02687038.2026.2621235

DO - 10.1080/02687038.2026.2621235

M3 - Journal Article

AN - SCOPUS:105030242618

SN - 0268-7038

JO - Aphasiology

JF - Aphasiology

ER -

Evaluating ASR for aphasia: a framework for clinically relevant transcription performance

Abstract

!!!Keywords

Access to Document

Other files and links

Fingerprint

Cite this