Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques

Sari Masri; Ahmad Hasasneh; Mohammad Tami; Chakib Tadj

doi:10.3390/info15120751

Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques

Sari Masri
, Ahmad Hasasneh
, Mohammad Tami
, Chakib Tadj

Arab American University

Résultats de recherche: Contribution à un journal › Article publié dans une revue, révisé par les pairs › Revue par des pairs

4 Citations (Scopus)

Résumé

An important hurdle in medical diagnostics is the high-quality and interpretable classification of audio signals. In this study, we present an image-based representation of infant crying audio files to predict abnormal infant cries using a vision transformer and also show significant improvements in the performance and interpretability of this computer-aided tool. The use of advanced feature extraction techniques such as Gammatone Frequency Cepstral Coefficients (GFCCs) resulted in a classification accuracy of 96.33%. For other features (spectrogram and mel-spectrogram), the performance was very similar, with an accuracy of 93.17% for the spectrogram and 94.83% accuracy for the mel-spectrogram. We used our vision transformer (ViT) model, which is less complex but more effective than the proposed audio spectrogram transformer (AST). We incorporated explainable AI (XAI) techniques such as Layer-wise Relevance Propagation (LRP), Local Interpretable Model-agnostic Explanations (LIME), and attention mechanisms to ensure transparency and reliability in decision-making, which helped us understand the why of model predictions. The accuracy of detection was higher than previously reported and the results were easy to interpret, demonstrating that this work can potentially serve as a new benchmark for audio classification tasks, especially in medical diagnostics, and providing better prospects for an imminent future of trustworthy AI-based healthcare solutions.

langue originale	Anglais
Numéro d'article	751
journal	Information (Switzerland)
Volume	15
Numéro de publication	12
Les DOIs	https://doi.org/10.3390/info15120751
état	Publié - déc. 2024

Accès au document

10.3390/info15120751

Autres fichiers et liens

Lien vers la publication dans Scopus

Empreinte digitale

Voici les principaux termes ou expressions associés à « Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques ». Ces libellés thématiques sont générés à partir du titre et du résumé de la publication. Ensemble, ils forment une empreinte digitale unique.

Contient cette citation

@article{a1290ae8f3494ac8ac970b90e8211255,

title = "Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques",

abstract = "An important hurdle in medical diagnostics is the high-quality and interpretable classification of audio signals. In this study, we present an image-based representation of infant crying audio files to predict abnormal infant cries using a vision transformer and also show significant improvements in the performance and interpretability of this computer-aided tool. The use of advanced feature extraction techniques such as Gammatone Frequency Cepstral Coefficients (GFCCs) resulted in a classification accuracy of 96.33\%. For other features (spectrogram and mel-spectrogram), the performance was very similar, with an accuracy of 93.17\% for the spectrogram and 94.83\% accuracy for the mel-spectrogram. We used our vision transformer (ViT) model, which is less complex but more effective than the proposed audio spectrogram transformer (AST). We incorporated explainable AI (XAI) techniques such as Layer-wise Relevance Propagation (LRP), Local Interpretable Model-agnostic Explanations (LIME), and attention mechanisms to ensure transparency and reliability in decision-making, which helped us understand the why of model predictions. The accuracy of detection was higher than previously reported and the results were easy to interpret, demonstrating that this work can potentially serve as a new benchmark for audio classification tasks, especially in medical diagnostics, and providing better prospects for an imminent future of trustworthy AI-based healthcare solutions.",

keywords = "audio feature, audio signals, explainable AI (XAI), gammatone frequency cepstral coefficients (GFCCs), healthcare AI, image-based representations, infant cry classification, layer-wise relevance propagation (LRP), local interpretable model-agnostic explanations (LIME), medical diagnostics, mel-spectrogram, spectrogram, vision transformers (ViTs)",

author = "Sari Masri and Ahmad Hasasneh and Mohammad Tami and Chakib Tadj",

note = "Publisher Copyright: {\textcopyright} 2024 by the authors.",

year = "2024",

month = dec,

doi = "10.3390/info15120751",

language = "English",

volume = "15",

journal = "Information (Switzerland)",

issn = "2078-2489",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "12",

}

Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques. / Masri, Sari; Hasasneh, Ahmad; Tami, Mohammad et al.
Dans: Information (Switzerland), Vol 15, Numéro 12, 751, 12.2024.

Résultats de recherche: Contribution à un journal › Article publié dans une revue, révisé par les pairs › Revue par des pairs

TY - JOUR

T1 - Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques

AU - Masri, Sari

AU - Hasasneh, Ahmad

AU - Tami, Mohammad

AU - Tadj, Chakib

PY - 2024/12

Y1 - 2024/12

N2 - An important hurdle in medical diagnostics is the high-quality and interpretable classification of audio signals. In this study, we present an image-based representation of infant crying audio files to predict abnormal infant cries using a vision transformer and also show significant improvements in the performance and interpretability of this computer-aided tool. The use of advanced feature extraction techniques such as Gammatone Frequency Cepstral Coefficients (GFCCs) resulted in a classification accuracy of 96.33%. For other features (spectrogram and mel-spectrogram), the performance was very similar, with an accuracy of 93.17% for the spectrogram and 94.83% accuracy for the mel-spectrogram. We used our vision transformer (ViT) model, which is less complex but more effective than the proposed audio spectrogram transformer (AST). We incorporated explainable AI (XAI) techniques such as Layer-wise Relevance Propagation (LRP), Local Interpretable Model-agnostic Explanations (LIME), and attention mechanisms to ensure transparency and reliability in decision-making, which helped us understand the why of model predictions. The accuracy of detection was higher than previously reported and the results were easy to interpret, demonstrating that this work can potentially serve as a new benchmark for audio classification tasks, especially in medical diagnostics, and providing better prospects for an imminent future of trustworthy AI-based healthcare solutions.

AB - An important hurdle in medical diagnostics is the high-quality and interpretable classification of audio signals. In this study, we present an image-based representation of infant crying audio files to predict abnormal infant cries using a vision transformer and also show significant improvements in the performance and interpretability of this computer-aided tool. The use of advanced feature extraction techniques such as Gammatone Frequency Cepstral Coefficients (GFCCs) resulted in a classification accuracy of 96.33%. For other features (spectrogram and mel-spectrogram), the performance was very similar, with an accuracy of 93.17% for the spectrogram and 94.83% accuracy for the mel-spectrogram. We used our vision transformer (ViT) model, which is less complex but more effective than the proposed audio spectrogram transformer (AST). We incorporated explainable AI (XAI) techniques such as Layer-wise Relevance Propagation (LRP), Local Interpretable Model-agnostic Explanations (LIME), and attention mechanisms to ensure transparency and reliability in decision-making, which helped us understand the why of model predictions. The accuracy of detection was higher than previously reported and the results were easy to interpret, demonstrating that this work can potentially serve as a new benchmark for audio classification tasks, especially in medical diagnostics, and providing better prospects for an imminent future of trustworthy AI-based healthcare solutions.

KW - audio feature

KW - audio signals

KW - explainable AI (XAI)

KW - gammatone frequency cepstral coefficients (GFCCs)

KW - healthcare AI

KW - image-based representations

KW - infant cry classification

KW - layer-wise relevance propagation (LRP)

KW - local interpretable model-agnostic explanations (LIME)

KW - medical diagnostics

KW - mel-spectrogram

KW - spectrogram

KW - vision transformers (ViTs)

UR - https://www.scopus.com/pages/publications/85213083497

U2 - 10.3390/info15120751

DO - 10.3390/info15120751

M3 - Journal Article

AN - SCOPUS:85213083497

SN - 2078-2489

VL - 15

JO - Information (Switzerland)

JF - Information (Switzerland)

IS - 12

M1 - 751

ER -