A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

Résultats de recherche: Chapitre dans un livre, rapport, actes de conférenceParticipation à un ouvrage collectif lié à un colloque ou une conférenceRevue par des pairs

83 Citations (Scopus)

Résumé

Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary modalities, such as audio, visual, and biosignals. However, most state-of-the- art audio-visual (A-V) fusion methods rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. This paper focuses on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention fusion model that can effectively exploit the complementary inter-modal relationships, allowing for an accurate prediction of valence and arousal. In particular, this model computes cross-attention weights based on the correlation between joint feature representations and individual modalities. By deploying a joint A-V feature representation into the cross-attention module, the performance of our fusion model improves significantly over the vanilla cross-attention module. Experimental results1 on the AffWild2 dataset highlight the robustness of our proposed A-V fusion model. It has achieved a concordance correlation coefficient (CCC) of 0.374 (0.663) and 0.363 (0.584) for valence and arousal, respectively, on the test set (validation set). This represents a significant improvement over the baseline for the third challenge of Affective Behavior Analysis in-the-Wild 2022 (ABAW3) competition, with a CCC of 0.180 (0.310) and 0.170 (0.170).

langue originaleAnglais
titreProceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
EditeurIEEE Computer Society
Pages2485-2494
Nombre de pages10
ISBN (Electronique)9781665487399
Les DOIs
étatPublié - 2022
Evénement2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022 - New Orleans, Etats-Unis
Durée: 19 juin 202220 juin 2022

Série de publications

NomIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
Volume2022-June
ISSN (imprimé)2160-7508
ISSN (Electronique)2160-7516

Conférence

Conférence2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
Pays/TerritoireEtats-Unis
La villeNew Orleans
période19/06/2220/06/22

Empreinte digitale

Voici les principaux termes ou expressions associés à « A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition ». Ces libellés thématiques sont générés à partir du titre et du résumé de la publication. Ensemble, ils forment une empreinte digitale unique.

Contient cette citation