A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

Research output: Contribution to Book/Report typesContribution to conference proceedingspeer-review

83 Citations (Scopus)

Abstract

Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary modalities, such as audio, visual, and biosignals. However, most state-of-the- art audio-visual (A-V) fusion methods rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. This paper focuses on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention fusion model that can effectively exploit the complementary inter-modal relationships, allowing for an accurate prediction of valence and arousal. In particular, this model computes cross-attention weights based on the correlation between joint feature representations and individual modalities. By deploying a joint A-V feature representation into the cross-attention module, the performance of our fusion model improves significantly over the vanilla cross-attention module. Experimental results1 on the AffWild2 dataset highlight the robustness of our proposed A-V fusion model. It has achieved a concordance correlation coefficient (CCC) of 0.374 (0.663) and 0.363 (0.584) for valence and arousal, respectively, on the test set (validation set). This represents a significant improvement over the baseline for the third challenge of Affective Behavior Analysis in-the-Wild 2022 (ABAW3) competition, with a CCC of 0.180 (0.310) and 0.170 (0.170).

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
PublisherIEEE Computer Society
Pages2485-2494
Number of pages10
ISBN (Electronic)9781665487399
DOIs
Publication statusPublished - 2022
Event2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022 - New Orleans, United States
Duration: 19 Jun 202220 Jun 2022

Publication series

NameIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
Volume2022-June
ISSN (Print)2160-7508
ISSN (Electronic)2160-7516

Conference

Conference2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
Country/TerritoryUnited States
CityNew Orleans
Period19/06/2220/06/22

Fingerprint

Dive into the research topics of 'A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition'. These topics are generated from the title and abstract of the publication. Together, they form a unique fingerprint.

Cite this