TY - GEN
T1 - A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
AU - Praveen, R. Gnana
AU - De Melo, Wheidima Carneiro
AU - Ullah, Nasib
AU - Aslam, Haseeb
AU - Zeeshan, Osama
AU - Denorme, Theo
AU - Pedersoli, Marco
AU - Koerich, Alessandro L.
AU - Bacon, Simon
AU - Cardinal, Patrick
AU - Granger, Eric
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary modalities, such as audio, visual, and biosignals. However, most state-of-the- art audio-visual (A-V) fusion methods rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. This paper focuses on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention fusion model that can effectively exploit the complementary inter-modal relationships, allowing for an accurate prediction of valence and arousal. In particular, this model computes cross-attention weights based on the correlation between joint feature representations and individual modalities. By deploying a joint A-V feature representation into the cross-attention module, the performance of our fusion model improves significantly over the vanilla cross-attention module. Experimental results1 on the AffWild2 dataset highlight the robustness of our proposed A-V fusion model. It has achieved a concordance correlation coefficient (CCC) of 0.374 (0.663) and 0.363 (0.584) for valence and arousal, respectively, on the test set (validation set). This represents a significant improvement over the baseline for the third challenge of Affective Behavior Analysis in-the-Wild 2022 (ABAW3) competition, with a CCC of 0.180 (0.310) and 0.170 (0.170).
AB - Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary modalities, such as audio, visual, and biosignals. However, most state-of-the- art audio-visual (A-V) fusion methods rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. This paper focuses on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention fusion model that can effectively exploit the complementary inter-modal relationships, allowing for an accurate prediction of valence and arousal. In particular, this model computes cross-attention weights based on the correlation between joint feature representations and individual modalities. By deploying a joint A-V feature representation into the cross-attention module, the performance of our fusion model improves significantly over the vanilla cross-attention module. Experimental results1 on the AffWild2 dataset highlight the robustness of our proposed A-V fusion model. It has achieved a concordance correlation coefficient (CCC) of 0.374 (0.663) and 0.363 (0.584) for valence and arousal, respectively, on the test set (validation set). This represents a significant improvement over the baseline for the third challenge of Affective Behavior Analysis in-the-Wild 2022 (ABAW3) competition, with a CCC of 0.180 (0.310) and 0.170 (0.170).
UR - https://www.scopus.com/pages/publications/85137817382
U2 - 10.1109/CVPRW56347.2022.00278
DO - 10.1109/CVPRW56347.2022.00278
M3 - Contribution to conference proceedings
AN - SCOPUS:85137817382
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 2485
EP - 2494
BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
PB - IEEE Computer Society
T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
Y2 - 19 June 2022 through 20 June 2022
ER -