loader
Generating audio...

arxiv

Paper 2503.10530

Lightweight Models for Emotional Analysis in Video

Authors: Quoc-Tien Nguyen, Hong-Hai Nguyen, Van-Thong Huynh

Published: 2025-03-13

Abstract:

In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.

Paper Content:
Page 1: arXiv:2503.10530v1 [cs.CV] 13 Mar 2025Lightweight Models for Emotional Analysis in Video Quoc-Tien Nguyen, Hong-Hai Nguyen, Van-Thong Huynh* Dept. of ITS, FPT University Ho Chi Minh City, 71216, Vietnam {tiennq27,hainh51,thonghv4 }@fe.edu.vn Abstract In this study, we present an approach for efficient spa- tiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottle- neck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image se- quences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which pro- cesses spatial features at multiple resolutions while main - taining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective be- havior analysis. By integrating an efficient vision back- bone with a structured temporal modeling mechanism, the proposed framework achieves a balance between compu- tational efficiency and predictive accuracy, making it well - suited for real-time applications in mobile and embedded computing environments. 1. Introduction Human emotion recognition is a subfield of human behav- ior analysis that has got significant interest from research ers in artificial intelligence, psychology, and human-compute r interaction. By utilizing various types of data, such as au- dio, images, and text, researchers can analyze and predict human emotions as well as continuous actions. Advance- ments in deep learning has significantly improved the accu- racy and efficiency of emotion recognition systems. These technologies can detect emotions in real time, allowing ap- plications across a wide range of fields. Understanding hu- man emotions and behavior can create various applications in areas such as healthcare, autonomous vehicles, robotics , and more [ 4–6,16]. For example, emotion recognition is increasingly being used in mental health monitoring, where *Corresponding authorit helps identify signs of depression or stress based on be- havioral patterns. In marketing, consumer emotion recogni - tion enables brands to tailor their products and advertisin g strategies to evoke desired emotional responses. Moreover , in human-computer interaction, emotion-aware systems are enhancing user experiences by adapting to users’ emotional state. The Emotion recognition has many benefits in life, but emotion recognition still faces various challenges in real - world applications. Human emotions are complex and in- fluenced by various factors such as gender, age, and con- text, etc. Therefore, the 8th Affective & Behavior Analysis in-the-Wild (ABAW8) workshop [ 10] specifically addresses the complex challenges inherent in analyzing human affec- tive states and behavioral patterns in unconstrained, real - world scenarios. Unlike controlled laboratory environ- ments, in-the-wild settings present numerous variables in - cluding diverse lighting conditions, varying head poses, occlusions, and spontaneous expressions that significantl y complicate accurate analysis. The challenge includes six tasks: Valence-Arousal (V A) Estimation, Expression (EXPR) Classification, Action Unit(AU) Detection, Compound Expression (CE) Recog- nition, Emotional Mimicry Intensity (EMI) Estimation, Ambivalence/Hesitancy (AH) Recognition. These chal- lenges leverage datasets such as Aff-Wild2, C-EXPR-DB, HUME-Vidmimic2, and BAH, providing a comprehensive benchmark for evaluating affective behavior analysis mod- els.The challenges utilize the Aff-Wild2 [ 8,11], C-EXPR- DB [ 7], HUME-Vidmimic2, and BAH datasets, providing a comprehensive evaluation framework for affective behav- ior analysis methodologies. In this paper, we solve the task s such as Valence-Arousal (V A) Estimation, Action Unit(AU) Detection, Emotional Mimicry Intensity (EMI) Estimation, Ambivalence/Hesitancy (AH) Recognition. The V A task predicts valence and arousal in video sequences, while the AU task detects the presence of 12 action units (AUs) in each video frame. The EMI task estimates the intensity of six emotional dimensions (Admiration, Amusement, Deter- mination, Empathic Pain, Excitement, and Joy). Lastly, the Page 2: AH task identifies the presence or absence of ambivalence or hesitancy in each frame. 2. Method 2.1. Visual feature extraction The visual feature extraction process is facilitated by a small variant of MobileNetV4 [ 14], pretrained on AffectNet dataset [ 13]. MobileNetV4 [ 14] introduces a universally ef- ficient architecture optimized for mobile devices, integra t- ing the Universal Inverted Bottleneck (UIB) and Mobile Multi-Query Attention (Mobile MQA) blocks to enhance feature extraction efficiency. The UIB block unifies key ar- chitectural components, including Inverted Bottleneck (I B), ConvNext, Feed Forward Network (FFN), and an Extra Depthwise (ExtraDW) variant, providing flexibility in spa- tial and channel mixing while improving computational ef- ficiency. Mobile MQA accelerates attention mechanisms by over 39% on mobile accelerators, further optimizing featur e representation. Additionally, an advanced neural archite c- ture search (NAS) strategy refines MobileNetV4’s design, ensuring mostly Pareto-optimal performance across CPUs, DSPs, GPUs, and dedicated accelerators like Google’s Ed- geTPU. In this study, we fed a sequence of 224×224×3 images into the MobileNetV4 to extract multi-scale feature maps from input frames. The backbone is configured to output hierarchical feature representations at different spa- tial resolutions, allowing for rich semantic extraction. T o maintain efficiency while preserving important spatial de- tails, all but the final layers of the backbone are frozen dur- ing training. The extracted feature maps serve as input to the subsequent temporal modeling module. 2.2. Temporal aggregation module To model temporal dependencies in the extracted features, we incorporate a multiscale 3D MLP-Mixer-based [ 17] to build temporal aggregation module (TAM). This module processes the sequential feature maps using multiple level s of spatial granularity. TAM consist of three mixer layers op - erate on different feature resolutions: (1) a high-resolut ion mixer for 28×28feature maps, (2) a mid-resolution mixer with input size of 14×14, and (3) a low-resolution mixer for highly detail feature maps of size 7×7. Each mixer employs 3D MLP-based transformations to capture tempo- ral relationships while preserving spatial structure. Fin ally, a fully connected layer maps the aggregated feature rep- resentations to the target output space, ensuring effectiv e sequence-level prediction. 3. Experiments and Results 3.1. Dataset This subsection provides a brief summary of the datasets used in each challenge, such as Action Unit (AU) Detec-Table 1. Distribution of AU Annotations in Aff-Wild2 AU Action Total Number of Activated AUs AU 1 inner brow raiser 301,102 AU 2 outer brow raiser 139,936 AU 4 brow lowerer 386,689 AU 6 cheek raiser 619,775 AU 7 lid tightener 964,312 AU 10 upper lip raiser 854,519 AU 12 lip corner puller 602,835 AU 15 lip corner depressor 63,230 AU 23 lip tightener 78,649 AU 24 lip pressor 61,500 AU 25 lips part 1,596,055 AU 26 jaw drop 206,535 tion, Valence-Arousal (V A) Estimation, Emotional Mimicry Intensity (EMI) Estimation. Action Unit (AU) Detection This challenge uses a data set that includes 542 videos with annotations for 12 Ac- tion Units (AU), which represent facial muscle movements, including brow raisers, cheek raisers, lip tighteners, and jaw drops. The data set consists of 2,627,632 frames cap- tured from 438 unique subjects. Annotations were devel- oped through a semiautomatic methodology that integrates both manual and computational techniques. The data set has been divided into three subsets: a training set (295 videos) , a validation set (105 videos), and a testing set (142 videos) . Table 1 provides the distribution of the AU annotations of the dataset. Valence-Arousal (V A) Estimation This challenge uti- lizes an expanded version of the Aff-Wild2 database, in- cluding 594 videos annotated for valence and arousal. The data set consists of 2,993,081 frames captured from 584 subjects. Significantly, sixteen videos contain dual subje cts, both of whom received independent annotations. Four ex- pert annotators evaluated the data set following the method - ology detailed in [ 3], continuous valence, and arousal values within the range of [-1, 1]. To maintain subject independence across experimental protocols, the data set has been partitioned into three dis- crete subsets: a training set (356 videos), a validation set (76 videos), and a testing set (162 videos). This division ensures that individual subjects appear exclusively in one subset. Emotional Mimicry Intensity (EMI) Estimation. The EMI Challenge considers emotional mimicry through the Page 3: Table 2. HUME-Vidmimic2 partition statistics. Partition Duration #Samples (HH:MM:SS) Train 15:07:03 8072 Validation 9:12:02 4588 Test 9:04:05 4586 HUME-Vidmimic2 dataset, which includes more than 30 hours of audiovisual recordings from 557 participants. Thi s data set was collected in naturalistic environments where participants used their webcams to mimic facial and vo- cal expressions presented in seed videos, subsequently sel f- evaluating their mimicry performance on a scale of 0-100. The data set has been systematically partitioned according to the distribution detailed in Table 2 , which presents com- prehensive statistics for each subset. To facilitate analy sis, participants are provided with facial detections extracte d from videos using MTCNN [ 24], processed at a sampling rate of 6 frames per second. Furthermore, to enable the development of end-to- end methodological approaches [ 18–22], participants re- ceive pre-extracted features derived from the raw audiovi- sual data, specifically: facial features processed using Vi - sion Transformer (ViT) [ 2], audio signals processed using Wav2Vec 2.0 [ 1]. Ambivalence/Hesitancy Recognition Action Unit (AU) Detection TheF1score uses precision and recall to calculate which ensure a robust assessment of classification performance. The equation of the F1score is as follows: F1=2∗precision ∗recall precision +recall(1) The average F1 score is used to evaluate 12 AUs. The F1 score ranges from 0 to 1, where 1 represents perfect and 0 represents the worst performance. The formula is expressed as follows: FAU=/summationtext auFau 1 12(2) Valence-Arousal (V A) Estimation In this task, the Con- cordance Correlation Coefficient (CCC) [ 12],Pis used to evaluate the performance of the model. The CCC is a mea- sure that is used to evaluate the agreement between two con- tinuous variables. The CCC range is from [-1,1] where -1 is a negative correlation, 0 is no correlation, and 1 is a high correlation. The formula is expressed as follows:PVA=PV+PA 2(3) wherePVandPAare the CCC of valence and arousal, re- spectively, which is defined as P=2ρσˆYσY σ2 ˆY+σ2 Y+(µˆY−µY)2 (4) whereµYwas the mean of the label Y,µˆYwas the mean of prediction ˆY,σˆYandσYwere the corresponding stan- dard deviations, ρwas the Pearson correlation coefficient betweenˆYandY. Emotional Mimicry Intensity (EMI) Estimation. The average Pearson’s Correlation Coefficient ( ρ)is used to mea- sure the six emotion dimensions: PEMI=/summationtext6 i=1ρi 6(5) Table 3 shown results of our approach compared to pre- vious studies on Affwild2 dataset [ 9]. Table 3. Action unit validation set. Methods F1 score Regnet-ViT [ 23] 0.5280 EmotiEffNet [ 15] 0.537 Ours with 3 level mixer 0.5369 Ours with 2 level mixer 0.5338 Ours - single mixer 0.5441 References [1] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural infor- mation processing systems , 33:12449–12460, 2020. 3 [2] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´ e J´ ego u, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision , pages 9650–9660, 2021. 3 [3] Roddy Cowie, Ellen Douglas-Cowie, Susie Savvidou, Edel le McMahon, Martin Sawey, and Marc Schr¨ oder. Feeltrace: An instrument for recording perceived emotion in real time. In Proceedings of the ISCA Workshop on Speech and Emotion , 2000. 2 [4] Albert Haque, Michelle Guo, Adam S Miner, and Li Fei- Fei. Measuring depression symptom severity from spo- ken language and 3d facial expressions. arXiv preprint arXiv:1811.08592 , 2018. 1 Page 4: [5] Varada Kolhatkar, Hanhan Wu, Luca Cavasso, Emilie Fran- cis, Kavan Shukla, and Maite Taboada. The sfu opinion and comments corpus: A corpus for the analysis of online news comments. Corpus pragmatics , 4:155–190, 2020. [6] Dimitrios Kollias. Abaw: Valence-arousal estimation, ex- pression recognition, action unit detection & multi-task learning challenges. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition , pages 2328–2336, 2022. 1 [7] Dimitrios Kollias. Multi-label compound expression re cog- nition: C-expr database & network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5589–5598, 2023. 1 [8] Dimitrios Kollias and Stefanos Zafeiriou. Expression, affect, action unit recognition: Aff-wild2, multi-task learning a nd arcface. arXiv preprint arXiv:1910.04855 , 2019. 1 [9] Dimitrios Kollias and Stefanos Zafeiriou. Affect analy sis in-the-wild: Valence-arousal, expressions, action units and a unified framework. arXiv preprint arXiv:2103.15792 , 2021. 3 [10] Dimitrios Kollias, Panagiotis Tzirakis, Alan Cowen, I rene Kotsia, UK Cogitat, Eric Granger, Marco Pedersoli, Simon Bacon, Alice Baird, Chunchang Shao, et al. Advancements in affective and behavior analysis: The 8th abaw workshop and competition. 1 [11] Dimitrios Kollias, Panagiotis Tzirakis, Alan Cowen, S te- fanos Zafeiriou, Irene Kotsia, Alice Baird, Chris Gagne, Chunchang Shao, and Guanyu Hu. The 6th affective behav- ior analysis in-the-wild (abaw) competition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition , pages 4587–4598, 2024. 1 [12] I Lawrence and Kuei Lin. A concordance correlation coef fi- cient to evaluate reproducibility. Biometrics , pages 255–268, 1989. 3 [13] Ali Mollahosseini, Behzad Hasani, and Mohammad H Ma- hoor. Affectnet: A database for facial expression, valence , and arousal computing in the wild. IEEE Transactions on Affective Computing , 10(1):18–31, 2017. 2 [14] Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Ban- bury, Chengxi Ye, Berkin Akin, et al. Mobilenetv4: universa l models for the mobile ecosystem. In European Conference on Computer Vision , pages 78–96. Springer, 2024. 2 [15] Andrey V Savchenko. Hsemotion team at the 6th abaw com- petition: Facial expressions, valence-arousal and emotio n in- tensity prediction. arXiv preprint arXiv:2403.11590 , 2024. 3 [16] Gulbadan Sikander and Shahzad Anwar. Driver fatigue de - tection systems: A review. IEEE Transactions on Intelligent Transportation Systems , 20(6):2339–2352, 2018. 1 [17] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov , Lu- cas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems , 34:24261–24272, 2021. 2[18] Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nic olaou, Bj¨ orn W Schuller, and Stefanos Zafeiriou. End-to-end mul- timodal emotion recognition using deep neural networks. IEEE Journal of selected topics in signal processing , 11(8): 1301–1309, 2017. 3 [19] Panagiotis Tzirakis, Stefanos Zafeiriou, and Bjorn W Schuller. End2you–the imperial toolkit for multi- modal profiling by end-to-end learning. arXiv preprint arXiv:1802.01115 , 2018. [20] Panagiotis Tzirakis, Jiehao Zhang, and Bjorn W Schulle r. End-to-end speech emotion recognition using deep neural networks. In 2018 IEEE international conference on acous- tics, speech and signal processing (ICASSP) , pages 5089– 5093. IEEE, 2018. [21] Panagiotis Tzirakis, Jiaxin Chen, Stefanos Zafeiriou , and Bj¨ orn Schuller. End-to-end multimodal affect recognitio n in real-world environments. Information Fusion , 68:46–53, 2021. [22] Panagiotis Tzirakis, Anh Nguyen, Stefanos Zafeiriou, and Bj¨ orn W Schuller. Speech emotion recognition using se- mantic information. In ICASSP 2021-2021 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP) , pages 6279–6283. IEEE, 2021. 3 [23] Ngoc Tu Vu, Van Thong Huynh, Trong Nghia Nguyen, and Soo-Hyung Kim. Ensemble spatial and temporal vision transformer for action units detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5770–5776, 2023. 3 [24] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascade d convolutional networks. IEEE signal processing letters , 23 (10):1499–1503, 2016. 3

---