![]() |
IEEE Transactions on Affective Computing
Journal Prestige (SJR): 1.13 ![]() Citation Impact (citeScore): 6 Number of Followers: 23 ![]() ISSN (Print) 1949-3045 Published by IEEE ![]() |
- E-Key: An EEG-Based Biometric Authentication and Driving Fatigue Detection
System-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Tao Xu;Hongtao Wang;Guanyong Lu;Feng Wan;Mengqi Deng;Peng Qi;Anastasios Bezerianos;Cuntai Guan;Yu Sun;
Pages: 864 - 877
Abstract: Due to the increasing number of fatal traffic accidents, there are strong desire for more effective and convenient techniques for driving fatigue detection. Here, we propose a unified framework – E-Key to simultaneously perform personal identification (PI) and driving fatigue detection using a convolutional neural network and attention (CNN-Attention) structure. The performance was assessed using EEG data collected through a wearable dry-sensor system from 31 healthy subjects undergoing a 90-min simulated driving task. In comparison with three widely-used competitive models (including CNN, CNN-LSTM, and Attention), the proposed scheme achieved the best (p $
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Facial Expression Animation by Landmark Guided Residual Module
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Xueping Wang;Yunhong Wang;Weixin Li;Zhengyin Du;Di Huang;
Pages: 878 - 894
Abstract: We study the problem of facial expression animation from a still image according to a driving video. This is a challenging task as expression motions are non-rigid and very subtle to be captured. Existing methods mostly fail to model these subtle expression motions, leading to the lack of details in their animation results. In this paper, we propose a novel facial expression animation method based on generative adversarial learning. To capture the subtle expression motions, Landmark guided Residual Module (LRM) is proposed to model detailed facial expression features. Specifically, residual learning is conducted at both coarse and fine levels conditioned on facial landmark heatmaps and landmark points respectively. Furthermore, we employ a consistency discriminator to ensure the temporal consistency of the generated video sequence. In addition, a novel metric named Emotion Consistency Metric is proposed to evaluate the consistency of facial expressions in the generated sequences with those in the driving videos. Experiments on MUG-Face, Oulu-CASIA and CAER datasets show that the proposed method can generate arbitrary expression motions on the source still image effectively, which are more photo-realistic and consistent with the driving video compared with results of state-of-the-art methods.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Long Short-Term Memory Network Based Unobtrusive Workload Monitoring With
Consumer Grade Smartwatches-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Deniz Ekiz;Yekta Said Can;Cem Ersoy;
Pages: 895 - 905
Abstract: Continuous high perceived workload has a negative impact on the individual's well-being. Prior works focused on detecting the workload with medical-grade wearable systems in restricted settings, and the effect of applying deep learning techniques for perceived workload detection in the wild settings is not investigated. We present an unobtrusive, comfortable, pervasive, and affordable Long Short-Term Memory Network based continuous workload monitoring system based on a smartwatch application that monitors the perceived workload of individuals in the wild. We have recorded physiological data from daily life with perceived workload questionnaires from subjects in their real-life environments over a month. The model was trained and evaluated with the daily-life physiological data coming from different days, which makes it robust to daily changes in the heart rate variability that we use with accelerometer features to assess low and high workload. Our system has the capability of detecting perceived workload by using traditional and deep classifiers. We discussed the problems related to ’in the wild’ applications with the consumer-grade smartwatches. We showed that Long Short-Term Memory Network with feature extraction outperforms traditional classifiers and Convolutional Neural Networks on discrimination of low and high perceived workload with smartwatches in the wild.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Investigating Multisensory Integration in Emotion Recognition Through
Bio-Inspired Computational Models-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Esma Mansouri Benssassi;Juan Ye;
Pages: 906 - 918
Abstract: Emotion understanding represents a core aspect of human communication. Our social behaviours are closely linked to expressing our emotions and understanding others’ emotional and mental states through social signals. The majority of the existing work proceeds by extracting meaningful features from each modality and applying fusion techniques either at a feature level or decision level. However, these techniques are incapable of translating the constant talk and feedback between different modalities. Such constant talk is particularly important in continuous emotion recognition, where one modality can predict, enhance and complement the other. This article proposes three multisensory integration models, based on different pathways of multisensory integration in the brain; that is, integration by convergence, early cross-modal enhancement, and integration through neural synchrony. The proposed models are designed and implemented using third-generation neural networks, Spiking Neural Networks (SNN). The models are evaluated using widely adopted, third-party datasets and compared to state-of-the-art multimodal fusion techniques, such as early, late and deep learning fusion. Evaluation results show that the three proposed models have achieved comparable results to the state-of-the-art supervised learning techniques. More importantly, this article demonstrates plausible ways to translate constant talk between modalities during the training phase, which also brings advantages in generalisation and robustness to noise.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Exploring Complexity of Facial Dynamics in Autism Spectrum Disorder
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Pradeep Raj Krishnappa Babu;J. Matias Di Martino;Zhuoqing Chang;Sam Perochon;Kimberly L. H. Carpenter;Scott Compton;Steven Espinosa;Geraldine Dawson;Guillermo Sapiro;
Pages: 919 - 930
Abstract: Atypical facial expression is one of the early symptoms of autism spectrum disorder (ASD) characterized by reduced regularity and lack of coordination of facial movements. Automatic quantification of these behaviors can offer novel biomarkers for screening, diagnosis, and treatment monitoring of ASD. In this work, 40 toddlers with ASD and 396 typically developing toddlers were shown developmentally-appropriate and engaging movies presented on a smart tablet during a well-child pediatric visit. The movies consisted of social and non-social dynamic scenes designed to evoke certain behavioral and affective responses. The front-facing camera of the tablet was used to capture the toddlers’ face. Facial landmarks’ dynamics were then automatically computed using computer vision algorithms. Subsequently, the complexity of the landmarks’ dynamics was estimated for the eyebrows and mouth regions using multiscale entropy. Compared to typically developing toddlers, toddlers with ASD showed higher complexity (i.e., less predictability) in these landmarks’ dynamics. This complexity in facial dynamics contained novel information not captured by traditional facial affect analyses. These results suggest that computer vision analysis of facial landmark movements is a promising approach for detecting and quantifying early behavioral symptoms associated with ASD.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- A Media-Guided Attentive Graphical Network for Personality Recognition
Using Physiology-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Hao-Chun Yang;Chi-Chun Lee;
Pages: 931 - 943
Abstract: Physiological automatic personality recognition has been largely developed to model an individual’s personality trait from a variety of signals. However, few studies have tackled the problems of integration methodology from multiple observations into a single personality prediction. In this study, we focus on finding a novel learning architecture to model the personality trait under a Many-to-One scenario. We propose to integrate not only the information on the user but also consider the effect of the affective multimedia stimulus. Specifically, we present a novel Acoustic-Visual Guided Attentive Graph Convolutional Network for enhanced personality recognition. The emotional multimedia content guides the formation of the physiological responses into a graph-like structure to integrate latent inter-correlation among all responses toward affective multimedia. Then these graphs would be further processed by the Graph Convolutional Network (GCN) to jointly model instances and inter-correlation levels of the subject’s responses. We show that our model outperforms the current state of the art on two large public corpora for personality recognition. Further analysis reveals that there indeed exists a multimedia preference for inferring personality from physiology, and several frequency-domain descriptors in ECG and the tonic component in EDA are shown to be robust for automatic personality recognition.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- A Deep Multimodal Learning Approach to Perceive Basic Needs of Humans From
Instagram Profile-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Mohammad Mahdi Dehshibi;Bita Baiani;Gerard Pons;David Masip;
Pages: 944 - 956
Abstract: Nowadays, a significant part of our time is spent sharing multimodal data on social media sites such as Instagram, Facebook and Twitter. The particular way through which users present themselves to social media can provide useful insights into their behaviours, personalities, perspectives, motives and needs. This article proposes to use multimodal data collected from Instagram accounts to predict the five basic prototypical needs described in Glasser's choice theory (i.e., Survival, Power, Freedom, Belonging, and Fun). We automate the identification of the unconsciously perceived needs from Instagram profiles by using both visual and textual contents. The proposed approach aggregates the visual and textual features extracted using deep learning and constructs a homogeneous representation for each profile through the proposed Bag-of-Content. Finally, we perform multi-label classification on the fusion of both modalities. We validate our proposal on a large database, consensually annotated by two expert psychologists, with more than 30,000 images, captions and comments. Experiments show promising accuracy and complementary information between visual and textual cues.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- EEG-Based Emotion Recognition via Neural Architecture Search
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Chang Li;Zhongzhen Zhang;Rencheng Song;Juan Cheng;Yu Liu;Xun Chen;
Pages: 957 - 968
Abstract: With the flourishing development of deep learning (DL) and the convolution neural network (CNN), electroencephalogram-based (EEG) emotion recognition is occupying an increasingly crucial part in the field of brain-computer interface (BCI). However, currently employed architectures have mostly been designed manually by human experts, which is a time-consuming and labor-intensive process. In this paper, we proposed a novel neural architecture search (NAS) framework based on reinforcement learning (RL) for EEG-based emotion recognition, which can automatically design network architectures. The proposed NAS mainly contains three parts: search strategy, search space, and evaluation strategy. During the search process, a recurrent network (RNN) controller is used to select the optimal network structure in the search space. We trained the controller with RL to maximize the expected reward of the generated models on a validation set and force parameter sharing among the models. We evaluated the performance of NAS on the DEAP and DREAMER dataset. On the DEAP dataset, the average accuracies reached 97.94%, 97.74%, and 97.82% on arousal, valence, and dominance respectively. On the DREAMER dataset, average accuracies reached 96.62%, 96.29% and 96.61% on arousal, valence, and dominance, respectively. The experimental results demonstrated that the proposed NAS outperforms the state-of-the-art CNN-based methods.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Multimodal Hierarchical Attention Neural Network: Looking for Candidates
Behaviour Which Impact Recruiter's Decision-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Léo Hemamou;Arthur Guillon;Jean-Claude Martin;Chloé Clavel;
Pages: 969 - 985
Abstract: Automatic analysis of job interviews has gained in interest amongst academic and industrial research. The particular case of asynchronous video interviews allows to collect vast corpora of videos where candidates answer standardized questions in monologue videos, enabling the use of deep learning algorithms. On the other hand, state-of-the-art approaches still face some obstacles, among which the fusion of information from multiple modalities and the interpretability of the predictions. We study the task of predicting candidates performance in asynchronous video interviews using three modalities (verbal content, prosody and facial expressions) independently or simultaneously, using data from real interviews which take place in real conditions. We propose a sequential and multimodal deep neural network model, called Multimodal HireNet. We compare this model to state-of-the-art approaches and show a clear improvement of the performance. Moreover, the architecture we propose is based on attention mechanism, which provides interpretability about which questions, moments and modalities contribute the most to the output of the network. While other deep learning systems use attention mechanisms to offer a visualization of moments with attention values, the proposed methodology enables an in-depth interpretation of the predictions by an overall analysis of the features of social signals contained in these moments.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Distilling Region-Wise and Channel-Wise Deep Structural Facial
Relationships for FAU (DSR-FAU) Intensity Estimation-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Yingruo Fan;Jacqueline C.K. Lam;Victor O.K. Li;
Pages: 986 - 997
Abstract: Facial emotions are expressed through a combination of facial muscle movements, namely, the Facial Action Units (FAUs). FAU intensity estimation aims to estimate the intensity of a set of structurally dependent FAUs. Contrary to the existing works that focus on improving FAU intensity estimation performance, this study investigates how knowledge distillation (KD) incorporated into a training model can improve FAU intensity estimation efficiency while achieving the comparable level of performance. Given the intrinsic structural characteristics of FAU, it is desirable to distill deep structural relationships, namely, DSR-FAU, using heatmap regression. Our methodology is as follows: First, a feature map-level distillation loss is applied to ensure that the student network and the teacher network share similar feature distributions. Second, the region-wise and channel-wise relationship distillation loss functions are introduced to penalize the difference in structural relationships. Specifically, the region-wise relationship can be represented by the structural correlations across the facial features, whereas the channel-wise relationship is represented by the implicit FAU co-occurrence dependencies. Third, we compare the model performance of DSR-FAU with the state-of-the-art models, based on two benchmarking datasets. It is shown that our model achieves comparable performance, with a lower number of model parameters and lower computation complexities.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Neurofeedback Training With an Electroencephalogram-Based Brain-Computer
Interface Enhances Emotion Regulation-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Weichen Huang;Wei Wu;Molly V. Lucas;Haiyun Huang;Zhenfu Wen;Yuanqing Li;
Pages: 998 - 1011
Abstract: Emotion regulation plays a vital role in human beings daily lives by helping them deal with social problems and protects mental and physical health. However, objective evaluation of the efficacy of emotion regulation and assessment of the improvement in emotion regulation ability at the individual level remain challenging. In this study, we leveraged neurofeedback training to design a real-time EEG-based brain-computer interface (BCI) system for users to effectively regulate their emotions. Twenty healthy subjects performed 10 BCI-based neurofeedback training sessions to regulate their emotion towards a specific emotional state (positive, negative, or neutral), while their EEG signals were analyzed in real time via machine learning to predict their emotional states. The prediction results were presented as feedback on the screen to inform the subjects of their immediate emotional state, based on which the subjects could update their strategies for emotion regulation. The experimental results indicated that the subjects improved their ability to regulate these emotions through our BCI neurofeedback training. Further EEG-based spectrum analysis revealed how each emotional state was related to specific EEG patterns, which were progressively enhanced through long-term training. These results together suggested that long-term EEG-based neurofeedback training could be a promising tool for helping people with emotional or mental disorders.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Multimodal Engagement Analysis From Facial Videos in the Classroom
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Ömer Sümer;Patricia Goldberg;Sidney D’Mello;Peter Gerjets;Ulrich Trautwein;Enkelejda Kasneci;
Pages: 1012 - 1027
Abstract: Student engagement is a key component of learning and teaching, resulting in a plethora of automated methods to measure it. Whereas most of the literature explores student engagement analysis using computer-based learning often in the lab, we focus on using classroom instruction in authentic learning environments. We collected audiovisual recordings of secondary school classes over a one and a half month period, acquired continuous engagement labeling per student (N=15) in repeated sessions, and explored computer vision methods to classify engagement from facial videos. We learned deep embeddings for attentional and affective features by training Attention-Net for head pose estimation and Affect-Net for facial expression recognition using previously-collected large-scale datasets. We used these representations to train engagement classifiers on our data, in individual and multiple channel settings, considering temporal dependencies. The best performing engagement classifiers achieved student-independent AUCs of .620 and .720 for grades 8 and 12, respectively, with attention-based features outperforming affective features. Score-level fusion either improved the engagement classifiers or was on par with the best performing modality. We also investigated the effect of personalization and found that only 60 seconds of person-specific data, selected by margin uncertainty of the base classifier, yielded an average AUC improvement of .084.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- DBATES: Dataset for Discerning Benefits of Audio, Textual, and Facial
Expression Features in Competitive Debate Speeches-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Taylan K. Sen;Gazi Naven;Luke Gerstner;Daryl Bagley;Raiyan Abdul Baten;Wasifur Rahman;Md Kamrul Hasan;Kurtis Haut;Abdullah Al Mamun;Samiha Samrose;Anne Solbu;R. Eric Barnes;Mark G. Frank;Ehsan Hoque;
Pages: 1028 - 1043
Abstract: In this article, we present a database of multimodal communication features extracted from debate speeches in the 2019 North American Universities Debate Championships (NAUDC). Feature sets were extracted from the visual (facial expression, gaze, and head pose), audio (PRAAT), and textual (word sentiment and linguistic category) modalities of raw video recordings of competitive collegiate debaters (N=716 6-minute recordings from 140 unique debaters). Each speech has an associated competition debate score (range: 67-96) from experienced judges as well as competitor demographic and per-round reflection surveys. We observe the fully multimodal model performs best in comparison to models trained on various compositions of individual modalities. We also find that the weights of some features (such as the expression of joy and the use of the word ”we”) change in direction between the aforementioned models. We use these results to highlight the value of a multimodal dataset for studying competitive, collegiate debate.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Receiving a Mediated Touch From Your Partner vs. a Male Stranger: How
Visual Feedback of Touch and Its Sender Influence Touch Experience-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Sima Ipakchian Askari;Ville J. Harjunen;Michiel M. Spapé;Antal Haans;Niklas Ravaja;Wijnand A. IJsselsteijn;
Pages: 1044 - 1055
Abstract: Social touch is essential to human development and communication. Mediated social touch is suggested as a solution for circumstances where distance prevents skin-to-skin contact. However, past research aimed at demonstrating efficacy of mediated touch in reducing stress and promoting helping have produced mixed findings. These inconsistent findings could possibly be due to insufficient control of contextual factors combined with unnatural interaction scenarios. For example, touch occurs less frequently among strangers and is often accompanied with nonverbal visual cues. We investigated how visual presentation of touch, and interpersonal relationship to the sender influence perception, affective experiences, and autonomic responses the touch evoke. Fifty couples of mixed gender were recruited. A mediated touch was repeatedly applied by either the male partner or male confederate to female participants. The latter witnessed through a webcam as the sender caressed a rubber hand or touchpad to send the touch. Following our hypotheses, touch sent by one's partner was perceived softer and more comforting than stranger touch. The partner's touch also resulted in weaker skin conductance responses, particularly when sent by touching a touchpad. In sum, how a mediated touch is experienced depends both on who is touching, and on how the touch is visually represented.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Audio-Visual Automatic Group Affect Analysis
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Garima Sharma;Abhinav Dhall;Jianfei Cai;
Pages: 1056 - 1069
Abstract: Affective computing has progressed well due to methods, which can identify a person’s posed and spontaneous perceived affect with high accuracy. This paper focuses on group-level affect analysis on videos, which is one of the first few multimodal group-level affect analysis studies. There are many challenges on video-based group-level affect analysis as most of the work is focused on either a single person's affect recognition or image-based group affect analysis. To address this, first, we present an audio-visual perceived group affect dataset - ‘Video-level Group AFfect (VGAF)’. VGAF is a large-scale dataset consisting of 4,183 group videos. The videos are collected from YouTube with large variations in the keywords for collecting data across different genders, group settings, group sizes, illuminations and poses. The variety within the dataset will help the study of perception of group affect in a real environment. The data is manually annotated for three group affect classes - positive, neutral, and negative. Further, a fusion based audio-visual method is proposed to set a benchmark performance on the proposed dataset. The experimental results show the effectiveness of facial, holistic and speech features for group-level affect analysis. The baseline code, dataset, and pre-trained models are available at [LINK].
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Modeling Multiple Temporal Scales of Full-Body Movements for Emotion
Classification-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Cigdem Beyan;Sukumar Karumuri;Gualtiero Volpe;Antonio Camurri;Radoslaw Niewiadomski;
Pages: 1070 - 1081
Abstract: This article investigates classification of emotions from full-body movements by using a novel Convolutional Neural Network-based architecture. The model is composed of two shallow networks processing in parallel when the 8-bit RGB images obtained from time intervals of 3D-positional data are the inputs. One network performs a coarse-grained modelling in the time domain while the other one applies a fine-grained modelling. We show that combining different temporal scales into a single architecture improves the classification results of a dataset composed of short excerpts of the performances of professional dancers who interpreted four affective states: anger, happiness, sadness, and insecurity. Additionally, we investigate the effect of data chunk duration, overlapping, the size of the input images and the contribution of several data augmentation strategies for our proposed method. Better recognition results were obtained when the duration of a data chunk was longer, and this was further improved by applying balanced data augmentation. Moreover, we test our method on other existing motion capture datasets and compare the results with prior art. In all experiments, our results surpassed the state-of-the-art approaches, showing that this method generalizes across diverse settings and contexts.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Behavioral and Physiological Signals-Based Deep Multimodal Approach for
Mobile Emotion Recognition-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Kangning Yang;Chaofan Wang;Yue Gu;Zhanna Sarsenbayeva;Benjamin Tag;Tilman Dingler;Greg Wadley;Jorge Goncalves;
Pages: 1082 - 1097
Abstract: With the rapid development of mobile and wearable devices, it is increasingly possible to access users’ affective data in a more unobtrusive manner. On this basis, researchers have proposed various systems to recognize user’s emotional states. However, most of these studies rely on traditional machine learning techniques and a limited number of signals, leading to systems that either do not generalize well or would frequently lack sufficient information for emotion detection in realistic scenarios. In this paper, we propose a novel attention-based LSTM system that uses a combination of sensors from a smartphone (front camera, microphone, touch panel) and a wristband (photoplethysmography, electrodermal activity, and infrared thermopile sensor) to accurately determine user’s emotional states. We evaluated the proposed system by conducting a user study with 45 participants. Using collected behavioral (facial expression, speech, keystroke) and physiological (blood volume, electrodermal activity, skin temperature) affective responses induced by visual stimuli, our system was able to achieve an average accuracy of 89.2 percent for binary positive and negative emotion classification under leave-one-participant-out cross-validation. Furthermore, we investigated the effectiveness of different combinations of data signals to cover different scenarios of signal availability.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Enforcing Semantic Consistency for Cross Corpus Emotion Prediction Using
Adversarial Discrepancy Learning in Emotion-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Chun-Min Chang;Gao-Yi Chao;Chi-Chun Lee;
Pages: 1098 - 1109
Abstract: Mismatch between databases entails a challenge in performing emotion recognition on a practical-condition unlabeled database with labeled source data. The alignment between the source and target is crucial for conventional neural network; therefore, many studies have mapped two domains in a common feature space. However, the effect of distortion in emotion semantics across different conditions has been neglected in such work, and a sample from the target may be considered a high emotional annotation in the target but as low in the source. In this article, we propose the maximum regression discrepancy (MRD) network, which enforces semantic consistency in a source and target by adjusting the acoustic feature encoder to minimize discrepancy in maximally distorted samples through adversarial training. We show our framework in several experiments using three databases (the USC IEMOCAP, MSP-Improv, and MSP-Podcast) for cross corpus emotion prediction. Compared to the Source-only neural network and DANN, MRD network demonstrates a significant improvement between 5% and 10% in the concordance correlation coefficient (CCC) in cross-corpus prediction and between 3% and 10% for evaluation on MSP-PODCAST. We also visualize the effect of MRD on feature representation to shows the efficacy of the MRD structure we designed.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Discriminative Few Shot Learning of Facial Dynamics in Interview Videos
for Autism Trait Classification-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Na Zhang;Mindi Ruan;Shuo Wang;Lynn Paul;Xin Li;
Pages: 1110 - 1124
Abstract: Autism is a prevalent neurodevelopmental disorder characterized by impairments in social and communicative behaviors. Possible connections between autism and facial expression recognition have recently been studied in the literature. However, most works are based on facial images or short videos. Few works aim at Autism Diagnostic Observation Schedule (ADOS) videos due to their complexity (e.g., interaction between interviewer and interviewee) and length (e.g., usually last for hours). In this paper, we attempt to fill this gap by developing a novel discriminative few shot learning method to analyze hour-long video data and exploring the fusion of facial dynamics for the trait classification of ASD. Leveraging well-established computer vision tools from spatio-temporal feature extraction and marginal fisher analysis to few-shot learning and scene-level fusion, we have constructed a three-category system to classify an individual into Autism, Autism Spectrum, and Non-Spectrum. For the first time, we have shown that certain interview scenes carry more discriminative information for ASD trait classification than others. Experimental results are reported to demonstrate the potential of the proposed automatic ASD trait classification system (achieving 91.72% accuracy on the Caltech ADOS video dataset) and the benefits of few-shot learning and scene-level fusion strategy by extensive ablation studies.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Typical Facial Expression Network Using a Facial Feature Decoupler and
Spatial-Temporal Learning-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Jianing Teng;Dong Zhang;Wei Zou;Ming Li;Dah-Jye Lee;
Pages: 1125 - 1137
Abstract: Facial expression recognition (FER) accuracy is often affected by an individual’s unique facial characteristics. Recognition performance can be improved if the influence from these physical characteristics is minimized. Using video instead of single image for FER provides better results but requires extracting temporal features and the spatial structure of facial expressions in an integrated manner. We propose a new network called Typical Facial Expression Network (TFEN) to address both challenges. TFEN uses two deep two-dimensional (2D) convolutional neural networks (CNNs) to extract facial and expression features from input video. A facial feature decoupler decouples facial features from expression features to minimize the influence from inter-subject face variations. These networks combine with a 3D CNN and form a spatial-temporal learning network to jointly explore the spatial-temporal features in a video. A facial recognition network works as an adversarial network to refine the facial feature decoupler and the network performance by minimizing the residual influence of facial features after decoupling. The whole network is trained with an adversarial algorithm to improve FER performance. TFEN was evaluated on four popular dynamic FER datasets. Experimental results show TFEN achieves or outperforms the recognition accuracy of state-of-the-art approaches.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- The Rhythm of Flow: Detecting Facial Expressions of Flow Experiences Using
CNNs-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Lukas Humpe;Simon Friedrich Murillo;Janik Muires;Peter Gloor;
Pages: 1138 - 1147
Abstract: Recently, the flow state, a state in which individuals perform at the peak of their ability and are completely immersed in the task while experiencing a state of elatedness, has been the subject of active research. We introduce a novel approach of using convolutional neural networks to recognize flow in live performing musicians from analyzing their facial expression. A modified and partially re-trained version of the popular ResNet-50 architecture is employed for binary classification of flow, achieving a detection accuracy of 77.55 percent. This is done on labelled YouTube video-data of musicians with a labeling strategy that was verified through a perception experiment. Maximum accuracy within a 5-fold cross-validation is 74.98 percent with the mean exhibiting an accuracy of 65.10 percent. The results indicate that the state of flow is indeed recognizable through facial expressions of musicians. In addition, the utility of the presented model is demonstrated in two exemplary applications: Predicting the popularity of YouTube videos based on flow recognized in the faces through our system and correlating flow and six discrete emotions (neutral, happy, angry, fear, disgust, surprise).
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Deep Siamese Neural Networks for Facial Expression Recognition in the Wild
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Wassan Hayale;Pooran Singh Negi;Mohammad H. Mahoor;
Pages: 1148 - 1158
Abstract: This article introduces an algorithm for facial expression recognition (FER) using deep Siamese Neural Networks (SNNs) that preserve the local structure of images in the embedding similarity space. We designed the network to reveal the input pairs similarity by comparing features through a designed metric. Furthermore, we developed a novel image pairing (i.e., positive and negative pairs) strategy technique to train our Siamese model. Our Siamese model comprises of a verification framework and an identification framework to learn a joint embedding space. The verification path reduces the intra-class variations by minimizing the distance between the extracted features from the same class, while the identification path increases the inter-class variations by maximizing the distance between the features extracted from different classes. We apply transfer learning to only use the identification model for facial expression classification. We evaluated our algorithm using AffectNet, FER2013, and Compound Facial Expressions of Emotion (CFEE) datasets, where better results are achieved compared to other deep learning-based approaches.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Geometry-Aware Facial Expression Recognition via Attentive Graph
Convolutional Networks-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Rui Zhao;Tianshan Liu;Zixun Huang;Daniel P.K. Lun;Kin-Man Lam;
Pages: 1159 - 1174
Abstract: Learning discriminative representations with good robustness from facial observations serves as a fundamental step towards intelligent facial expression recognition (FER). In this article, we propose a novel geometry-aware FER framework to boost the FER performance based on both the geometric and appearance knowledge. Specifically, we propose an encoding strategy for facial landmarks, and adopt a graph convolutional network (GCN) to fully explore the structural information of the facial components behind different expressions. A convolutional neural network (CNN) is further applied to the whole facial observation to learn the global characteristics of different expressions. The features from these two networks are fused into a comprehensive high-semantic representation, which promotes the FER reasoning from both visual and structural perspectives. Moreover, to facilitate the networks to concentrate on the most informative facial regions and components, we introduce multi-level attention mechanisms into the proposed framework, which enhance the reliability of the learned representations for effective FER. Experiments on two challenging FER benchmarks demonstrate that the attentive graph-based learning on the facial geometry boosts the FER accuracy. Furthermore, the insensitivity of the geometric information to the appearance variations also improves the generalization of the proposed framework.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Looking at the Body: Automatic Analysis of Body Gestures and Self-Adaptors
in Psychological Distress-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Weizhe Lin;Indigo Orton;Qingbiao Li;Gabriela Pavarini;Marwa Mahmoud;
Pages: 1175 - 1187
Abstract: Psychological distress is a significant and growing issue in society. In particular, depression and anxiety are leading causes of disability that often go undetected or late-diagnosed. Automatic detection, assessment, and analysis of behavioural markers of psychological distress can help improve identification and support prevention and early intervention efforts. Compared to modalities such as face, head, and vocal, research investigating the use of the body modality for these tasks is relatively sparse, which is partly due to the limited available datasets and difficulty in automatically extracting useful body features. To enable our research, we have collected and analyzed a new dataset containing full body videos for interviews and self-reported distress labels. We propose a novel approach to automatically detect self-adaptors and fidgeting, a subset of self-adaptors that has been shown to correlate with psychological distress. We perform analysis on statistical body gestures and fidgeting features to explore how distress levels affect behaviors. We then propose a multi-modal approach that combines different feature representations using Multi-modal Deep Denoising Auto-Encoders and Improved Fisher Vector Encoding. We demonstrate that our proposed model, combining audio-visual features with detected fidgeting behavioral cues, can successfully predict depression and anxiety in the dataset.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Automatic Estimation of Action Unit Intensities and Inference of Emotional
Appraisals-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Dominik Seuss;Teena Hassan;Anja Dieckmann;Matthias Unfried;Klaus R. Scherer;Marcello Mortillaro;Jens Garbas;
Pages: 1188 - 1200
Abstract: The development of a two-stage approach for appraisal inference from automatically detected Action Unit (AU) intensities in recordings of human faces is described. AU intensity estimation is based on a hybrid approach fusing information from an individually fitted mesh model of the faces and texture information. Evaluation results for two datasets and a comparison against a state-of-the-art system, namely OpenFace are provided. In the second stage, the emotional appraisals novelty, valence and control are predicted from estimated AU intensities by linear regressions. Prediction performance is evaluated based on face recordings from a market research study, which were rated by human observers in terms of perceived appraisals. Predictions of valence and control from automatically estimated AU intensities closely match those obtained from manually coded AUs in terms of agreement with human observers, while novelty predictions lag somewhat behind. Overall, results highlight the flexibility and interpretability of a two-stage approach to emotion inference.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Werewolf-XL: A Database for Identifying Spontaneous Affect in Large
Competitive Group Interactions-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Kejun Zhang;Xinda Wu;Xinhang Xie;Xiaoran Zhang;Hui Zhang;Xiaoyu Chen;Lingyun Sun;
Pages: 1201 - 1214
Abstract: Affective computing and natural human-computer interaction, which would be capable of interpreting and responding intelligently to the social cues of interaction in crowds, are more needed than ever as an individual's affective experience is often related to others in group activities. To develop the next-generation intelligent interactive systems, we require numerous human facial expressions with accurate annotations. However, existing databases usually consider nonspontaneous human behavior (posed or induced), individual or dyadic setting, and a single type of emotion annotation. To address this need, we created the Werewolf-XL database, which contains a total of 890 minutes of spontaneous audio-visual recordings of 129 subjects in a group interaction of nine individuals playing a conversational role-playing game called Werewolf. We provide 131,688 individual utterance-level video clips with internal self-assessment of 18 non-prototypical emotional categories and external assessment of pleasure, arousal, and dominance, including 14,632 speakers' samples and the rest of listeners' samples. Besides, the results of the annotation agreement analysis show fair reliability and validity. Role information and outcomes of the game are also recorded. Furthermore, we provided extensive benchmarks of unimodal and multimodal emotional recognition results. The database is made publicly available.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Chunk-Level Speech Emotion Recognition: A General Framework of
Sequence-to-One Dynamic Temporal Modeling-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Wei-Cheng Lin;Carlos Busso;
Pages: 1215 - 1227
Abstract: A critical issue of current speech-based sequence-to-one learning tasks, such as speech emotion recognition (SER), is the dynamic temporal modeling for speech sentences with different durations. The goal is to extract an informative representation vector of the sentence from acoustic feature sequences with varied length. Traditional methods rely on static descriptions such as statistical functions or a universal background model (UBM), which are not capable of characterizing dynamic temporal changes. Recent advances in deep learning architectures provide promising results, directly extracting sentence-level representations from frame-level features. However, conventional cropping and padding techniques that deal with varied length sequences are not optimal, since they truncate or artificially add sentence-level information. Therefore, we propose a novel dynamic chunking approach, which maps the original sequences of different lengths into a fixed number of chunks that have the same duration by adjusting their overlap. This simple chunking procedure creates a flexible framework that can incorporate different feature extractions and sentence-level temporal aggregation approaches to cope, in a principled way, with different sequence-to-one tasks. Our experimental results based on three databases demonstrate that the proposed framework provides: 1) improvement in recognition accuracy, 2) robustness toward different temporal length predictions, and 3) high model computational efficiency advantages.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Data Augmentation via Face Morphing for Recognizing Intensities of Facial
Emotions-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Tsung-Ren Huang;Shin-Min Hsu;Li-Chen Fu;
Pages: 1228 - 1235
Abstract: Being able to recognize emotional intensity is a desirable feature for a facial emotional recognition (FER) system. However, the development of such a feature is hindered by the paucity of intensity-labeled data for model training. To ameliorate the situation, the present study proposes using face morphing as a novel way of data augmentation to synthesize faces that express different degrees of a designated emotion. Such an approach has been successfully validated on humans and machines. Specifically, humans indeed perceived different levels of intensified emotions in these parametrically synthesized faces, and FER systems based on neural networks indeed showed improved sensitivities to intensities of different emotions when additionally trained on the synthesized faces. Overall, the proposed data augmentation method is not only simple and effective but also useful for building FER systems that recognize facial expressions of mixed emotions.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Facial Expression Recognition With Visual Transformers and Attentional
Selective Fusion-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Fuyan Ma;Bin Sun;Shutao Li;
Pages: 1236 - 1248
Abstract: Facial Expression Recognition (FER) in the wild is extremely challenging due to occlusions, variant head poses, face deformation and motion blur under unconstrained conditions. Although substantial progresses have been made in automatic FER in the past few decades, previous studies were mainly designed for lab-controlled FER. Real-world occlusions, variant head poses and other issues definitely increase the difficulty of FER on account of these information-deficient regions and complex backgrounds. Different from previous pure CNNs based methods, we argue that it is feasible and practical to translate facial images into sequences of visual words and perform expression recognition from a global perspective. Therefore, we propose the Visual Transformers with Feature Fusion (VTFF) to tackle FER in the wild by two main steps. First, we propose the attentional selective fusion (ASF) for leveraging two kinds of feature maps generated by two-branch CNNs. The ASF captures discriminative information by fusing multiple features with the global-local attention. The fused feature maps are then flattened and projected into sequences of visual words. Second, inspired by the success of Transformers in natural language processing, we propose to model relationships between these visual words with the global self-attention. The proposed method is evaluated on three public in-the-wild facial expression datasets (RAF-DB, FERPlus and AffectNet). Under the same settings, extensive experiments demonstrate that our method shows superior performance over other methods, setting new state of the art on RAF-DB with 88.14%, FERPlus with 88.81% and AffectNet with 61.85%. The cross-dataset evaluation on CK+ shows the promising generalization capability of the proposed method.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Collecting Mementos: A Multimodal Dataset for Context-Sensitive Modeling
of Affect and Memory Processing in Responses to Videos-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Bernd Dudzik;Hayley Hung;Mark Neerincx;Joost Broekens;
Pages: 1249 - 1266
Abstract: In this article we introduce Mementos: the first multimodal corpus for computational modeling of affect and memory processing in response to video content. It was collected online via crowdsourcing and captures 1995 individual responses collected from 297 unique viewers responding to 42 different segments of music videos. Apart from webcam recordings of their upper-body behavior (totaling 2012 minutes) and self-reports of their emotional experience, it contains detailed descriptions of the occurrence and content of 989 personal memories triggered by the video content. Finally, the dataset includes self-report measures related to individual differences in participants’ background and situation (Demographics, Personality, and Mood), thereby facilitating the exploration of important contextual factors in research using the dataset. We describe 1) the construction and contents of the corpus itself, 2) analyse the validity of its content by investigating biases and consistency with existing research on affect and memory processing, 3) review previously published work that demonstrates the usefulness of the multimodal data in the corpus for research on automated detection and prediction tasks, and 4) provide suggestions for how the dataset can be used in future research on modeling Video-Induced Emotions, Memory-Associated Affect, and Memory Evocation.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Impact of Facial Landmark Localization on Facial Expression Recognition
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Romain Belmonte;Benjamin Allaert;Pierre Tirilly;Ioan Marius Bilasco;Chaabane Djeraba;Nicu Sebe;
Pages: 1267 - 1279
Abstract: Although facial landmark localization (FLL) approaches are becoming increasingly accurate in identifying facial components, one question remains unanswered: what is the impact of these approaches on subsequent, related tasks' In this paper, we focus on facial expression recognition (FER), where facial landmarks are used for face registration, which is a common usage. Since the common datasets for facial landmark localization do not allow for a proper measurement of performance according to the different difficulties (e.g., pose, expression, illumination, occlusion, motion blur), we also quantify the performance of recent approaches in the presence of head pose variations and facial expressions. Finally, we conduct a study of the impact of these approaches on FER. We show that the landmark accuracy achieved so far by optimizing the euclidean distance does not necessarily guarantee a gain in performance for FER. To deal with this issue, we propose a new evaluation metric for FLL that is more relevant to FER.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Learning Users Inner Thoughts and Emotion Changes for Social Media Based
Suicide Risk Detection-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Lei Cao;Huijun Zhang;Xin Wang;Ling Feng;
Pages: 1280 - 1296
Abstract: Suicide has become a serious problem, hurting the well-being of human society. Thanks to social media, from people's linguistic posts, suicide risk detection has achieved good performance. The aim of this article is to investigate whether more significant accuracy could be achieved. Motivated by the observation that the prior solutions strived to detect suicide risk based on users explicit outer post expressions on social media, and no attempt was made to infer users’ inner true thoughts and emotion changes from their normal open posts for suicide risk detection, we propose to first learn the correlations between user's normal open posts and hidden comments, trying to understand user's inner true thoughts and emotion changes from the open posts, and then detect user's suicide risk upon the generated intermediate results. The better detection performance on the microblog dataset (3,652 at-risk microblog users and 3,652 ordinary microblog users) and forum dataset (392 at-risk forum users and 108 ordinary forum users) verifies the insight that it is more effective to learn users’ inner thoughts and emotion changes for social media-based suicide risk detection.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Improving Textual Emotion Recognition Based on Intra- and Inter-Class
Variations-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Hassan Alhuzali;Sophia Ananiadou;
Pages: 1297 - 1307
Abstract: Textual Emotion Recognition (TER) is an important task in Natural Language Processing (NLP), due to its high impact in real-world applications. Prior research has tackled the automatic classification of emotion expressions in text by maximising the probability of the correct emotion class using cross-entropy loss. However, this approach does not account for intra- and inter-class variations within and between emotion classes. To overcome this problem, we introduce a variant of triplet centre loss as an auxiliary task to emotion classification. This allows TER models to learn compact and discriminative features. Furthermore, we introduce a method for evaluating the impact of intra- and inter-class variations on each emotion class. Experiments performed on three datasets demonstrate the effectiveness of our method when applied to each emotion class in comparison to previous approaches. Finally, we present analyses that illustrate the benefits of our method in terms of improving the prediction scores as well as producing discriminative features.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Learning Enhanced Acoustic Latent Representation for Small Scale Affective
Corpus with Adversarial Cross Corpora Integration-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Chun-Min Chang;Chi-Chun Lee;
Pages: 1308 - 1321
Abstract: Achieving robust cross contexts speech emotion recognition (SER) has become a critical next direction of research for wide adoption of SER technology. The core challenge is in the large variability of affective speech that is highly contextualized. Prior works have worked on this as a transfer learning problem that mostly focuses on developing domain adaptation strategy. However, many of the existing speech emotion corpora, even those considered as large scale, are still limited in size resulting in an unsatisfactory transfer result. On the other hand, directly collecting context-specific corpus often results in an even smaller data size leading to an inevitably non-robust accuracy. In order to mitigate this issue, we propose the concept of enhancing the affect-related variability when learning the in-context acoustic latent representation by integrating out-of-context emotion data. Specifically, we utilize adversarial autoencoder network as our backbone with multiple out-of-context emotion labels derived for each in-context samples that serve as an auxiliary constraint in learning the latent representation. We extensively evaluate our framework using three in-context databases with three out-of-context databases. In this work, we demonstrate not only an improved recognition accuracy but also a comprehensive analysis on the effectiveness of this representation learning strategy.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Learning Transferable Sparse Representations for Cross-Corpus Facial
Expression Recognition-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Dongliang Chen;Peng Song;Wenming Zheng;
Pages: 1322 - 1333
Abstract: An assumption widely used in traditional facial expression recognition algorithms is that the training and testing are conducted on the same dataset. However, this assumption does not hold in practice, in which the training data and testing data are often from different datasets. In this scenario, directly deploying these algorithms would lead to severe information loss and performance degradation due to the domain shift. To address this challenging problem, in this article, we propose a novel transferable sparse subspace representation method (TSSR) for cross-corpus facial expression recognition. Specifically, in order to reduce the cross-corpus mismatch, inspired by sparse subspace clustering, we advocate reconstructing the source and target samples using the source data points based on $ell _1-$ℓ1-norm sparse representation. Each data point in source and target corpora can be ideally represented as a combination of a few other source points from its own subspace. Moreover, we take into account the local geometrical information within the cross-corpus data by adopting a graph Laplacian regularizer, which can efficiently preserve the local manifold structure and better transfer knowledge between two corpora. Finally, extensive experiments on several facial expression datasets are conducted to evaluate the recognition performance of TSSR. Experimental results demonstrate the superiority of the proposed method over some state-of-the-art methods.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset:
Collection, Insights and Improvements-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Lukas Stappen;Alice Baird;Lea Schumann;Björn Schuller;
Pages: 1334 - 1350
Abstract: Truly real-life data presents a strong, but exciting challenge for sentiment and emotion research. The high variety of possible ‘in-the-wild’ properties makes large datasets such as these indispensable with respect to building robust machine learning models. A sufficient quantity of data covering a deep variety in the challenges of each modality to force the exploratory analysis of the interplay of all modalities has not yet been made available in this context. In this contribution, we present MuSe-CaR, a first of its kind multimodal dataset. The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge, and focused on the tasks of emotion, emotion-target engagement, and trustworthiness recognition by means of comprehensively integrating the audio-visual and language modalities. Furthermore, we give a thorough overview of the dataset in terms of collection and annotation, including annotation tiers not used in this year's MuSe 2020. In addition, for one of the sub-challenges – predicting the level of trustworthiness – no participant outperformed the baseline model, and so we propose a simple, but highly efficient Multi-Head-Attention network that exceeds using multimodal fusion the baseline by around 0.2 CCC (almost 50 percent improvement).
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- You're Not You When You're Angry: Robust Emotion Features Emerge by
Recognizing Speakers-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Zakaria Aldeneh;Emily Mower Provost;
Pages: 1351 - 1362
Abstract: The robustness of an acoustic emotion recognition system hinges on first having access to features that represent an acoustic input signal. These representations should abstract extraneous low-level variations present in acoustic signals and only capture speaker characteristics relevant for emotion recognition. Previous research has demonstrated that, in other classification tasks, when large labeled datasets are available, neural networks trained on these data learn to extract robust features from the input signal. However, the datasets used for developing emotion recognition systems remain significantly smaller than those used for developing other speech systems. Thus, acoustic emotion recognition systems remain in need of robust feature representations. In this article, we study the utility of speaker embeddings, representations extracted from a trained speaker recognition network, as robust features for detecting emotions. We first study the relationship between emotions and speaker embeddings and demonstrate how speaker embeddings highlight the differences that exist between neutral speech and emotionally expressive speech. We quantify the modulations that variations in emotional expression incur on speaker embeddings and show how these modulations are greater than those incurred from lexical variations in an utterance. Finally, we demonstrate how speaker embeddings can be used as a replacement for traditional low-level acoustic features for emotion recognition.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Multi-Modal Sarcasm Detection and Humor Classification in Code-Mixed
Conversations-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Manjot Bedi;Shivani Kumar;Md Shad Akhtar;Tanmoy Chakraborty;
Pages: 1363 - 1375
Abstract: Sarcasm detection and humor classification are inherently subtle problems, primarily due to their dependence on the contextual and non-verbal information. Furthermore, existing studies in these two topics are usually constrained in non-English languages such as Hindi, due to the unavailability of qualitative annotated datasets. In this work, we make two major contributions considering the above limitations: (1) we develop a Hindi-English code-mixed dataset, MaSaC,1 for the multi-modal sarcasm detection and humor classification in conversational dialog, which to our knowledge is the first dataset of its kind; (2) we propose MSH-COMICS,2 a novel attention-rich neural architecture for the utterance classification. We learn efficient utterance representation utilizing a hierarchical attention mechanism that attends to a small portion of the input sentence at a time. Further, we incorporate dialog-level contextual attention mechanism to leverage the dialog history for the multi-modal classification. We perform extensive experiments for both the tasks by varying multi-modal inputs and various submodules of MSH-COMICS. We also conduct comparative analysis against existing approaches. We observe that MSH-COMICS attains superior performance over the existing models by $>$>1 F1-score point for the sarcasm detection and 10 F1-score points in humor classification. We diagnose our model and perform thorough analysis of the results to understand the superiority and pitfalls.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Quantifying Emotional Similarity in Speech
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: John Harvill;Seong-Gyun Leem;Mohammed AbdelWahab;Reza Lotfian;Carlos Busso;
Pages: 1376 - 1390
Abstract: This study proposes the novel formulation of measuring emotional similarity between speech recordings. This formulation explores the ordinal nature of emotions by comparing emotional similarities instead of predicting an emotional attribute, or recognizing an emotional category. The proposed task determines which of two alternative samples has the most similar emotional content to the emotion of a given anchor. This task raises some interesting questions. Which is the emotional descriptor that provide the most suitable space to assess emotional similarities' Can deep neural networks (DNNs) learn representations to robustly quantify emotional similarities' We address these questions by exploring alternative emotional spaces created with attribute-based descriptors and categorical emotions. We create the representation using a DNN trained with the triplet loss function, which relies on triplets formed with an anchor, a positive example, and a negative example. We select a positive sample that has similar emotion content to the anchor, and a negative sample that has dissimilar emotion to the anchor. The task of our DNN is to identify the positive sample. The experimental evaluations demonstrate that we can learn a meaningful embedding to assess emotional similarities, achieving higher performance than human evaluators asked to complete the same task.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Multimodal Affective States Recognition Based on Multiscale CNNs and
Biologically Inspired Decision Fusion Model-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Yuxuan Zhao;Xinyan Cao;Jinlong Lin;Dunshan Yu;Xixin Cao;
Pages: 1391 - 1403
Abstract: There has been an encouraging progress in the affective states recognition models based on the single-modality signals as electroencephalogram (EEG) signals or peripheral physiological signals in recent years. However, multimodal physiological signals-based affective states recognition methods have not been thoroughly exploited yet. Here we propose Multiscale Convolutional Neural Networks (Multiscale CNNs) and a biologically inspired decision fusion model for multimodal affective states recognition. First, the raw signals are pre-processed with baseline signals. Then, the High Scale CNN and Low Scale CNN in Multiscale CNNs are utilized to predict the probability of affective states output for EEG and each peripheral physiological signal respectively. Finally, the fusion model calculates the reliability of each single-modality signals by the euclidean distance between various class labels and the classification probability from Multiscale CNNs, and the decision is made by the more reliable modality information while other modalities information is retained. We use this model to classify four affective states from the arousal valence plane in the DEAP and AMIGOS dataset. The results show that the fusion model improves the accuracy of affective states recognition significantly compared with the result on single-modality signals, and the recognition accuracy of the fusion result achieve 98.52 and 99.89 percent in the DEAP and AMIGOS dataset respectively.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Dual Learning for Joint Facial Landmark Detection and Action Unit
Recognition-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Shangfei Wang;Yanan Chang;Can Wang;
Pages: 1404 - 1416
Abstract: Facial landmark detection and action unit (AU) recognition are two essential tasks in facial analysis. Previous works rarely consider the relationship between these complementary tasks. In this article, we introduce a novel multi-task dual learning framework to exploit the relationship between facial landmark detection and AU recognition while simultaneously addressing both tasks. When both tasks share middle-level features, common patterns can be exploited and middle- and high-level features can be used to perform facial landmark detection and AU recognition, respectively. In addition, a dual learning mechanism is designed to convert the predicted landmarks and AUs of the label space to the corresponding facial image of the image space, further exploring the strong correlations between the tasks. By jointly training the proposed method at both the feature and label levels, each task improves the other. Experiments on two benchmark databases demonstrate that the proposed method can leverage dependencies to boost the generalization of both tasks.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- CPED: A Chinese Positive Emotion Database for Emotion Elicitation and
Analysis-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Yulin Zhang;Guozhen Zhao;Yezhi Shu;Yan Ge;Dan Zhang;Yong-Jin Liu;Xianghong Sun;
Pages: 1417 - 1430
Abstract: Positive emotions are of great significance to people's daily life, such as human-computer/robot interaction. However, the structure of extensive positive emotions is not clear yet and effective standardized inducing materials containing as many positive emotional categories as possible are lacking. Thus, this article aims to establish a Chinese positive emotion database (CPED) to (1) effectively elicit positive emotion categories as many as possible, (2) provide both the subjective feelings of different positive emotions and a corresponding peripheral physiological database, and (3) explore the structure and framework of positive emotion categories. 42 video clips of 16 positive emotion categories were screened from 1000+ online clips. Then a total of 312 participants watched and rated these video clips during which GSR and PPG signals were recorded. 34 video clips that met hit rate and intensity standards were systemically clustered into four emotion categories (empathy, fun, creativity and esteem). Eventually, 22 film clips of these four major categories formed the CPED database. A total of 84 features from GSR and PPG signals were extracted and entered into RF, SVM, DBN and LSTM classifiers that serves as baseline classification methods. A classification accuracy of 44.66 percent for four major categories of positive emotions was achieved.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- MERASTC: Micro-Expression Recognition Using Effective Feature Encodings
and 2D Convolutional Neural Network-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Puneet Gupta;
Pages: 1431 - 1441
Abstract: Facial micro-expression (ME) can disclose genuine and concealed human feelings. It makes MEs extensively useful in real-world applications pertaining to affective computing and psychology. Unfortunately, they are induced by subtle facial movements for a short duration of time, which makes the ME recognition, a highly challenging problem even for human beings. In automatic ME recognition, the well-known features encode either incomplete or redundant information, and there is a lack of sufficient training data. The proposed method, Micro-Expression Recognition by Analysing Spatial and Temporal Characteristics, $MERASTC$MERASTC mitigates these issues for improving the ME recognition. It compactly encodes the subtle deformations using action units (AUs), landmarks, gaze, and appearance features of all the video frames while preserving most of the relevant ME information. Furthermore, it improves the efficacy by introducing a novel neutral face normalization for ME and initiating the utilization of gaze features in deep learning-based ME recognition. The features are provided to the 2D convolutional neural network that jointly analyses the spatial and temporal behavior for correct ME classification. Experimental results1 on publicly available datasets indicate that the proposed method exhibits better performance than the well-known methods.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Examining Emotion Perception Agreement in Live Music Performance
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Simin Yang;Courtney N. Reed;Elaine Chew;Mathieu Barthet;
Pages: 1442 - 1460
Abstract: Current music emotion recognition (MER) systems rely on emotion data averaged across listeners and over time to infer the emotion expressed by a musical piece, often neglecting time- and listener-dependent factors. These limitations can restrict the efficacy of MER systems and cause misjudgements. We present two exploratory studies on music emotion perception. First, in a live music concert setting, fifteen audience members annotated perceived emotion in the valence-arousal space over time using a mobile application. Analyses of inter-rater reliability yielded widely varying levels of agreement in the perceived emotions. A follow-up lab-based study to uncover the reasons for such variability was conducted, where twenty-one participants annotated their perceived emotions whilst viewing and listening to a video recording of the original performance and offered open-ended explanations. Thematic analysis revealed salient features and interpretations that help describe the cognitive processes underlying music emotion perception. Some of the results confirm known findings of music perception and MER studies. Novel findings highlight the importance of less frequently discussed musical attributes, such as musical structure, performer expression, and stage setting, as perceived across audio and visual modalities. Musicians are found to attribute emotion change to musical harmony, structure, and performance technique more than non-musicians. We suggest that accounting for such listener-informed music features can benefit MER in helping to address variability in emotion perception by providing reasons for listener similarities and idiosyncrasies.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Improving Humanness of Virtual Agents and Users’ Cooperation
Through Emotions-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Moojan Ghafurian;Neil Budnarain;Jesse Hoey;
Pages: 1461 - 1471
Abstract: In this article, we analyze the performance of an agent developed according to a well-accepted appraisal theory of human emotion with respect to how it modulates play in the context of a social dilemma. We ask if the agent will be capable of generating interactions that are considered to be more human-like than machine-like. We conducted an experiment with 117 participants and show how participants rated our agent on dimensions of human-uniqueness (separating humans from animals) and human-nature (separating humans from machines). We show that our appraisal theoretic agent is perceived to be more human-like than the baseline models, by significantly improving both human-nature and human-uniqueness aspects of the intelligent agent. We also show that perception of humanness positively affects enjoyment and cooperation in the social dilemma, and discuss consequences for the task duration recall.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion
Recognition-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Maurice Gerczuk;Shahin Amiriparian;Sandra Ottl;Björn W. Schuller;
Pages: 1472 - 1487
Abstract: In this manuscript, the topic of multi-corpus Speech Emotion Recognition (SER) is approached from a deep transfer learning perspective. A large corpus of emotional speech data, EmoSet, is assembled from a number of existing Speech Emotion Recognition (SER) corpora. In total, EmoSet contains 84 181 audio recordings from 26 SER corpora with a total duration of over 65 hours. The corpus is then utilised to create a novel framework for multi-corpus SER and general audio recognition, namely EmoNet. A combination of a deep ResNet architecture and residual adapters is transferred from the field of multi-domain visual recognition to multi-corpus SER on EmoSet. The introduced residual adapter approach enables parameter efficient training of a multi-domain SER model on all 26 corpora. A shared model with only 3.5 times the number of parameters of a model trained on a single database leads to increased performance for 21 of the 26 corpora in EmoSet. Using repeated training runs and Almost Stochastic Order with significance level of $alpha = 0.05$α=0.05, these improvements are further significant for 15 datasets while there are just three corpora that see only significant decreases across the residual adapter transfer experiments. Finally, we make our EmoNet framework publicly available for users and developers at https://github.com/EIHW/EmoNet.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Finding Needles in a Haystack: Recognizing Emotions Just From Your Heart
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Wei Li;
Pages: 1488 - 1505
Abstract: Emotion plays an important role in human cognition and behavior. How to recognize emotions based on physiological signals has attracted increasing research interests worldwide up to date. Both traditional eastern medicine and modern western medicine have confirmed the existence of relationship between human emotionality and heart activity. However, in practice, emotion recognition only using Electrocardiogram (ECG) signals seems quite challenging, not only due to the severe noise interferences and the serious data variations, but also because of the ambiguous relationship between emotional states and ECG data. Such difficulty can even be compared to finding needles in a haystack. As an innovative endeavor to deal with the issue of only-ECG-based emotion recognition, this paper has proposed a novel solution from the perspective of weak signal classification. The proposed solution extracts the static-dynamic representation under the principle of Yin-Yang balance from the heartbeat data, and then utilizes the set-based collaborative measurement upon the thought of data coopetition to classify these features for recognizing emotions. Experimental results have demonstrated the effectiveness, efficiency and adaptiveness of the solution for uncovering the potential relationship between emotions and ECG. Thus, this proposal has also illuminated a promising research direction for the general problem of weak signal classification.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Indirect Identification of Perinatal Psychosocial Risks From Natural
Language-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Kristen C. Allen;Alex Davis;Tamar Krishnamurti;
Pages: 1506 - 1519
Abstract: During the perinatal period, psychosocial health risks, including depression and intimate partner violence, are associated with serious adverse health outcomes for birth parents and children. To appropriately intervene, healthcare professionals must first identify those at risk, yet stigma often prevents people from directly disclosing the information needed to prompt an assessment. In this research we use short diary entries to indirectly elicit information that could indicate psychosocial risks, then examine patterns that emerge in the language of those at risk. We find that diary entries exhibit consistent themes, extracted using topic modeling, and emotional perspective, drawn from dictionary-informed sentiment features. Using these features, we use regularized regression to predict screening measures for depression and psychological aggression by an intimate partner. Journal text entries quantified through topic models and sentiment features show promise for depression prediction, corresponding with self-reported screening measures almost as well as closed-form questions. Text-based features are less useful in predicting intimate partner violence, but topic models generate themes that align with known risk correlates. The indirect features uncovered in this research could aid in the detection and analysis of stigmatized risks.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- AT2GRU: A Human Emotion Recognition Model With Mitigated Device
Heterogeneity-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Pritam Khan;Priyesh Ranjan;Sudhir Kumar;
Pages: 1520 - 1532
Abstract: Device heterogeneity can cause a detrimental impact on the classification of healthcare data. In this work, we propose the Maximum Difference-based Heterogeneity Mitigation (MDHM) method to address device heterogeneity. Mitigating heterogeneity increases the reliability of using multiple devices from different manufacturers for measuring a particular physiological signal. Further, we propose an attention-based bilevel GRU (Gated Recurrent Unit) model, abbreviated as AT2GRU, to classify multi-modal healthcare time-series data for human emotion recognition. The physiological signals of Electroencephalogram (EEG) and Electrocardiogram (ECG) for twenty-three persons are leveraged from the DREAMER dataset for emotion recognition. Also, from the DEAP dataset, the biosignals namely EEG, Galvanic Skin Response (GSR), Respiration Amplitude (RA), Skin Temperature (ST), Blood Volume (BV), Electromyogram (EMG) and Electrooculogram (EOG) of thirty-two persons are used for emotion recognition. The EEG and the other biosignals are denoised by the wavelet filters for enhancing the model's classification accuracy. A multi-class classification is carried out considering valence, arousal, and dominance for each person in the datasets. The classification accuracy is validated against the self-assessment obtained from the respective person after watching a movie/video. The proposed AT2GRU model surpasses the other sequential models namely Long Short Term Memory (LSTM) and GRU in performance.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- User State Modeling Based on the Arousal-Valence Plane: Applications in
Customer Satisfaction and Health-Care-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Paula Andrea Pérez-Toro;Juan Camilo Vásquez-Correa;Tobias Bocklet;Elmar Nöth;Juan Rafael Orozco-Arroyave;
Pages: 1533 - 1546
Abstract: The acoustic analysis helps to discriminate emotions according to non-verbal information, while linguistics aims to capture verbal information from written sources. Acoustic and linguistic analyses can be addressed for different applications, where information related to emotions, mood, or affect are involved. The Arousal-Valence plane is commonly used to model emotional states in a multidimensional space. This study proposes a methodology focused on modeling the user’s state based on the Arousal-Valence plane in different scenarios. Acoustic and linguistic information are used as input to feed different deep learning architectures mainly based on convolutional and recurrent neural networks, which are trained to model the Arousal-Valence plane. The proposed approach is used for the evaluation of customer satisfaction in call-centers and for health-care applications in the assessment of depression in Parkinson’s disease and the discrimination of Alzheimer’s disease. F-scores of up to 0.89 are obtained for customer satisfaction, of up to 0.82 for depression in Parkinson’s patients, and of up to 0.80 for Alzheimer’s patients. The proposed approach confirms that there is information embedded in the Arousal-Valence plane that can be used for different purposes.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Exploring the Contextual Factors Affecting Multimodal Emotion Recognition
in Videos-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Prasanta Bhattacharya;Raj Kumar Gupta;Yinping Yang;
Pages: 1547 - 1557
Abstract: Emotional expressions form a key part of user behavior on today's digital platforms. While multimodal emotion recognition techniques are gaining research attention, there is a lack of deeper understanding on how visual and non-visual features can be used to better recognize emotions in certain contexts, but not others. This study analyzes the interplay between the effects of multimodal emotion features derived from facial expressions, tone and text in conjunction with two key contextual factors: i) gender of the speaker, and ii) duration of the emotional episode. Using a large public dataset of 2,176 manually annotated YouTube videos, we found that while multimodal features consistently outperformed bimodal and unimodal features, their performance varied significantly across different emotions, gender and duration contexts. Multimodal features performed particularly better for male speakers in recognizing most emotions. Furthermore, multimodal features performed particularly better for shorter than for longer videos in recognizing neutral and happiness, but not sadness and anger. These findings offer new insights towards the development of more context-aware emotion recognition and empathetic systems.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Cross-Task and Cross-Participant Classification of Cognitive Load in an
Emergency Simulation Game-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Tobias Appel;Peter Gerjets;Stefan Hoffmann;Korbinian Moeller;Manuel Ninaus;Christian Scharinger;Natalia Sevcenko;Franz Wortha;Enkelejda Kasneci;
Pages: 1558 - 1571
Abstract: Assessment of cognitive load is a major step towards adaptive interfaces. However, non-invasive assessment is rather subjective as well as task specific and generalizes poorly, mainly due to methodological limitations. Additionally, it heavily relies on performance data like game scores or test results. In this study, we present an eye-tracking approach that circumvents these shortcomings and allows for effective generalizing across participants and tasks. First, we established classifiers for predicting cognitive load individually for a typical working memory task (n-back), which we then applied to an emergency simulation game by considering the similar ones and weighting their predictions. Standardization steps helped achieve high levels of cross-task and cross-participant classification accuracy between 63.78 and 67.25 percent for the distinction between easy and hard levels of the emergency simulation game. These very promising results could pave the way for novel adaptive computer-human interaction across domains and particularly for gaming and learning environments.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- The Mediating Effect of Emotions on Trust in the Context of Automated
System Usage-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Md Abdullah Al Fahim;Mohammad Maifi Hasan Khan;Theodore Jensen;Yusuf Albayram;Emil Coman;Ross Buck;
Pages: 1572 - 1585
Abstract: Safety-critical systems are often equipped with warning mechanisms to alert users regarding imminent system failures. However, they can suffer from false alarms, and affect users’ emotions and trust in the system negatively. While providing feedback could be an effective way to calibrate trust under such scenarios, the effects of feedback and warning reliability on users’ emotions, trust, and compliance behavior are not clear. This article investigates this by designing a 2 (feedback: present/absent) × 2 (warning reliability: high/low) × 4 (sessions) mixed design study where participants interacted with a simulated unmanned aerial vehicle (UAV) system to identify and neutralize enemy targets. Results indicated that feedback containing both correctness and affective components decreased users’ positive emotions and trust in the system, and increased loneliness and hostility (negative) emotions. Emotions were found to mediate the relationship between feedback and trust. Implications of our findings for designing feedback and calibration of trust are discussed in the paper.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- EEG-Based Emotional Video Classification via Learning Connectivity
Structure-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Soobeom Jang;Seong-Eun Moon;Jong-Seok Lee;
Pages: 1586 - 1597
Abstract: Electroencephalography (EEG) is a useful way to implicitly monitor the user's perceptual state during multimedia consumption. One of the primary challenges for the practical use of EEG-based monitoring is to achieve a satisfactory level of accuracy in EEG classification. Connectivity between different brain regions is an important property for the classification of EEG. However, how to define the connectivity structure for a given task is still an open problem, because there is no ground truth about how the connectivity structure should be in order to maximize the classification performance. In this paper, we propose an end-to-end neural network model for EEG-based emotional video classification, which can extract an appropriate multi-layer graph structure and signal features directly from a set of raw EEG signals and perform classification using them. Experimental results demonstrate that our method yields improved performance in comparison to the existing approaches where manually defined connectivity structures and signal features are used. Furthermore, we show that the graph structure extraction process is reliable in terms of consistency, and the learned graph structures make much sense in the viewpoint of emotional perception occurring in the brain.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Discerning Affect From Touch and Gaze During Interaction With a Robot Pet
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Xi Laura Cang;Paul Bucci;Jussi Rantala;Karon E. MacLean;
Pages: 1598 - 1612
Abstract: Practical affect recognition needs to be efficient and unobtrusive in interactive contexts. One approach to a robust realtime system is to sense and automatically integrate multiple nonverbal sources. We investigated how users’ touch, and secondarily gaze, perform as affect-encoding modalities during physical interaction with a robot pet, in comparison to more-studied biometric channels. To elicit authentically experienced emotions, participants recounted two intense memories of opposing polarity in Stressed-Relaxed or Depressed-Excited conditions. We collected data (N=30) from a touch sensor embedded under robot fur (force magnitude and location), a robot-adjacent gaze tracker (location), and biometric sensors (skin conductance, blood volume pulse, respiration rate). Cross-validation of Random Forest classifiers achieved best-case accuracy for combined touch-with-gaze approaching that of biometric results: where training and test sets include adjacent temporal windows, subject-dependent prediction was 94% accurate. In contrast, subject-independent Leave-One-participant-Out predictions resulted in 30% accuracy (chance 25%). Performance was best where participant information was available in both training and test sets. Addressing computational robustness for dynamic, adaptive realtime interactions, we analyzed subsets of our multimodal feature set, varying sample rates and window sizes. We summarize design directions based on these parameters for this touch-based, affective, and hard, realtime robot interaction application.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- A Region Group Adaptive Attention Model For Subtle Expression Recognition
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Gan Chen;Junjie Peng;Wenqiang Zhang;Kanrun Huang;Feng Cheng;Haochen Yuan;Yansong Huang;
Pages: 1613 - 1626
Abstract: Facial expression recognition has received extensive attention in recent years due to its important applications in many fields. Most expression samples used in research are relatively easy to analyze emotions because they have explicit expressions with strong intensities. However, in situations such as video question and answer, business negotiation, polygraph detection in the security field, autism treatment and medical escort, emotions are expressed in suppressed manners with low intensive expression or subtle expressions, making it difficult to estimate emotions accurately. In these situations, how to effectively extract expression features from facial expression images is a critical problem that affects the accuracy of subtle expression recognition. To address this problem, we propose an end-to-end group adaptive attention model for subtle expression recognition. Cropping an image into several regions of interest (ROI) according to the correlations between facial skeleton and emotions, the proposed model analyses the relationship among regions of interest, and mutual relations between local regions and the holistic region. Using the region group adaptive attention mechanism, the model effectively trains the convolutional neural network to efficiently extract facial expressions representing features and increases the accuracy and robustness of the recognition, particularly in some subtle facial expression circumstances. To improve the ability of different regional features to discriminate expressions, a group adaptive loss function is introduced to verify and improve estimation accuracy. Extensive experiments are conducted on the existing public face datasets CK+, JAFFE, KDEF and the self-collected subtle expression dataset SFER. Results show that the proposed model achieves accuracies of 99.59%, 95.20%, and 93.47% with datasets CK+, JAFFE, and KDEF, respectively. The proposed model thus generally achieve- better performance in facial expression recognition than other methods.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Few-Shot Learning in Emotion Recognition of Spontaneous Speech Using a
Siamese Neural Network With Adaptive Sample Pair Formation-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Kexin Feng;Theodora Chaspari;
Pages: 1627 - 1633
Abstract: Speech-based machine learning (ML) has been heralded as a promising solution for tracking prosodic and spectrotemporal patterns in real-life that are indicative of emotional changes, providing a valuable window into one's cognitive and mental state. Yet, the scarcity of labelled data in ambulatory studies prevents the reliable training of ML models, which usually rely on “data-hungry” distribution-based learning. Leveraging the abundance of labelled speech data from acted emotions, this paper proposes a few-shot learning approach for automatically recognizing emotion in spontaneous speech from a small number of labelled samples. Few-shot learning is implemented via a metric learning approach through a siamese neural network, which models the relative distance between samples rather than relying on learning absolute patterns of the corresponding distributions of each emotion. Results indicate the feasibility of the proposed metric learning in recognizing emotions from spontaneous speech in four datasets, even with a small amount of labelled samples. They further demonstrate superior performance of the proposed metric learning compared to commonly used adaptation methods, including network fine-tuning and adversarial learning. Findings from this work provide a foundation for the ambulatory tracking of human emotion in spontaneous speech contributing to the real-life assessment of mental health degradation.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Survey of Deep Representation Learning for Speech Emotion Recognition
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Siddique Latif;Rajib Rana;Sara Khalifa;Raja Jurdak;Junaid Qadir;Björn Schuller;
Pages: 1634 - 1654
Abstract: Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual effort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated deep representation learning where hierarchical representations are automatically learned in a data-driven manner. This article presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- A Review of Affective Computing Research Based on
Function-Component-Representation Framework-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Haiwei Ma;Svetlana Yarosh;
Pages: 1655 - 1674
Abstract: Affective computing (AC), a field that bridges the gap between human affect and computational technology, has witnessed remarkable technical advancement. However, theoretical underpinnings of affective computing are rarely discussed and reviewed. This paper provides a thorough conceptual analysis of the literature to understand theoretical questions essential to affective computing and current answers. Inspired by emotion theories, we proposed the function-component-representation (FCR) framework to organize different conceptions of affect along three dimensions that each address an important question: function of affect (why compute affect), component of affect (how to compute affect), and representation of affect (what affect to compute). We coded each paper by its underlying conception of affect and found preferences towards affect detection, behavioral component, and categorical representation. We also observed coupling of certain conceptions. For example, papers using the behavioral component tend to adopt the categorical representation, whereas papers using the physiological component tend to adopt the dimensional representation. The FCR framework is not only the first attempt to organize different theoretical perspectives in a systematic and quantitative way, but also a blueprint to help conceptualize an AC project and pinpoint new possibilities. Future work may explore how the identified frequencies of FCR framework combinations may be applied in practice.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- Automatic Emotion Recognition in Clinical Scenario: A Systematic Review of
Methods-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Lucia Pepa;Luca Spalazzi;Marianna Capecci;Maria Gabriella Ceravolo;
Pages: 1675 - 1695
Abstract: BACKGROUND - Automatic emotion recognition has powerful and interesting opportunities in the clinical field, but several critical aspects are still open, such as heterogeneity of methodologies or technologies tested mainly on healthy people. This systematic review aims to survey automatic emotion recognition systems applied in real clinical contexts (i.e., on a population of people with a pathology). METHODS - The literature review was conducted on the following scientific databases: IEEE Xplore®, ScienceDirect®, Scopus®, PubMed®, ACM®. Inclusion criteria were the presence of an automatic emotion recognition algorithm and the enrollment of at least 2 patients in the experimental protocol. The review process followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Moreover, the works were analysed according to a reference model in the form of a class diagram, to highlight the most important clinical and technical aspects and relationships among them. RESULTS - 52 scientific papers passed the inclusion criteria. Based on our findings, most clinical applications involved neuro-developmental, neurological and psychiatric disorders with the aims of diagnosing, monitoring, or treating emotional symptoms. The study design seems to be mostly related to the aim of the study (it is generally observational for monitoring and diagnosis, interventional for treatment), the most adopted signals are video and audio, and supervised shallow learning emerged as most used approach for emotion recognition algorithm. DISCUSSION - Tiny samples, absence of a control group and of tests in real-life conditions emerged as important clinical limitations. Under a technical point of view, a great heterogeneity of performance metrics, datasets and algorithms challenges the comparability, robustness, reliability and reproducibility of results.-Suggested guidelines are identified and discussed to help scientific community to overcome limitations and provide direction for future works.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-
- A Re-Analysis and Synthesis of Data on Affect Dynamics in Learning
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors: Shamya Karumbaiah;Ryan S. Baker;Jaclyn Ocumpaugh;Juliana Ma. Alexandra L. Andres;
Pages: 1696 - 1710
Abstract: Affect dynamics, the study of how affect develops and manifests over time, has become a popular area of research in affective computing for learning. In this article, we first provide a detailed analysis of prior affect dynamics studies, elaborating both their findings and the contextual and methodological differences between these studies. We then address methodological concerns that have not been previously addressed in the literature, discussing how various edge cases should be treated. Next, we present mathematical evidence that several past studies applied the transition metric (L) incorrectly - leading to invalid conclusions of statistical significance - and provide a corrected method. Using this corrected analysis method, we reanalyze ten past affect datasets collected in diverse contexts and synthesize the results, determining that the findings do not match the most popular theoretical model of affect dynamics. Instead, our results highlight the need to focus on cultural factors in future affect dynamics research.
PubDate: April-June 1 2023
Issue No: Vol. 14, No. 2 (2023)
-