Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Image memes and specifically their widely known variation image macros are a special new media type that combines text with images and are used in social media to playfully or subtly express humor, irony, sarcasm and even hate. It is important to accurately retrieve image memes from social media to better capture the cultural and social aspects of online phenomena and detect potential issues (hate-speech, disinformation). Essentially, the background image of an image macro is a regular image easily recognized as such by humans but cumbersome for the machine to do so due to feature map similarity with the complete image macro. Hence, accumulating suitable feature maps in such cases can lead to deep understanding of the notion of image memes. To this end, we propose a methodology, called visual part utilization, that utilizes the visual part of image memes as instances of the regular image class and the initial image memes as instances of the image meme class to force the model to concentrate on the critical parts that characterize an image meme. Additionally, we employ a trainable attention mechanism on top of a standard ViT architecture to enhance the model’s ability to focus on these critical parts and make the predictions interpretable. Several training and test scenarios involving web-scraped regular images of controlled text presence are considered for evaluating the model in terms of robustness and accuracy. The findings indicate that light visual part utilization combined with sufficient text presence during training provides the best and most robust model, surpassing state of the art. Source code and dataset are available at https://github.com/mever-team/memetector. PubDate: 2023-05-13
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: The human brain can process sound and visual information in overlapping areas of the cerebral cortex, which means that audio and visual information are deeply correlated with each other when we explore the world. To simulate this function of the human brain, audio-visual event retrieval (AVER) has been proposed. AVER is about using data from one modality (e.g., audio data) to query data from another. In this work, we aim to improve the performance of audio-visual event retrieval. To achieve this goal, first, we propose a novel network, InfoIIM, which enhance the accuracy of intra-model feature representation and inter-model feature alignment. The backbone of this network is a parallel connection of two VAE models with two different encoders and a shared decoder. Secondly, to enable the VAE to learn better feature representations and to improve intra-modal retrieval performance, we have used InfoMax-VAE instead of the vanilla VAE model. Additionally, we study the influence of modality-shared features on the effectiveness of audio-visual event retrieval. To verify the effectiveness of our proposed method, we validate our model on the AVE dataset, and the results show that our model outperforms several existing algorithms in most of the metrics. Finally, we present our future research directions, hoping to inspire relevant researchers. PubDate: 2023-05-04
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Visual-aware personalized recommendation systems can estimate the potential demand by evaluating consumer personalized preferences. In general, consumer feedback data is deduced from either explicit feedback or implicit feedback. However, explicit and implicit feedback raises the chance of malicious operation or misoperation, which can lead to deviations in recommended outcomes. Adversarial learning, a regularization approach that can resist disturbances, could be a promising choice for enhancing model resilience. We propose a novel adversarial collaborative filtering with aesthetics (ACFA) for the visual recommendation that utilizes adversarial learning to improve resilience and performance in the case of perturbation. The ACFA algorithm applies three types of input to the visual Bayesian personalized ranking: negative, unobserved, and positive feedback. Through feedbacks at various levels, it uses a probabilistic approach to obtain consumer personalized preferences. Since in visual recommendation, the aesthetic data in determining consumer preferences on product is critical, we construct the consumer personalized preferences model with aesthetic elements, and then use them to enhance the sampling quality when training the algorithm. To mitigate the negative effects of feedback noise, We use minimax adversarial learning to learn the ACFA objective function. Experiments using two datasets demonstrate that the ACFA model outperforms state-of-the-art algorithms on two metrics. PubDate: 2023-04-29
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: The textural and structural information contained in the images is very important for generating highly discriminative features for the task of image retrieval. Morphological operations are nonlinear mathematical operations that can provide such textural and structural information. In this work, a new residual block based on a module using morphological operations coupled with an edge extraction module is proposed. A novel pooling operation focusing on the edges of the images is also proposed. A deep convolutional network is then designed using the proposed residual block and the new pooling operation that significantly improves its representational capacity. Extensive experiments are carried out to show the effectiveness of the ideas used in the design of the proposed deep image retrieval network. The proposed network is shown to significantly outperform existing state-of-the-art image retrieval networks on various benchmark datasets. PubDate: 2023-04-26
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Alzheimer’s disease (AD) is one of the most severe kinds of dementia that affects the elderly population. Since this disease is incurable and the changes in brain sub-regions start decades before the symptoms are observed, early detection becomes more challenging. Discriminating similar brain patterns for AD classification is difficult as minute changes in biomarkers are detected in different neuroimaging modality, also in different image projections. Deep learning models have provided excellent performance in analyzing various neuroimaging and clinical data. In this survey, we performed a comparative analysis of 134 papers published between 2017 and 2022 to get 360° knowledge of the AD kind of problem and everything done to examine and deeply analyze factors causing this. Different pre-processing tools and techniques, various datasets, and brain sub-regions affected mainly by AD have been reviewed. Further deep analysis of various biomarkers, feature extraction techniques, Deep learning and Machine learning architectures has been done for the survey. Summarization of the latest research articles with valuable findings has been represented in multiple tables. A novel approach has been used representing classification of biomarkers, pre-processing techniques and AD detection methods in form of figures and classification of AD on the basis of stages showing difference in accuracies between binary and multi-class in form of table. We finally concluded our paper by addressing some challenges faced during classification and provided recommendations that can be considered for future research in diagnosing various stages of AD. PubDate: 2023-03-17
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Playing a vitally important role in the operation of intelligent video surveillance system and smart city, video anomaly detection (VAD) has been widely practiced and studied in both industrial circles and academia. In the present study, a new anomaly detection method is proposed for multi-level memory embedding. According to the novel method, the feature prototype of the sample is stored in the memory pool, which enhances the diversity of the sample feature prototype paradigm. Besides, the memory is embedded in the decoder in a hierarchical integrating manner, which makes the feature information of the object more complete and improves the quality of features. At the end of the model, modeling is performed for the channel relationship between the features of the object in the channel dimension, thus making the model capable of more efficient anomaly detection. This method is verified by conducting evaluation on three publicly available datasets: UCSD Ped2, CUHK Avenue, ShanghaiTech. PubDate: 2023-03-15 DOI: 10.1007/s13735-023-00272-x
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Background subtraction is one of the most highly regarded steps in computer vision, especially in video surveillance applications. Although various approaches have been proposed to cope with the different difficulties of this field, many of these methods have not been able to fully tackle complicated situations in realistic scenes due to their sensitivity to many challenges. This paper presents a deep nested background subtraction algorithm based on residual micro-autoencoder blocks. Hence, our method is implemented as a U-net like architecture with more skip connections. The nested network uses residual connections between these micro-autoencoders that can extract significant multi-scale features of a complex scene. We also test and prove that the proposed method can work in various challenging situations. A small set of training samples is included to train this end-to-end network. The experimental results demonstrate that our model outperforms other state-of-the-art methods on two well-known benchmark datasets: CDNet 2014 and SBI 2015. PubDate: 2023-03-07 DOI: 10.1007/s13735-023-00270-z
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Self-attention-based image captioning model exists visual features’ spatial information loss problem, introducing relative position encoding can solve the problem to some extent. However, it will bring additional parameters and greater computational complexity. To solve the above problem, we propose a novel local–global MLFormer (LG-MLFormer) with specifically designed encoder module Local–global multi-layer perceptron (LG-MLP). The LG-MLP can capture the latent correlations between different images and its linear stacking calculation mode can reduce computational complexity. It consists of two independent local MLP (LM) modules and a cross-domain global MLP (CDGM) module. The LM specially designs the mapping dimension between linear layers to realize the self-compensation of visual features’ spatial information without introducing relative position encoding. The CDGM module aggregates cross-domain potential correlations between grid-based features and region-based features to realize the complementary advantages of these global and local semantic associations. Experiments on the Karpathy test split and the online test server reveal that our approach provides superior or comparable performance to the state-of-the-art (SOTA). Trained models and code for reproducing the experiments are publicly available at: https://github.com/wxx1921/LGMLFormer-local-and-global-mlp-for-image-captioning. PubDate: 2023-02-23 DOI: 10.1007/s13735-023-00266-9
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. In general, this retrieval task is composed of four successive steps: video and textual feature representation extraction, feature embedding and matching, and objective functions. In the last, a list of samples retrieved from the dataset is ranked based on their matching similarities to the query. In recent years, significant and flourishing progress has been achieved by deep learning techniques, however, VTR is still a challenging task due to the problems like how to learn an efficient spatial-temporal video feature and how to narrow the cross-modal gap. In this survey, we review and summarize over 100 research papers related to VTR, demonstrate state-of-the-art performance on several commonly benchmarked datasets, and discuss potential challenges and directions, with the expectation to provide some insights for researchers in the field of video-text retrieval. PubDate: 2023-02-23 DOI: 10.1007/s13735-023-00267-8
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: As multi-modal data proliferates, people are no longer content with a single mode of data retrieval for access to information. Deep hashing retrieval algorithms have attracted much attention for their advantages of efficient storage and fast query speed. Currently, the existing unsupervised hashing methods generally have two limitations: (1) Existing methods fail to adequately capture the latent semantic relevance and coexistent information from the different modality data, resulting in the lack of effective feature and hash encoding representation to bridge the heterogeneous and semantic gaps in multi-modal data. (2) Existing unsupervised methods typically construct a similarity matrix to guide the hash code learning, which suffers from inaccurate similarity problems, resulting in sub-optimal retrieval performance. To address these issues, we propose a novel CLIP-based fusion-modal reconstructing hashing for Large-scale Unsupervised Cross-modal Retrieval. First, we use CLIP to encode cross-modal features of visual modalities, and learn the common representation space of the hash code using modality-specific autoencoders. Second, we propose an efficient fusion approach to construct a semantically complementary affinity matrix that can maximize the potential semantic relevance of different modal instances. Furthermore, to retain the intrinsic semantic similarity of all similar pairs in the learned hash codes, an objective function for similarity reconstruction based on semantic complementation is designed to learn high-quality hash code representations. Sufficient experiments were carried out on four multi-modal benchmark datasets (WIKI, MIRFLICKR, NUS-WIDE, and MS COCO), and the proposed method achieves state-of-the-art image-text retrieval performance compared to several representative unsupervised cross-modal hashing methods. PubDate: 2023-02-22 DOI: 10.1007/s13735-023-00268-7
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Human activity recognition is a theme commonly explored in computer vision. Its applications in various domains include monitoring systems, video processing, robotics, and healthcare sector, etc. Activity recognition is a difficult task since there are structural changes among subjects, as well as inter-class and intra-class correlation between activities. As a result, a continuous intelligent control system for detecting human behavior with grouping of maximum information is necessary. Therefore, in this paper, a novel automatic system to identify human activity on the UTKinect dataset is implemented by using Residual learning-based Network “ResNet-50” and transfer learning to represent more complicated features and improved model robustness. The experimental results have shown an excellent generalization capability when tested on the validation set and obtained high accuracy of 98.60 per cent with a 0.02 loss score. The designed residual learning-based system indicates the efficiency of comparing with the other state-of-the-art models. PubDate: 2023-02-19 DOI: 10.1007/s13735-023-00269-6
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Scene classification is a mature and active computer vision task, due to the inherent ambiguity. The scene classification task aims to classify the visual scene images in predefined categories based on the ambient content, objects and the layout of the images. Inspired by human visual scene understanding, the visual scenes can be divided into two categories: (1) Object-based scenes that consist of the scene-specific objects and can be understood with those objects. (2) Layout-based scenes that are understandable based on the layout and the ambient content of the scene images. Scene-specific objects semantically help to understand object-based scenes, whereas the layout and the ambient content are effective in understanding layout-based scenes by representing the visual appearance of the scene images. Hence, one of the main challenges in scene classification is to create a discriminative representation that can provide a high-level perception of visual scenes. Accordingly, we have presented a discriminative hybrid representation of visual scenes, in which semantic features extracted from scene-specific objects are fused with visual features extracted from a deep CNN. The proposed scene representation method is used for the scene classification task and is applied to three benchmark scene datasets including: MIT67, SUN397, and UIUC Sports. Moreover, a new scene dataset, called "Scene40," has been introduced, and also, the results of our proposed method have been presented on it. Experimental results show that our proposed method has achieved remarkable performance in the scene classification task and has outperformed the state-of-the-art methods. PubDate: 2022-12-01 DOI: 10.1007/s13735-022-00246-5
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Representation learning of knowledge graphs has emerged as a powerful technique for various downstream tasks. In recent years, numerous research efforts have been made for knowledge graphs embedding. However, previous approaches usually have difficulty dealing with complex multi-relational knowledge graphs due to their shallow network architecture. In this paper, we propose a novel framework named Transformers with Contrastive learning for Knowledge Graph Embedding (TCKGE), which aims to learn complex semantics in multi-relational knowledge graphs with deep architectures. To effectively capture the rich semantics of knowledge graphs, our framework leverages the powerful Transformers to build a deep hierarchical architecture to dynamically learn the embeddings of entities and relations. To obtain more robust knowledge embeddings with our deep architecture, we design a contrastive learning scheme to facilitate optimization by exploring the effectiveness of several different data augmentation strategies. The experimental results on two benchmark datasets show the superior of TCKGE over state-of-the-art models. PubDate: 2022-11-27 DOI: 10.1007/s13735-022-00256-3
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Micro-expressions can convey feelings that people are trying to hide. At present, some studies on micro-expression, most of which only use the temporal or spatial information in the image to recognize micro-expressions, neglect the intrinsic features of the image. To solve this problem, we detect the subject’s heart rate in the long micro-expression videos; we extract the image’s spatio-temporal feature through a spatio-temporal network and then extract the heart rate feature using a heart rate network. A multimodal learning method that combines the heart rate and spatio-temporal features is used to recognize micro-expressions. The experimental results on CASMEII, SAMM, and SMIC show that the proposed methods’ unweighted F1-score and unweighted average recall are 0.8867 and 0.8962, respectively. The spatio-temporal fusion network combined with heart rate information provides an essential reference for multimodal approaches to micro-expression recognition. PubDate: 2022-10-09 DOI: 10.1007/s13735-022-00250-9
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Estimating the preferences of consumers is of utmost importance for the fashion industry as appropriately leveraging this information can be beneficial in terms of profit. Trend detection in fashion is a challenging task due to the fast pace of change in the fashion industry. Moreover, forecasting the visual popularity of new garment designs is even more demanding due to lack of historical data. To this end, we propose MuQAR, a Multimodal Quasi-AutoRegressive deep learning architecture that combines two modules: (1) a multimodal multilayer perceptron processing categorical, visual and textual features of the product and (2) a Quasi-AutoRegressive neural network modelling the “target” time series of the product’s attributes along with the “exogenous” time series of all other attributes. We utilize computer vision, image classification and image captioning, for automatically extracting visual features and textual descriptions from the images of new products. Product design in fashion is initially expressed visually and these features represent the products’ unique characteristics without interfering with the creative process of its designers by requiring additional inputs (e.g. manually written texts). We employ the product’s target attributes time series as a proxy of temporal popularity patterns, mitigating the lack of historical data, while exogenous time series help capture trends among interrelated attributes. We perform an extensive ablation analysis on two large-scale image fashion datasets, Mallzee-P and SHIFT15m to assess the adequacy of MuQAR and also use the Amazon Reviews: Home and Kitchen dataset to assess generalization to other domains. A comparative study on the VISUELLE dataset shows that MuQAR is capable of competing and surpassing the domain’s current state of the art by 4.65% and 4.8% in terms of WAPE and MAE, respectively. PubDate: 2022-10-08 DOI: 10.1007/s13735-022-00262-5
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Image–text retrieval is a challenging task due to the requirement of thorough multimodal understanding and precise inter-modality relationship discovery. However, most previous approaches resort to doing global image–text alignment and neglect fine-grained correspondence. Although some works explore local region–word alignment, they usually suffer from a heavy computing burden. In this paper, we propose a prototype local–global alignment (PLGA) network for image–text retrieval by jointly performing the fine-grained local alignment and high-level global alignment. Specifically, our PLGA contains two key components: a prototype-based local alignment module and a multi-scale global alignment module. The former enables efficient fine-grained local matching by combining region–prototype alignment and word–prototype alignment, and the latter helps perceive hierarchical global semantics by exploring multi-scale global correlations between the image and text. Overall, the local and global alignment modules can boost their performances for each other via the unified model. Quantitative and qualitative experimental results on Flickr30K and MS-COCO benchmarks demonstrate that our proposed approach performs favorably against state-of-the-art methods. PubDate: 2022-10-06 DOI: 10.1007/s13735-022-00258-1
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Scene text recognition is a challenging task in computer vision due to the significant differences in text appearance, such as image distortion and rotation. However, linguistic prior helps individuals reason text from images even if some characters are missing or blurry. This paper investigates the fusion of visual cues and linguistic dependencies to boost recognition performance. We introduce a relational attention module to leverage visual patterns and word representations. We embed linguistic dependencies from a language model into the optimization framework to ensure that the predicted sequence captures the contextual dependencies within a word. We propose a dual mutual attention transformer that promotes cross-modality interactions such that the inter- and intra-correlations between visual and linguistic can be fully explored. The introduced gate function enables the model to learn to determine the contribution of each modality and further boost the model performance. Extensive experiments demonstrate that our method enhances the recognition performance of low-quality images and achieves state-of-the-art performance on datasets of texts from regular and irregular scenes. PubDate: 2022-10-06 DOI: 10.1007/s13735-022-00253-6
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Attention mechanisms and grid features are widely used in current visual language tasks like image captioning. The attention scores are the key factor to the success of the attention mechanism. However, the connection between attention scores in different layers is not strong enough since Transformer is a hierarchical structure. Additionally, geometric information is inevitably lost when grid features are flattened to be fed into a transformer model. Therefore, bias scores about geometric position information should be added to the attention scores. Considering that there are three different kinds of attention modules in the transformer architecture, we build three independent paths (residual attention paths, RAPs) to propagate the attention scores from the previous layer as a prior for attention computation. This operation is like a residual connection between attention scores, which can enhance the connection and make each attention layer obtain a global comprehension. Then, we replace the traditional attention module with a novel residual attention with relative position module in the encoder to incorporate relative position scores with attention scores. Residual attention may increase the internal covariate shifts. To optimize the data distribution, we introduce residual attention with layer normalization on query vectors module in the decoder. Finally, we build our Residual Attention Transformer with three RAPs (Tri-RAT) for the image captioning task. The proposed model achieves competitive performance on the MSCOCO benchmark with all the state-of-the-art models. We gain 135.8 \(\%\) CIDEr on MS COCO “Karpathy” offline test split and 135.3 \(\%\) CIDEr on the online testing server. PubDate: 2022-10-06 DOI: 10.1007/s13735-022-00260-7
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: The computer vision community considers the pose-invariant face recognition (PIFR) as one of the most challenging applications. Many works were devoted to enhancing face recognition performance when facing profile samples. They mainly focused on 2D- and 3D-based frontalization techniques trying to synthesize frontal views from profile ones. In the same context, we propose in this paper a new 2D PIFR technique based on Generative Adversarial Network image translation. The used GAN is Pix2Pix paired architecture covering many generator and discriminator models that will be comprehensively evaluated on a new benchmark proposed in this paper referred to as Combined-PIFR database, which is composed of four datasets that provide profiles images and their corresponding frontal ones. The paired architecture we are using is based on computing the L1 distance between the generated image and the ground truth one (pairs). Therefore, both generator and discriminator architectures are paired ones. The Combined-PIFR database is partitioned respecting person-independent constraints to evaluate our proposed framework’s frontalization and classification sub-systems fairly. Thanks to the GAN-based frontalization, the recorded results demonstrate an important improvement of 33.57% compared to the baseline. PubDate: 2022-09-15 DOI: 10.1007/s13735-022-00249-2