Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Searching for unseen objects in extensive visual archives is challenging, demanding efficient indexing methods that can support meaningful similarity retrievals. This research paper presents the Stratified Graph (SG) approach for indexing similar deep descriptors by sorting them into distance-sensitive layers. The indexing algorithm incrementally constructs a bi-directional m-nearest neighbor graph within each layer, with additional 1-nearest neighbor links from outer layers, providing a distant scaling property in the graph structure. The search process starts from the innermost layer, and the same layer neighbors enhance Average Recall (AR), while the distant scaling property enhances search speed, maintaining logarithmic complexity scaling. We compare and contrast SG with six state-of-the-art retrieval methods in four deep-descriptor and two classical-descriptor databases, and we show that the SG indexing and search has smaller memory usage (up to four times) and the Mean Average Precision and AR improve up to 8% over state-of-art for all six datasets at five retrieval depths. PubDate: 2024-08-07
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In computer graphics, Content-Based Informaion Retrieval (CBIR) is a database system designed to take an object as input and return a list of similar objects. Originally intended for images, CBIR systems now extend to various types of data, including sounds and three-dimensional models represented as geometrical meshes, broadening their applicability beyond images. In the field of information retrieval, it’s customary to interpret the “I” in CBIR as “Information.” Throughout this text, we adopt this interpretation. To evaluate the similarity between 3D meshes, several techniques involve transforming meshes into feature vectors and measuring the distance between these vectors. In our research, we propose leveraging an algorithm rooted in compressive sensing theory to extract features from 3D meshes. Additionally, we introduce a prototype of a 3D meshes CBIR system that utilizes an order relation, rather than a distance function, to assess the similarity between objects. We introduce an order in \({\mathbb {R}}^n\) called the Extended Lexicographical Order (ELO), designed to incorporate all information present in the vectors being compared. Our comparative analysis includes traditional distance functions as well as classical \({\mathbb {R}}^n\) order relations such as lexicographical and revlex. Furthermore, we employ two types of descriptors: a spectral descriptor based on compressive sensing theory, which builds upon previous work from our research group, and a harmonic spherical-based descriptor, which has already been established in the literature as a successful extractor in the context of medical models. Across all experiments, our prototype consistently outperforms traditional techniques, showcasing its efficacy in CBIR applications. PubDate: 2024-08-02
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Given a natural language query, mining a relevant chart image, i.e., the one that contains the answer to the query, is an overlooked problem in the literature. Our study explores this novel problem. Consider an example of retrieving relevant chart images for a query: Which Indian city has the highest annual rainfall over the past decade'. Retrieving relevant chart images for such natural language queries necessitates a deep semantic understanding of chart images. Towards addressing this problem, in this work, we make two key contributions: (a) We present a dataset, namely WebCIRD (or Web Chart Image Retrieval) for studying this problem, and (b) propose a solution viz. ChartSemBERT that offers a deeper semantic understanding of chart images for effective natural language-to-chart image retrieval. Our proposed approach yields remarkable performance improvements compared to the existing baselines, achieving R@10 as 86.9%. PubDate: 2024-07-29
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract 3D human motion prediction; predicting future human poses in the basis of historically observed motion sequences, is a core task in computer vision. Thus far, it has been successfully applied to both autonomous driving and human–robot interaction. Previous research work has usually employed Recurrent Neural Networks (RNNs)-based models to predict future human poses. However, as previous works have amply demonstrated, RNN-based prediction models suffer from unrealistic and discontinuous problems in human motion prediction due to the accumulation of prediction errors. To address this, we propose a feed-forward, 3D skeleton-based model for human motion prediction. This model, the Spatial–Temporal Graph Convolutional Network (ST-GCN) model, automatically learns the spatial and temporal patterns of human motion from input sequences. This model overcomes the limitations of previous research approaches. Specifically, our ST-GCN model is based on an encoder-decoder architecture. The encoder consists of 5 ST-GCN modules, with each ST-GCN module consisting of a spatial GCN layer and a 2D convolution-based TCN layer, which facilitate the encoding of the spatio-temporal dynamics of human motion. Subsequently, the decoder, consisting of 5 TCN layers, exploits the encoded spatio-temporal representation of human motion to predict future human pose. We leveraged the ST-GCN model to perform extensive experiments on various large-scale human activity 3D pose datasets (Human3.6 M, AMASS, 3DPW) while adopting MPJPE (Mean Per Joint Position Error) as the evaluation metric. The experimental results demonstrate that our ST-GCN model outperforms the baseline models in both short-term (< 400 ms) and long-term (> 400 ms) predictions, thus yielding the best prediction results. PubDate: 2024-07-29
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Recently video retrieval based on the pre-training models (e.g., CLIP) has achieved outstanding success. To further improve the search performance, most existing methods usually utilize the multi-grained contrastive fine tuning scheme. For example, frame features and word features are taken as fine-grained representations, aggregate features for frame features and [CLS] token for textual side are used as global representations. However, the above scheme still remains challenging. There are redundant and noise information in the raw output features of pre-training encoders, leading to suboptimal retrieval performance. Besides, a video usually correlates several text descriptions, while video embedding is fixed in previous works, which may also reduce the search performance. To conquer these problems, we propose a novel video-text retrieval model, named Local Semantic Enhancement and Cross Aggregation (LSECA). To be specific, we design a local semantic enhancement scheme, which utilizes global feature for video and keyword information for text to augment fine-grained semantic representations. Moreover, the cross aggregation module is proposed to enhance the interaction between video and text modalities. In this way, the local semantic enhancement scheme can increase the related representation of modalities and the developed cross aggregation module can make the representations of texts and videos more uniform. Extensive experiments on three popular text-video retrieval benchmark datasets demonstrate that our LSECA outperforms several state-of-the-art methods. PubDate: 2024-07-22
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Multimodal sentiment analysis has received much attention in recent years, especially in the context of the abundant multimodal data generated by social platforms. Models like CLIP, BLIP, VisualBERT are all excellent but often come with a large number of parameters and require inputs in the form of image-text pairs, which will constrain their flexibility. The integration of superior unimodal sentiment analysis models allows simultaneous processing of multimodal data, enabling arbitrary addition or removal of modalities for generalized multimodal sentiment analysis. The integrated model that preprocesses and fuses each modality can sometimes significantly improve the accuracy of sentiment analysis. Therefore, this study proposes a novel multimodal sentiment analysis approach, MuAL, based on cross-modal attention and difference loss. Cross-modal attention is used to integrate information from two modalities, and difference loss is utilized to minimize the gap between image and text information, enhancing the model’s robustness. Additionally, MuAL uses cls token to capture overall sentiment information, further eliminating noise within modalities and reducing computational expenses. The study evaluates MuAL on five real-world datasets, demonstrating superior performance over baseline methods with fewer parameters. Furthermore, considering MuAL utilizes pre-trained models as encoders, the research assesses its capability in transfer learning. Results reveal that even after freezing the parameters of the pre-trained model, MuAL outperforms the baseline on all five datasets, confirming its superior performance. PubDate: 2024-07-22
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Recognizing various human actions in videos is considered a highly complicated problem, which has many potential applications in solving real-world problems such as human behavior analysis, artificial intelligence, video surveillance, and smart manufacturing. Therefore, designing novel approaches for automatically understanding video data is highly demanded. Towards this goal, different algorithms have been investigated, concentrating on extracting the spatial information and the temporal dependencies. However, motion feature extraction is engineered and isolated from the learning operations. In this paper, to comprehend motion features along with the spatial information and the time dependencies, an innovative attempt is made by designing a new Gated Recurrent Unit (GRU) network. Moreover, a novel deep neural network is presented using the proposed GRU to recognize human actions. Evaluations on popular datasets (YouTube2011, UCF50, UCF101, and HMDB51) not only convey the superiority of the proposed GRU in action recognition using an end-to-end learning model but also emphasize on the generalizability of the proposed method. Additionally, to show the applicability and functionality of the proposed model in solving real-world problems, an engine block assembly dataset was collected and the performance of the proposed method was measured on this dataset. Finally, the robustness of the proposed method against various kinds of noise was tested. The obtained results demonstrate the high performance of the proposed method and its robustness against noise. PubDate: 2024-07-16
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Human face retrieval has long been established as one of the most interesting research topics in computer vision. With the recent development of deep learning, many researchers have addressed this problem by building deep hashing models to learn binary code from face images, while performing face retrieval as a classification task. Nevertheless, the performance is still unsatisfactory since these models are incapable of handling inter-class variation between multiple persons, as we need to make a class label for each identity. In this backdrop, we propose in this paper an effective deep learning-based framework for face image retrieval. The key to our framework is mainly based on the matching of face pairs, where a two-stream network, named \(\chi Net+\chi Match\) , is designed to learn similarities in terms of person identity. Such similarities are investigated by embedding both deep local representation via face components, and deep global face representation via the whole face image. Since the similarities captured over face components are supposed to diversify due to variation in pose, expression and occlusion, we also introduce a Sparse Score Fusion layer that learns automatically the weight of each component according to its contribution to face matching. To allow fast retrieval, we farther propose a method that generates binary codes corresponding to the groups of similar faces through the hierarchical k-means, where the path down binary tree is exploited as a binary code for indexing. The final retrieval is then conducted within a privileged subset of images in the database. Our experiments on different challenging datasets show that our approach obtains outstanding results while outperforming most existing methods. PubDate: 2024-07-08
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Transformers have achieved success in many computer vision tasks, but their potential in Zero-Shot Learning (ZSL) has yet to be fully explored. In this paper, a Transformer architecture is developed, termed DSPformer, which can discover semantic parts by token growth and clustering. This is achieved through two proposed methods: Adaptive Token Growth and Semantic Part Clustering. Firstly, it is observed that the background may distract models, causing the model to rely on irrelevant regions to make decisions. To alleviate this issue, the ATG is proposed to locate discriminative foreground regions and remove meaningless and even noisy backgrounds. Secondly, semantically similar parts may be distributed into different tokens. To address this problem, the SPC is proposed to group semantically consistent parts by token clustering. Extensive experiments on several challenging datasets demonstrate the effectiveness of the proposed DSPformer. PubDate: 2024-06-27
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Large language models (LLMs) have exhibited remarkable efficacy and proficiency in a wide array of NLP endeavors. Nevertheless, concerns are growing rapidly regarding the security and vulnerabilities linked to the adoption and incorporation of LLM. In this work, a systematic study focused on the most up-to-date attack and defense frameworks for the LLM is presented. This work delves into the intricate landscape of adversarial attacks on language models (LMs) and presents a thorough problem formulation. It covers a spectrum of attack enhancement techniques and also addresses methods for strengthening LLMs. This study also highlights challenges in the field, such as the assessment of offensive or defensive performance, defense and attack transferability, high computational requirements, embedding space size, and perturbation. This survey encompasses more than 200 recent papers concerning adversarial attacks and techniques. By synthesizing a broad array of attack techniques, defenses, and challenges, this paper contributes to the ongoing discourse on securing LM against adversarial threats. PubDate: 2024-06-25
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Adversarial examples have exposed the inherent vulnerabilities of deep neural networks. Although adversarial training has emerged as the leading strategy for adversarial defenses, it is frequently hindered by a challenging balance between maintaining accuracy on unaltered examples and enhancing model robustness. Recent efforts on decoupling network components can effectively reduce the degradation of classification accuracy, but at the cost of an unsatisfactory in robust accuracy, and may suffer from robust overfitting. In this paper, we delve into the underlying causes of this compromise, and introduce a novel framework, the Regularized Decoupled Adversarial Training Mechanism (RDAT) to effectively deal with the trade-off and overfitting. Specifically, RDAT comprises two distinct modules: Regularization module mitigates harmful perturbations by controlling the data distribution distance of examples before and after adversarial attacks. Decoupling training module separates clean and adversarial examples so that they can have special optimization strategies to avoid the suboptimal result in adversarial training. With marginal compromise on the classification accuracy, RDAT achieves remarkably better model robustness with the improvement of robust accuracy by an average of 4.47% on CIFAR-10 and 3.23% on CIFAR-100 when compared to state-of-the-art methods. PubDate: 2024-05-07 DOI: 10.1007/s13735-024-00330-y
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Video saliency prediction aims to simulate human visual attention by selecting the most pertinent and important components within a video frame or sequence. When evaluating video saliency, time and space data are essential, particularly in the presence of challenging features such as fast motion, shifting background, and nonrigid deformation. The current video saliency frameworks are highly prone to failure under the specified conditions. Moreover, it is unsuitable to perform video saliency identification by solely relying on image saliency models, disregarding the temporal information in videos. This research proposes a novel Spatiotemporal Bidirectional Network for Video Salient Object Detection using Multiscale Transfer Learning (SBMTL-Net) to solve the issue of detecting important objects in videos. The SBMTL-Net produces significant outcomes for a given sequence of frames by utilizing Multi-scale transfer learning with an encoder and decoder technique to acquire knowledge and spatially and temporally map properties. SBMTL-Net model consists of bidirectional LSTM (Long Short-Term Memory) and CNN (Convolutional Neural Network), where the VGG16 (Video Geometry Group) and VGG19 are utilized for multi-scale feature extraction of the input video frames. The performance of the proposed model has been evaluated on five publically available challenging datasets DAVIS-T, SegTrack-V2, ViSal, VOS-T and DAVSOD-T for the parameters MAE, F-measure and S-measure. The experimental results show the effectiveness of the proposed model as compared with other competitive models. PubDate: 2024-05-07 DOI: 10.1007/s13735-024-00331-x
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Deep learning has achieved great success in computer vision, especially in image classification tasks. How to improve the generalization ability and compactness of deep neural networks has gradually attracted widespread attention from researchers. Knowledge distillation is an effective technique for model compression. It transfers general knowledge from a sophisticated teacher model to a smaller student model. Recently, some studies refine knowledge from feature maps or adopt complex attention mechanisms to better supervise students imitating teachers. However, their methods focus too much on how to improve students’ accuracy and largely overlook the associated training costs, which violates the original intention of knowledge distillation to compress the model. To achieve a balance between performance and efficiency, in this paper, we introduce a straightforward and effective distillation method to utilize the deepest feature maps to enhance shallow features. Specifically, our method performs processing only on the original feature maps without an extra assisting network. Moreover, we use cross-layer feature fusion to enhance the attention on shallow feature maps. By visualizing the features of different layers, we demonstrate the importance of the fusion operation in our method. Our experimental results on the CIFAR-100, tinyImageNet and miniImageNet datasets show that our approach outperforms previous methods, especially in the balance between performance and training cost. Further ablative studies verify the effectiveness of the design. PubDate: 2024-05-02 DOI: 10.1007/s13735-024-00332-w
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract A group recommender system (GRS) generates suggestions for a group of individuals, considering not only each person's preferences but also factors such as social dynamics and behavior to deliver recommendations that balance personal taste and social factors. In this study, 225 papers have been analyzed from different journals and conference papers, covering a variety of literature works on group recommendation systems. The articles in the literature used for the review were published between 2010 and 2023. This overview of the literature focuses on several methods for creating group recommender systems. This review starts by providing an overview of group recommender systems, including the challenges and essential elements for their development. It then examines the existing literature on collaborative, content-based, and knowledge-based group recommendation techniques. Beyond traditional approaches, this study identifies a notable research gap in the integration of audio, image, and video recommendation systems within the group recommendation paradigm. It then discusses the research gaps found in the existing papers. The review also discusses various aggregation techniques and evaluation metrics used to evaluate these techniques. The review concludes by discussing the limitations and potential future directions for group recommendation research. This review aims to give a thorough understanding of the current status of group recommendations and to pinpoint potential areas for future study. PubDate: 2024-05-02 DOI: 10.1007/s13735-024-00329-5
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Cross-domain few-shot learning (CD-FSL) aims to develop a robust and generalizable model from a data-abundant source domain and apply it to the data-scarce target domain. An intrinsic challenge in CD-FSL is the domain shift problem, often manifested as a discrepancy in data distributions. This work addresses the domain shift problem from a model learning perspective, characterizing it in two specific aspects: over-sensitivity and excessive invariance. Specifically, we introduce a novel Relevance Equilibrium Network (ReqNet) to enhance the generalizability of few-shot models on target domain tasks. In particular, we design a Style Augmentation (StyleAug) module to diversify low-level visual styles of feature representations, alleviating the model’s over-sensitivity to class- or task-irrelevant changes. Furthermore, to mitigate the excessive invariance to features relevant to the class and task, we devise a Task Context Modeling (TCM) module that strategically employs non-local operations to incorporate comprehensive task-level information. Extensive experiments and ablation studies are conducted on eight datasets to demonstrate the competitive performance of our proposed ReqNet. PubDate: 2024-04-29 DOI: 10.1007/s13735-024-00333-9
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract An image caption is a sentence summarizing the semantic details of an image. It is a blended application of computer vision and natural language processing. The earlier research addressed this domain using machine learning approaches by modeling image captioning frameworks using hand-engineered feature extraction techniques. With the resurgence of deep-learning approaches, the development of improved and efficient image captioning frameworks is on the rise. Image captioning is witnessing tremendous growth in various domains as medical, remote sensing, security, visual assistance, and multimodal search engines. In this survey, we comprehensively study the image captioning frameworks based on our proposed domain-specific taxonomy. We explore the benchmark datasets and metrics leveraged for training and evaluating image captioning models in various application domains. In addition, we also perform a comparative analysis of the reviewed models. Natural image captioning, medical image captioning, and remote sensing image captioning are currently among the most prominent application domains of image captioning. The efficacy of real-time image captioning is a challenging obstacle limiting its implementation in sensitive areas such as visual aid, remote security, and healthcare. Further challenges include the scarcity of rich domain-specific datasets, training complexity, evaluation difficulty, and a deficiency of cross-domain knowledge transfer techniques. Despite the significant contributions made, there is a need for additional efforts to develop steadfast and influential image captioning models. PubDate: 2024-04-18 DOI: 10.1007/s13735-024-00328-6
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Visible-Infrared Person Re-identification (VI-ReID) is challenging in social security surveillance because the semantic gap between cross-modal data significantly reduces VI-ReID performance. To overcome this challenge, this paper proposes a novel Multi Knowledge-driven Enhancement Module (MKEM) for high-performance VI-ReID. It mainly focuses on explicitly learning appropriate transition modalities and effectively synthesizing them to reduce the burden of models learning vastly different cross-modal knowledge. The MKEM consists of a Visible Knowledge-driven Enhancement Module (VKEM) and an Infrared Knowledge-driven Enhancement Module (IKEM), which generate model knowledge-accumulating transition modalities for the visible and infrared modalities, respectively. To effectively leverage the transition modalities, the model needs to learn the original data distribution while accumulating knowledge of the transition modes; thus, a Diversity Loss is designed to guide the representation of the generated transition modalities to be diverse, which can facilitate the model’s knowledge accumulation. To prevent redundant knowledge accumulation, a Consistency Loss is proposed to maintain the semantic similarity between the original and modeled transitional modalities. Furthermore, we implemented a Bias Adjustment Strategy (BAS) to effectively adjust the gap between the head and tail categories. We evaluated our proposed MKEM on two VI-ReID benchmark datasets, SYSU-MM01 and RegDB, and the experimental results demonstrate that our method outperforms existing methods significantly. The source code of our proposed MKEM is available at https://github.com/SWU-CS-MediaLab/MKEM. PubDate: 2024-04-16 DOI: 10.1007/s13735-024-00327-7
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In recent years, object detection has become one of the most prominent components in computer vision. State-of-the-art object detectors now employ convolutional neural networks (CNNs) techniques alongside other deep neural network techniques to improve detection performance and accuracy. Most of the recent object detectors employ feature pyramid network (FPN) and their variants while others use combinations of attention mechanisms to achieve better performance. The open question is object detectors inconsistency between the lower layer features, their resolution receptive field and semantic information with the upper layers features in detecting objects. Although some researchers have attempted to address this issue, we exploit ideas surrounding the field and proposed a more prominent architecture called dense attention feature pyramid network (DAF-Net) for multiscale object detection. DAF-Net consists of two attention models, the spatial attention model and channel attention model. Different from other attention models, we proposed lightweight attention models which are fully data-driven then implemented a dense connected attention FPN to reduce the model’s complexity and resolve the learning of redundant feature maps. First, we developed the two attention models then used only the spatial attention model in the backbone of our network, and finally used both attention models to filter and maintain a steady flow of semantic information from lower layers to improve the model’s accuracy and efficiency. Experimental results on underwater images from the National Natural Science Foundation of China (NSFC) (Underwater Image Dataset, National Natural Science Foundation of China (NSFC). Online, retrieved from http://www.cnurpc.org/index.html), MS COCO dataset, and PASCAL VOC dataset indicate higher accuracy and better detection results using the proposed model compared to the benchmark model YOLOX-Darknet53 (Ge in Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430). Our model achieved 70.2mAP, 48.9 mAP, and 83.9 mAP on (NSFC), MS COCO, and PASCAL VOC datasets, respectively, compared with benchmark model 68.9mAP on (NSFC), 47.7mAP on MS COCO, and 82.4mAP on PASCAL VOC. PubDate: 2024-04-08 DOI: 10.1007/s13735-024-00323-x
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Over the past decade, a more widespread area of computer vision research has been person re-identification (P-Reid). This technology is applied in fields such as pedestrian tracking, security, and video surveillance. Currently, person re-identification performs well when supervised with labeled data, but accuracy frequently suffers when learning unsupervised on unlabeled samples. Therefore, improving unlabeled samples model is a challenging endeavor. In order to solve this problem, we propose a progressive spatial–temporal transfer model (PSTT), which consists of three stages, including incremental tuning, spatial–temporal fusion and target domain learning. In the first stage, a high-performance multi-scale network that can initially cluster samples is obtained through triplet loss function. In the next stage, to mine spatial–temporal and visual semantic information, we introduce a fusion model that fuses the visual information extracted from the labeled dataset and the unlabeled dataset using a trained network with its spatial–temporal information. In the final stage, with the assistance of fusion model, we employ a strategy that extends learning from labeled to unlabeled samples. During the training, the fusion model is used to select labeled and unlabeled samples, and multiple meta loss function is used for transfer learning. During the testing, the fusion model is employed to enhance the accuracy of network. In the experiment, we evaluate our method on five standard P-Reid benchmarks: Market1501, DukeMTMC-ReID, CUHK03, MSMT17 and Occluded-DukeMTMC. Extensive experiments show that our proposed PSTT achieves state-of-the-art performance, exceeding the previous method by a certain margin. The source code is available at https://github.com/LiZX12/PSTT. PubDate: 2024-04-03 DOI: 10.1007/s13735-024-00324-w
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Multimodal hash technology maps high-dimensional multimodal data into hash codes, which greatly reduces the cost of data storage and improves query speed through the Hamming similarity calculation. However, existing unsupervised methods still have two key obstacles: (1) With the evolution of large multimodal models, how to efficiently distill the multimodal matching relationship of large models to train a powerful student model' (2) Existing methods do not consider other adjacencies between multimodal instances, resulting in limited similarity representation. To address these obstacles, called Unsupervised Graph Reasoning Distillation Hashing (UGRDH) is proposed. The UGRDH approach uses the CLIP as the teacher model, thus extracting fine-grained multimodal features and relations for teacher–student distillation. Specifically, the multimodal features of the teacher are used to construct a similarity–complementary relation graph matrix, and the proposed graph convolution auxiliary network performs feature aggregation guided by the relation graph matrix to generate a more discriminative hash code. In addition, a cross-attention module was designed to reason potential instance relations to enable effective teacher–student distilled learning. Finally, UGRDH greatly improves search precision while maintaining lightness. Experimental results show that our method achieves about 1.5%, 3%, and 2.8% performance improvements on MS COCO, NUS-WIDE, and MIRFlickr, respectively. PubDate: 2024-03-30 DOI: 10.1007/s13735-024-00326-8