Subjects -> ELECTRONICS (Total: 207 journals)
| A B C D E F G H I J K L M N O P Q R S T U V W X Y Z | The end of the list has been reached or no journals were found for your choice. |
|
|
- IEEE Transactions on Circuits and Systems for Video Technology publication
information-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- IEEE Transactions on Circuits and Systems for Video Technology publication
information-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- A Progressive Difference Method for Capturing Visual Tempos on Action
Recognition-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Xiaoxiao Sheng;Kunchang Li;Zhiqiang Shen;Gang Xiao;
Pages: 977 - 987 Abstract: Visual tempos show the dynamics of action instances, characterizing the diversity of the actions, such as walking slowly and running quickly. To facilitate action recognition, it is essential to capture visual tempos. To this end, previous methods sample raw videos at multiple frame rates or integrate multi-scale temporal features. These methods inevitably introduce two-stream networks or feature-level pyramid structures, leading to expensive computation. In this work, we propose a progressive difference method to capture visual tempos for efficient action recognition, by computing coarse-to-fine motion information within a small neighborhood around temporal frames. Specifically, the uniform sampling method is first applied to each video, and then first-order temporal differences around each frame are calculated to describe local motions. On the basis of differences, further computing the variations of differences, namely second-order differences, can gradually capture fine-grained spatiotemporal features and characterize the areas where the motion cues are more prominent. On one hand, multi-order motion differences can be combined with raw input to describe the diversity of the actions. On the other hand, the variations of first-order differences information can be used to activate first-order salient motion regions, thereby facilitating the discrimination of finer-grained actions. Our method can be combined with existing backbones in a plug-and-play manner. Extensive experiments are conducted on several video benchmarks, including Kinetics400, HMDB51, UCF101, UAV-Human, Something-Something V1 and V2. We also give detailed analysis and qualitative experiments to demonstrate the effectiveness of our method. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- A Perception-Aware Decomposition and Fusion Framework for Underwater Image
Enhancement-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Yaozu Kang;Qiuping Jiang;Chongyi Li;Wenqi Ren;Hantao Liu;Pengjun Wang;
Pages: 988 - 1002 Abstract: This paper presents a perception-aware decomposition and fusion framework for underwater image enhancement (UIE). Specifically, a general structural patch decomposition and fusion (SPDF) approach is introduced. SPDF is built upon the fusion of two complementary pre-processed inputs in a perception-aware and conceptually independent image space. First, a raw underwater image is pre-processed to produce two complementary versions including a contrast-corrected image and a detail-sharpened image. Then, each of them is decomposed into three conceptually independent components, i.e., mean intensity, contrast, and structure, via structural patch decomposition (SPD). Afterwards, the corresponding components are fused using tailored strategies. The three components after fusion are finally integrated via inverting the decomposition to reconstruct a final enhanced underwater image. The main advantage of SPDF is that two complementary pre-processed images are fused in a perception-aware and conceptually independent image space and the fusions of different components can be performed separately without any interactions and information loss. Comprehensive comparisons on two benchmark datasets demonstrate that SPDF outperforms several state-of-the-art UIE algorithms qualitatively and quantitatively. Moreover, the effectiveness of SPDF is also verified on another two relevant tasks, i.e., low-light image enhancement and single image dehazing. The code will be made available soon. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Enabling Large-Capacity Reversible Data Hiding Over Encrypted JPEG
Bitstreams-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Zhongyun Hua;Ziyi Wang;Yifeng Zheng;Yongyong Chen;Yuanman Li;
Pages: 1003 - 1018 Abstract: Cloud computing offers advantages in handling the exponential growth of images but also entails privacy concerns on outsourced private images. Reversible data hiding (RDH) over encrypted images has emerged as an effective technique for securely storing and managing confidential images in the cloud. Most existing schemes only work on uncompressed images. However, almost all images are transmitted and stored in compressed formats such as JPEG. Recently, some RDH schemes over encrypted JPEG bitstreams have been developed, but these works have some disadvantages such as a small embedding capacity (particularly for low quality factors), damage to the JPEG format, and file size expansion. In this study, we propose a permutation-based embedding technique that allows the embedding of significantly more data than existing techniques. Using the proposed embedding technique, we further design a large-capacity RDH scheme over encrypted JPEG bitstreams, in which a grouping method is designed to boost the number of embeddable blocks. The designed RDH scheme allows a content owner to encrypt a JPEG bitstream before uploading it to a cloud server. The cloud server can embed additional data (e.g., copyright and identification information) into the encrypted JPEG bitstream for storage, management, or other processing purpose. A receiver can losslessly recover the original JPEG bitstream using a decryption key. Comprehensive evaluation results demonstrate that our proposed design can achieve approximately twice the average embedding capacity compared to the best prior scheme while preserving the file format without file size expansion. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Language-Augmented Pixel Embedding for Generalized Zero-Shot Learning
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Ziyang Wang;Yunhao Gou;Jingjing Li;Lei Zhu;Heng Tao Shen;
Pages: 1019 - 1030 Abstract: Zero-shot Learning (ZSL) aims to recognize novel classes through seen knowledge. The canonical approach to ZSL leverages a visual-to-semantic embedding to map the global features of an image sample to its semantic representation. These global features usually overlook the fine-grained information which is vital for knowledge transfer between seen and unseen classes, rendering these features sub-optimal for ZSL task, especially the more realistic Generalized Zero-shot Learning (GZSL) task where global features of similar classes could hardly be separated. To provide a remedy to this problem, we propose Language-Augmented Pixel Embedding (LAPE) that directly bridges the visual and semantic spaces in a pixel-based manner. To this end, we map the local features of each pixel to different attributes and then extract each semantic attribute from the corresponding pixel. However, the lack of pixel-level annotation conduces to an inefficient pixel-based knowledge transfer. To mitigate this dilemma, we adopt the text information of each attribute to augment the local features of image pixels which are related to the semantic attributes. Experiments on four ZSL benchmarks demonstrate that LAPE outperforms current state-of-the-art methods. Comprehensive ablation studies and analyses are provided to dissect what factors lead to this success. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Learning Spatiotemporal Interactions for User-Generated Video Quality
Assessment-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Hanwei Zhu;Baoliang Chen;Lingyu Zhu;Shiqi Wang;
Pages: 1031 - 1042 Abstract: Distortions from spatial and temporal domains have been identified as the dominant factors that govern the visual quality. Though both have been studied independently in deep learning-based user-generated content (UGC) video quality assessment (VQA) by frame-wise distortion estimation and temporal quality aggregation, much less work has been dedicated to the integration of them with deep representations. In this paper, we propose a SpatioTemporal Interactive VQA (STI-VQA) model based upon the philosophy that video distortion can be inferred from the integration of both spatial characteristics and temporal motion, along with the flow of time. In particular, for each timestamp, both the spatial distortion explored by the feature statistics and local motion captured by feature difference are extracted and fed to a transformer network for the motion aware interaction learning. Meanwhile, the information flow of spatial distortion from the shallow layer to the deep layer is constructed adaptively during the temporal aggregation. The transformer network enjoys an advanced advantage for long-range dependencies modeling, leading to superior performance on UGC videos. Experimental results on five UGC video benchmarks demonstrate the effectiveness and efficiency of our STI-VQA model, and the source code will be available online at https://github.com/h4nwei/STI-VQA. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Quality Assessment of UGC Videos Based on Decomposition and Recomposition
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Yongxu Liu;Jinjian Wu;Leida Li;Weisheng Dong;Guangming Shi;
Pages: 1043 - 1054 Abstract: The prevalence of short-video applications imposes more requirements for video quality assessment (VQA). User-generated content (UGC) videos are captured under an unprofessional environment, thus suffering from various dynamic degradations, such as camera shaking. To cover the dynamic degradations, existing recurrent neural network-based UGC-VQA methods can only provide implicit modeling, which is unclear and difficult to analyze. In this work, we consider explicit motion representation for dynamic degradations, and propose a motion-enhanced UGC-VQA method based on decomposition and recomposition. In the decomposition stage, a dual-stream decomposition module is built, and VQA task is decomposed into single frame-based quality assessment problem and cross frames-based motion understanding. The dual streams are well grounded on the two-pathway visual system during perception, and require no extra UGC data due to knowledge transfer. Hierarchical features from shallow to deep layers are gathered to narrow the gaps from tasks and domains. In the recomposition stage, a progressively residual aggregation module is built to recompose features from the dual streams. Representations with different layers and pathways are interacted and aggregated in a progressive and residual manner, which keeps a good trade-off between representation deficiency and redundancy. Extensive experiments on UGC-VQA databases verify that our method achieves the state-of-the-art performance and keeps a good capability of generalization. The source code will be available in https://github.com/Sissuire/DSD-PRO. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Counting Varying Density Crowds Through Density Guided Adaptive Selection
CNN and Transformer Estimation-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Yuehai Chen;Jing Yang;Badong Chen;Shaoyi Du;
Pages: 1055 - 1068 Abstract: In real-world crowd counting applications, the crowd densities in an image vary greatly. When facing density variation, humans tend to locate and count the targets in low-density regions, and reason the number in high-density regions. We observe that CNN focus on the local information correlation using a fixed-size convolution kernel and the Transformer could effectively extract the semantic crowd information by using the global self-attention mechanism. Thus, CNN could locate and estimate crowds accurately in low-density regions, while it is hard to properly perceive the densities in high-density regions. On the contrary, Transformer has a high reliability in high-density regions, but fails to locate the targets in sparse regions. Neither CNN nor Transformer can well deal with this kind of density variation. To address this problem, we propose a CNN and Transformer Adaptive Selection Network (CTASNet) which can adaptively select the appropriate counting branch for different density regions. Firstly, CTASNet generates the prediction results of CNN and Transformer. Then, considering that CNN/Transformer is appropriate for low/high-density regions, a density guided adaptive selection module is designed to automatically combine the predictions of CNN and Transformer. Moreover, to reduce the influences of annotation noise, we introduce a Correntropy based optimal transport loss. Extensive experiments on four challenging crowd counting datasets have validated the proposed method. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Minimum Noticeable Difference-Based Adversarial Privacy Preserving Image
Generation-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Wen Sun;Jian Jin;Weisi Lin;
Pages: 1069 - 1081 Abstract: Deep learning models are found to be vulnerable to adversarial examples, as wrong predictions can be caused by small perturbation in input for deep learning models. Most of the existing works of adversarial image generation try to achieve attacks for most models, while few of them make efforts on guaranteeing the perceptual quality of the adversarial examples. High quality adversarial examples matter for many applications, especially for the privacy preserving. In this work, we develop a framework based on the Minimum Noticeable Difference (MND) concept to generate adversarial privacy preserving images that have minimum perceptual difference from the clean ones but are able to attack deep learning models. To achieve this, an adversarial loss is firstly proposed to make the deep learning models attacked by the adversarial images successfully. Then, a perceptual quality-preserving loss is developed by taking the magnitude of perturbation and perturbation-caused structural and gradient changes into account, which aims to preserve high perceptual quality for adversarial image generation. To the best of our knowledge, this is the first work on exploring quality-preserving adversarial image generation based on the MND concept for privacy preserving. To evaluate its performance in terms of perceptual quality, the deep models on image classification and face recognition are tested with the proposed method and several anchor methods in this work. Extensive experimental results demonstrate that the proposed MND framework is capable of generating adversarial images with remarkably improved performance metrics (e.g., PSNR, SSIM, and MOS) than that generated with the anchor methods. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- G2LP-Net: Global to Local Progressive Video Inpainting Network
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Zhong Ji;Jiacheng Hou;Yimu Su;Yanwei Pang;Xuelong Li;
Pages: 1082 - 1092 Abstract: The self-attention based video inpainting methods have achieved promising progress by establishing long-range correlation over the whole video. However, existing methods generally relied on the global self-attention that directly searches missing contents among all reference frames but lacks accurate matching and effective organization on contents, which often blurs the result owing to the loss of local textures. In this paper, we propose a Global-to-Local Progressive Inpainting Network (G2LP-Net) consisting of the following innovative ideas. First, we present a global to local self-attention mechanism by incorporating local self-attention into global self-attention to improve searching efficiency and accuracy, where the self-attention is implemented in multi-scale regions to fully exploit local redundancy for the texture recovery. Second, we propose a progressive video inpainting (PVI) method to organize the generated contents, which completes the target video frames from periphery to core to ensure reliable contents serve first. Last, we develop a window-sliding method for sampling reference frames to obtain rich available information for inpainting. In addition, we release a wire-removal video (WRV) dataset that consists of 150 video clips masked by wires to evaluate the video inpainting on irregularly slender regions. Both quantitative and qualitative experiments on benchmark datasets, DAVIS, YouTube-VOS and our WRV dataset have demonstrated the superiority of our proposed G2LP-Net method. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- CrossDet++: Growing Crossline Representation for Object
Detection-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Heqian Qiu;Hongliang Li;Qingbo Wu;Jianhua Cui;Zichen Song;Lanxiao Wang;Minjian Zhang;
Pages: 1093 - 1108 Abstract: In object detection, precise object representation is a key factor to successfully classify and locate objects of an image. Existing methods usually use rectangular anchor boxes or a set of points to represent objects. However, these methods either introduce background noise or miss the continuous appearance information inside the object, and thus cause incorrect detection results. In this paper, we propose a novel anchor-free object detection network, called CrossDet++, which uses a set of growing crosslines along horizontal and vertical axes as object representations. An object can be flexibly represented as crosslines in different combinations, which inspires us to select the expressive crossline to effectively reduce the interference of noise. Meanwhile, the crossline representation takes into account the continuous adjacent object information, which is useful to enhance the discriminability of object features and find the object boundaries. Based on the learned crosslines, we propose an axis-query crossline growing module to adaptively capture features of crosslines and query surrounding pixels related to the line features for subsequent growing of crosslines. Their growing offsets and scales can be supervised by a decoupled regression mechanism, which limits the regression target to a specific direction for decreasing the optimization difficulty. During the training, we design a semantic-guided label assignment to emphasize the importance of crossline targets with higher semantic richness, further improving the detection performance. The experiment results demonstrate the effectiveness of our proposed method. Code can be available at: https://github.com/QiuHeqian/CrossDet. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Hybrid CNN-Transformer Features for Visual Place Recognition
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Yuwei Wang;Yuanying Qiu;Peitao Cheng;Junyu Zhang;
Pages: 1109 - 1122 Abstract: Visual place recognition is a challenging problem in robotics and autonomous systems because the scene undergoes appearance and viewpoint changes in a changing world. Existing state-of-the-art methods heavily rely on CNN-based architectures. However, CNN cannot effectively model image spatial structure information due to the inherent locality. To address this issue, this paper proposes a novel Transformer-based place recognition method to combine local details, spatial context, and semantic information for image feature embedding. Firstly, to overcome the inherent locality of the convolutional neural network (CNN), a hybrid CNN-Transformer feature extraction network is introduced. The network utilizes the feature pyramid based on CNN to obtain the detailed visual understanding, while using the vision Transformer to model image contextual information and aggregate task-related features dynamically. Specifically, the multi-level output tokens from the Transformer are fed into a single Transformer encoder block to fuse multi-scale spatial information. Secondly, to acquire the multi-scale semantic information, a global semantic NetVLAD aggregation strategy is constructed. This strategy employs semantic enhanced NetVLAD, imposing prior knowledge on the terms of the Vector of Locally Aggregated Descriptors (VLAD), to aggregate multi-level token maps, and further concatenates the multi-level semantic features globally. Finally, to alleviate the disadvantage that the fixed margin of triplet loss leads to the suboptimal convergence, an adaptive triplet loss with dynamic margin is proposed. Extensive experiments on public datasets show that the learned features are robust to appearance and viewpoint changes and achieve promising performance compared to state-of-the-arts. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Generation-Based Joint Luminance-Chrominance Learning for Underwater Image
Quality Assessment-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Zheyin Wang;Liquan Shen;Zhengyong Wang;Yufei Lin;Yanliang Jin;
Pages: 1123 - 1139 Abstract: Underwater enhanced images (UEIs) are affected by not only the color cast and haze effect due to light attenuation and scattering, but also the over-enhancement and texture distortion caused by enhancement algorithms. However, existing underwater image quality assessment (UIQA) methods mainly focus on the inherent distortion caused by underwater optical imaging, and ignore the widespread artificial distortion, which leads to poor performance in evaluating UEIs. In this paper, a novel mapping-based underwater image quality representation is proposed. We divide underwater enhanced images into different domains and utilize a feature vector to measure the distance from the raw image domain to each enhanced image domain. The length and direction of the vector are defined as the enhancement degree and enhancement direction of the image. We construct a best enhancement direction and map other vectors to this direction to obtain the corresponding quality representation. Based on this, a novel network, called generation-based joint luminance-chrominance underwater image quality evaluation (GLCQE), is proposed, which is mainly divided into three parts: bi-directional reference generation module (BRGM), chromatic distortion evaluation network (CDEN), and sharpness distortion evaluation network (SDEN). BRGM is designed to generate two reference images about the unenhanced and the optimal enhanced versions of input UEI. In addition, the distortions in the luminance and chrominance domains of the UEI are analyzed. The luminance and chrominance channels of images are separated and input to SDEN and CDEN respectively to detect different distortions. A multi-scale feature mapping module is proposed in CDEN and SDEN to extract the feature representation of quality in chrominance and luminance of these images respectively. Moreover, a parallel spatial attention module is designed to focus on distortions in structural space by utilizing the different receptive fields of the convolutio- layer, due to the diverse manifestations of structural loss in the image. Finally, the mapped features extracted by two collaborative networks help the model evaluate the quality of underwater images more accurately. Extensive experiments demonstrate the superiority of our model against other representative state-of-the-art models. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- A Format Compliant Framework for HEVC Selective Encryption After Encoding
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Bo Tang;Cheng Yang;Yana Zhang;
Pages: 1140 - 1156 Abstract: The security protection of Ultra-High-Definition (UHD) video is facing grand challenges due to changeable application scenarios. The video business is highly dependent on the video format structure, which makes the format compliance of the encryption algorithm essential. Existing HEVC Selective Encryption (SE) algorithms are difficult to encrypt with format compliance while independent from the encoding, which limits their practical application. In order to realize the format compliant encryption on encoded bitstream, this paper first proposes a non-diffusion rule by analyzing the coding format specification of High Efficiency Video Coding (HEVC). Following the non-diffusion rule, a Bit Flipping (BF) method and a Bit Insertion-Deletion (BID) method are proposed based on the lower bound mapping of the bypass syntax element bit coding interval. Then, considering the various binarization methods and the correlation between syntax elements when decoding, this paper proposes the Validity Principle (VP) and Independence Principle (IP) of format compliant encryption. Focusing on the binary bits of the primary bypass elements in HEVC, the encryptability of them is analyzed in detail, and the corresponding format compliant encryption schemes based on BF, BID methods are constructed. Based on the above, a Format Compliance Encryption (FCE) framework after encoding is formulated. The flexibility, security, and adaptability of the framework are analyzed. Finally, through experiments from the aspects of format compliance, encryption speed, and perception effect, the proposed methods, schemes, and framework are format compliance while ensuring the basic performance and security requirements. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Deep Texture-Aware Features for Camouflaged Object Detection
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Jingjing Ren;Xiaowei Hu;Lei Zhu;Xuemiao Xu;Yangyang Xu;Weiming Wang;Zijun Deng;Pheng-Ann Heng;
Pages: 1157 - 1167 Abstract: Camouflaged object detection is a challenging task that aims to identify objects having similar texture to the surroundings. This paper presents to amplify the subtle texture difference between camouflaged objects and the background for camouflaged object detection by formulating multiple texture-aware refinement modules to learn the texture-aware features in a deep convolutional neural network. The texture-aware refinement module computes the biased co-variance matrices of feature responses to extract the texture information, adopts an affinity loss to learn a set of parameter maps that help to separate the texture between camouflaged objects and the background, and leverages a boundary-consistency loss to explore the structures of object details. We evaluate our network on the benchmark datasets for camouflaged object detection both qualitatively and quantitatively. Experimental results show that our approach outperforms various state-of-the-art methods by a large margin. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Consistent Intra-Video Contrastive Learning With Asynchronous Long-Term
Memory Bank-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Zelin Chen;Kun-Yu Lin;Wei-Shi Zheng;
Pages: 1168 - 1180 Abstract: Unsupervised representation learning for videos has recently achieved remarkable performance owing to the effectiveness of contrastive learning. Most works on video contrastive learning (VCL) pull all snippets from the same video into the same category, even if some of them are from different actions, leading to temporal collapse, i.e., the snippet representations of a video are invariable with the evolution of time. In this paper, we introduce a novel intra-video contrastive learning (intra-VCL) that further distinguishes intra-video actions to alleviate this issue, which includes an asynchronous long-term memory bank (that caches the representations of all snippets of each video) and mines an extra positive/negative snippet within a video based on the asynchronous long-term memory bank. In addition, since an asynchronous long-term memory bank is required for performing intra-VCL and asynchronous update of the long-term memory leads to inconsistencies when performing contrastive learning, we further propose a consistent contrastive module (CCM) to perform consistent intra-VCL. Specifically, in the CCM, we propose an intra-video self-attention refinement function to reduce the inconsistencies within the asynchronously updated representations (of all snippets of each video) in the long-term memory and an adaptive loss re-weighting to reduce unreliable self-supervision produced by inconsistent contrastive pairs. We call our method as consistent intra-VCL. Extensive experiments demonstrate the effectiveness of the proposed consistent intra-VCL, which achieves state-of-the-art performance on the standard benchmarks of self-supervised action recognition, with top-1 accuracies of 64.2% and 91.0% on HMDB-51 and UCF-101, respectively. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Towards Zero-Shot Learning: A Brief Review and an Attention-Based
Embedding Network-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Guo-Sen Xie;Zheng Zhang;Huan Xiong;Ling Shao;Xuelong Li;
Pages: 1181 - 1197 Abstract: Zero-shot learning (ZSL), an emerging topic in recent years, targets at distinguishing unseen class images by taking images from seen classes for training the classifier. Existing works often build embeddings between global feature space and attribute space, which, however, neglect the treasure in image parts. Discrimination information is usually contained in the image parts, e.g., black and white striped area of a zebra is the key difference from a horse. As such, image parts can facilitate the transfer of knowledge among the seen and unseen categories. In this paper, we first conduct a brief review on ZSL with detailed descriptions of these methods. Next, to discover meaningful parts, we propose an end-to-end attention-based embedding network for ZSL, which contains two sub-streams: the attention part embedding (APE) stream, and the attention second-order embedding (ASE) stream. APE is used to discover multiple image parts based on attention. ASE is introduced for ensuring knowledge transfer stably by second-order collaboration. Furthermore, an adaptive thresholding strategy is proposed to suppress noise and redundant parts. Finally, a global branch is incorporated for the full use of global information. Experiments on four benchmarks demonstrate that our models achieve superior results under both ZSL and GZSL settings. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- CorrI2P: Deep Image-to-Point Cloud Registration via Dense Correspondence
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Siyu Ren;Yiming Zeng;Junhui Hou;Xiaodong Chen;
Pages: 1198 - 1208 Abstract: Motivated by the intuition that the critical step of localizing a 2D image in the corresponding 3D point cloud is establishing 2D-3D correspondence between them, we propose the first feature-based dense correspondence framework for addressing the challenging problem of 2D image-to-3D point cloud registration, dubbed CorrI2P. CorrI2P is mainly composed of three modules, i.e., feature embedding, symmetric overlapping region detection, and pose estimation through the established correspondence. Specifically, given a pair of a 2D image and a 3D point cloud, we first transform them into high-dimensional feature spaces and feed the resulting features into a symmetric overlapping region detector to determine the region where the image and point cloud overlap. Then we use the features of the overlapping regions to establish dense 2D-3D correspondence, on which EPnP within RANSAC is performed to estimate the camera pose, i.e., translation and rotation matrices. Experimental results on KITTI and NuScenes datasets show that our CorrI2P outperforms state-of-the-art image-to-point cloud registration methods significantly. The code will be publicly available at https://github.com/rsy6318/CorrI2P. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Local-to-Global Cost Aggregation for Semantic Correspondence
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Zi Wang;Zhiheng Fu;Yulan Guo;Zhang Li;Qifeng Yu;
Pages: 1209 - 1222 Abstract: Establishing visual correspondences across semantically similar images is challenging due to intra-class variations, viewpoint changes, repetitive patterns, and background clutter. Recent approaches focus on cost aggregation to achieve promising performance. However, these methods fail to jointly utilize local and global cues to suppress unreliable matches. In this paper, we propose a cost aggregation network with convolutions and transformers, dubbed CACT. Different from existing methods, CACT refines the correlation map in a local-to-global manner by utilizing the strengths of convolutions and transformers in different stages. Additionally, considering the bidirectional nature of the correlation map, we propose a dual-path learning framework to work parallelly. Benefiting from the proposed framework, we can use 2D blocks to construct a cost aggregator to improve the efficiency of our model. Experimental results on the SPair-71k, PF-PASCAL, and PF-WILLOW datasets show that the proposed method outperforms the most state-of-the-art methods. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- RGB-T Semantic Segmentation With Location, Activation, and Sharpening
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Gongyang Li;Yike Wang;Zhi Liu;Xinpeng Zhang;Dan Zeng;
Pages: 1223 - 1235 Abstract: Semantic segmentation is important for scene understanding. To address the scenes of adverse illumination conditions of natural images, thermal infrared (TIR) images are introduced. Most existing RGB-T semantic segmentation methods follow three cross-modal fusion paradigms, i. e., encoder fusion, decoder fusion, and feature fusion. Some methods, unfortunately, ignore the properties of RGB and TIR features or the properties of features at different levels. In this paper, we propose a novel feature fusion-based network for RGB-T semantic segmentation, named LASNet, which follows three steps of location, activation, and sharpening. The highlight of LASNet is that we fully consider the characteristics of cross-modal features at different levels, and accordingly propose three specific modules for better segmentation. Concretely, we propose a Collaborative Location Module (CLM) for high-level semantic features, aiming to locate all potential objects. We propose a Complementary Activation Module for middle-level features, aiming to activate exact regions of different objects. We propose an Edge Sharpening Module (ESM) for low-level texture features, aiming to sharpen the edges of objects. Furthermore, in the training phase, we attach a location supervision and an edge supervision after CLM and ESM, respectively, and impose two semantic supervisions in the decoder part to facilitate network convergence. Experimental results on two public datasets demonstrate that the superiority of our LASNet over relevant state-of-the-art methods. The code and results of our method are available at https://github.com/MathLee/LASNet. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Iterative Class Prototype Calibration for Transductive Zero-Shot Learning
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Hairui Yang;Baoli Sun;Baopu Li;Caifei Yang;Zhihui Wang;Jenhui Chen;Lei Wang;Haojie Li;
Pages: 1236 - 1246 Abstract: Zero-shot learning (ZSL) typically suffers from the domain shift issue since the projected feature embedding of unseen samples mismatch with the corresponding class semantic prototypes, making it very challenging to fine-tune an optimal visual-semantic mapping for the unseen domain. Some existing transductive ZSL methods solve this problem by introducing unlabeled samples of the unseen domain, in which the projected features of unseen samples are still not discriminative and tend to be distributed around prototypes of seen classes. Therefore, how to effectively align the projection features of samples in unseen classes with corresponding predefined class prototypes is crucial for promoting the generalization of ZSL models. In this paper, we propose a novel Iterative Class Prototype Calibration (ICPC) framework for transductive ZSL which consists of a pseudo-labeling stage and a model retraining stage to address the above key issue. First, in the labeling stage, we devise a Class Prototype Calibration (CPC) module to calibrate the predefined class prototypes of the unseen domain by estimating the real center of projected feature distribution, which achieves better matching of sample points and class prototypes. Next, in the retraining stage, we devise a Certain Samples Screening (CSS) module to select relatively certain unseen samples with high confidence and align them with predefined class prototypes in the embedding space. A progressive training strategy is adopted to select more certain samples and update the proposed model with augmented training data. Extensive experiments on AwA2, CUB, and SUN datasets demonstrate that the proposed scheme achieves new state-of-the-art in the conventional setting under both standard split (SS) and proposed split (PS). PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Audio-Driven Dubbing for User Generated Contents via Style-Aware
Semi-Parametric Synthesis-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Linsen Song;Wayne Wu;Chaoyou Fu;Chen Change Loy;Ran He;
Pages: 1247 - 1261 Abstract: Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which takes full advantage of the limited training data of the unseen speaker via a video-level retrieve-warp-refine pipeline. Finally, we propose a temporal regularization for the semi-parametric renderer, generating more continuous videos. Extensive experiments show that our method generates videos that accurately preserve various speaking styles, yet with considerably lower amount of training data and training time in comparison to existing methods. Besides, our method achieves a faster testing speed than most recent methods. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Fast CNN-Based Single-Person 2D Human Pose Estimation for Autonomous
Systems-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Christos Papaioannidis;Ioannis Mademlis;Ioannis Pitas;
Pages: 1262 - 1275 Abstract: This paper presents a novel Convolutional Neural Network (CNN) architecture for 2D human pose estimation from RGB images that balances between high 2D human pose/skeleton estimation accuracy and rapid inference. Thus, it is suitable for safety-critical embedded AI scenarios in autonomous systems, where computational resources are typically limited and fast execution is often required, but accuracy cannot be sacrificed. The architecture is composed of a shared feature extraction backbone and two parallel heads attached on top of it: one for 2D human body joint regression and one for global human body structure modelling through Image-to-Image Translation (I2I). A corresponding multitask loss function allows training of the unified network for both tasks, through combining a typical 2D body joint regression with a novel I2I term. Along with enhanced information flow between the parallel neural heads via skip synapses, this strategy is able to extract both ample semantic and rich spatial information, while using a less complex CNN; thus it permits fast execution. The proposed architecture is evaluated on public 2D human pose estimation datasets, achieving the best accuracy-speed ratio compared to the state-of-the-art. Additionally, it is evaluated on a pedestrian intention recognition task for self-driving cars, leading to increased accuracy and speed in comparison to competing approaches. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- PointFilterNet: A Filtering Network for Point Cloud Denoising
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Xingtao Wang;Xiaopeng Fan;Debin Zhao;
Pages: 1276 - 1290 Abstract: Point clouds obtained by 3D scanning or reconstruction are usually accompanied by noise. Filtering-based point cloud denoising methods are simple and effective, but they are limited by the manually defined coefficients. Deep learning has shown excellent ability in automatically learning parameters. In this paper, a filtering network named PointFilterNet (PFN for short) is proposed to denoise point clouds by the combination of filtering and deep learning. Instead of directly outputting denoised points using networks, PFN generates the filtering denoised points through learned coefficients. Specifically, PFN outputs three coefficient vectors. These coefficient vectors are then used to filter the coordinates of points in the neighborhood of the noisy point. PFN consists of two parts: an outlier recognizer and a denoiser, both of which generate different but indispensable filtering coefficients. The outlier recognizer reduces the interference of outliers by assigning small coefficients to them. The denoiser is designed to progressively denoise point clouds which accords to the perception process of the human visual system. The experiments on synthetic and real scanned point clouds demonstrate that PFN outperforms state-of-the-art point cloud denoising works. Compared with DMR, the Chamfer distance of PFN is reduced by 22.07% on PCN-dataset. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Switch and Refine: A Long-Term Tracking and Segmentation Framework
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Xiang Xu;Jian Zhao;Jianmin Wu;Furao Shen;
Pages: 1291 - 1304 Abstract: In long-term video object tracking (VOT) tasks, most long-term trackers are modified from short-term trackers, which contain more and more machine learning modules to improve their performance. However, we empirically find that more modules do not necessarily lead to better results. In this paper, we make the long-term tracking framework simple by carefully selecting the cutting-edge trackers. Specifically, we propose a new long-term VOT framework that combines the benefits of two mainstream short-term tracking pipelines, i.e., the discriminative online tracker and the one-shot Siamese tracker, with a global re-detector awakened when the target is lost. Such a framework fully exploits existing advanced works from three complementary perspectives. Experimental results show that by exploiting the capabilities of existing methods instead of designing new neural networks, we can still achieve remarkable results on seven long-term VOT datasets. By introducing a continuous adjustable speed control parameter, our tracker reaches 20+FPS with only a small performance loss. The refine module not only improves the bounding box estimations but also outputs segmentation masks, so that our framework can handle the video object segmentation (VOS) tasks by using only VOT trackers. We obtain a trade-off between time and accuracy on two representative VOS datasets by only using bounding boxes as the initial input. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Semi-Supervised Action Recognition From Temporal Augmentation Using
Curriculum Learning-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Anyang Tong;Chao Tang;Wenjian Wang;
Pages: 1305 - 1319 Abstract: Semi-supervised learning for video action recognition is a very challenging research area. Existing state-of-the-art methods perform data augmentation on the temporality of actions, which are combined with the mainstream consistency-based semi-supervised learning framework FixMatch for action recognition. However, these approaches have the following limitations: (1) data augmentation based on video clips lacks coarse-grained and fine-grained representations of actions in temporal sequences, and the models have difficulty understanding synonymous representations of actions in different motion phases. (2) Pseudo labeling selection based on the constant thresholds lacks a “make-up curriculum” for difficult actions, that results in the low utilization of unlabeled data corresponding to difficult actions. To address the above shortcomings, we propose a semi-supervised action recognition via the temporal augmentation using curriculum learning (TACL) algorithm. Compared to previous works, TACL explores different representations of the same semantics of actions in temporal sequences for video and uses the idea of curriculum learning (CL) to reduce the difficulty of the model training process. First, for different action expressions with the same semantics, we designed the temporal action augmentation (TAA) for videos to obtain coarse-grained and fine-grained action expressions based on constant-velocity and hetero-velocity methods, respectively. Second, we construct a temporal signal to constrain the model such that fine-grained action expressions containing different movement phases have the same prediction results, and achieve action consistency learning (ACL) by combining the label and pseudo-label signals. Finally, we propose action curriculum pseudo labeling (ACPL), a loosely and strictly parallel dynamic threshold evaluation algorithm for selecting and labeling unlabeled data. We evaluate TACL on three standard public datasets: U-F101, HMDB51, and Kinetics. The combined experiments show that TACL significantly improves the accuracy of models trained on a small amount of labeled data and better evaluates the learning effects for different actions. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Pareto Refocusing for Drone-View Object Detection
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Jiaxu Leng;Mengjingcheng Mo;Yinghua Zhou;Chenqiang Gao;Weisheng Li;Xinbo Gao;
Pages: 1320 - 1334 Abstract: Drone-view Object Detection (DOD) is a meaningful but challenging task. It hits a bottleneck due to two main reasons: (1) The high proportion of difficult objects (e.g., small objects, occluded objects, etc.) makes the detection performance unsatisfactory. (2) The unevenly distributed objects make detection inefficient. These two factors also lead to a phenomenon, obeying the Pareto principle, that some challenging regions occupying a low area proportion of the image have a significant impact on the final detection while the vanilla regions occupying the major area have a negligible impact due to the limited room for performance improvement. Motivated by the human visual system that naturally attempts to invest unequal energies in things of hierarchical difficulty for recognizing objects effectively, this paper presents a novel Pareto Refocusing Detection (PRDet) network that distinguishes the challenging regions from the vanilla regions under reverse-attention guidance and refocuses the challenging regions with the assistance of the region-specific context. Specifically, we first propose a Reverse-attention Exploration Module (REM) that excavates the potential position of difficult objects by suppressing the features which are salient to the commonly used detector. Then, we propose a Region-specific Context Learning Module (RCLM) that learns to generate specific contexts for strengthening the understanding of challenging regions. It is noteworthy that the specific context is not shared globally but unique for each challenging region with the exploration of spatial and appearance cues. Extensive experiments and comprehensive evaluations on the VisDrone2021-DET and UAVDT datasets demonstrate that the proposed PRDet can effectively improve the detection performance, especially for those difficult objects, outperforming state-of-the-art detectors. Furthermore, our method also achieves significant performance improvements on the DTU-Drone dataset for power inspection. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Learning to Reduce Scale Differences for Large-Scale Invariant Image
Matching-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Yujie Fu;Pengju Zhang;Bingxi Liu;Zheng Rong;Yihong Wu;
Pages: 1335 - 1348 Abstract: Most image matching methods perform poorly when encountering large scale changes in images. To solve this problem, we propose a Scale-Difference-Aware Image Matching method (SDAIM) that reduces image scale differences before local feature extraction, via resizing both images of an image pair according to an estimated scale ratio. In order to accurately estimate the scale ratio for the proposed SDAIM, we propose a Covisibility-Attention-Reinforced Matching module (CVARM) and then design a novel neural network, termed as Scale-Net, based on CVARM. The proposed CVARM can lay more stress on covisible areas within the image pair and suppress the distraction from those areas visible in only one image. Quantitative and qualitative experiments confirm that the proposed Scale-Net has higher scale ratio estimation accuracy and much better generalization ability compared with all the existing scale ratio estimation methods. Further experiments on image matching and relative pose estimation tasks demonstrate that our SDAIM and Scale-Net are able to greatly boost the performance of representative local features and state-of-the-art local feature matching methods. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Weakly Supervised Pedestrian Segmentation for Person Re-Identification
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Ziqi Jin;Jinheng Xie;Bizhu Wu;Linlin Shen;
Pages: 1349 - 1362 Abstract: Person re-identification (RelD) is an important problem in intelligent surveillance and public security. Among all the solutions to this problem, existing mask-based methods first use a well-pretrained segmentation model to generate a foreground mask, in order to exclude the background from ReID. Then they perform the RelD task directly on the segmented pedestrian image. However, such a process requires extra datasets with pixel-level semantic labels. In this paper, we propose a Weakly Supervised Pedestrian Segmentation (WSPS) framework to produce the foreground mask directly from the RelD datasets. In contrast, our WSPS only requires image-level subject ID labels. To better utilize the pedestrian mask, we also propose the Image Synthesis Augmentation (ISA) technique to further augment the dataset. Experiments show that the features learned from our proposed framework are robust and discriminative. Compared with the baseline, the mAP of our framework is about 4.4%, 11.7%, and 4.0% higher on three widely used datasets including Market-1501, CUHK03, and MSMT17. The code will be available soon. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Weakly-Supervised Semantic Feature Refinement Network for MMW Concealed
Object Detection-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Shuiping Gou;Xinlin Wang;Shasha Mao;Licheng Jiao;Zhen Liu;Yinghai Zhao;
Pages: 1363 - 1373 Abstract: The concealed object detection in millimeter-wave human body images is a challenging task due to the noise and dim-small objects. Exploiting the spatial dependencies to mine the difference between the object and the noise is vital for the discrimination of objects. However, most approaches ignore the context around the object. In this paper, a concealed object detection framework based on structural context is proposed to suppress noise interference and refine localizable semantic features. The framework consists of two subnetworks, structural region-based multi-scale weakly supervised feature refinement and local context-based concealed object detection. The multi-scale weakly supervised feature refinement is constructed to learn position-aware semantics of objects of various sizes while suppressing background noises in structural regions. Specifically, a multi-scale pooling method is proposed to better localize objects of different sizes, and an object-activated region enhancement module is designed to strengthen object semantic representations and suppress the background interference. Moreover, an adaptive local context aggregation module is designed to integrate the local context around the bounding box in the concealed object detection, which improves the discrimination of the model for the dim-small objects. Experimental results on the AMMW and the PMMW datasets demonstrate that the proposed approach improves detection performance with lower false alarm rates. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Self-Attention Memory-Augmented Wavelet-CNN for Anomaly Detection
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Kun Wu;Lei Zhu;Weihang Shi;Wenwu Wang;Jin Wu;
Pages: 1374 - 1385 Abstract: Anomaly detection plays an important role in manufacturing quality control/assurance. Among approaches adopting computer vision techniques, reconstruction-based methods learn a content-aware mapping function that transfers abnormal regions to normal regions in an unsupervised manner. Such methods usually have difficulty in improving both the reconstruction quality and capacity for abnormal discovery. We observe that high-level semantic contextual features demonstrate a strong ability for abnormal discovery, while variational features help to preserve fine image details. Inspired by the observation, we propose a new abnormal detection model by utilizing features for different purposes depending on their frequency characteristics. The 2D-discrete wavelet transform (DWT) is introduced to obtain the low-frequency and high-frequency components of features and further used to generate the two essential features following different routing paths in our encoder process. To further improve the capacity for abnormal discovery, we propose a novel feature augmentation module that is informed by a customized self-attention mechanism. Extensive experiments are conducted on two popular datasets: MVTec AD and BTAD. The experimental results illustrate that the proposed method outperforms other state-of-the-art approaches in terms of the image-level AUROC score. In particular, our method achieves 100% of the image-level AUROC score on 8 out of 15 classes on the MVTec dataset. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- A Softmax-Free Loss Function Based on Predefined Optimal-Distribution of
Latent Features for Deep Learning Classifier-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Qiuyu Zhu;Xuewen Zu;
Pages: 1386 - 1397 Abstract: In the field of pattern classification, the training of deep learning classifiers is mostly end-to-end learning, and the loss function is the constraint on the final output (posterior probability) of the network, so the existence of Softmax is essential. In the case of end-to-end learning, there is usually no effective loss function that completely relies on the features of the middle layer to restrict learning, resulting in the distribution of sample latent features is not optimal, so there is still room for improvement in classification accuracy. Based on the concept of Predefined Evenly-Distributed Class Centroids (PEDCC), this article proposes a Softmax-free loss function based on predefined optimal-distribution of latent features—POD Loss. The loss function only restricts the latent features of the samples, including the norm-adaptive Cosine distance between the latent feature vector of the sample and the center of the predefined evenly-distributed class, and the correlation between the latent features of the samples. Finally, Cosine distance is used for classification. Compared with the commonly used Softmax Loss, some typical Softmax related loss functions and PEDCC-Loss, experiments on several commonly used datasets on several typical deep learning classification networks show that the classification performance of POD Loss is always significant better and easier to converge. Code is available in https://github.com/TianYuZu/POD-Loss. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- RFS-Net: Railway Track Fastener Segmentation Network With Shape Guidance
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Shixiang Su;Songlin Du;Xuan Wei;Xiaobo Lu;
Pages: 1398 - 1412 Abstract: The fastener is one of the main components of a rail track system. In recent years, deep learning methods such as image segmentation have greatly boosted the fastener state detection process. However, there is still a need to improve the segmentation accuracy and speed, especially for the fasteners in complex environments. To handle this problem, a fast and accurate fastener semantic segmentation network named RFS-Net is proposed based on shape guidance, which can offer a better speed/accuracy trade-off performance via a very shallow architecture. Specifically, in the encoder, a two-stream structure (i.e., regular stream and shape stream) that processes the fastener and shape image in parallel is introduced. The shape image is created based on the geometric structure of the fastener, and it is served as input to the shape stream to guide the segmentation of the fastener. The decoder integrates deep features from the two-stream encoder and then recovers the shape information by the shape attention blocks with skipping connections. We provide two versions of RFS-Net: RFS-Net_S (1.0M, 1014FPS) and RFS-Net_L (12.01M, 453FPS) on the NVIDIA RTX 3060. Experimental results demonstrate the effectiveness of our method by achieving a promising trade-off between accuracy and inference speed. In particular, our method is faster and more accurate on a challenging dataset, from fast modes: 1014 FPS for RFS-Net_S versus 724 FPS for Segmenter, to high-quality segmentation: better performance than STDC with nearly one percent (92.36% versus 91.48% Mean IoU score). PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- INENet: Inliers Estimation Network With Similarity Learning for Partial
Overlapping Registration-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Yue Wu;Yue Zhang;Xiaolong Fan;Maoguo Gong;Qiguang Miao;Wenping Ma;
Pages: 1413 - 1426 Abstract: Point cloud registration is a key problem in the application of computer vision to robotics, autopilot and other fields. However, because the object is partially covered up or the resolution of 3D scanners is different, point clouds collected by the same sense may be inconsistent and even incomplete. Inspired by the recently proposed learning-based approaches, we propose Inliers Estimation Network (INENet) which includes a self-designed threshold prediction network and a probability estimation network with adaptive similarity mutual attention to help to find the overlapping area of the point clouds. In order to solve the above problems, we divide the partially overlapping point cloud registration task into two sub-tasks: overlapping areas detection and registration. The threshold prediction network can automatically calculate the threshold according to the input point clouds, and then the probability estimation network estimates the overlapping points by using threshold. The advantages of the proposed approach include: (1) threshold prediction network avoids bias and the complexity of manually adjusting the threshold. (2) Probability estimation network with similarity matrix can deeply fuse the information between a pair of point clouds, which is helpful to improve the accuracy. (3) INENet can be easily integrated into other overlapping region sensitive algorithms and without adjusting parameters. We conduct experiments on the ModelNet40, S3DIS and 3DMatch data sets. Specifically, the rotation error of the registration algorithm integrated with INENet is improved by at least 25% compared with direct partial overlap registration, our method improves the $F_{1} $ score by 5% and has better anti-noise ability compared with the existing overlap detection methods, showing the effectiveness of the proposed method. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- A Regularized Projection-Based Geometry Compression Scheme for LiDAR Point
Cloud-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Youguang Yu;Wei Zhang;Ge Li;Fuzheng Yang;
Pages: 1427 - 1437 Abstract: Due to the ability to depict large-scale 3D scenes, point clouds acquired by the Light Detection And Ranging (LiDAR) devices have played an indispensable role in various fields. The growing data amount of point cloud, however, brings huge challenges to existing point cloud processing networks. Developing point cloud compression algorithms has become an active research area in recent years. Representative compression frameworks include the MPEG Geometry-based Point Cloud Compression (G-PCC) standard in which a dedicated profile is designed for spinning LiDAR point clouds. In that design, prior knowledge of the LiDAR device is used to project points to nodes in a predictive structure which better reflects the spatial correlation of LiDAR point clouds. In this paper, an analysis has been conducted to explain the observed irregular point distribution in the predictive structure. A regularized projection algorithm is then proposed to construct a reliable prediction relationship in the predictive structure. Simplified geometry prediction techniques are further proposed based on the regularized projection pattern. Experimental results show that an average BD-rate gain of 18% can be achieved with lower encoding runtime if compared with MPEG G-PCC. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Temporal Multimodal Graph Transformer With Global-Local Alignment for
Video-Text Retrieval-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Zerun Feng;Zhimin Zeng;Caili Guo;Zheng Li;
Pages: 1438 - 1453 Abstract: Video-text retrieval is a crucial task that has been a powerful application for multi-media data analysis and attracted tremendous interest in the research area. The core steps are feature representations and alignment to overcome the heterogeneous gap between videos and texts. Existing methods not only take advantage of multi-modal information in videos but also explore local alignment to enhance retrieval accuracy. Although performing well, these methods seem deficient at three perspectives: a) The semantic correlations between different modal features are not considered, which introduces irrelevant noise in feature representations. b) The cross-modal relations and temporal associations are ambiguously learned by a single self-attention manipulation. c) The training signal to optimize the semantic topic assignment for local alignment is missing. In this paper, we proposed a novel Temporal Multi-modal Graph Transformer with Global-Local Alignment (TMMGT-GLA) for video-text retrieval. We model the input video as a sequence of semantic correlation graphs to exploit the structural information between multi-modal features. Graph and temporal self-attention layers are leveraged on the semantic correlation graphs to effectively learn cross-modal relations and temporal associations respectively. For local alignment, the encoded video and text features are assigned to a set of shared semantic topics, and the distances between residuals from the same ones are minimized. To optimize the assignments, a minimum entropy-based regularization term is proposed for training the overall framework. Experimental results are carried out on the MSR-VTT, LSMDC, and ActivityNet Captions datasets. Our method outperforms previous approaches by a large margin and achieves state-of-the-art performance. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- ERM: Energy-Based Refined-Attention Mechanism for Video Question Answering
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Fuwei Zhang;Ruomei Wang;Fan Zhou;Yuanmao Luo;
Pages: 1454 - 1467 Abstract: Spatiotemporal attention learning remains a challenging video question answering (VideoQA) task as it requires a sufficient understanding of cross-modal spatiotemporal information. Existing methods usually leverage different cross-modal attention mechanisms to reveal potential associations between video and question. While these methods effectively remove irrelevant information from the spatiotemporal attention, they ignore the pseudo-related information within the cross-modal interaction attention. To address this problem, we proposed a novel energy-based refined-attention mechanism (ERM). ERM leverages the significant difference distribution as a discriminative criterion derived from question-guided cross-modal interaction information to determine question-related and question-irrelated cross-modal interaction information. The specific method is to measure the linear separability between the target neuron and other neurons in the neural network to confirm the importance of neurons. In addition, to solve the statistical bias caused by the differences between different modes in video tasks, the ERM proposed in this paper has learnable parameters. The correlation between different modes can be learned adaptively through learnable parameters. The advantages of the proposed ERM are that it is more flexible and modular while remaining lightweight. With the help of the ERM, we construct a lightweight VideoQA model that efficiently integrates the cross-modal feature representations in an energy-based manner. To evaluate the effectiveness of our method, we carried out extensive experiments on five publicly available datasets and compared them with state-of-the-art VideoQA methods. The experiment results demonstrate that our method brings a noticeable performance improvement compared to state-of-the-art VideoQA methods. ERM can be flexibly integrated into different VideoQA methods to improve their Q&A performance. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Learning Features of Intra-Consistency and Inter-Diversity: Keys Toward
Generalizable Deepfake Detection-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Han Chen;Yuzhen Lin;Bin Li;Shunquan Tan;
Pages: 1468 - 1480 Abstract: Public concerns about deepfake face forgery are continually rising in recent years. Most deepfake detection approaches attempt to learn discriminative features between real and fake faces through end-to-end trained deep neural networks. However, the majorities of them suffer from the problem of poor generalization across different data sources, forgery methods, and/or post-processing operations. In this paper, following the simple but effective principle in discriminative representation learning, i.e., towards learning features of intra-consistency within classes and inter-diversity between classes, we leverage a novel transformer-based self-supervised learning method and an effective data augmentation strategy towards generalizable deepfake detection. Considering the differences between the real and fake images are often subtle and local, the proposed method firstly utilizes Self Prediction Learning (SPL) to learn rich hidden representations by predicting masked patches at a pre-training stage. Intra-class consistency clues in images can be mined without deepfake labels. After pre-training, the discrimination model is then fine-tuned via multi-task learning, including a deepfake classification task and a forgery mask estimation task. It is facilitated by our new data augmentation method called Adjustable Forgery Synthesizer (AFS), which can conveniently simulate the process of synthesizing deepfake images with various levels of visual reality in an explicit manner. AFS greatly prevents overfitting due to insufficient diversity in training data. Comprehensive experiments demonstrate that our method outperforms the state-of-the-art competitors on several popular benchmark datasets in terms of generalization to unseen forgery methods and untrained datasets. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- DFCE: Decoder-Friendly Chrominance Enhancement for HEVC Intra Coding
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Renwei Yang;Hewei Liu;Shuyuan Zhu;Xiaozhen Zheng;Bing Zeng;
Pages: 1481 - 1486 Abstract: We propose a decoder-friendly chrominance enhancement method for the compressed images. Our proposed method is developed based on the luminance-guided chrominance enhancement network (LGCEN) and online learning. With LGCEN, the textures of the compressed chrominance components are enhanced by the guidance of luminance component. Moreover, LGCEN is constructed with the recursive design and the light-weight channel attention mechanism to achieve high performance as well as low complexity. It is arranged at both encoder and decoder sides. Given the input image, we train LGCEN at encoder side by using online learning. With online learning, we partially update network parameters and transmit them to decoder to update LGCEN arranged there. The adoption of online learning effectively reduces the workload of decoder and guarantee high robustness. Compared with the state-of-the-art methods, our proposed approach achieves superior performance. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
- Comment-Guided Semantics-Aware Image Aesthetics Assessment
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:
Yuzhen Niu;Shanshan Chen;Bingrui Song;Zhixian Chen;Wenxi Liu;
Pages: 1487 - 1492 Abstract: Existing image aesthetics assessment methods mainly rely on the visual features of images but ignore their rich semantics. Nowadays, with the widespread application of social media, the comments corresponding to images in the form of texts can be easily accessed and provide rich semantic information, which can be utilized to effectively complement image features. This paper proposes a comment-guided semantics-aware image aesthetics assessment method, which is built upon a multi-task learning framework for image aesthetics prediction and comment-guided semantics classification. To assist image aesthetics assessment, we first model the semantics of an image as the topic features of its corresponding comments using Latent Dirichlet Allocation. We then propose a two-stream multitask learning framework for both topic feature prediction and aesthetic score distribution prediction. Topic feature prediction task enables to infer the semantics from images, since the comments are usually unavailable during inference and comment-guided semantics can only serve as supervision during training. We further propose to deeply fuse aesthetics and semantic features using a layerwise feature fusion method. Experimental results demonstrate that the proposed method outperforms state-of-the-art image aesthetics assessment methods. PubDate:
March 2023
Issue No: Vol. 33, No. 3 (2023)
|