A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  

  Subjects -> ELECTRONICS (Total: 207 journals)
The end of the list has been reached or no journals were found for your choice.
Similar Journals
Journal Cover
IEEE Transactions on Circuits and Systems for Video Technology
Journal Prestige (SJR): 0.977
Citation Impact (citeScore): 5
Number of Followers: 33  
 
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1051-8215
Published by IEEE Homepage  [228 journals]
  • IEEE Transactions on Circuits and Systems for Video Technology Publication
           Information

    • Free pre-print version: Loading...

      Abstract: Presents a listing of the editorial board, board of governors, current staff, committee members, and/or society editors for this issue of the publication.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • IEEE Circuits and Systems Society Information

    • Free pre-print version: Loading...

      Abstract: Presents a listing of the editorial board, board of governors, current staff, committee members, and/or society editors for this issue of the publication.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Rethinking Camouflaged Object Detection: Models and Datasets

    • Free pre-print version: Loading...

      Authors: Hongbo Bi;Cong Zhang;Kang Wang;Jinghui Tong;Feng Zheng;
      Pages: 5708 - 5724
      Abstract: Camouflaged object detection (COD) is an emerging visual detection task, which aims to locate and distinguish the disguised target in complex backgrounds by imitating the human visual detection system. Recently, COD has attracted increasing attention in computer vision, and a few models of camouflaged object detection have been successfully explored. However, most existing works primarily focus on modeling camouflaged object detection over in-depth analyzing existing COD structures. To the best of our knowledge, a systematic review for COD has not been publicly reported, especially for recently proposed deep learning-based COD models. To make up this vacancy, we firstly proposed a comprehensive review on both COD models and public benchmark datasets and provide potential directions for future COD studies. Specifically, we conduct a comprehensive summary of 39 existing COD models from 1998 to 2021. And then, to facilitate subsequent research on COD, we classify the existing structures into two categories, 27 traditional handcrafted feature-based structures and 12 structures based on deep learning. In addition, we further group traditional handcrafted feature-based structures into six sub-classes based on the detection mechanism: texture, color, motion, intensity, optical flow, and multi-modal fusion. Furthermore, we take an in-depth analysis of the deep learning-based structure based on both detection motivation and detection performance and evaluate the performance of each structure. Moreover, we sum up four widely used COD datasets and describe the details of each one. Finally, we also discuss the limitations of COD and the corresponding solutions to improve detection accuracy. We still mention the relevant applications of camouflaged object detection and its future research directions to promote the development of camouflaged object detection.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Reversible Data Hiding for Color Images Based on Adaptive
           Three-Dimensional Histogram Modification

    • Free pre-print version: Loading...

      Authors: Qi Chang;Xiaolong Li;Yao Zhao;
      Pages: 5725 - 5735
      Abstract: Reversible data hiding (RDH) is a hot research topic, and many related techniques are proposed, but only a few are devised for color images. Most current RDH schemes, including those for color images, follow a well-developed framework: histogram-based embedding, which consists of two main steps, i.e., prediction-error histogram (PEH) generation by a pixel predictor and PEH modification through exploring efficient reversible mappings. The reversible mappings employed in these RDH approaches for color images, on the other hand, are empirically designed ignoring the specific image content, resulting in limited embedding performance. To address this issue, a novel RDH method for color images based on adaptive mapping selection is proposed in this paper. First, to leverage high inter-channel correlation of color images, a three-dimensional (3D) PEH is generated. Then, an effective reversible mapping selection mechanism is proposed, in which 3D mappings are adjusted in an ordered iterative manner according to PEH frequency ranking so that the embedding performance is optimized. By the proposed approach, the optimal reversible mapping can be acquired with low computing complexity and better embedding performance, and its efficiency is experimentally validated in comparison to various state-of-the-art studies.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Unsupervised Image Restoration With Quality-Task-Perception Loss

    • Free pre-print version: Loading...

      Authors: Wei Xu;Xinyuan Chen;Haoming Guo;Xiaolin Huang;Wei Liu;
      Pages: 5736 - 5747
      Abstract: Image restoration includes various kinds of tasks, such as image denoising, image deraining and low-light image enhancement, etc. Due to the domain shift problem of current supervised methods, researchers tend to adopt unsupervised image restoration methods. However, fake color or blur image, insufficient restoration and missing semantic information are three common problems when utilizing these methods. In this paper, we propose a new hybrid loss named Quality-Task-Perception (QTP) to deal with these three problems simultaneously. Specifically, this hybrid loss includes three components: quality, task and perception. The quality part overcomes the fake color or blur image problem by enforcing image quality scores of the restored images and those of the unpaired clean images to be similar. For the task part, we tackle the insufficient restoration problem by proposing to apply a task probability network to convert the unsupervised image restoration into a supervised classification problem, and this task probability network is learned from our proposed pipeline. The perception part handles the missing semantic information by restricting the multi-scale phase consistency between the degraded image and its restored version. Comprehensive experiments on both supervised and unsupervised datasets in three image restoration tasks demonstrate the superiority of our proposed approach.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Auto-Perceiving Correlation Filter for UAV Tracking

    • Free pre-print version: Loading...

      Authors: Lei Wang;Jianan Li;Bo Huang;Junjie Chen;Xiangmin Li;Jihui Wang;Tingfa Xu;
      Pages: 5748 - 5761
      Abstract: Discriminative correlation filter (DCF)-based methods have demonstrated superior performance in UAV tracking via fusing multiple types of features and updating models online. However, most DCF-based trackers simply cascade different features, failing to fully take advantage of their complementary strength. In addition, online update strategies are limited to using a single and fixed learning rate, which often leads to model degradation when suffering tracking challenges. In this paper, we present an Auto-Perceiving Correlation Filter (APCF) which explicitly models the target and context with a novel Target State and Background Perception (TSBP) feature. Concretely, we first propose a simple yet effective State Evaluation Metric (SEM) to estimate target states by analyzing the spatial distribution of responses. Based on SEM, we extract TSBP features by adaptively selecting effective features depending on the current target state. Accordingly, a new online model update strategy is also introduced to avoid model degradation. Moreover, we further introduce a perception regularization term to make the extracted feature emphasis more on the target rather than background. Extensive experiments on four widely-used UAV benchmarks have well demonstrated the superiority of the proposed method compared with both DCF and deep learning based trackers while running at a high speed of 76.7 FPS on a single CPU. In addition, APCF with deep features also performs favorably against state-of-the-art trackers.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Adaptive Path Selection for Dynamic Image Captioning

    • Free pre-print version: Loading...

      Authors: Tiantao Xian;Zhixin Li;Zhenjun Tang;Huifang Ma;
      Pages: 5762 - 5775
      Abstract: Image captioning is a challenging task, i.e., given an image machine automatically generates natural language that matches its semantic content and has attracted much attention in recent years. However, most existing models are designed manually, and their performance depends heavily on the expert experience of the designer. In addition, the computational flow of the model is predefined, and hard and easy samples will share the same coding path and easily interfere with each other, thus confusing the learning of the model. In this paper, we propose a Dynamic Transformer to change the encoding procedure from sequential to adaptive, i.e., data-dependent computing paths. Specifically, we design three different types of visual feature extraction blocks and deploy them in parallel at each layer to construct a multi-layer routing space in a fully connected manner. Each block contains a calculation unit that performs the corresponding operations and a routing gate that learns to adaptively select the direction to pass the signal based on the input image. Thus, our model can achieve a robust visual representation by exploring potential visual feature extraction paths. We evaluate our method quantitatively and qualitatively using a benchmark MSCOCO image caption dataset and perform extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method is significantly superior to previous state-of-the-art methods.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • EDFLOW: Event Driven Optical Flow Camera With Keypoint Detection and
           Adaptive Block Matching

    • Free pre-print version: Loading...

      Authors: Min Liu;Tobi Delbruck;
      Pages: 5776 - 5789
      Abstract: Event cameras such as the Dynamic Vision Sensor (DVS) are useful because of their low latency, sparse output, and high dynamic range. In this paper, we propose a DVS+FPGA camera platform and use it to demonstrate the hardware implementation of event-based corner keypoint detection and adaptive block-matching optical flow. To adapt sample rate dynamically, events are accumulated in event slices using the area event count slice exposure method. The area event count is feedback controlled by the average optical flow matching distance. Corners are detected by streaks of accumulated events on event slice rings of radius 3 and 4 pixels. Corner detection takes about 6 clock cycles (16 MHz event rate at the 100MHz clock frequency) At the corners, flow vectors are computed in 100 clock cycles (1 MHz event rate). The multiscale block match size is $25times 25$ pixels and the flow vectors span up to 30-pixel match distance. The FPGA processes the sum-of-absolute distance block matching at 123 GOp/s, the equivalent of 1230 Op/clock cycle. EDFLOW is several times more accurate on MVSEC drone and driving optical flow benchmarking sequences than the previous best DVS FPGA optical flow implementation, and achieves similar accuracy to the CNN-based EV-Flownet, although it burns about 100 times less power. The EDFLOW design and benchmarking videos are available at https://sites.google.com/view/edflow21/home.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • An Automatically Layer-Wise Searching Strategy for Channel Pruning Based
           on Task-Driven Sparsity Optimization

    • Free pre-print version: Loading...

      Authors: Kai-Yuan Feng;Xia Fei;Maoguo Gong;A. K. Qin;Hao Li;Yue Wu;
      Pages: 5790 - 5802
      Abstract: Deep convolutional neural networks (CNNs) have achieved tremendous successes but tend to suffer from high computation costs mainly due to heavy over-parameterization, resulting in the difficulty of directly applying them to the ever-growing application demands based on low-end edge devices with strong power restriction and real-time inference requirement. Recently, there has much research attention devoted to compressing the network via pruning to address this issue. Most of the existing methods rely on some hand-designed pruning rules, which suffer from several limitations. Firstly, manually designed rules are only applicable to limited application scenarios, which can hardly generalize well in a broader scope. And these rules are typically designed based on human experience and via trial and error, and thus highly subjective. Then, channels of different layers in a network may have diverse distributions, which means the same pruning rule is not appropriate for each layer. To address these limitations, we propose a novel channel pruning scheme, in which the task-irrelevant channels are removed in a task-driven manner. Specifically, an adaptively differentiable search module is proposed to find the best pruning rule automatically for different layers in CNNs under sparsity constraints. Besides, we employed knowledge distillation to alleviate the excessive performance loss. Once the training process is finished, a compact network will be obtained by removing channels based on layer-wise pruning rules. We have evaluated the proposed method on some well-known benchmark datasets including CIFAR, MNIST, and ImageNet in comparison to several state-of-the-art pruning methods. Experimental results demonstrate the superiority of our method over the compared ones in terms of both parameters and FLOPs reduction.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Attentive Feature Augmentation for Long-Tailed Visual Recognition

    • Free pre-print version: Loading...

      Authors: Weiqiu Wang;Zhicheng Zhao;Pingyu Wang;Fei Su;Hongying Meng;
      Pages: 5803 - 5816
      Abstract: Deep neural networks have achieved great success on many visual recognition tasks. However, training data with a long-tailed distribution dramatically degenerates the performance of recognition models. In order to relieve this imbalance problem, an effective Long-Tailed Visual Recognition (LTVR) framework is proposed based on learned balance and robust features under long-tailed distribution circumstances. In this framework, a plug-and-play Attentive Feature Augmentation (AFA) module is designed to mine class-related and variation-related features of original samples via a novel hierarchical channel attention mechanism. Then, those features are aggregated to synthesize fake features to cope with the imbalance of the original dataset. Moreover, a Lay-Back Learning Schedule (LBLS) is developed to ensure a good initialization of feature embedding. Extensive experiments are conducted with a two-stage training method to verify the effectiveness of the proposed framework on both feature learning and classifier rebalancing in the long-tailed image recognition task. Experimental results show that, when trained with imbalanced datasets, the proposed framework achieves superior performance over the state-of-the-art methods.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • A Robust Coverless Steganography Based on Generative Adversarial Networks
           and Gradient Descent Approximation

    • Free pre-print version: Loading...

      Authors: Fei Peng;Guanfu Chen;Min Long;
      Pages: 5817 - 5829
      Abstract: Aiming at resolving the problem of the irreversibility in some common neural networks for secret data extraction, a novel image steganography framework is proposed based on the generator of GAN (Generative Adversarial Networks) and gradient descent approximation. During data embedding, the secret data is first mapped into a stego noise vector by a specific mapping rule, and it is input into the generator of a GAN to produce a stego image. The data extraction is accomplished by iteratively updating the noise vector using the gradient descent with the generator. When the error is declined within the allowable error, the output image of the generator is approximate to the stego image, and the updated noise vector will also approach to the stego noise vector. Finally, the secret data is extracted from the updated noise vector. Experiments and analysis with WGAN-GP (Wasserstein GAN-Gradient Penalty) show that it can achieve good performance in extraction accuracy, capacity and robustness. Furthermore, the discussions also illustrate its good generalization with different GAN models and image datasets.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Designing CNNs for Multimodal Image Restoration and Fusion via Unfolding
           the Method of Multipliers

    • Free pre-print version: Loading...

      Authors: Iman Marivani;Evaggelia Tsiligianni;Bruno Cornelis;Nikos Deligiannis;
      Pages: 5830 - 5845
      Abstract: Multimodal, alias, guided, image restoration is the reconstruction of a degraded image from a target modality with the aid of a high quality image from another modality. A similar task is image fusion; it refers to merging images from different modalities into a composite image. Traditional approaches for multimodal image restoration and fusion include analytical methods that are computationally expensive at inference time. Recently developed deep learning methods have shown a great performance at a reduced computational cost; however, since these methods do not incorporate prior knowledge about the problem at hand, they result in a “black box” model, that is, one can hardly say what the model has learned. In this paper, we formulate multimodal image restoration and fusion as a coupled convolutional sparse coding problem, and adopt the Method of Multipliers (MM) for its solution. Then, we use the MM-based solution to design a convolutional neural network (CNN) encoder that follows the principle of deep unfolding. To address multimodal image restoration and fusion, we design two multimodal models which employ the proposed encoder followed by an appropriately designed decoder that maps the learned representations to the desired output. Unlike most existing deep learning designs comprising multiple encoding branches followed by a concatenation or a linear combination fusion block, the proposed design provides an efficient and structured way to fuse information at different stages of the network, providing representations that can lead to accurate image reconstruction. The proposed models are applied to three image restoration tasks, as well as two image fusion tasks. Quantitative and qualitative comparisons against various state-of-the-art analytical and deep learning methods corroborate the superior performance of the proposed framework.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Towards More Realistic Human Motion Prediction With Attention to Motion
           Coordination

    • Free pre-print version: Loading...

      Authors: Pengxiang Ding;Jianqin Yin;
      Pages: 5846 - 5858
      Abstract: Joint relation modeling is a curial component in human motion prediction. Most existing methods rely on skeletal-based graphs to build the joint relations, where local interactive relations between joint pairs are well learned. However, the motion coordination, a global joint relation reflecting the simultaneous cooperation of all joints, is usually weakened because it is learned from part to whole progressively and asynchronously. Thus, the final predicted motions usually appear unrealistic. To tackle this issue, we learn a medium, called coordination attractor (CA), from the spatiotemporal features of motion to characterize the global motion features, which is subsequently used to build new relative joint relations. Through the CA, all joints are related simultaneously, and thus the motion coordination of all joints can be better learned. Based on this, we further propose a novel joint relation modeling module, Comprehensive Joint Relation Extractor (CJRE), to combine this motion coordination with the local interactions between joint pairs in a unified manner. Additionally, we also present a Multi-timescale Dynamics Extractor (MTDE) to extract enriched dynamics from the raw position information for effective prediction. Extensive experiments show that the proposed framework outperforms state-of-the-art methods in both short- and long-term predictions on H3.6M, CMU-Mocap, and 3DPW.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Large-Scale Crowdsourced Subjective Assessment of Picturewise Just
           Noticeable Difference

    • Free pre-print version: Loading...

      Authors: Hanhe Lin;Guangan Chen;Mohsen Jenadeleh;Vlad Hosu;Ulf-Dietrich Reips;Raouf Hamzaoui;Dietmar Saupe;
      Pages: 5859 - 5873
      Abstract: The picturewise just noticeable difference (PJND) for a given image, compression scheme, and subject is the smallest distortion level that the subject can perceive when the image is compressed with this compression scheme. The PJND can be used to determine the compression level at which a given proportion of the population does not notice any distortion in the compressed image. To obtain accurate and diverse results, the PJND must be determined for a large number of subjects and images. This is particularly important when experimental PJND data are used to train deep learning models that can predict a probability distribution model of the PJND for a new image. To date, such subjective studies have been carried out in laboratory environments. However, the number of participants and images in all existing PJND studies is very small because of the challenges involved in setting up laboratory experiments. To address this limitation, we develop a framework to conduct PJND assessments via crowdsourcing. We use a new technique based on slider adjustment and a flicker test to determine the PJND. A pilot study demonstrated that our technique could decrease the study duration by 50% and double the perceptual sensitivity compared to the standard binary search approach that successively compares a test image side by side with its reference image. Our framework includes a robust and systematic scheme to ensure the reliability of the crowdsourced results. Using 1,008 source images and distorted versions obtained with JPEG and BPG compression, we apply our crowdsourcing framework to build the largest PJND dataset, KonJND-1k (Konstanz just noticeable difference 1k dataset). A total of 503 workers participated in the study, yielding 61,030 PJND samples that resulted in an average of 42 samples per source image. The KonJND-1k dataset is available at http://database.mmsp-kn.de/konjnd-1k-database.html
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • High-Capacity Framework for Reversible Data Hiding in Encrypted Image
           Using Pixel Prediction and Entropy Encoding

    • Free pre-print version: Loading...

      Authors: Yingqiang Qiu;Qichao Ying;Yuyan Yang;Huanqiang Zeng;Sheng Li;Zhenxing Qian;
      Pages: 5874 - 5887
      Abstract: While the existing reserving room before encryption (RRBE) based reversible data hiding in encrypted image (RDHEI) schemes can achieve decent embedding capacity, the capacity of the existing vacating room by encryption (VRBE) based schemes is relatively low. To address this issue, this paper proposes a generalized framework for high-capacity RDHEI for both the RRBE and VRBE cases. First, an efficient embedding room generation algorithm (ERGA) is designed to produce large embedding room using pixel prediction and entropy encoding. Then, we propose two RDHEI schemes, one for RRBE, another for VRBE. In the RRBE scenario, the image owner generates the embedding room with ERGA and encrypts the preprocessed image using stream cipher with two encryption keys. Then, the data hider locates the embedding room and embeds the additional encrypted data. In the VRBE scenario, the cover image is encrypted by an improved block modulation and permutation encryption algorithm, where the spatial redundancy in the plain-text image is greatly preserved. Then, the data hider applies ERGA on the encrypted image to generate the embedding room and conducts data embedding. For both schemes, receivers with different authentication keys can conduct either error-free data extraction or error-free image recovery. The experimental results show that the two proposed schemes outperform many state-of-the-art RDHEI schemes. Besides, they can ensure high security level, where the original image can be hardly discovered from the encrypted version before or after data hiding by unauthorized users.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Grayscale-Invariant Reversible Data Hiding Based on Multiple Histograms
           Modification

    • Free pre-print version: Loading...

      Authors: Chun-Liang Jhong;Hsin-Lung Wu;
      Pages: 5888 - 5901
      Abstract: Grayscale-invariant reversible data hiding (GI-RDH) in color images is a data embedding framework in which the grayscales of a marked color image must be identical to those of the host color image. Recently, some state-of-the-art GI-RDH schemes were proposed. However, their performance in embedding distortion is unsatisfactory. In order to obtain better image quality, a well-known histogram-shifting-based RDH method called multiple histograms modification (MHM) is considered. In this paper, we propose an MHM-based GI-RDH scheme. First, we modified our previous GI-RDH scheme using a multiple-histogram-shifting approach instead of a difference expansion approach. Next, we designed a procedure to select expansion–bin pairs for generated histograms to achieve low embedding distortion through further data embedding. Specifically, we analyzed the expected embedding distortion of our MHM-based GI-RDH scheme given any set of expansion–bin pairs. We then formulated an optimization problem called the GI-MHM minimization problem to identify the optimal expansion–bin pairs for further embedding tasks. Finally, we generated an approximated solution for the GI-MHM minimization problem and conducted the embedding task with these selected expansion–bin pairs. The experimental results revealed that the proposed GI-RDH scheme outperformed previous methods when the embedding capacity was small.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Progressive Dual-Attention Residual Network for Salient Object Detection

    • Free pre-print version: Loading...

      Authors: Liqian Zhang;Qing Zhang;Rui Zhao;
      Pages: 5902 - 5915
      Abstract: Due to the rapid development of deep learning, the performance of salient object detection has been constantly refreshed. Nevertheless, it is still challenging for existing methods to distinguish the location of salient objects and retain fine structural details. In this paper, a novel progressive dual-attention residual network (PDRNet) is proposed to exploit two complementary attention maps to guide residual learning, thus progressively refining prediction in a coarse-to-fine manner. We design a dual-attention residual module (DRM) to achieve residual refinement with the help of the dual attention (DA) scheme. Specifically, an attention map and its corresponding reverse attention map are used to make the network be aware of learning residual details from the perspective of the salient and non-salient regions, thus utilizing their complementarity to correct the mistakes of object parts and boundary details. Besides, a hierarchical feature screening module (HFSM) is designed to capture more powerful global contextual knowledge for locating salient objects. It establishes cross-scale skip connections among multi-scale features and utilizes the intra-channel dependency of these scales to enhance information interaction and feature representation. Extensive experiments have proved that our proposed PDRNet performs favorably against 18 state-of-the-art competitors on five benchmark datasets, demonstrating the effectiveness and superiority of our method.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Progressive Meta-Learning With Curriculum

    • Free pre-print version: Loading...

      Authors: Ji Zhang;Jingkuan Song;Lianli Gao;Ye Liu;Heng Tao Shen;
      Pages: 5916 - 5930
      Abstract: Meta-learning offers an effective solution to learn new concepts under scarce supervision through an episodic-training scheme: a series of target-like tasks sampled from base classes are sequentially fed into a meta-learner to extract cross-task knowledge, which can facilitate the quick acquisition of task-specific knowledge of the target task with few samples. Despite its noticeable improvements, the episodic-training strategy samples tasks randomly and uniformly, without considering their hardness and quality, which may not progressively improve the meta-leaner’s generalization. In this paper, we propose Progressive Meta-learning using tasks from easy to hard. First, based on a predefined curriculum, we develop a Curriculum-Based Meta-learning (CubMeta) method. CubMeta is in a stepwise manner, and in each step, we design a BrotherNet module to establish harder tasks and an effective learning scheme for obtaining an ensemble of stronger meta-learners. Then we move a step further to propose an end-to-end Self-Paced Meta-learning (SepMeta) method. The curriculum in SepMeta is effectively integrated as a regularization term into the objective so that the meta-learner can measure the hardness of tasks adaptively, according to what the model has already learned. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed methods. Our code is available at https://github.com/nobody-777.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Video Snapshot Compressive Imaging Using Residual Ensemble Network

    • Free pre-print version: Loading...

      Authors: Yubao Sun;Xunhao Chen;Mohan S. Kankanhalli;Qingshan Liu;Junxia Li;
      Pages: 5931 - 5943
      Abstract: Video snapshot compressive imaging (SCI) system enables high-frame-rate imaging by projecting multiple frames into a 2D snapshot measurement during a single exposure, and the original video frames can be reconstructed by solving an optimization problem. However, existing methods usually cannot achieve a good balance between reconstruction time and reconstruction quality, which has become a major obstacle for practical application of video SCI. In order to cope with this issue, we propose a residual ensemble network to learn the explicit inverse mapping from the 2D snapshot measurement to the original video. Specifically, the proposed network aims to exploit the spatiotemporal correlations between video frames for improving reconstruction quality. The spatiotemporal correlations of video frames demonstrate multiple types, including intra-frame spatial correlation, inter-frame forward and backward temporal correlation. With the purpose of fully capturing these differentiated correlations, we design four sub-networks, namely, a pseudo-3D U-shape sub-network, two residual sub-networks, and a serial forward and backward recurrent sub-network, and further assemble these four sub-networks into an ensemble network through alternate residual links. This ensemble network can effectively fuse the predictions of each sub-network and maintain spatiotemporal consistency between video frames. We further design a compound loss function to guide the network learning, and the new video can be fast reconstructed by simply feeding its 2D snapshot measurement into the learned network. The experimental results demonstrate that our network can significantly improve the reconstruction quality while maintaining low computational cost.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Blindly Assess Quality of In-the-Wild Videos via Quality-Aware
           Pre-Training and Motion Perception

    • Free pre-print version: Loading...

      Authors: Bowen Li;Weixia Zhang;Meng Tian;Guangtao Zhai;Xianpei Wang;
      Pages: 5944 - 5958
      Abstract: Perceptual quality assessment of the videos acquired in the wilds is of vital importance for quality assurance of video services. The inaccessibility of reference videos with pristine quality and the complexity of authentic distortions pose great challenges for this kind of blind video quality assessment (BVQA) task. Although model-based transfer learning is an effective and efficient paradigm for the BVQA task, it remains to be a challenge to explore what and how to bridge the domain shifts for better video representation. In this work, we propose to transfer knowledge from image quality assessment (IQA) databases with authentic distortions and large-scale action recognition with rich motion patterns. We rely on both groups of data to learn the feature extractor and use a mixed list-wise ranking loss function to train the entire model on the target VQA databases. Extensive experiments on six benchmarking databases demonstrate that our method performs very competitively under both individual database and mixed databases training settings. We also verify the rationality of each component of the proposed method and explore a simple ensemble trick for further improvement.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Underwater Image Enhancement Quality Evaluation: Benchmark Dataset and
           Objective Metric

    • Free pre-print version: Loading...

      Authors: Qiuping Jiang;Yuese Gu;Chongyi Li;Runmin Cong;Feng Shao;
      Pages: 5959 - 5974
      Abstract: Due to the attenuation and scattering of light by water, there are many quality defects in raw underwater images such as color casts, decreased visibility, reduced contrast, et al.. Many different underwater image enhancement (UIE) algorithms have been proposed to enhance underwater image quality. However, how to fairly compare the performance among UIE algorithms remains a challenging problem. So far, the lack of comprehensive human subjective user study with large-scale benchmark dataset and reliable objective image quality assessment (IQA) metric makes it difficult to fully understand the true performance of UIE algorithms. We in this paper make efforts in both subjective and objective aspects to fill these gaps. Firstly, we construct a new Subjectively-Annotated UIE benchmark Dataset (SAUD) which simultaneously provides real-world raw underwater images, readily available enhanced results by representative UIE algorithms, and subjective ranking scores of each enhanced result. Secondly, we propose an effective No-reference (NR) Underwater Image Quality metric (NUIQ) to automatically evaluate the visual quality of enhanced underwater images. Experiments on the constructed SAUD dataset demonstrate the superiority of our proposed NUIQ metric, achieving higher consistency with subjective rankings than 22 mainstream NR-IQA metrics. The dataset and source code will be made available at https://github.com/yia-yuese/SAUD-Dataset.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Dual-Pyramidal Image Inpainting With Dynamic Normalization

    • Free pre-print version: Loading...

      Authors: Chao Wang;Mingwen Shao;Deyu Meng;Wangmeng Zuo;
      Pages: 5975 - 5988
      Abstract: Deep autoencoder-based approaches have achieved significant improvements on restoring damaged images, yet they still suffer from artifacts due to the inadequate representation and inaccurate regularization of existing features. In this paper, we propose a dual-pyramidal inpainting framework called DPNet to address these two limitations, which seamlessly integrates sufficient feature learning and dynamic regularization within an autoencoder network. Specifically, to exhaustively extract multi-scale features, we adopt layer-wise pyramidal convolution in encoder, which provides an arbitrary combination pool of various receptive fields. Subsequently, to tackle the patch deterioration problem in previous cross-scale non-local schemes, we further propose a Pyramidal Attention Mechanism (PAM) in decoder to acquire finer patches directly from learned layers. Mutually benefited with pyramidal features extraction in encoder, the dissemination space for non-local pixels in our PAM is notably enlarged to pyramidal level, thus significantly benefiting the feature representation. Moreover, to avoid the mask error accumulation in existing works, a dynamic normalization mechanism utilizing the spatial mask information updated in encoder is introduced, which further ensures the feature integrity and consistency. Such a dual-pyramidal structure along with dynamic normalization significantly improve the inpainting quality, outperforming existing competitors. Comprehensive experiments conducted on three benchmark datasets demonstrate that our DPNet performs favorably against the state-of-the-arts.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • General Expansion-Shifting Model for Reversible Data Hiding: Theoretical
           Investigation and Practical Algorithm Design

    • Free pre-print version: Loading...

      Authors: Haorui Wu;Xiaolong Li;Xiangyang Luo;Xinpeng Zhang;Yao Zhao;
      Pages: 5989 - 6001
      Abstract: As a specific data hiding technique, reversible data hiding (RDH) has recently received extensive attention. By this technique, both the embedded data and the original cover image can be exactly extracted from the marked image. In our previous work, a general expansion-shifting model for RDH is proposed by introducing the so-called reversible embedding function (REF). With REF, RDH can be designed and the corresponding rate-distortion formulations can be established, providing an approach to optimize the reversible embedding performance. In this paper, by extending our previous work, optimal REF for one-dimensional histogram is investigated, and all optimal REF are derived in this case when the maximum modification to the cover pixel is limited as a small value. Moreover, based on the derived optimal REF for one-dimensional histogram and multiple histograms modification, a practical RDH scheme is presented and it is experimental verified better than some state-of-the-art algorithms in terms of capacity-distortion performance.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Detecting Steganography in JPEG Images Recompressed With the Same
           Quantization Matrix

    • Free pre-print version: Loading...

      Authors: Qingxiao Guan;Kaimeng Chen;Hefeng Chen;Weiming Zhang;Nenghai Yu;
      Pages: 6002 - 6016
      Abstract: JPEG steganalysis aims to detect stego JPEG images. For some robust steganography methods, in order to enhance stego images’ robustness of resisting JPEG recompression from lossy channel such SNS or photo sharing websites, steganographer may intentionally recompress cover image several times with quantization matrix of targeted channel, which thereby make it possible to transmit stego data in such channel for better disguise. In addition, there are huge number of cover JPEG images may be recompressed for various reasons, such as processing by some tools. Thus a better steganalysis method for such images is needed. In this paper, we investigate the steganalysis method for images recompressed with the same quantization matrix, namely, discriminate recompressed JPEG cover images and its stego images. We present some observed phenomenon on recompressed JPEG images, and design methods to enhance the sensitivity of feature based and deep model based steganalysis methods for this task. To verify their effectiveness with different acquisition of recompression prior-knowledge, we conduct experiments in various settings including conventional setting and mixing samples of different recompressing times in training. Their results demonstrate that the proposed method can notably improve detection accuracy on recompressed JPEG images.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Semi-Supervised Action Quality Assessment With Self-Supervised Segment
           Feature Recovery

    • Free pre-print version: Loading...

      Authors: Shao-Jie Zhang;Jia-Hui Pan;Jibin Gao;Wei-Shi Zheng;
      Pages: 6017 - 6028
      Abstract: Action Quality Assessment aims to evaluate how well an action performs. Existing methods have achieved remarkable progress on fully-supervised action assessment. However, in real-world applications, with expert’s experience, it is not always feasible to manually label all samples. Therefore, it is important to study the problem of semi-supervised action assessment with only a small amount of samples annotated. A major challenge for semi-supervised action assessment is how to exploit the temporal pattern from unlabeled videos. Inspired by the temporal dependencies of the action execution, we propose a self-supervised learning on the unlabeled videos by recovering the feature of a masked segment of an unlabeled video. Furthermore, we leverage adversarial learning to align the representation distribution of the labeled and the unlabeled samples to close their gap in the sample space since unlabeled samples always come from unseen actions. Finally, we propose an adversarial self-supervised framework for semi-supervised action quality assessment. The extensive experimental results on the MTL-AQA and the Rhythmic Gymnastics datasets will demonstrate the effectiveness of our framework, achieving the state-of-the-art performances of semi-supervised action quality assessment.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • IRDCLNet: Instance Segmentation of Ship Images Based on Interference
           Reduction and Dynamic Contour Learning in Foggy Scenes

    • Free pre-print version: Loading...

      Authors: Yuxin Sun;Li Su;Yongkang Luo;Hao Meng;Zhi Zhang;Wen Zhang;Shouzheng Yuan;
      Pages: 6029 - 6043
      Abstract: Frequent bad weather at sea severely damages the quality of visual images captured by imaging equipment. Ship instance segmentation in adverse weather conditions remains a major challenge because of poor visibility at sea. Existing approaches for instance segmentation are primarily designed for clear days and rarely consider the aforementioned severe weather. Blurred ship objects can easily cause missed ship detection and decrease the instance segmentation performance on ship images, especially in the case of frequent fog at sea. To this end, we propose a ship instance segmentation framework (IRDCLNet) based on Interference Reduction and Dynamic Contour Learning in foggy scenes. The Interference Reduction Module is proposed to reduce the interference caused by fog and solves the problem of missed ship detection. Meanwhile, we present Dynamic Contour Learning to predict the overall contour of the blurred ships to assist in mask prediction. To handle the scarcity of ocean data in foggy weather, we build the Foggy ShipInsseg dataset, which contains 5,739 real and simulated foggy ship images with 10,900 fine instance mask annotations. Experiments on the Foggy ShipInsseg dataset show that our IRDCLNet outperforms the Mask R-CNN and CondInst baselines and achieves the state-of-the-art performance.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Depth Estimation From a Single Image of Blast Furnace Burden Surface Based
           on Edge Defocus Tracking

    • Free pre-print version: Loading...

      Authors: Jiancai Huang;Zhaohui Jiang;Weihua Gui;Zunhui Yi;Dong Pan;Ke Zhou;Chuan Xu;
      Pages: 6044 - 6057
      Abstract: Continuous and accurate depth information of blast furnace burden surface is important for optimizing charging operations, thereby reducing its energy consumption and CO2 emissions. However, depth estimation for a single image is challenging, especially when estimating the depth of burden surface images in the harsh internal environment of the blast furnace. In this paper, a novel method that is based on edge defocus tracking is proposed to estimate the depth of burden surface images with different morphological characteristics. First, an endoscopic video acquisition system is designed, key frames of burden surface video in stable state are extracted based on feature point optical flow method, and the sparse depth is estimated by using the defocus-based method. Next, the burden surface image is divided into four subregions according to the distribution characteristics of the burden surface, the edge line trajectories and an eight-direction depth gradient template are designed to develop depth propagation rules. Finally, the depth is propagated from edge to the entire image based on edge line tracking method. The experimental results show that the proposed method can accurately and efficiently estimate the depth of the burden surface and provide key data support for optimizing the operation of blast furnace.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Ambiguousness-Aware State Evolution for Action Prediction

    • Free pre-print version: Loading...

      Authors: Lei Chen;Jiwen Lu;Zhanjie Song;Jie Zhou;
      Pages: 6058 - 6072
      Abstract: In this paper, we propose an ambiguousness-aware state evolution (AASE) method which represents the uncertainty of the input sequence and evolves the subsequent skeletons to generate a reasonable full-length sequence for action prediction. Unlike most existing methods that enforce partial sequences with the labels of full-length videos and ignore the semantic information of the subsequent action, we develop an evolution method by predicting the instructional actions and generating the reasonable candidate subsequent actions, so that the ambiguity of the full sequence’s label supervising for the partial actions can be effectively alleviated. Our method generates the rational subsequent actions under the instructional action class to complement the partially observed action sequence. We design two criteria for a rational generation: 1) the instruction of subsequent action keeps the semantic consistency with the observed sequence; 2) the generation sequence is satisfied with the distribution of the sequence of real data. Moreover, we design an uncertainty module to decide the instructional action class for the generation network. AASE predicts instructional actions with uncertainty learning and evolves different instructional actions by generating the subsequent skeletons, which find the most probable action to represent the partially observed action by learning the way of perceiving the tendency of the ongoing action. We conduct experiments on seven widely used action datasets: NTU-60, NTU-120, UCF101, UT-Interaction, BIT, PKU-MMD and HMDB51, and our experimental results clearly demonstrate that our method achieves very competitive performance with state-of-the-art.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Industrial Scene Text Detection With Refined Feature-Attentive Network

    • Free pre-print version: Loading...

      Authors: Tongkun Guan;Chaochen Gu;Changsheng Lu;Jingzheng Tu;Qi Feng;Kaijie Wu;Xinping Guan;
      Pages: 6073 - 6085
      Abstract: Detecting the marking characters of industrial metal parts remains challenging due to low visual contrast, uneven illumination, corroded surfaces, and cluttered background of metal part images. Affected by these factors, bounding boxes generated by most existing methods could not locate low-contrast text areas very well. In this paper, we propose a refined feature-attentive network (RFN) to solve the inaccurate localization problem. Specifically, we first design a parallel feature integration mechanism to construct an adaptive feature representation from multi-resolution features, which enhances the perception of multi-scale texts at each scale-specific level to generate a high-quality attention map. Then, an attentive proposal refinement module is developed by the attention map to rectify the location deviation of candidate boxes. Besides, a re-scoring mechanism is designed to select text boxes with the best rectified location. To promote the research towards industrial scene text detection, we contribute two industrial scene text datasets, including a total of 102156 images and 1948809 text instances with various character structures and metal parts. Extensive experiments on our dataset and four public datasets demonstrate that our proposed method achieves the state-of-the-art performance. Both code and dataset are available at: https://github.com/TongkunGuan/RFN.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Spatio-Temporal Player Relation Modeling for Tactic Recognition in Sports
           Videos

    • Free pre-print version: Loading...

      Authors: Longteng Kong;Duoxuan Pei;Rui He;Di Huang;Yunhong Wang;
      Pages: 6086 - 6099
      Abstract: Tactic recognition in sports videos is a challenging task. To address this, we present a novel spatio-temporal relation modeling approach, which captures both detailed player interactions and long-range group dynamics in tactics. In spatial modeling, we propose an Adaptive Graph Convolutional Network (A-GCN), and it represents individual and common patterns of data through local and global graphs to learn diverse player interactions. In temporal modeling, we propose an Attentive Temporal Convolutional Network (A-TCN) and with spatial configurations as input, it builds group dynamics and is robust to redundant content by considering sequence dependencies. Due to adaptive interaction and attentive dynamics modeling, our approach is able to comprehensively describe team cooperation over time in a tactic. We extensively evaluate the proposed approach on the Volleyball dataset and a newly collected VolleyTactic dataset, and the experimental results show its advantage.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Saliency and Granularity: Discovering Temporal Coherence for Video-Based
           Person Re-Identification

    • Free pre-print version: Loading...

      Authors: Cuiqun Chen;Mang Ye;Meibin Qi;Jingjing Wu;Yimin Liu;Jianguo Jiang;
      Pages: 6100 - 6112
      Abstract: Video-based person re-identification (ReID) matches the same people across the video sequences with rich spatial and temporal information in complex scenes. It is highly challenging to capture discriminative information when occlusions and pose variations exist between frames. A key solution to this problem rests on extracting the temporal invariant features of video sequences. In this paper, we propose a novel method for discovering temporal coherence by designing a region-level saliency and granularity mining network (SGMN). Firstly, to address the varying noisy frame problem, we design a temporal spatial-relation module (TSRM) to locate frame-level salient regions, adaptively modeling the temporal relations on spatial dimension through a probe-buffer mechanism. It avoids the information redundancy between frames and captures the informative cues of each frame. Secondly, a temporal channel-relation module (TCRM) is proposed to further mine the small granularity information of each frame, which is complementary to TSRM by concentrating on discriminative small-scale regions. TCRM exploits a one-and-rest difference relation on channel dimension to enhance the granularity features, leading to stronger robustness against misalignments. Finally, we evaluate our SGMN with four representative video-based datasets, including iLIDS-VID, MARS, DukeMTMC-VideoReID, and LS-VID, and the results indicate the effectiveness of the proposed method.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • A Novel Deep Learning Framework for Automatic Recognition of Thyroid Gland
           and Tissues of Neck in Ultrasound Image

    • Free pre-print version: Loading...

      Authors: Laifa Ma;Guanghua Tan;Hongxia Luo;Qing Liao;Shengli Li;Kenli Li;
      Pages: 6113 - 6124
      Abstract: Recognition of thyroid glands and tissues of the neck is vital for screening related diseases in ultrasound videos. This task is subjective, challenging, and dependent on the experience of sonographer in current clinical practice. The purpose is to develop a fully automated thyroid gland and tissues of neck recognition framework to assist doctors in distinguishing the boundaries of different tissues. In this paper, we propose a novel deep learning framework that consists of a feature extraction network, region proposal network, object detection head, and spatial pyramid RoIAlign-based segmentation head. Designed spatial pyramid RoIAlign can efficiently capture local and global context features, and aggregates the multiple context information that makes the result much more reliable. A large dataset is constructed to train the proposed method. The performance is evaluated using the COCO metrics. The experimental results demonstrate that the proposed deep learning method can effectively realize the automatic recognition of the thyroid gland and tissues of neck in ultrasound videos. Considering the clinical practical application scenarios, we developed an automatic recognition system of thyroid and neck tissue based on edge computing, which can expediently assist doctors in distinguishing the boundaries between different tissues.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Multibranch Adversarial Regression for Domain Adaptative Hand Pose
           Estimation

    • Free pre-print version: Loading...

      Authors: Rui Jin;Jing Zhang;Jianyu Yang;Dacheng Tao;
      Pages: 6125 - 6136
      Abstract: Although hand pose estimation has achieved a great success in recent years, there are still challenges with RGB-based estimation tasks, the most significant of which is the absence of labeled training data. At present, the synthetic dataset has plenty of images with accurate annotation, but the difference from real-world datasets affects generalization. Therefore, a transfer learning strategy, which tries to transfer knowledge from a labeled source domain to an unlabeled target domain, is a frequent solution. Existing methods such as mean-teacher, Cyclegan, and MCD will train models with the help of some easily accessible domains such as synthetic data. However, these methods are not guaranteed to operate well in real-world settings due to the domain shift. In this paper, we design a new unsupervised domain adaptation method named Multi-branch Adversarial Regressors (MarsDA) in hand pose estimation, where it could be better for feature migration. Specifically, we first generate pseudo-labels for unlabeled target domain data. Then, the new adversarial training loss between multiple regression branches we designed for hand pose estimation is introduced to narrow the domain gap. In this way, our model can reduce the noise of pseudo labels caused by the domain gap and improve the accuracy of pseudo labels. We evaluate our method on two publicly available real-world datasets, H3D and STB. Experimental results show that our method outperforms existing methods by a large margin.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Finding Stars From Fireworks: Improving Non-Cooperative Iris Tracking

    • Free pre-print version: Loading...

      Authors: Chengdong Lin;Xinlin Li;Zhenjiang Li;Junhui Hou;
      Pages: 6137 - 6147
      Abstract: We revisit the problem of iris tracking with RGB cameras, aiming to obtain iris contours from captured images of eyes. We find the reason that limits the performance of the state-of-the-art method in more general non-cooperative environments, which prohibits a wider adoption of this useful technique in practice. We believe that because the iris boundary could be inherently unclear and blocked, as its pixels occupy only an extremely limited percentage of those on the entire image of the eye, similar to the stars hidden in fireworks, we should not treat the boundary pixels as one class to conduct end-to-end recognition directly. Thus, we propose to learn features from iris and sclera regions first, and then leverage entropy to sketch the thin and sharp iris boundary pixels, where we can trace more precise parameterized iris contours. In this work, we also collect a new dataset by smartphone with 22 K images of eyes from video clips. We annotate a subset of 2 K images, so that label propagation can be applied to further enhance the system performance. Extensive experiments over both public and our own datasets show that our method outperforms the state-of-the-art method. The results also indicate that our method can improve the coarsely labeled data to enhance the iris contour’s accuracy and support the downstream application better than the prior method.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Multi-Source Aggregation Transformer for Concealed Object Detection in
           Millimeter-Wave Images

    • Free pre-print version: Loading...

      Authors: Peng Sun;Ting Liu;Xiaotong Chen;Shiyin Zhang;Yao Zhao;Shikui Wei;
      Pages: 6148 - 6159
      Abstract: The active millimeter wave scanner has been widely used for detecting objects concealed underneath a person’s clothing in the field of security inspection and anti-terrorism. However, the active millimeter wave (AMMW) images always suffer from low signal-noise ratio, motion blur, and small size objects, making it challenging to detect concealed objects efficiently and accurately. The scanner usually captures a sequence of images in different views around a human body at once, while the existing algorithms only utilize the single image without considering the relationships among images. In this paper, we design a multi-source aggregation transformer (MATR) with two different attention mechanisms to model spatial correlations within an image and contextual interactions across images. Specifically, a self-attention module is introduced to encode local relationships between the region proposals in each image, while a cross-attention mechanism is built to focus on modeling the cross-correlations between different images. Besides, to handle the problem of small objects in size and suppress the noise in AMMW images, we present a selective context module (SCM). It designs a dynamic selection mechanism to enhance the high-resolution feature with spatial details and make it more distinguishable from the noisy background. Experiments on two AMMW image datasets demonstrate that the proposed methods lead to a remarkable improvement compared to previous state-of-the-art and will benefit the concealed object detection in practice.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • SDTP: Semantic-Aware Decoupled Transformer Pyramid for Dense Image
           Prediction

    • Free pre-print version: Loading...

      Authors: Zekun Li;Yufan Liu;Bing Li;Bailan Feng;Kebin Wu;Chengwei Peng;Weiming Hu;
      Pages: 6160 - 6173
      Abstract: Although transformer has achieved great progress on computer vision tasks, the scale variation in dense image prediction is still the key challenge. Few effective multi-scale techniques are applied in transformer and there are two main limitations in the current methods. On the one hand, self-attention module in vanilla transformer fails to sufficiently exploit the diversity of semantic information because of its rigid mechanism. On the other hand, it is difficult to build attention and interaction among different levels due to the heavy computational burden. To alleviate this problem, we first revisit multi-scale problem in dense prediction, verifying the significance of diverse semantic representation and multi-scale interaction, and exploring the adaptation of transformer to pyramidal structure. Inspired by these findings, we propose a novel Semantic-aware Decoupled Transformer Pyramid (SDTP) for dense image prediction, consisting of Intra-level Semantic Promotion (ISP), Cross-level Decoupled Interaction (CDI) and Attention Refinement Function (ARF). ISP explores the semantic diversity in different receptive space through more flexible self-attention strategy. CDI builds the global attention and interaction among different levels in decoupled space which also solves the problem of heavy computation. Besides, ARF is further added to refine the attention in transformer. Experimental results demonstrate the validity and generality of the proposed method, which outperforms the state-of-the-art by a significant margin in dense image prediction tasks. Furthermore, the proposed components are all plug-and-play, which can be embedded in other methods.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Temporal Moment Localization via Natural Language by Utilizing Video
           Question Answers as a Special Variant and Bypassing NLP for Corpora

    • Free pre-print version: Loading...

      Authors: Hafiza Sadia Nawaz;Zhensheng Shi;Yanhai Gan;Amanuel Hirpa;Junyu Dong;Haiyong Zheng;
      Pages: 6174 - 6185
      Abstract: Temporal moment localization using natural language (TMLNL) is an emerging issue in computer vision for localizing a specific moment inside a long, untrimmed video. The goal of TMLNL is to obtain the video’s output moment, which is related to the input query in a substantial way. Previous research focused on the visual portion of TMLNL, such as objects, backdrops, and other visual attributes, but natural language processing (NLP) techniques were largely used for the textual portion. A long query requires sufficient context to properly localize moments within a long untrimmed video. Thus, as a consequence of not completely understanding how to handle queries, performances deteriorated, especially when the query was longer. In this paper, we treat the TMLNL challenge as a unique variation of VQA, which equally considers the visual elements by using our proposed VQA joint visual-textual framework (JVTF). However, we also manage complex and long input queries without employing natural language processing (NLP) by improving poorly graded to finely graded distinct granularity representations. Our suggested BCPN searches for insufficient context for long input queries using an approach called query handler (QH) and helps the JVTF find the most relevant moment. Previously, a recurrence of words was caused by increasing the number of encoding layers in transformers, LSTMs, and other NLP techniques; however, our QH ensured that repetition of word locations was reduced. The output of BCPN is combined with JVTF’s guided attention to further improve the end outcome. Therefore, we propose a novel bidirectional context predictor network (BCPN), in addition to a VQA joint visual-textual framework (JVTF), to address the equal importance of videos and queries. Through extensive experiments on three benchmark datasets, we show that the proposed BCPN outperforms the state-of-the-art methods by
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Spreading Fine-Grained Prior Knowledge for Accurate Tracking

    • Free pre-print version: Loading...

      Authors: Jiahao Nie;Han Wu;Zhiwei He;Mingyu Gao;Zhekang Dong;
      Pages: 6186 - 6199
      Abstract: With the widespread use of deep learning in single object tracking task, mainstream tracking algorithms treat tracking as a combined classification and regression problem. Classification aims at locating an arbitrary target, and regression aims at estimating the corresponding bounding box. In this paper, we focus on regression and propose a novel box estimation network, which consists of a transformer encoder target pyramid guide (TPG) and transformer decoder target pyramid spread (TPS). Specifically, the transformer encoder TPG is designed to generate fine-grained prior knowledge with explicit representation for template targets. In contrast to the raw transformer encoder, we capture the visual dependence through local-global self-attention and deem the multi-scale target regions as the “local” region. Using this fine-grained prior knowledge, we design the transformer decoder TPS to spread it to the subsequent search regions with high affinity to accurately estimate the bounding boxes. Considering that self-attention fails to model information interaction across channels between the template target and search regions, we develop a channel-wise cross-attention block within the TPS as compensation. Extensive experiments on the OTB100, UAV123, NFS, VOT2020, VOT2021, LaSOT, LaSOT_ext, TrackingNet and GOT-10k benchmarks show that the proposed box estimation network outperforms most existing box estimation methods. Furthermore, our trackers based on this estimation network exhibit a competitive performance against state-of-the-art trackers.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Efficient and Robust MultiView Clustering With Anchor Graph Regularization

    • Free pre-print version: Loading...

      Authors: Ben Yang;Xuetao Zhang;Zhiping Lin;Feiping Nie;Badong Chen;Fei Wang;
      Pages: 6200 - 6213
      Abstract: Multi-view clustering has received widespread attention owing to its effectiveness by integrating multi-view data appropriately, but traditional algorithms have limited applicability to large-scale real-world data due to their high computational complexity and low robustness. Focusing on the aforementioned issues, we propose an efficient and robust multi-view clustering algorithm with anchor graph regularization (ERMC-AGR). In this work, a novel anchor graph regularization (ARG) is designed to improve the quality of the learned embedded anchor graph (EAG), and the obtained EAG is decomposed by nonnegative matrix factorization (NMF) under correntropy criterion to acquire clustering results directly. Different from the traditional graph regularization that needs to construct a large-scale Laplacian matrix pertaining to the all-sample graph, our lightweight AGR, constructed from the perspective of anchors, can reduce the computational complexity significantly while improving the EAG quality. Moreover, a factor matrix of NMF is constrained to be the cluster indicator matrix to omit additional k-means after optimization. Subsequently, correntropy is utilized to improve the effectiveness and robustness of ERMC-AGR owing to its promising performance to complex noises and outliers. Extensive experiments on real-world datasets and noisy datasets show that ERMC-ARG can improve the clustering efficiency and robustness while ensuring comparable or even better effectiveness.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Hierarchical Dynamic Programming Module for Human Pose Refinement

    • Free pre-print version: Loading...

      Authors: Chunyang Xie;Dongheng Zhang;Yang Hu;Yan Chen;
      Pages: 6214 - 6226
      Abstract: We observed that remarkable and impressive performance on image-based human pose estimation have been achieved by deep Convolutional Neural Networks (CNN). Nevertheless, directly applying these image-based models on videos is not only computionally intensive, but also may cause jitter and loss. The main reason is that the image-based models purely focus on the local features of individual frames and totally ignore the temporal information among adjacent frames. Some existing methods are proposed to address the temporal coherency issue. However, these methods need to be designed carefully and cannot be combined with existing image-based methods. In this paper, we propose a simple yet effective module to refine the estimated pose by exploiting the temporal coherency among the heatmaps of adjacent frames, which can be easily inserted into image-based networks as a plug-in. We show that the temporal coherency issue among the heatmap frames could be re-formulated as a graph path selection optimization problem. Moreover, to speed up the refinement process, we propose a hierarchical graph optimization to achieve the refinement from coarse to fine. Experimental results on two large-scale video pose estimation benchmarks show that our module can improve the performance with little speed loss when combined with image-based methods as an efficient plug-in.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Dynamic Hand Gesture Recognition Using Improved Spatio-Temporal Graph
           Convolutional Network

    • Free pre-print version: Loading...

      Authors: Jae-Hun Song;Kyeongbo Kong;Suk-Ju Kang;
      Pages: 6227 - 6239
      Abstract: Hand gesture recognition is essential to human-computer interaction as the most natural way of communicating. Furthermore, with the development of 3D hand pose estimation technology and the performance improvement of low-cost depth cameras, skeleton-based dynamic hand gesture recognition has received much attention. This paper proposes a novel multi-stream improved spatio-temporal graph convolutional network (MS-ISTGCN) for skeleton-based dynamic hand gesture recognition. We adopt an adaptive spatial graph convolution that can learn the relationship between distant hand joints and propose an extended temporal graph convolution with multiple dilation rates that can extract informative temporal features from short to long periods. Furthermore, we add a new attention layer consisting of effective spatio-temporal attention and channel attention between the spatial and temporal graph convolution layers to find and focus on key features. Finally, we propose a multi-stream structure that feeds multiple data modalities (i.e., joints, bones, and motions) as inputs to improve performance using the ensemble technique. Each of the three-stream networks is independently trained and fused to predict the final hand gesture. The performance of the proposed method is verified through extensive experiments with two widely used public dynamic hand gesture datasets: SHREC’17 Track and DHG-14/28. Our proposed method achieves the highest recognition accuracy in various gesture categories for both datasets compared with state-of-the-art methods.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Task Encoding With Distribution Calibration for Few-Shot Learning

    • Free pre-print version: Loading...

      Authors: Jing Zhang;Xinzhou Zhang;Zhe Wang;
      Pages: 6240 - 6252
      Abstract: Few-shot learning is an extremely challenging task in computer vision that has attracted increased research attention in recent years. However, most recent methods do not fully use the task’s information, and few of the seen samples result in large intraclass differences among the same classes. In this paper, we propose a novel task encoding with distribution calibration (TEDC) model for few-shot learning, which uses the relationships among the feature distributions to reduce intraclass differences. In the TEDC model, an integrated feature extraction module (IFEM) is proposed, which extracts the multiangle visual features of an image and fuses them to obtain more representative features. To effectively utilize the task information, a novel task encoding module (TEM) is proposed, which obtains the task features by fusing all the seen samples’ information and uses them to adjust all the samples’ features for more generalizable task-specific representations. We also propose a distribution calibration module (DCM) to reduce the bias between the distribution of the support features and the query features in the same class. Extensive experiments show that our proposed TEDC model achieves an excellent performance and outperforms the state-of-the-art methods on three widely used few-shot classification benchmarks, specifically miniImageNet, tieredImageNet and CUB-200-2011.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Adaptive Multilayer Perceptual Attention Network for Facial Expression
           Recognition

    • Free pre-print version: Loading...

      Authors: Hanwei Liu;Huiling Cai;Qingcheng Lin;Xuefeng Li;Hui Xiao;
      Pages: 6253 - 6266
      Abstract: In complex real-world situations, problems such as illumination changes, facial occlusion, and variant poses make facial expression recognition (FER) a challenging task. To solve the robustness problem, this paper proposes an adaptive multilayer perceptual attention network (AMP-Net) that is inspired by the facial attributes and the facial perception mechanism of the human visual system. AMP-Net extracts global, local, and salient facial emotional features with different fine-grained features to learn the underlying diversity and key information of facial emotions. Different from existing methods, AMP-Net can adaptively guide the network to focus on multiple finer and distinguishable local patches with robustness to occlusion and variant poses, improving the effectiveness of learning potential facial diversity information. In addition, the proposed global perception module can learn different receptive field features in the global perception domain, and AMP-Net also supplements salient facial region features with high emotion correlation based on prior knowledge to capture key texture details and avoid important information loss. Many experiments show that AMP-Net achieves good generalizability and state-of-the-art results on several real-world datasets, including RAF-DB, AffectNet-7, AffectNet-8, SFEW 2.0, FER-2013, and FED-RO, with accuracies of 89.25%, 64.54%, 61.74%, 61.17%, 74.48%, and 71.75%, respectively. All codes and training logs are publicly available at https://github.com/liuhw01/AMP-Net.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Target-Distractor Aware Deep Tracking With Discriminative Enhancement
           Learning Loss

    • Free pre-print version: Loading...

      Authors: Huanlong Zhang;Liyun Cheng;Tianzhu Zhang;Yanfeng Wang;W. J. Zhang;Jie Zhang;
      Pages: 6267 - 6278
      Abstract: Numerous tracking approaches attempt to improve target representation through target-aware or distractor-aware. However, the unbalanced considerations of target or distractor information make it diffcult for these methods to benefit from the two aspects at the same time. In this paper, we propose a target-distractor aware model with discriminative enhancement learning loss to learn target representation, which can better distinguish the target in complex scenes. Firstly, to enlarge the gap between the target and distractor, we design a discriminative enhancement learning loss. By highlighting the hard negatives that are similar to the target and shrinking the easy negatives that are pure background, the features sensitive to the target or distractor representation can be more conveniently mined. On this basis, we further propose a target-distractor aware model. Unlike existing methods of preference target or distractor, we construct the target-specific feature space by activating the target-sensitive and the distractor-silence feature. Therefore, the appearance model can not only represent the target well but also suppress the background distractor. Finally, the target-distractor aware target representation model is integrated with a Siamese matching network for visual tracking for achieving robust and realtime visual tracking. Extensive experiments are performed on eight tracking benchmarks show that the proposed algorithm achieves favorable performance.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Implicit Motion-Compensated Network for Unsupervised Video Object
           Segmentation

    • Free pre-print version: Loading...

      Authors: Lin Xi;Weihai Chen;Xingming Wu;Zhong Liu;Zhengguo Li;
      Pages: 6279 - 6292
      Abstract: Unsupervised video object segmentation (UVOS) aims at automatically separating the primary foreground object(s) from the background in a video sequence. Existing UVOS methods either lack robustness when there are visually similar surroundings (appearance-based) or suffer from deterioration in the quality of their predictions because of dynamic background and inaccurate flow (flow-based). To overcome the limitations, we propose an implicit motion-compensated network (IMCNet) combining complementary cues (i.e., appearance and motion) with aligned motion information from the adjacent frames to the current frame at the feature level without estimating optical flows. The proposed IMCNet consists of an affinity computing module (ACM), an attention propagation module (APM), and a motion compensation module (MCM). The light-weight ACM extracts commonality between neighboring input frames based on appearance features. The APM then transmits global correlation in a top-down manner. Through coarse-to-fine iterative inspiring, the APM will refine object regions from multiple resolutions so as to efficiently avoid losing details. Finally, the MCM aligns motion information from temporally adjacent frames to the current frame which achieves implicit motion compensation at the feature level. We perform extensive experiments on $textit {DAVIS}_{textit {16}}$ and $textit {YouTube-Objects}$ . Our network achieves favorable performance while running at a faster speed compared to the state-of-the-art methods. Our code is available at https://github.com/xilin1991/IMCNet.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Bridging Video and Text: A Two-Step Polishing Transformer for Video
           Captioning

    • Free pre-print version: Loading...

      Authors: Wanru Xu;Zhenjiang Miao;Jian Yu;Yi Tian;Lili Wan;Qiang Ji;
      Pages: 6293 - 6307
      Abstract: Video captioning is a joint task of computer vision and natural language processing, which aims to describe the video content using several natural language sentences. Nowadays, most methods cast this task as a mapping problem, which learns a mapping from visual features to natural language and generates captions directly from videos. However, the underlying challenge of video captioning, i.e., sequence to sequence mapping across the different domains, is still not well handled. To address these problems, we introduce the polishing mechanism in an attempt to mimic human polishing process and propose a generate-and-polish framework for video captioning. In this paper, we propose a two-step transformer based polishing network (TSTPN) consisting of two sub-modules: the generation-module is to generate the caption candidate and the polishing-module is to gradually refine the generated candidate. Specifically, the candidate provides a global information of the visual contents in a semantically-meaningful order, where it is firstly considered as a semantic intersnubber to bridge the semantic gap between the text and video, with the cross-modal attention mechanism for better cross-modal modeling; and it secondly provides a global planning ability to maintain the semantic consistency and fluency of the whole sentence for better sequence mapping. In experiments, we present adequate evaluations to show that the proposed TSTPN achieves the comparable and even better performance than the state-of-the-art methods on the benchmark datasets.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • CGMDRNet: Cross-Guided Modality Difference Reduction Network for RGB-T
           Salient Object Detection

    • Free pre-print version: Loading...

      Authors: Gang Chen;Feng Shao;Xiongli Chai;Hangwei Chen;Qiuping Jiang;Xiangchao Meng;Yo-Sung Ho;
      Pages: 6308 - 6323
      Abstract: How to explore the interaction between the RGB and thermal modalities is the key success of the RGB-T saliency object detection (SOD). Most of the existing methods integrate multi-modality information by designing various fusion strategies. However, the modality gap between the RGB and thermal features will lead to unsatisfactory performances by simple feature concatenation. To solve this problem, we innovatively propose a cross-guided modality difference reduction network (CGMDRNet) to achieve intrinsic consistency feature fusion via reducing the modality differences. Specifically, we design a modality difference reduction (MDR) module, which is embedded in each layer of the backbone network. The module uses a cross-guided strategy to reduce the modality difference between the RGB and thermal features. Then, a cross-attention fusion (CAF) module is designed to fuse cross-modality features with small modality differences. In addition, we use a transformer-based feature enhancement (TFE) module to enhance the high-level feature representation that contributes more to performance. Finally, the high-level features guide the fusion of low-level features to obtain a saliency map with clear boundaries. Extensive experiments on three public RGB-T datasets show that the proposed CGMDRNet achieves competitive performance compared with state-of-the-art (SOTA) RGB-T SOD models.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Decoupled R-CNN: Sensitivity-Specific Detector for Higher Accurate
           Localization

    • Free pre-print version: Loading...

      Authors: Dong Wang;Kun Shang;Huaming Wu;Ce Wang;
      Pages: 6324 - 6336
      Abstract: Object detection, as a fundamental problem in computer vision, has been widely used in many industrial applications, such as intelligent manufacturing and intelligent video surveillance. In this work, we find that classification and regression have different sensitivities to the object translation, from the investigation about the availability of highly overlapping proposals. More specifically, the regressor head has intrinsic characteristics of higher sensitivity to translation than the classifier. Based on it, we propose a decoupled sampling strategy for a deep detector, named Decoupled R-CNN, to decouple the proposals sampling for the two tasks, which induces two sensitivity-specific heads. Furthermore, we adopt the cascaded structure for the single regressor head of Decoupled R-CNN, which is an extremely simple but highly effective way of improving the performance of object detection. Extensive empirical analyses using real-world datasets demonstrate the value of the proposed method when compared with the state-of-the-art models. The reproducing code is available at https://github.com/shouwangzhe134/Decoupled-R-CNN.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Multiple Resolution Prediction With Deep Up-Sampling for Depth Video
           Coding

    • Free pre-print version: Loading...

      Authors: Ge Li;Jianjun Lei;Zhaoqing Pan;Bo Peng;Nam Ling;
      Pages: 6337 - 6346
      Abstract: The depth video contains large smooth contents with sharp edges. Since the deep learning-based color video orientated intra prediction methods pay no attention to the characteristics of depth video, they are unsuitable for optimizing the coding efficiency of depth video. In this paper, a multiple resolution prediction method with deep up-sampling is proposed to promote the coding efficiency of depth video. To efficiently encode the depth blocks of different complexity, the depth block is selectively encoded at different resolutions, including $times 1$ , $times 1$ /2, and $times 1$ /4 resolutions. If the block is encoded with a low-resolution (LR), the resolution of reconstructed LR depth block is recovered by an up-sampling network. To constrain the quality of both reconstructed high-resolution depth block and its synthesized view, a view synthesis distortion guidance mechanism is proposed for the up-sampling network. In addition, a distillation-based lightweight up-sampling network is proposed to reduce the computational complexity. Experimental results demonstrate that the proposed multiple resolution prediction method obtains an average of 10.84% BD-rate saving in comparison with 3D-HEVC.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • RDEN: Residual Distillation Enhanced Network-Guided Lightweight
           Synthesized View Quality Enhancement for 3D-HEVC

    • Free pre-print version: Loading...

      Authors: Zhaoqing Pan;Feng Yuan;Weijie Yu;Jianjun Lei;Nam Ling;Sam Kwong;
      Pages: 6347 - 6359
      Abstract: In the three-dimensional video system, the depth image-based rendering is a key technique for generating synthesized views, which provides audiences with depth perception and interactivity. However, the inaccuracy of depth information leads to geometrical rendering position errors, and the compression distortion of texture and depth videos degrades the quality of the synthesized views. Although existing quality enhancement methods can eliminate the distortions in the synthesized views, their huge computational complexity hinders their applications in real-time multimedia systems. To this end, a residual distillation enhanced network (RDEN)-guided lightweight synthesized view quality enhancement (SVQE) method is proposed to minimize holes and compression distortions in the synthesized views while reducing the model complexity. First, a rethinking on the deep-learning-based SVQE methods is performed. Then, a feature distillation attention block is proposed to effectively reduce the distortions in the synthesized views and make the model fulfill more real-time tasks, which is a lightweight and flexible feature extraction block using an information distillation mechanism and a lightweight multi-scale spatial attention mechanism. Third, a residual feature fusion block is proposed to improve the enhancement performance by using the feature fusion mechanism, which efficiently improves the feature extraction capability without introducing any additional parameters. Experimental results prove that the proposed RDEN efficiently improves the SVQE performance while consuming few computational complexities compared with the state-of-the-art SVQE methods.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Overview and Efficiency of Decoder-Side Depth Estimation in MPEG Immersive
           Video

    • Free pre-print version: Loading...

      Authors: Dawid Mieloch;Patrick Garus;Marta Milovanović;Joël Jung;Jun Young Jeong;Smitha Lingadahalli Ravi;Basel Salahieh;
      Pages: 6360 - 6374
      Abstract: This paper presents the overview and rationale behind the Decoder-Side Depth Estimation (DSDE) mode of the MPEG Immersive Video (MIV) standard, using the Geometry Absent profile, for efficient compression of immersive multiview video. A MIV bitstream generated by an encoder operating in the DSDE mode does not include depth maps. It only contains the information required to reconstruct them in the client or in the cloud: decoded views and metadata. The paper explains the technical details and techniques supported by this novel MIV DSDE mode. The description additionally includes the specification on Geometry Assistance Supplemental Enhancement Information which helps to reduce the complexity of depth estimation, when performed in the cloud or at the decoder side. The depth estimation in MIV is a non-normative part of the decoding process, therefore, any method can be used to compute the depth maps. This paper lists a set of requirements for depth estimation, induced by the specific characteristics of the DSDE. The depth estimation reference software, continuously and collaboratively developed with MIV to meet these requirements, is presented in this paper. Several original experimental results are presented. The efficiency of the DSDE is compared to two MIV profiles. The combined non-transmission of depth maps and efficient coding of textures enabled by the DSDE leads to efficient compression and rendering quality improvement compared to the usual encoder-side depth estimation. Moreover, results of the first evaluation of state-of-the-art multiview depth estimators in the DSDE context, including machine learning techniques, are presented.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Depth-Wise Split Unit Coding Order for Video Compression

    • Free pre-print version: Loading...

      Authors: Yinji Piao;Kiho Choi;Kwang Pyo Choi;Minsoo Park;Min Woo Park;
      Pages: 6375 - 6384
      Abstract: In this paper, we propose a depth-wise flexible block processing method called split unit coding order (SUCO) for video coding. Conventionally, block-based image and video compression frameworks always apply raster scans to process blocks in order from left to right. Owing to the fixed coding order, the available information for prediction in the coding blocks is limited to adjacent blocks on the left and top. To address the limitations of block-based images and video compression frameworks, the proposed SUCO provides more flexibility in handling coding block sequences than predicted on the left and top, and thus coding blocks can take advantage of adjacent right information, such as reconstructed pixels and motion information. The flexibility is achieved by depth-wise signaling of preferred coding order for the given partitions. The experiment results demonstrate that the proposed SUCO can effectively improve the coding efficiency of both intra and inter prediction in the latest video coding standards.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • FPX-NIC: An FPGA-Accelerated 4K Ultra-High-Definition Neural Video Coding
           System

    • Free pre-print version: Loading...

      Authors: Chuanmin Jia;Xinyu Hang;Shanshe Wang;Yaqiang Wu;Siwei Ma;Wen Gao;
      Pages: 6385 - 6399
      Abstract: The recent trend in neural image compression (NIC) research could be generally grounded into two categories: analysis-synthesis transform network improvements and entropy estimation optimization. They promote the compression efficiency of NIC by leveraging more expressive network structures and advanced entropy models respectively. From a different but more systematic viewpoint, we extend the horizon of NIC from software- to hardware-based lossy compression using more resource-constrained platforms, such as field programmable gate array (FPGA) or deep-learning processor unit (DPU). In this paper, we propose a novel hardware-oriented NIC system for real-time edge-computing video services. We for the first time present FPX-NIC, an FPGA-accelerated NIC framework designed for hardware encoding, which consists of a novel NIC scheme and an energy-efficient neural network (NN) deployment method. The former contribution is a block-based adaptive NIC approach based on local content characteristics. Essential side-information is also signalled to realize adaptive patch representation. The critical advantage of our latter contribution lies in the network-reconfigurable framework plus fixed-precision weights quantization method that takes advantage of quantization-aware post training procedure to compensate the performance degradation caused by quantization error. Therefore it is able to improve both processing speed and energy efficiency. We finally establish an intelligent video coding system using the proposed scheme, enabling visual capturing, neural encoding, decoding, and display, realizing 4K ultra-high-definition (UHD) all intra neural video coding on edge-computing devices.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • An Overview of Panoramic Video Projection Schemes in the IEEE 1857.9
           Standard for Immersive Visual Content Coding

    • Free pre-print version: Loading...

      Authors: Yangang Cai;Xufeng Li;Yueming Wang;Ronggang Wang;
      Pages: 6400 - 6413
      Abstract: Panoramic video contains 360-degree video content and it is convenient to draw the content that the user wants to watch on the head-Mount display (HMD) of the virtual reality display device. However, few encoders code panoramic video directly. Panoramic video content needs to be projected to a compression-friendly rectangular plane first. Then the existing video coding standards such as AVC, HEVC, VVC, AVS are used to code the panoramic video content. In order to improve the panoramic video coding efficiency, IEEE 1857.9 immersive video content coding working group was established to develop efficient immersive video projection and coding methods. This paper presents an overview of the panoramic video projection schemes in Immersive Visual Content Coding Standard IEEE 1857.9.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • RB-Net: Training Highly Accurate and Efficient Binary Neural Networks With
           Reshaped Point-Wise Convolution and Balanced Activation

    • Free pre-print version: Loading...

      Authors: Chunlei Liu;Wenrui Ding;Peng Chen;Bohan Zhuang;Yufeng Wang;Yang Zhao;Baochang Zhang;Yuqi Han;
      Pages: 6414 - 6424
      Abstract: In this paper, we find that the conventional convolution operation becomes the bottleneck for extremely efficient binary neural networks (BNNs). To address this issue, we open up a new direction by introducing a reshaped point-wise convolution (RPC) to replace the conventional one to build BNNs. Specifically, we conduct a point-wise convolution after rearranging the spatial information into depth, with which at least $2.25times $ computation reduction can be achieved. Such an efficient RPC allows us to explore more powerful representational capacity of BNNs under a given computation complexity budget. Moreover, we propose to use a balanced activation (BA) to adjust the distribution of the scaled activations after binarization, which enables significant performance improvement of BNNs. After integrating RPC and BA, the proposed network, dubbed as RB-Net, strikes a good trade-off between accuracy and efficiency, achieving superior performance with lower computational cost against the state-of-the-art BNN methods. Specifically, our RB-Net achieves 66.8% Top-1 accuracy with ResNet-18 backbone on ImageNet, exceeding the state-of-the-art Real-to-Binary Net (65.4%) by 1.4% while achieving more than $3times $ reduction (52M vs. 165M) in computational complexity.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • MMatch: Semi-Supervised Discriminative Representation Learning for
           Multi-View Classification

    • Free pre-print version: Loading...

      Authors: Xiaoli Wang;Liyong Fu;Yudong Zhang;Yongli Wang;Zechao Li;
      Pages: 6425 - 6436
      Abstract: Semi-supervised multi-view learning has been an important research topic due to its capability to exploit complementary information from unlabeled multi-view data. This work proposes MMatch, a new semi-supervised discriminative representation learning method for multi-view classification. Unlike existing multi-view representation learning methods that seldom consider the negative impact caused by particular views with unclear classification structures (weak discriminative views). MMatch jointly learns view-specific representations and class probabilities of training data. The representations concatenated to integrate multiple views’ information to form a global representation. Moreover, MMatch performs the smoothness constraint on the class probabilities of the global representation to improve pseudo labels, whereas the pseudo labels regularize the structure of view-specific representations. A discriminative global representation is mined with the training process, and the negative impact of weak discriminative views is overcome. Besides, MMatch learns consistent classification while preserving diverse information from multiple views. Experiments on several multi-view datasets demonstrate the effectiveness of MMatch.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Hierarchical Feature Aggregation Based on Transformer for Image-Text
           Matching

    • Free pre-print version: Loading...

      Authors: Xinfeng Dong;Huaxiang Zhang;Lei Zhu;Liqiang Nie;Li Liu;
      Pages: 6437 - 6447
      Abstract: In order to carry out more accurate retrieval across image-text modalities, some scholars use fine-grained feature to align image and text. Most of them directly use attention mechanism to align image regions and words in the sentence, and ignore the fact that semantics related to an object is abstract and cannot be accurately expressed by object information alone. To overcome this weakness, we propose a hierarchical feature aggregation algorithm based on graph convolutional networks (GCN) to facilitate object semantic integrity by integrating attributes of an object and relations between objects hierarchically in both image and text modalities. In order to eliminate the semantic gap between modalities, we propose a cross-modal feature fusion method based on transformer to generate modal-specific feature representations by integrating both the object feature and global feature from the other modality. Then we map the fusion feature into a common space. Experiment results on the most frequently-used datasets MSCOCO and Flickr30K show the effectiveness of the proposed model compared with the latest methods.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Unsupervised Domain Adaptation for Disguised-Gait-Based Person
           Identification on Micro-Doppler Signatures

    • Free pre-print version: Loading...

      Authors: Yang Yang;Xiaoyi Yang;Takuya Sakamoto;Francesco Fioranelli;Beichen Li;Yue Lang;
      Pages: 6448 - 6460
      Abstract: In recent years, gait-based person identification has gained significant interest for a variety of applications, including security systems and public security forensics. Meanwhile, this task is faced with the challenge of disguised gaits. When a human subject changes what he or she is wearing or carrying, it becomes challenging to reliably identify the subject’s identity using gait data. In this paper, we propose an unsupervised domain adaptation (UDA) model, named Guided Subspace Alignment under the Class-aware condition (G-SAC), to recognize human subjects based on their disguised gait data by fully exploiting the intrinsic information in gait biometrics. To accomplish this, we employ neighbourhood component analysis (NCA) to create an intrinsic feature subspace from which we can obtain similarities between normal and disguised gaits. With the aid of a proposed constraint for adaptive class-aware alignment, the class-level discriminative feature representation can be learned guided by this subspace. Our experimental results on a measured micro-Doppler radar dataset demonstrate the effectiveness of our approach. The comparison results with several state-of-the-art methods indicate that our work provides a promising domain adaptation solution for the concerned problem, even in cases where the disguised pattern differs significantly from the normal gaits. Additionally, we extend our approach to more complex multi-target domain adaptation (MTDA) challenge and video-based gait recognition tasks, the superior results demonstrate that the proposed model has a great deal of potential for tackling increasingly difficult problems.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • A Self-Supervised Metric Learning Framework for the Arising-From-Chair
           Assessment of Parkinsonians With Graph Convolutional Networks

    • Free pre-print version: Loading...

      Authors: Rui Guo;Jie Sun;Chencheng Zhang;Xiaohua Qian;
      Pages: 6461 - 6471
      Abstract: The onset and progression of Parkinson’s disease (PD) gradually affect the patient’s motor functions and quality of life. The PD motor symptoms are usually assessed using the Movement Disorder Society-sponsored revision of the Unified Parkinson’s Disease Rating Scale (MDS-UPDRS). Automated MDS-UPDRS assessment has been recently required as an invaluable tool for PD diagnosis and telemedicine, especially with the recent novel coronavirus pandemic outbreak. This paper proposes a novel vision-based method for automated assessment of the arising-from-chair task, which is one of the key MDS-UPDRS components. The proposed method is based on a self-supervised metric learning scheme with a graph convolutional network (SSM-GCN). Specifically, for human skeleton sequences extracted from videos, a self-supervised intra-video quadruplet learning strategy is proposed to construct a metric learning formulation with prior knowledge, for improving the spatial-temporal representations. Afterwards, a vertex-specific convolution operation is designed to achieve effective aggregation of all skeletal joint features, where each joint or feature is weighted differently based on its relative factor of importance. Finally, a graph representation supervised mechanism is developed to maximize the potential consistency between the joint and bone information streams. Experimental results on a clinical dataset demonstrate the superiority of the proposed method over the existing sensor-based methods, with an accuracy of 70.60% and an acceptable accuracy of 98.65%. The analysis of discriminative spatial connections makes our predictions more clinically interpretable. This method can achieve reliable automated PD assessment using only easily-obtainable videos, thus providing an effective tool for real-time PD diagnosis or remote continuous monitoring.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Temporal Relation Inference Network for Multimodal Speech Emotion
           Recognition

    • Free pre-print version: Loading...

      Authors: Guan-Nan Dong;Chi-Man Pun;Zheng Zhang;
      Pages: 6472 - 6485
      Abstract: Speech emotion recognition (SER) is a non-trivial task for humans, while it remains challenging for automatic SER due to the linguistic complexity and contextual distortion. Notably, previous automatic SER systems always regarded multi-modal information and temporal relations of speech as two independent tasks, ignoring their association. We argue that the valid semantic features and temporal relations of speech are both meaningful event relationships. This paper proposes a novel temporal relation inference network (TRIN) to help tackle multi-modal SER, which fully considers the underlying hierarchy of phonetic structure and its associations between various modalities under the sequential temporal guidance. Mainly, we design a temporal reasoning calibration module to imitate real and abundant contextual conditions. Unlike the previous works, which assume all multiple modalities are related, it infers the dependency relationship between the semantic information from the temporal level and learns to handle the multi-modal interaction sequence with a flexible order. To enhance the feature representation, an innovative temporal attentive fusion unit is developed to magnify the details embedded in a single modality from semantic level. Meanwhile, it aggregates the feature representation from both the temporal and semantic levels to maximize the integrity of feature representation by an adaptive feature fusion mechanism to selectively collect the implicit complementary information to strengthen the dependencies between different information subspaces. Extensive experiments conducted on two benchmark datasets demonstrate the superiority of our TRIN method against some state-of-the-art SER methods.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
  • Call for IEEE T-CSVT Associate Editors Nomination

    • Free pre-print version: Loading...

      Pages: 6486 - 6486
      Abstract: Advertisement.
      PubDate: Sept. 2022
      Issue No: Vol. 32, No. 9 (2022)
       
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
 


Your IP address: 44.192.26.60
 
Home (Search)
API
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-