 International Journal on Document Analysis and Recognition (IJDAR)Journal Prestige (SJR): 0.456 Citation Impact (citeScore): 2Number of Followers: 2      Hybrid journal (It can contain Open Access articles) ISSN (Print) 1433-2833 - ISSN (Online) 1433-2825 Published by Springer-Verlag  [2469 journals]
• Boosting modern and historical handwritten text recognition with
deformable convolutions

Abstract: Handwritten Text Recognition (HTR) in free-layout pages is a challenging image understanding task that can provide a relevant boost to the digitization of handwritten documents and reuse of their content. The task becomes even more challenging when dealing with historical documents due to the variability of the writing style and degradation of the page quality. State-of-the-art HTR approaches typically couple recurrent structures for sequence modeling with Convolutional Neural Networks for visual feature extraction. Since convolutional kernels are defined on fixed grids and focus on all input pixels independently while moving over the input image, this strategy disregards the fact that handwritten characters can vary in shape, scale, and orientation even within the same document and that the ink pixels are more relevant than the background ones. To cope with these specific HTR difficulties, we propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text. We design two deformable architectures and conduct extensive experiments on both modern and historical datasets. Experimental results confirm the suitability of deformable convolutions for the HTR task.
PubDate: 2022-05-09

• Correction to: Radical-based extract and recognition networks for Oracle
character recognition

PubDate: 2022-05-07

• YOLO-table: disclosure document table detection with involution

Abstract: Abstract As financial document automation becomes more general, table detection is receiving more and more attention as an important part of document automation. Disclosure documents contain both bordered and borderless tables of varying lengths, and there is currently no model that performs well on these types of documents. To solve this problem, we propose a table detection model based on YOLO-table. We introduce involution into the backbone of the network to improve the network’s ability to learn table spatial layout features and design a simple Feature Pyramid Network to improve model effectiveness. In addition, this paper proposes a table-based augment method. We experiment on a disclosure document dataset, and the results show that the F1-measure of the YOLO-table reaches 97.3%. Compared with YOLOv3, our method improves the accuracy by 2.8% and the speed by 1.25 times. It also evaluates the ICDAR2013 and ICDAR2019 Table Competition datasets and achieves state-of-the-art performance.
PubDate: 2022-05-02

• Fusion of visual representations for multimodal information extraction
from unstructured transactional documents

Abstract: Abstract The importance of automated document understanding in terms of today’s businesses’ speed, efficiency, and cost reduction is indisputable. Although structured and semi-structured business documents have been studied intensively within the literature, information extraction from the unstructured ones remains still an open and challenging research topic due to their difficulty levels and the scarcity of available datasets. Transactional documents occupy a special place among the various types of business documents as they serve to track the financial flow and are the most studied type accordingly. The processing of unstructured transactional documents requires the extraction of complex relations (i.e., n-ary, document-level, overlapping, and nested relations). Studies focusing on unstructured transactional documents rely mostly on textual information. However, the impact of their visual compositions remains an unexplored area and may be valuable on their automatic understanding. For the first time in the literature, this article investigates the impact of using different visual representations and their fusion on information extraction from unstructured transactional documents (i.e., for complex relation extraction from money transfer order documents). It introduces and experiments with five different visual representation approaches (i.e., word bounding box, grid embedding, grid convolutional neural network, layout embedding, and layout graph convolutional neural network) and their possible fusion with five different strategies (i.e., three basic vector operations, weighted fusion, and attention-based fusion). The results show that fusion strategies provide a valuable enhancement on combining diverse visual information from which unstructured transactional document understanding obtains different benefits depending on the context. While different visual representations have little effect when added individually to a pure textual baseline, their fusion provides a relative error reduction of up to 33%.
PubDate: 2022-04-22

• Radical-based extract and recognition networks for Oracle character
recognition

Abstract: Abstract The recognition of Oracle bone inscription (OBI) is one of the most fundamental aspect of OBI study. However, the complex glyph structure and many variants of OBI, which hinder the advancement of automatic recognition research. In order to solve these problems, this paper designs an Oracle radical extract and recognition framework(ORERF) based on deep learning. First, combining the maximally stable extremal regions(MSER) algorithm and self-defined post-processing algorithm to generate Oracle single radical data annotation; then, the generated Oracle radical-level annotation data set is input into the detection network, the detection network integrates multi-scale features, and uses the attention mechanism to implicitly extract Oracle single radical features, and then feeds the feature map to the detection module for radical detection; finally, we put the detected radicals to the auxiliary classifier network for recognition. The method of treating an OBI character as a composition of radicals rather than as a character category is a human-like method that can reduce the size of the vocabulary, ignore redundant information among similar characters. The experimental results are highlighted and compared to demonstrate the efficiency of the method. Furthermore, we also introduce two new datasets containing Oracle radical character dataset(ORCD) and Oracle combined-character dataset(OCCD).
PubDate: 2022-04-13

• CarveNet: a channel-wise attention-based network for irregular scene text
recognition

Abstract: Abstract Although it has achieved considerable progress in recent years, recognizing irregular text in natural scene is still a challenging problem due to the distortion and background interference. The prior works use either spatial transformation network(STN) or 2D Attention mechanism to improve the recognition accuracy. However, STN-based methods are not robust as the limited network capacity while 2D Attention-based methods are highly interfered by fuzziness, distortion and background. In this paper, we propose a text recognition model CarveNet which consists of three substructures: feature extractor, feature filter and decoder. Feature extractor utilizes FPN (Feature Pyramid Network) to aggregate multi-scale hierarchical feature maps and obtain a larger receptive field. Then, feature filter composed of stacked Residual Channel Attention Block is followed to separate text features from background interference. The 2D self-attention-based decoder generates the text sequence according to the output of feature filter and the previously generated symbols. Extensive evaluation results show CarveNet achieves state-of-the-art on both regular and irregular scene text recognition benchmark datasets. Compared with the previous work based on 2D self-attention, CarveNet achieves accuracy increases of 2.3 and 4.6% on irregular dataset SVTP and CT80.
PubDate: 2022-04-05

• Scene text detection via decoupled feature pyramid networks

Abstract: Abstract Detecting arbitrary shape scene texts is challenging mainly due to the varied aspect ratios, curves, and scales. In this paper, we propose a novel arbitrary shape scene text detection method via Decoupled Feature Pyramid Networks (DFPN) and regression-based linking (RegLink). Our innovative DFPN decouples the width and height of feature maps generated by FPN to enhance the discriminability of features for varied aspect ratios. As quadrilateral regression results cannot directly represent curve text, we propose a simple yet effective RegLink to link pixels into text instances because pixels in the same curve text have an identical target quadrilateral. Thus, our RegLink can extend the ability of the rotated rectangles text detector for detecting curve text. Besides, we propose a Feature Scale Module to enhance the robustness of features for varied scales. In this way, our method can effectively detect scene texts in arbitrary shapes. Meanwhile, experimental results on three publicly available challenging datasets demonstrate the effectiveness of our method. The code and model of our method is available at https://github.com/lmplayer/DFPN-master.
PubDate: 2022-03-29

• Arbitrary-shaped scene text detection with keypoint-based shape
representation

Abstract: Abstract Recently scene text detection has become a hot research topic. Arbitrary-shaped text detection is more challenging due to the irregular geometry of the texts such as long curved shapes. Most existing works attempt to solve the problem by using bottom-up methods, followed by heuristic post-processing, or top-down methods with boundary regression. Through analysis and comparison, we present an efficient framework to detect arbitrary-shaped text by fusing bottom-up and top-down methods. Specifically, we use a segmentation method as the bottom-up detector to regress the text areas. We employ an anchor-free method as the top-down detector to represent and distinguish each text based on the results of bottom-up detector. To detect text with arbitrary shapes, we propose a keypoint-based shape representation method, which treats a text as several keypoints linked together. Then, keypoints are regressed by the top-down detector. With the keypoint-based shape representation, the detected text can be easily rectified by Thin Plate Spline (TPS) transformation, and the framework can be directly extended to support end-to-end text spotting. Extensive experiments on several public benchmarks, including both regular-shaped and arbitrary-shaped scene texts in natural images, demonstrate that our method has achieved state-of-the-art performance .
PubDate: 2022-03-25

• Robust text line detection in historical documents: learning and
evaluation methods

Abstract: Abstract Text line segmentation is one of the key steps in historical document understanding. It is challenging due to the variety of fonts, contents, writing styles and the quality of documents that have degraded through the years. In this paper, we address the limitations that currently prevent people from building line segmentation models with a high generalization capacity. We present a study conducted using three state-of-the-art systems Doc-UFCN, dhSegment and ARU-Net and show that it is possible to build generic models trained on a wide variety of historical document datasets that can correctly segment diverse unseen pages. This paper also highlights the importance of the annotations used during training: Each existing dataset is annotated differently. We present a unification of the annotations and show its positive impact on the final text recognition results. In this end, we present a complete evaluation strategy using standard pixel-level metrics, object-level ones and introducing goal-oriented metrics.
PubDate: 2022-03-23

• MRZ code extraction from visa and passport documents using convolutional
neural networks

Abstract: Abstract Detecting and extracting information from the machine-readable zone (MRZ) on passports and visas is becoming increasingly important for verifying document authenticity. However, computer vision methods for performing similar tasks, such as optical character recognition, fail to extract the MRZ from digital images of passports with reasonable accuracy. We present a specially designed model based on convolutional neural networks that is able to successfully extract MRZ information from digital images of passports of arbitrary orientation and size. Our model achieves 100% MRZ detection rate and 99.25% character recognition macro-f1 score on a passport and visa dataset.
PubDate: 2022-03-01

• Feature learning and encoding for multi-script writer identification

Abstract: Abstract Writer identification from handwriting samples has been an interesting research problem for the pattern recognition community in general and handwriting recognition community in particular. In most cases, however, it is assumed that writers produce writing samples in a single script only. A more challenging scenario is the multi-script writer identification where the training and test samples of writers belong to different scripts. This paper presents a deep learning-based solution for writer identification in a multi-script scenario. The technique relies on identifying keypoints in handwriting and extracting small patches around these keypoints. These patches are aimed to capture the writing gestures of individuals which are likely to be common across multiple scripts. Robust feature representations are learned from these patches using a deep convolutional neural network and the features are encoded using a newly proposed variant of the Vector of Locally Aggregated Descriptors (VLAD). Experiments on three bilingual handwriting datasets including writing samples in Arabic, English, French, Chinese and Farsi report promising identification rates and significantly outperform the current state-of-the-art on this problem.
PubDate: 2022-02-14
DOI: 10.1007/s10032-022-00394-8

• Correction to: Personalizing image enhancement for critical visual tasks:
improved legibility of papyri using color processing and visual illusions

Abstract: This article develops theoretical, algorithmic, perceptual, and interaction aspects of script legibility enhancement in the visible light spectrum for the purpose of scholarly editing of papyri texts.
PubDate: 2022-02-09
DOI: 10.1007/s10032-022-00393-9

• Segmentation for document layout analysis: not dead yet

Abstract: Abstract Document layout analysis is often the first task in document understanding systems, where a document is broken down into identifiable sections. One of the most common approaches to this task is image segmentation, where each pixel in a document image is classified. However, this task is challenging because as the number of classes increases, small and infrequent objects often get missed. In this paper, we propose a weighted bounding box regression loss methodology to improve accuracy for segmentation of document layouts, while demonstrating our results on our dense article dataset (DAD) and the existing PubLayNet dataset. First, we collect and annotate 43 document object classes across 450 open access research articles, constructing DAD. After benchmarking several segmentation networks, we achieve an F1 score of 96.26% on DAD and 97.11% on PubLayNet with DeeplabV3+, while also showing a bounding box regression method for segmentation results that improves the F1 by +1.99 points on DAD. Finally, we demonstrate the networks trained on DAD can be used as a bootstrapped annotation tool for the existing document layout datasets, decreasing annotation time by 38% with DeeplabV3+.
PubDate: 2022-01-13
DOI: 10.1007/s10032-021-00391-3

• Personalizing image enhancement for critical visual tasks: improved
legibility of papyri using color processing and visual illusions

Abstract: Abstract This article develops theoretical, algorithmic, perceptual, and interaction aspects of script legibility enhancement in the visible light spectrum for the purpose of scholarly editing of papyri texts. Novel legibility enhancement algorithms based on color processing and visual illusions are compared to classic methods in a user experience experiment. (1) The proposed methods outperformed the comparison methods. (2) Users exhibited a broad behavioral spectrum, under the influence of factors such as personality and social conditioning, tasks and application domains, expertise level and image quality, and affordances of software, hardware, and interfaces. No single enhancement method satisfied all factor configurations. Therefore, it is suggested to offer users a broad choice of methods to facilitate personalization, contextualization, and complementarity. (3) A distinction is made between casual and critical vision on the basis of signal ambiguity and error consequences. The criteria of a paradigm for enhancing images for critical applications comprise: interpreting images skeptically; approaching enhancement as a system problem; considering all image structures as potential information; and making uncertainty and alternative interpretations explicit, both visually and numerically.
PubDate: 2021-12-27
DOI: 10.1007/s10032-021-00386-0

• A novel normal to tangent line (NTL) algorithm for scale invariant feature
extraction for Urdu OCR

Abstract: Abstract The font invariant recognition of Urdu optical characters is a difficult task due to the nature of Nastalique script. Urdu Nastalique is a complex script as it is excessively cursive and contains characters which are overlapping. Characters also change shape along with change in context. The identification of starting position of same character in different contexts further increases complexity. Hence, an optical character recognition (OCR) system, which is trained to recognize characters of a particular font size, may not show the same level of accuracy if font size varies. While considering this complexity the current research has focused on discovering such a feature set which may provide sufficient information for scale invariant Urdu optical character recognition. For this task, calligraphic properties of Urdu Nastalique, the thickness of ligature, the direction of movement of calligraphic pen and global geometric features (height and weight) are used as feature set. The feature of thickness is extracted using two novel algorithms, i.e. “Normal to Tangent Line Algorithm (NTL)” and “Angle to Tangent Line Algorithm (ATL)”. These features are fed to three different models, i.e. correlation, C4.5 and feedforward artificial neural network, and the performance of these models is also compared with SIFT (Scale Invariant Features Transformation). For training and testing, both real and fabricated data sets are employed. The new benchmark dataset of extracted features named Urdu OCR—Scale Invariant Feature Vectors (SIFVs), is developed and released at Kaggle. The newly developed SIFVs dataset, when used to train Correlation, C4.5 and ANN-based models, outperformed SIFT descriptors and yielded 94.56%, 90.54% and 94.65% accuracy, respectively, while SIFT descriptors achieved only 75.45% accuracy on average.
PubDate: 2021-11-30
DOI: 10.1007/s10032-021-00389-x

• TableSegNet: a fully convolutional network for table detection and
segmentation in document images

Abstract: Abstract Advances in image object detection lead to applying deep convolution neural networks in the document image analysis domain. Unlike general colorful and pattern-rich objects, tables in document images have properties that limit the capacity of deep learning structures. Significant variation in size and aspect ratio and the local similarity among document components are the main challenges that require both global features for detection and local features for the separation of nearby objects. To deal with these challenges, we present TableSegNet, a compact architecture of a fully convolutional network to detect and separate tables simultaneously. TableSegNet consists of a deep convolution path to detect table regions in low resolution and a shallower path to locate table locations in high resolution and split the detected regions into individual tables. To improve the detection and separation capacity, TableSegNet uses convolution blocks of wide kernel sizes in the feature extraction process and an additional table-border class in the main output. With only 8.1 million parameters and trained purely on document images from the beginning, TableSegNet has achieved state-of-the-art F1 score at the IoU threshold of 0.9 on the ICDAR2019 and the highest number of correctly detected tables on the ICDAR2013 table detection datasets.
PubDate: 2021-11-22
DOI: 10.1007/s10032-021-00390-4

• An end-to-end network for irregular printed Mongolian recognition

Abstract: Abstract Mongolian is a language spoken in Inner Mongolia, China. In the recognition process, due to the shooting angle and other reasons, the image and text will be deformed, which will cause certain difficulties in recognition. This paper propose a triplet attention Mogrifier network (TAMN) for print Mongolian text recognition. The network uses a spatial transformation network to correct deformed Mongolian images. It uses gated recurrent convolution layers (GRCL) combine with triplet attention module to extract image features for the corrected images. The Mogrifier long short-term memory (LSTM) network gets the context sequence information in the feature and finally uses the decoder’s LSTM attention to get the prediction result. Experimental results show the spatial transformation network can effectively recognize deformed Mongolian images, and the recognition accuracy can reach 90.30%. This network achieves good performance in Mongolian text recognition compare with the current mainstream text recognition network. The dataset has been publicly available at https://github.com/ShaoDonCui/Mongolian-recognition.
PubDate: 2021-10-18
DOI: 10.1007/s10032-021-00388-y

• $$\hbox {TG}^2$$ TG 2 : text-guided transformer GAN for restoring document

Abstract: Abstract Most image enhancement methods focused on restoration of digitized textual documents are limited to cases where the text information is still preserved in the input image, which may often not be the case. In this work, we propose a novel generative document restoration method which allows conditioning the restoration on a guiding signal in the form of target text transcription and which does not need paired high- and low-quality images for training. We introduce a neural network architecture with an implicit text-to-image alignment module. We demonstrate good results on inpainting, debinarization and deblurring tasks, and we show that the trained models can be used to manually alter text in document images. A user study shows that that human observers confuse the outputs of the proposed enhancement method with reference high-quality images in as many as 30% of cases.
PubDate: 2021-09-22
DOI: 10.1007/s10032-021-00387-z

• Extracting text from scanned Arabic books: a large-scale benchmark dataset
and a fine-tuned Faster-R-CNN model

Abstract: Abstract Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.
PubDate: 2021-06-30
DOI: 10.1007/s10032-021-00382-4

• A hybrid approach to recognize generic sections in scholarly documents

Abstract: Abstract Discourse parsing of scholarly documents is the premise and basis for standardizing the writing of scholarly documents, understanding their content, and quickly locating and extracting specific information from them. With the continuous emergence of a large number of scholarly documents, how to automatically analyze scholarly documents quickly and effectively has become a research hotspot. In this paper, we propose a hybrid model, which considers both section headers and body texts, to recognize generic sections in scholarly documents automatically. We conduct a comprehensive analysis of the semantic difference between short phrases and long narrative text chunks on the SectLabel dataset. The experimental results show that our model achieves 91.67% $$F_{1}$$ -value in the generic section recognization, which is better than the baseline.
PubDate: 2021-06-21
DOI: 10.1007/s10032-021-00381-5

