Publisher: Centre national de la recherche scientifique   (Total: 1 journals)   [Sort by number of followers]

Showing 1 - 1 of 1 Journals sorted alphabetically
J. of Data Mining and Digital Humanities     Open Access   (Followers: 39)
Similar Journals
Journal Cover
Journal of Data Mining and Digital Humanities
Number of Followers: 39  

  This is an Open Access Journal Open Access journal
ISSN (Online) 2416-5999
Published by Centre national de la recherche scientifique Homepage  [1 journal]
  • Linguistic Fingerprints on Translation's Lens

    • Abstract: What happens to the language fingerprints of a work when it is translated into another language' While translation studies has often prioritized concepts of equivalence (of form and function), and of textual function, digital humanities methodologies can provide a new analytical lens onto ways that stylistic traces of a text's source language can persist in a translated text. This paper presents initial findings of a project undertaken by the Stanford Literary Lab, which has identified distinctive grammatical features in short stories that have been translated into English. While the phenomenon of "translationese" has been well established particularly in corpus translation studies, we argue that digital humanities methods can be valuable for identifying specific traits for a vision of a world atlas of literary style.
      PubDate: Thu, 27 Jan 2022 15:25:29 +010
  • Combining Morphological and Histogram based Text Line Segmentation in the
           OCR Context

    • Abstract: Text line segmentation is one of the pre-stages of modern optical characterrecognition systems. The algorithmic approach proposed by this paper has beendesigned for this exact purpose. Its main characteristic is the combination oftwo different techniques, morphological image operations and horizontalhistogram projections. The method was developed to be applied on a historicdata collection that commonly features quality issues, such as degraded paper,blurred text, or presence of noise. For that reason, the segmenter in questioncould be of particular interest for cultural institutions, that want access torobust line bounding boxes for a given historic document. Because of thepromising segmentation results that are joined by low computational cost, thealgorithm was incorporated into the OCR pipeline of the National Library ofLuxembourg, in the context of the initiative of reprocessing their historicnewspaper collection. The general contribution of this paper is to outline theapproach and to evaluate the gains in terms of accuracy and speed, comparing itto the segmentation algorithm bundled with the used open source OCR software.
      PubDate: Thu, 04 Nov 2021 10:38:36 +010
  • Topic models do not model topics: epistemological remarks and steps
           towards best practices

    • Abstract: The social sciences and digital humanities have recently adopted the machine learning technique of topic modeling to address research questions in their fields. This is problematic in a number of ways, some of which have not received much attention in the debate yet. This paper adds epistemological concerns centering around the interface between topic modeling and linguistic concepts and the argumentative embedding of evidence obtained through topic modeling. It concludes that topic modeling in its present state of methodological integration does not meet the requirements of an independent research method. It operates from relevantly unrealistic assumptions, is non-deterministic, cannot effectively be validated against a reasonable number of competing models, does not lock into a well-defined linguistic interface, and does not scholarly model topics in the sense of themes or content. These features are intrinsic and make the interpretation of its results prone to apophenia (the human tendency to perceive random sets of elements as meaningful patterns) and confirmation bias (the human tendency to perceptually prefer patterns that are in alignment with pre-existing biases). While partial validation of the statistical model is possible, a conceptual validation would require an extended triangulation with other methods and human ratings, and clarification of whether statistical distinctivity of lexical co-occurrence correlates with conceputal topics in any reliable way.
      PubDate: Wed, 27 Oct 2021 15:38:17 +020
  • Publishing open-access bibliographical data on Ancient Greek and Latin
           texts: challenges, constraints, progression

    • Abstract: We present here both some of our thoughts on methodology in relation to the specific constraints that complexify the ways of structuring and accessing bibliographical data in the Sciences of Antiquity, and the solutions adopted by the IPhiS-CIRIS project for dealing with these constraints. The project began in 2014 in a general scientific environment that was still being standardised and structured, with digital bibliographical resources in this disciplinary field becoming increasingly numerous, although of uneven quality and hard to access and/or private.
      PubDate: Tue, 05 Oct 2021 08:14:33 +020
  • The renewal of the digital humanities. An overview of the transformation
           of professions in the humanities and social sciences

    • Abstract: This article presents a study of the French-speaking digital humanities. It is based on the experience of two research engineers from the French National Center for Scientific Research (CNRS) who have been studying these issues for the last ten years. They conducted a survey at the École Normale Supérieure (ENS-Paris) which enabled them to draw up an overview of the transformation of the profession of humanities and social sciences research engineers in the context of the digital humanities. The Digit_Hum initiative, which they run in parallel with their respective activities at the ENS, also provided information for this overview thanks to its role as a space for discussion about the digital humanities along with training and structuring of this field at the ENS and the Université Paris Sciences & Lettres (PSL).
      PubDate: Tue, 28 Sep 2021 11:40:10 +020
  • French vital records data gathering and analysis through image processing
           and machine learning algorithms

    • Abstract: Vital records are rich of meaningful historical data concerning city as well as countryside inhabitants that can be used, among others, to study former populations and then reveal the social, economic and demographic characteristics of those populations. However, these studies encounter a main difficulty for collecting the data needed since most of these records are scanned documents that need a manual transcription step in order to gather all the data and start exploiting it from a historical point of view. This step consequently slows down the historical research and is an obstacle to a better knowledge of the population habits depending on their social conditions. Therefore in this paper, we present a modular and self-sufficient analysis pipeline using state-of-the-art algorithms mostly regardless of the document layout that aims to automate this data extraction process.
      PubDate: Thu, 15 Jul 2021 13:55:51 +020
  • Conceptual modeling of prosopographic databases integrating quality

    • Abstract: Prosopographic databases, which allow the study of social groups through their bibliography, are used today by a significant number of historians. Computerization has allowed intensive and large-scale exploitation of these databases. The modeling of these proposopographic databases has given rise to several data models. An important problem is to ensure a level of quality of the stored information. In this article , we propose a generic data model allowing to describe most of the existing prosopographic databases and to enrich them by integrating several quality concepts such as uncertainty, reliability, accuracy or completeness.
      PubDate: Fri, 07 May 2021 04:20:20 +020
  • Corpus and Models for Lemmatisation and POS-tagging of Classical French

    • Abstract: This paper describes the process of building an annotated corpus and trainingmodels for classical French literature, with a focus on theatre, andparticularly comedies in verse. It was originally developed as a preliminarystep to the stylometric analyses presented in Cafiero and Camps [2019]. The useof a recent lemmatiser based on neural networks and a CRF tagger allows toachieve accuracies beyond the current state-of-the art on the in-domain test,and proves to be robust during out-of-domain tests, i.e.up to 20th c.novels.
      PubDate: Sun, 14 Feb 2021 09:36:20 +010
  • Plague Dot Text: Text mining and annotation of outbreak reports of the
           Third Plague Pandemic (1894-1952)

    • Abstract: The design of models that govern diseases in population is commonly built oninformation and data gathered from past outbreaks. However, epidemic outbreaksare never captured in statistical data alone but are communicated bynarratives, supported by empirical observations. Outbreak reports discusscorrelations between populations, locations and the disease to infer insightsinto causes, vectors and potential interventions. The problem with thesenarratives is usually the lack of consistent structure or strong conventions,which prohibit their formal analysis in larger corpora. Our interdisciplinaryresearch investigates more than 100 reports from the third plague pandemic(1894-1952) evaluating ways of building a corpus to extract and structure thisnarrative information through text mining and manual annotation. In this paperwe discuss the progress of our ongoing exploratory project, how we enhanceoptical character recognition (OCR) methods to improve text capture, ourapproach to structure the narratives and identify relevant entities in thereports. The structured corpus is made available via Solr enabling search andanalysis across the whole collection for future research dedicated, forexample, to the identification of concepts. We show preliminary visualisationsof the characteristics of causation and differences with respect to gender as aresult of syntactic-category-dependent corpus statistics. Our goal is todevelop structured accounts of some of the most significant concepts that wereused to understand the epidemiology of the third plague pandemic around theglobe. The corpus enables researchers to analyse the reports collectivelyallowing for deep insights into the global epidemiological consideration ofplague in the early twentieth century.
      PubDate: Wed, 20 Jan 2021 10:39:56 +010
  • Combining Visual and Textual Features for Semantic Segmentation of
           Historical Newspapers

    • Abstract: The massive amounts of digitized historical documents acquired over the lastdecades naturally lend themselves to automatic processing and exploration.Research work seeking to automatically process facsimiles and extractinformation thereby are multiplying with, as a first essential step, documentlayout analysis. If the identification and categorization of segments ofinterest in document images have seen significant progress over the last yearsthanks to deep learning techniques, many challenges remain with, among others,the use of finer-grained segmentation typologies and the consideration ofcomplex, heterogeneous documents such as historical newspapers. Besides, mostapproaches consider visual features only, ignoring textual signal. In thiscontext, we introduce a multimodal approach for the semantic segmentation ofhistorical newspapers that combines visual and textual features. Based on aseries of experiments on diachronic Swiss and Luxembourgish newspapers, weinvestigate, among others, the predictive power of visual and textual featuresand their capacity to generalize across time and sources. Results showconsistent improvement of multimodal models in comparison to a strong visualbaseline, as well as better robustness to high material variance.
      PubDate: Tue, 19 Jan 2021 11:29:40 +010
  • Character Segmentation in Asian Collector's Seal Imprints: An Attempt to
           Retrieval Based on Ancient Character Typeface

    • Abstract: Collector's seals provide important clues about the ownership of a book. They contain much information pertaining to the essential elements of ancient materials and also show the details of possession, its relation to the book, the identity of the collectors and their social status and wealth, amongst others. Asian collectors have typically used artistic ancient characters rather than modern ones to make their seals. In addition to the owner's name, several other words are used to express more profound meanings. A system that automatically recognizes these characters can help enthusiasts and professionals better understand the background information of these seals. However, there is a lack of training data and labelled images, as samples of some seals are scarce and most of them are degraded images. It is necessary to find new ways to make full use of such scarce data. While these data are available online, they do not contain information on the characters' position. The goal of this research is to assist in obtaining more labelled data through user interaction and provide retrieval tools that use only standard character typefaces extracted from font files. In this paper, a character segmentation method is proposed to predict the candidate characters' area without any labelled training data that contain character coordinate information. A retrieval-based recognition system that focuses on a single character is also proposed to support seal retrieval and matching. The experimental results demonstrate that the proposed character segmentation method performs well on Asian collector's seals, with 85% of the test data being correctly segmented.
      PubDate: Mon, 11 Jan 2021 13:56:45 +010
  • Digital interfaces of historical newspapers: opportunities, restrictions
           and recommendations

    • Abstract: Many libraries offer free access to digitised historical newspapers via user interfaces. After an initial period of search and filter options as the only features, the availability of more advanced tools and the desire for more options among users has ushered in a period of interface development. However, this raises a number of open questions and challenges. For example, how can we provide interfaces for different user groups' What tools should be available on interfaces and how can we avoid too much complexity' What tools are helpful and how can we improve usability' This paper will not provide definite answers to these questions, but it gives an insight into the difficulties, challenges and risks of using interfaces to investigate historical newspapers. More importantly, it provides ideas and recommendations for the improvement of user interfaces and digital tools.
      PubDate: Mon, 11 Jan 2021 13:54:32 +010
  • Indigenous frameworks for data-intensive humanities: recalibrating the
           past through knowledge engineering and generative modelling.

    • Abstract: Identifying, contacting and engaging missing shareholders constitutes an enormous challenge for Māori incorporations, iwi and hapū across Aotearoa New Zealand. Without accurate data or tools to har-monise existing fragmented or conflicting data sources, issues around land succession, opportunities for economic development, and maintenance of whānau relationships are all negatively impacted. This unique three-way research collaboration between Victoria University of Wellington (VUW), Parininihi ki Waitotara Incorporation (PKW), and University of Auckland funded by the National Science Challenge Science for Technological Innovation catalyses innovation through new digital humanities-inflected data science modelling and analytics with the kaupapa of reconnecting missing Māori shareholders for a prosperous economic, cultural, and socially revitalised future. This paper provides an overview of VUW's culturally-embedded social network approach to the project, discusses the challenges of working within an indigenous worldview, and emphasises the importance of decolonising digital humanities.
      PubDate: Fri, 08 Jan 2021 10:13:14 +010
  • TraduXio Project: Latest Upgrades and Feedback

    • Abstract: TraduXio is a digital environment for computer assisted multilingual translation which is web-based, free to use and with an open source code. Its originality is threefold-whereas traditional technologies are limited to two languages (source/target), TraduXio enables the comparison of different versions of the same text in various languages; its concordancer provides relevant and multilingual suggestions through a classification of the source according to the history, genre and author; it uses collaborative devices (privilege management, forums, networks, history of modification, etc.) to promote collective (and distributed) translation. TraduXio is designed to encourage the diversification of language learning and to promote a reappraisal of translation as a professional skill. It can be used in many different ways, by very diverse kind of people. In this presentation, I will present the recent developments of the software (its version 2.1) and illustrate how specific groups (language teaching, social sciences, literature) use it on a regular basis. In this paper, I present the technology but concentrate more on the possible uses of TraduXio, thus focusing on translators' feedback about their experience when working in this digital environment in a truly collaborative way.
      PubDate: Fri, 08 Jan 2021 05:52:04 +010
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762

Your IP address:
Home (Search)
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-