Followed Journals
Journal you Follow: 0
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Journals are published.
Already have an account? Sign In to see the journals you follow.
Similar Journals
Journal Cover
International Journal on Digital Libraries
Journal Prestige (SJR): 0.441
Citation Impact (citeScore): 2
Number of Followers: 796  
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012
Published by Springer-Verlag Homepage  [2626 journals]
  • Feature selection for classifying multi-labeled past events
    • Abstract: Abstract The study and analysis of past events can provide numerous benefits. While event categorization has been previously studied, it usually assigned only one event category to an event. In this study, we focus on multi-label classification for past events, which is a more general and challenging problem than those approached in previous studies. We categorize events into thirteen different types using a range of diverse features and classifiers trained on a dataset that has at least 50 labeled news articles for each category. We have confirmed that using all the features to train classifiers has statistical significance and improves all micro- and macro-average \(F_1\) , multi-label accuracy, average precision@5, area under the receiver operating characteristic curve and example-based loss functions.
      PubDate: 2020-09-08
  • A crowdsourcing approach to construct mono-lingual plagiarism detection
    • Abstract: Abstract Plagiarism detection deals with detecting plagiarized fragments among textual documents. The availability of digital documents in online libraries makes plagiarism easier and on the other hand, to be easily detected by automatic plagiarism detection systems. Large scale plagiarism corpora with a wide variety of plagiarism cases are needed to evaluate different detection methods in different languages. Plagiarism detection corpora play an important role in evaluating and tuning plagiarism detection systems. Despite of their importance, few corpora have been developed for low resource languages. In this paper, we propose HAMTA, a Persian plagiarism detection corpus. To simulate real cases of plagiarism, manually paraphrased text are used to compile the corpus. For obtaining the manual plagiarism cases, a crowdsourcing platform is developed and crowd workers are asked to paraphrase fragments of text in order to simulate real cases of plagiarism. Moreover, artificial methods are used to scale-up the proposed corpus by automatically generating cases of text re-use. The evaluation results indicate a high correlation between the proposed corpus and the PAN state-of-the-art English plagiarism detection corpus.
      PubDate: 2020-09-07
  • Identification of tweets that mention books
    • Abstract: Abstract We address the task of identifying tweets that mention books from amongst tweets that contain the same strings as book titles. Assuming the existence of a comprehensive list of book titles, this task can be defined as text classification targeting tweets that contain the same string as book titles. In carrying out the task, we need to exclude two types of tweets. The first is automatically posted, spam-like tweets that promote book sales or post recommendations (bot tweets). This type of tweets is excluded because we are developing an online surrogate to book exposure embedded within human communication on social media, and the results of the present task are to be used in this system. The second is tweets that contain the same string as book titles but are not about books (noise tweets). We proposed a two-step, machine learning-based pipeline consisting of bot filtering followed by noise reduction. Evaluation of experiments showed that our proposed method achieved an F1-score of 0.76, which is comparable to the best performance reported in similar tasks and sufficient as a first step for use in practical applications. We also analysed the detailed performance and errors, which suggested that the proposed method maintained an appropriate balance between precision and recall, and can be further improved by increasing the data size and taking into account word senses.
      PubDate: 2020-09-01
  • Improving semantic change analysis by combining word embeddings and word
    • Abstract: Language is constantly evolving. As part of diachronic linguistics, semantic change analysis examines how the meanings of words evolve over time. Such semantic awareness is important to retrieve content from digital libraries. Recent research on semantic change analysis relying on word embeddings has yielded significant improvements over previous work. However, a recent, but somewhat neglected observation so far is that the rate of semantic shift negatively correlates with word-usage frequency. In this article, we therefore propose SCAF, Semantic Change Analysis with Frequency. It abstracts from the concrete embeddings and includes word frequencies as an orthogonal feature. SCAF allows using different combinations of embedding type, optimization algorithm and alignment method. Additionally, we leverage existing approaches for time series analysis, by using change detection methods to identify semantic shifts. In an evaluation with a realistic setup, SCAF achieves better detection rates than prior approaches, 95% instead of 51%. On the Google Books Ngram data set, our approach detects both known and yet unknown shifts for popular words.
      PubDate: 2020-09-01
  • Choice overload and recommendation effectiveness in related-article
    • Abstract: Abstract Choice overload describes a situation in which a person has difficulty in making decisions due to too many options. We examine choice overload when displaying related-article recommendations in digital libraries, and examine the effectiveness of recommendation algorithms in this domain. We first analyzed existing digital libraries, and found that only 30% of digital libraries show related-article recommendations to their users. Of these libraries, the majority (74%) displays 3–5 related articles; 28% of them display 6–10 related articles; and no digital library displayed more than ten related-article recommendations. We then conducted our experimental evaluation through GESIS’ digital library Sowiport, with recommendations delivered by recommendations-as-a-service provider Mr. DLib. We use four metrics to analyze 41.3 million delivered recommendations: click-through rate (CTR), percentage of clicked recommendation sets (clicked set rate, CSR), average clicks per clicked recommendation set (ACCS), and time to first click (TTFC), which is the time between delivery of a set of recommendations to the first click. These metrics help us to analyze choice overload and can yield evidence for finding the ideal number of recommendations to display. We found that with increasing recommendation set size, i.e., the numbers of displayed recommendations, CTR decreases from 0.41% for one recommendation to 0.09% for 15 recommendations. Most recommendation sets only receive one click. ACCS increases with set size, but increases more slowly for six recommendations and more. When displaying 15 recommendations, the average clicks per set is at a maximum (1.15). Similarly, TTFC increases with larger recommendation set size, but increases more slowly for sets of more than five recommendations. While CTR and CSR do not indicate choice overload, ACCS and TTFC point toward 5–6 recommendations as being optimal for Sowiport. Content-based filtering yields the highest CTR with 0.118%, while stereotype recommendations yield the highest ACCS (1.28). Stereotype recommendations also yield the highest TTFC. This means that users take more time before clicking stereotype recommendations when compared to recommendations based on other algorithms.
      PubDate: 2020-09-01
  • Thinking digital libraries for preservation as digital cultural heritage:
           by R to R 4 facet of FAIR principles
    • Abstract: Abstract The Art. 2 of the UE Council conclusions of 21 May 2014 on cultural heritage as a strategic resource for a sustainable Europe (2014/C 183/08) states: “Cultural heritage consists of the resources inherited from the past in all forms and aspects—tangible, intangible and digital (born digital and digitized), including monuments, sites, landscapes, skills, practices, knowledge and expressions of human creativity, as well as collections conserved and managed by public and private bodies such as museums, libraries and archives”. Starting from this assumption, we have to rethink digital and digitization as social and cultural expressions of the contemporary age. We need to rethink digital libraries produced by digitization as cultural entities and no longer as mere dataset for enhancing fruition of cultural heritage, by defining clear and homogeneous criteria to validate and certify them as memory and sources of knowledge for future generations. By expanding R: Re-usable of the FAIR Guiding Principles for scientific data management and stewardship into R4: Re-usable, Relevant, Reliable and Resilient, this paper aims to propose a more reflective approach to creation of descriptive metadata for managing digital resource of cultural heritage, which can guarantee their long term preservation.
      PubDate: 2020-08-27
  • Citation recommendation: approaches and datasets
    • Abstract: Abstract Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recent years, several approaches and evaluation data sets have been presented. However, to the best of our knowledge, no literature survey has been conducted explicitly on citation recommendation. In this article, we give a thorough introduction to automatic citation recommendation research. We then present an overview of the approaches and data sets for citation recommendation and identify differences and commonalities using various dimensions. Last but not least, we shed light on the evaluation methods and outline general challenges in the evaluation and how to meet them. We restrict ourselves to citation recommendation for scientific publications, as this document type has been studied the most in this area. However, many of the observations and discussions included in this survey are also applicable to other types of text, such as news articles and encyclopedic articles.
      PubDate: 2020-08-11
  • Representing quantitative documentation of 3D cultural heritage artefacts
           with CIDOC CRMdig
    • Abstract: Abstract In this paper, we will explore the theme of the documentation of 3D cultural heritage assets, not only as entire artefacts but also including the interesting features of the object from an archaeological perspective. Indeed, the goal is supporting archaeological research and curation, providing a different approach to enrich the documentation of digital resources and their components with corresponding measurements, combining semantic and geometric techniques. A documentation scheme based on CIDOC, where measurements on digital data have been included extending CIDOC CRMdig, is discussed. To annotate accurately the components and features of the artefacts, a controlled vocabulary named Cultural Heritage Artefact Partonomy (CHAP) has been defined and integrated into the scheme as a SKOS taxonomy to showcase the proposed methodology. CHAP concerns Coroplastic, which is the study of ancient terracotta figurines and in particular the Cypriot production. Two case studies have been considered: the terracotta statues from the port of Salamis and the small clay statuettes from the Ayia Irini sanctuary. Focussing both on the artefacts and their digital counterparts, the proposed methodology supports effectively typical operations within digital libraries and repositories (e.g. search, part-based annotation), and more specific objectives such as the archaeological interpretation and digitally assisted classification, as proved in a real archaeological scenario. The proposed approach is general and applies to different contexts, since it is able to support any archaeological research where the goal is an extensive digital documentation of tangible findings including quantitative attributes.
      PubDate: 2020-08-08
  • OrgBR-M: a method to assist in organizing bibliographic material based on
           formal concept analysis—a case study in educational data mining
    • Abstract: Abstract For conducting a literature review is necessary a preliminary organization of the available bibliographic material. In this article, we present a novel method called OrgBR-M (method to organize bibliographic references), based on the formal concept analysis theory, to assist in organizing bibliographic material. Our method systematizes the organization of bibliography and proposes metrics to assist in guiding the literature review. As a case study, we apply the OrgBR-M method to perform a literature review of the educational data mining field of study.
      PubDate: 2020-08-01
  • A fuzzy-based framework for evaluation of website design quality index
    • Abstract: Abstract An unrecognized significance of the web acts as a driving force for the massive and rapid growth of websites in each domain of social life. For making a successful website, it is necessary for developers to embrace appropriate web testing and evaluation methodology. Some valuable works in the past have striven to appraise the web applications quantitatively. Various parameters have been considered which are again sub-parameterized to measurable indicators. But their weighing criterion has not been appropriately taken into account according to the domain of the website. Also, the relative degrees of interactions among parameters have not been taken into consideration. The work presented in this paper aims at describing a framework, Quality Index Evaluation Method to gauge the design quality of a website in the form of index value. An automated tool has been designed and coded to measure the metrics quantitatively. A weighing technique based on Fuzzy-DEMATEL (Decision Making Trial and Evaluation Laboratory Method) has been applied on these metrics. Fuzzy trapezoidal numbers have been used for assessment of parameters and the final design quality index value. To verify the use of framework in different website domains, it has been exercised on eight academic (four institutional and four digital libraries), five informative and four commercial websites. The results have been validated through the most widely used method in literature, i.e., user judgment. Opinions of users for each website have been quantified and aggregated with fuzzy aggregation technique. Experimental results show that the proposed framework provides accurate and consistent results in very less time.
      PubDate: 2020-07-29
  • PVAF: an environment for disambiguation of scientific publication venues
    • Abstract: Abstract A publication venue authority file stores variants of the names of journals and conferences that publish scientific articles. It is useful in the construction of search tools and data disambiguation, and it is of special interest to agencies funding research and evaluating graduate programs, which use the quality of publication venues as a basis for evaluating researchers’ and research groups’ publications. However, keeping an updated authority file is not a trivial task. Different names are used to refer to the same publication venue, these venues sometimes change their name, new venues emerge regularly, and journal bibliometrics are updated frequently. This paper presents the publication venue authority file (PVAF), an environment for the disambiguation of scientific publication venues. It consists of an authority file and a set of tools for updating and querying its data. We describe and experimentally evaluate each of these tools. We also propose a search algorithm based on an associative classifier, which allows for incremental updates of its learning model. The results show that the PVAF has coverage greater than 86% for publication venues in several fields of knowledge, and its tools attain a good accuracy in the classification of publication venues from curricula vitae formatted in various citation styles.
      PubDate: 2020-07-26
  • A framework for modelling and visualizing the US Constitutional Convention
           of 1787
    • Abstract: Abstract This paper describes a new approach to the presentation of records relating to formal negotiations and the texts that they create. It describes the architecture of a model, platform, and web interface ( that can be used by domain experts to convert the records typical of formal negotiations into a model of decision-making (with minimal training). This model has implications for both research and teaching, by allowing for better qualitative and quantitative analysis of negotiations. The platform emphasizes the reconstruction as closely as possible of the context within which proposals and decisions are made. The usability and benefits of a generic platform are illustrated by a presentation of the records relating to the 1787 Constitutional Convention that wrote the Constitution of the USA.
      PubDate: 2020-06-01
  • On the effectiveness of the scientific peer-review system: a case study of
           the Journal of High Energy Physics
    • Abstract: Abstract The importance and the need for the peer-review system is highly debated in the academic community, and recently there has been a growing consensus to completely get rid of it. This is one of the steps in the publication pipeline that usually requires the publishing house to invest a significant portion of their budget in order to ensure quality editing and reviewing of the submissions received. Therefore, a very pertinent question is if at all such investments are worth making. To answer this question, in this paper, we perform a rigorous measurement study on a massive dataset (29k papers with 70k distinct review reports) to unfold the detailed characteristics of the peer-review process considering the three most important entities of this process—(i) the paper (ii) the authors and (iii) the referees and thereby identify different factors related to these three entities which can be leveraged to predict the long-term impact of a submitted paper. These features when plugged into a regression model achieve a high \(R^2\) of 0.85 and RMSE of 0.39. Analysis of feature importance indicates that reviewer- and author-related features are most indicative of long-term impact of a paper. We believe that our framework could definitely be utilized in assisting editors to decide the fate of a paper.
      PubDate: 2020-06-01
  • Toward comprehensive event collections
    • Abstract: Abstract Web archives, such as the Internet Archive, preserve an unprecedented abundance of materials regarding major events and transformations in our society. In this paper, we present an approach for building event-centric sub-collections from such large archives, which includes not only the core documents related to the event itself but, even more importantly, documents describing related aspects (e.g., premises and consequences). This is achieved by identifying relevant concepts and entities from a knowledge base, and then detecting their mentions in documents, which are interpreted as indicators for relevance. We extensively evaluate our system on two diachronic corpora, the New York Times Corpus and the US Congressional Record; additionally, we test its performance on the TREC KBA Stream Corpus and on the TREC-CAR dataset, two publicly available large-scale web collections.
      PubDate: 2020-06-01
  • Building and querying semantic layers for web archives (extended version)
    • Abstract: Abstract Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles (“layers”) that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts, and events), and publishing all these data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities, and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.
      PubDate: 2020-06-01
  • Toward meaningful notions of similarity in NLP embedding models
    • Abstract: Finding similar words with the help of word embedding models, such as Word2Vec or GloVe, computed on large-scale digital libraries has yielded meaningful results in many cases. However, the underlying notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful. To do so, we analyze the statistical distribution of similarity values systematically, conducting two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values and thresholds actually are meaningful for a given embedding model. Based on these results, we calculate how these thresholds, when taken into account during evaluation, change the evaluation scores of the models in similarity test sets. In more abstract terms, our insights give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.
      PubDate: 2020-06-01
  • Introduction to the focused issue on the 2017 ACM/IEEE-CS Joint Conference
           on Digital Libraries JCDL 2017
    • PubDate: 2020-05-18
  • An analysis and comparison of keyword recommendation methods for
           scientific data
    • Abstract: Abstract To classify and search various kinds of scientific data, it is useful to annotate those data with keywords from a controlled vocabulary. Data providers, such as researchers, annotate their own data with keywords from the provided vocabulary. However, for the selection of suitable keywords, extensive knowledge of both the research domain and the controlled vocabulary is required. Therefore, the annotation of scientific data with keywords from a controlled vocabulary is a time-consuming task for data providers. In this paper, we discuss methods for recommending relevant keywords from a controlled vocabulary for the annotation of scientific data through their metadata. Many previous studies have proposed approaches based on keywords in similar existing metadata; we call this the indirect method. However, when the quality of the existing metadata set is insufficient, the indirect method tends to be ineffective. Because the controlled vocabularies for scientific data usually provide definition sentences for each keyword, it is also possible to recommend keywords based on the target metadata and the keyword definitions; we call this the direct method. The direct method does not utilize the existing metadata set and therefore is independent of its quality. Also, for the evaluation of keyword recommendation methods, we propose evaluation metrics based on a hierarchical vocabulary structure, which is a distinctive feature of most controlled vocabularies. Using our proposed evaluation metrics, we can evaluate keyword recommendation methods with an emphasis on keywords that are more difficult for data providers to select. In experiments using real earth science datasets, we compare the direct and indirect methods to verify their effectiveness, and observe how the indirect method depends on the quality of the existing metadata set. The results show the importance of metadata quality in recommending keywords.
      PubDate: 2020-02-07
  • Extending the IFLA Library Reference Model for a Brazilian popular music
           digital library
    • Abstract: Abstract Brazil is recognized as a musical country, with a diverse collection of musical resources served by many digital repositories and music libraries. Historically, those systems are supported by cataloging schemes that are insufficient because they follow standards more focused on the catalog record than on the structure of cataloged works. On the other hand, it is perceived the popularization of multi-entity bibliographic conceptual models, such as IFLA LRM, which seem to be an interesting solution because they (i) have a better capacity to represent the internal architecture of musical objects; (ii) support the creation of cataloging codes and international standards that enable interoperability between collections; and (iii) can be adapted through extension mechanisms. The purpose of this paper is to present an experiment of extending the IFLA LRM in the context of a Brazilian popular music digital library application in order to identify its power of expressivity for attending this specific domain. Instead of making direct use of the entities, attributes and relationships of IFLA LRM, the adopted method was to map concepts of a specific conceptual model (adherent to the digital library in question) that represents aspects of Brazilian popular music on IFLA LRM elements. The resulting extended model demonstrated to be aligned with users’ information needs and proved the efficiency of IFLA LRM to adapt to specific domains. In addition, the strategy of extending IFLA LRM from a specific model seemed appropriate for dealing with its level of generality.
      PubDate: 2020-01-31
  • Historical document layout analysis using anisotropic diffusion and
           geometric features
    • Abstract: Abstract There are several digital libraries worldwide which maintain valuable historical manuscripts. Usually, digital copies of these manuscripts are offered to researchers and readers in raster-image format. These images carry several document degradations that may hinder automatic information retrieval solutions such as manuscript indexing, categorization, retrieval by content, etc. In this paper, we propose a learning-free and hybrid document layout analysis for handwritten historical manuscripts. It has two main phases: page characterization and segmentation. First, the proposed method locates main-content initially using top-down whitespace analysis. It employs anisotropic diffusion filtering to find whitespaces. Then, it extracts template features representing manuscripts’ authors writing behavior. After that, moving windows are used to scan the manuscript page and define main-content boundaries more precisely. We evaluated the proposed method on two datasets: One set is publicly available with 38 historical manuscript pages, and the other set of 51 historical manuscript pages that are collected from the online Harvard Library. Experiments on both datasets show promising results in terms of segmentation quality of main-content that reaches up to 98.5% success rate.
      PubDate: 2020-01-23
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762

Your IP address:
Home (Search)
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-