Journal Cover
International Journal on Digital Libraries
Journal Prestige (SJR): 0.441
Citation Impact (citeScore): 2
Number of Followers: 668  
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012
Published by Springer-Verlag Homepage  [2351 journals]
  • Guest editors’ introduction to the special issue on web archiving
    • Authors: Edward A. Fox; Martin Klein; Zhiwu Xie
      Pages: 1 - 2
      PubDate: 2018-03-01
      DOI: 10.1007/s00799-016-0203-5
      Issue No: Vol. 19, No. 1 (2018)
  • API-based social media collecting as a form of web archiving
    • Authors: Justin Littman; Daniel Chudnov; Daniel Kerchner; Christie Peterson; Yecheng Tan; Rachel Trent; Rajat Vij; Laura Wrubel
      Pages: 21 - 38
      Abstract: Social media is increasingly a topic of study across a range of disciplines. Despite this popularity, current practices and open source tools for social media collecting do not adequately support today’s scholars or support building robust collections for future researchers. We are continuing to develop and improve Social Feed Manager (SFM), an open source application assisting scholars collecting data from Twitter’s API for their research. Based on our experience with SFM to date and the viewpoints of archivists and researchers, we are reconsidering assumptions about API-based social media collecting and identifying requirements to guide the application’s further development. We suggest that aligning social media collecting with web archiving practices and tools addresses many of the most pressing needs of current and future scholars conducting quality social media research. In this paper, we consider the basis for these new requirements, describe in depth an alignment between social media collecting and web archiving, outline a technical approach for effecting this alignment, and show how the technical approach has been implemented in SFM.
      PubDate: 2018-03-01
      DOI: 10.1007/s00799-016-0201-7
      Issue No: Vol. 19, No. 1 (2018)
  • ArchiveWeb: collaboratively extending and exploring web archive
           collections—How would you like to work with your collections'
    • Authors: Zeon Trevor Fernando; Ivana Marenzi; Wolfgang Nejdl
      Pages: 39 - 55
      Abstract: Curated web archive collections contain focused digital content which is collected by archiving organizations, groups, and individuals to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In this paper, we discuss how to best support collaborative construction and exploration of these collections through the ArchiveWeb system. ArchiveWeb has been developed using an iterative evaluation-driven design-based research approach, with considerable user feedback at all stages. The first part of this paper describes the important insights we gained from our initial requirements engineering phase during the first year of the project and the main functionalities of the current ArchiveWeb system for searching, constructing, exploring, and discussing web archive collections. The second part summarizes the feedback we received on this version from archiving organizations and libraries, as well as our corresponding plans for improving and extending the system for the next release.
      PubDate: 2018-03-01
      DOI: 10.1007/s00799-016-0206-2
      Issue No: Vol. 19, No. 1 (2018)
  • Heuristic and supervised approaches to handwritten annotation extraction
           for musical score images
    • Authors: Eamonn Bell; Laurent Pugin
      Abstract: Performers’ copies of musical scores are typically rich in handwritten annotations, which capture historical and institutional performance practices. The development of interactive interfaces to explore digital archives of these scores and the systematic investigation of their meaning and function will be facilitated by the automatic extraction of handwritten score annotations. We present several approaches to the extraction of handwritten annotations of arbitrary content from digitized images of musical scores. First, we show promising results in certain contexts when using simple unsupervised clustering techniques to identify handwritten annotations in conductors’ scores. Next, we compare annotated scores to unannotated copies and use a printed sheet music comparison tool, Aruspix, to recover handwritten annotations as additions to the clean copy. Using both of these techniques in a combined annotation pipeline qualitatively improves the recovery of handwritten annotations. Recent work has shown the effectiveness of reframing classical optical musical recognition tasks as supervised machine learning classification tasks. In the same spirit, we pose the problem of handwritten annotation extraction as a supervised pixel classification task, where the feature space for the learning task is derived from the intensities of neighboring pixels. After an initial investment of time required to develop dependable training data, this approach can reliably extract annotations for entire volumes of score images without further supervision. These techniques are demonstrated using a sample of orchestral scores annotated by professional conductors of the New York Philharmonic Orchestra. Handwritten annotation extraction in musical scores has applications to the systematic investigation of score annotation practices by performers, annotator attribution, and to the interactive presentation of annotated scores, which we briefly discuss.
      PubDate: 2018-07-11
      DOI: 10.1007/s00799-018-0249-7
  • Image libraries and their scholarly use in the field of art and
           architectural history
    • Authors: Sander Münster; Christina Kamposiori; Kristina Friedrichs; Cindy Kröber
      Abstract: The use of image libraries in the field of art and architectural history has been the subject of numerous research studies over the years. However, since previous investigations have focused, primarily, either on user behavior or reviewed repositories, our aim is to bring together both approaches. Against this background, this paper identifies the main characteristics of research and information behavior of art and architectural history scholars and students in the UK and Germany and presents a structured overview of currently available scholarly image libraries. Finally, the implications for a user-centered design of information resources and, in particular, image libraries are provided.
      PubDate: 2018-07-07
      DOI: 10.1007/s00799-018-0250-1
  • Building and querying semantic layers for web archives (extended version)
    • Authors: Pavlos Fafalios; Helge Holzmann; Vaibhav Kasturia; Wolfgang Nejdl
      Abstract: Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles (“layers”) that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts, and events), and publishing all these data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities, and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.
      PubDate: 2018-07-05
      DOI: 10.1007/s00799-018-0251-0
  • Characterising online museum users: a study of the National Museums
           Liverpool museum website
    • Authors: David Walsh; Mark M. Hall; Paul Clough; Jonathan Foster
      Abstract: Museums are increasing access to their collections and providing richer user experiences via web-based interfaces. However, they are seeing high numbers of users looking at only one or two pages within 10 s and then leaving. To reduce this rate, a better understanding of the type of user who visits a museum website is required. Existing models for museum website users tend to focus on groups that are readily accessible for study or provide little detail in their definitions of the groups. This paper presents the results of a large-scale user survey for the National Museums Liverpool museum website in which data on a wide range of user characteristics were collected regarding their current visit to provide a better understanding of their motivations, tasks, engagement and domain knowledge. Results show that the frequently understudied general public and non-professional users make up the majority (approximately 77%) of the respondents.
      PubDate: 2018-07-05
      DOI: 10.1007/s00799-018-0248-8
  • Open information extraction as an intermediate semantic structure for
           Persian text summarization
    • Authors: Mahmoud Rahat; Alireza Talebpour
      Abstract: Semantic applications typically exploit structures such as dependency parse trees, phrase-chunking, semantic role labeling or open information extraction. In this paper, we introduce a novel application of Open IE as an intermediate layer for text summarization. Text summarization is an important method for providing relevant information in large digital libraries. Open IE is referred to the process of extracting machine-understandable structural propositions from text. We use these propositions as a building block to shorten the sentence and generate a summary of the text. The proposed system offers a new form of summarization that is able to break the structure of the sentence and extract the most significant sub-sentential elements. Other advantages include the ability to identify and eliminate less important sections of the sentence (such as adverbs, adjectives, appositions or dependent clauses), or duplicate pieces of sentences which in turn opens up the space for entering more sentences in the summary to enhance the coverage and coherency of it. The proposed system is localized for Persian language; however, it can be adopted to other languages. Experiments performed on a standard data set “Pasokh” with a standard comparison tool showed promising results for the proposed approach. We used summaries produced by the system in a real-world application in the virtual library of Shahid Beheshti University and received good feedbacks from users.
      PubDate: 2018-06-28
      DOI: 10.1007/s00799-018-0244-z
  • Toward comprehensive event collections
    • Authors: Federico Nanni; Simone Paolo Ponzetto; Laura Dietz
      Abstract: Web archives, such as the Internet Archive, preserve an unprecedented abundance of materials regarding major events and transformations in our society. In this paper, we present an approach for building event-centric sub-collections from such large archives, which includes not only the core documents related to the event itself but, even more importantly, documents describing related aspects (e.g., premises and consequences). This is achieved by identifying relevant concepts and entities from a knowledge base, and then detecting their mentions in documents, which are interpreted as indicators for relevance. We extensively evaluate our system on two diachronic corpora, the New York Times Corpus and the US Congressional Record; additionally, we test its performance on the TREC KBA Stream Corpus and on the TREC-CAR dataset, two publicly available large-scale web collections.
      PubDate: 2018-06-22
      DOI: 10.1007/s00799-018-0246-x
  • On the effectiveness of the scientific peer-review system: a case study of
           the Journal of High Energy Physics
    • Authors: Sandipan Sikdar; Paras Tehria; Matteo Marsili; Niloy Ganguly; Animesh Mukherjee
      Abstract: The importance and the need for the peer-review system is highly debated in the academic community, and recently there has been a growing consensus to completely get rid of it. This is one of the steps in the publication pipeline that usually requires the publishing house to invest a significant portion of their budget in order to ensure quality editing and reviewing of the submissions received. Therefore, a very pertinent question is if at all such investments are worth making. To answer this question, in this paper, we perform a rigorous measurement study on a massive dataset (29k papers with 70k distinct review reports) to unfold the detailed characteristics of the peer-review process considering the three most important entities of this process—(i) the paper (ii) the authors and (iii) the referees and thereby identify different factors related to these three entities which can be leveraged to predict the long-term impact of a submitted paper. These features when plugged into a regression model achieve a high \(R^2\) of 0.85 and RMSE of 0.39. Analysis of feature importance indicates that reviewer- and author-related features are most indicative of long-term impact of a paper. We believe that our framework could definitely be utilized in assisting editors to decide the fate of a paper.
      PubDate: 2018-06-15
      DOI: 10.1007/s00799-018-0247-9
  • Analyzing the network structure and gender differences among the members
           of the Networked Knowledge Organization Systems (NKOS) community
    • Authors: Fariba Karimi; Philipp Mayr; Fakhri Momeni
      Abstract: In this paper, we analyze a major part of the research output of the Networked Knowledge Organization Systems (NKOS) community in the period 2000–2016 from a network analytical perspective. We focus on the papers presented at the European and US NKOS workshops and in addition four special issues on NKOS in the last 16 years. For this purpose, we have generated an open dataset, the “NKOS bibliography” which covers the bibliographic information of the research output. We analyze the co-authorship network of this community which results in 123 papers with a sum of 256 distinct authors. We use standard network analytic measures such as degree, betweenness and closeness centrality to describe the co-authorship network of the NKOS dataset. First, we investigate global properties of the network over time. Second, we analyze the centrality of the authors in the NKOS network. Lastly, we investigate gender differences in collaboration behavior in this community. Our results show that apart from differences in centrality measures of the scholars, they have higher tendency to collaborate with those in the same institution or the same geographic proximity. We also find that homophily is higher among women in this community. Apart from small differences in closeness and clustering among men and women, we do not find any significant dissimilarities with respect to other centralities.
      PubDate: 2018-06-14
      DOI: 10.1007/s00799-018-0243-0
  • Promoting user engagement with digital cultural heritage collections
    • Authors: Maristella Agosti; Nicola Orio; Chiara Ponchia
      Abstract: In the context of cooperating in a project whose central aim has been the production of a corpus agnostic research environment supporting access to and exploitation of digital cultural heritage collections, we have worked towards promoting user engagement with the collections. The aim of this paper is to present the methods and the solutions that have been envisaged and implemented to engage a diversified range of users with digital collections. Innovative solutions to stimulate and enhance user engagement have been achieved through a sustained and broad-based involvement of different cohorts of users. In particular, we propose the use of narratives to support and guide users within the collection and present them the main available tools. In moving beyond the specialized, search-based and stereotyped norm, the environment that we have contributed to developing offers a new approach for accessing and interacting with cultural heritage collections. It shows the value of an adaptive interface that dynamically responds to support the user, whatever his or her level of experience with digital environments or familiarity with the content may be.
      PubDate: 2018-06-11
      DOI: 10.1007/s00799-018-0245-y
  • Knowledge Organization Systems (KOS) in the Semantic Web: a
           multi-dimensional review
    • Abstract: Since the Simple Knowledge Organization System (SKOS) specification and its SKOS eXtension for Labels (SKOS-XL) became formal W3C recommendations in 2009, a significant number of conventional Knowledge Organization Systems (KOS) (including thesauri, classification schemes, name authorities, and lists of codes and terms, produced before the arrival of the ontology-wave) have made their journeys to join the Semantic Web mainstream. This paper uses “LOD KOS” as an umbrella term to refer to all of the value vocabularies and lightweight ontologies within the Semantic Web framework. The paper provides an overview of what the LOD KOS movement has brought to various communities and users. These are not limited to the colonies of the value vocabulary constructors and providers, nor the catalogers and indexers who have a long history of applying the vocabularies to their products. The LOD dataset producers and LOD service providers, the information architects and interface designers, and researchers in sciences and humanities, are also direct beneficiaries of LOD KOS. The paper examines a set of the collected cases (experimental or in real applications) and aims to find the usages of LOD KOS in order to share the practices and ideas among communities and users. Through the viewpoints of a number of different user groups, the functions of LOD KOS are examined from multiple dimensions. This paper focuses on the LOD dataset producers, vocabulary producers, and researchers (as end-users of KOS).
      PubDate: 2018-05-25
      DOI: 10.1007/s00799-018-0241-2
  • Neural ParsCit: a deep learning-based reference string parser
    • Authors: Animesh Prasad; Manpreet Kaur; Min-Yen Kan
      Abstract: We present a deep learning approach for the core digital libraries task of parsing bibliographic reference strings. We deploy the state-of-the-art long short-term memory (LSTM) neural network architecture, a variant of a recurrent neural network to capture long-range dependencies in reference strings. We explore word embeddings and character-based word embeddings as an alternative to handcrafted features. We incrementally experiment with features, architectural configurations, and the diversity of the dataset. Our final model is an LSTM-based architecture, which layers a linear chain conditional random field (CRF) over the LSTM output. In extensive experiments in both English in-domain (computer science) and out-of-domain (humanities) test cases, as well as multilingual data, our results show a significant gain ( \(p<0.01\) ) over the reported state-of-the-art CRF-only-based parser.
      PubDate: 2018-05-19
      DOI: 10.1007/s00799-018-0242-1
  • Bias-aware news analysis using matrix-based news aggregation
    • Authors: Felix Hamborg; Norman Meuschke; Bela Gipp
      Abstract: Media bias describes differences in the content or presentation of news. It is an ubiquitous phenomenon in news coverage that can have severely negative effects on individuals and society. Identifying media bias is a challenging problem, for which current information systems offer little support. News aggregators are the most important class of systems to support users in coping with the large amount of news that is published nowadays. These systems focus on identifying and presenting important, common information in news articles, but do not reveal different perspectives on the same topic. Due to this analysis approach, current news aggregators cannot effectively reveal media bias. To address this problem, we present matrix-based news aggregation, a novel approach for news exploration that helps users gain a broad and diverse news understanding by presenting various perspectives on the same news topic. Additionally, we present NewsBird, an open-source news aggregator that implements matrix-based news aggregation for international news topics. The results of a user study showed that NewsBird more effectively broadens the user’s news understanding than the list-based visualization approach employed by established news aggregators, while achieving comparable effectiveness and efficiency for the two main use cases of news consumption: getting an overview of and finding details on current news topics.
      PubDate: 2018-05-18
      DOI: 10.1007/s00799-018-0239-9
  • Fusion architectures for automatic subject indexing under concept drift
    • Authors: Martin Toepfer; Christin Seifert
      Abstract: Indexing documents with controlled vocabularies enables a wealth of semantic applications for digital libraries. Due to the rapid growth of scientific publications, machine learning-based methods are required that assign subject descriptors automatically. While stability of generative processes behind the underlying data is often assumed tacitly, it is being violated in practice. Addressing this problem, this article studies explicit and implicit concept drift, that is, settings with new descriptor terms and new types of documents, respectively. First, the existence of concept drift in automatic subject indexing is discussed in detail and demonstrated by example. Subsequently, architectures for automatic indexing are analyzed in this regard, highlighting individual strengths and weaknesses. The results of the theoretical analysis justify research on fusion of different indexing approaches with special consideration on information sharing among descriptors. Experimental results on titles and author keywords in the domain of economics underline the relevance of the fusion methodology, especially under concept drift. Fusion approaches outperformed non-fusion strategies on the tested data sets, which comprised shifts in priors of descriptors as well as covariates. These findings can help researchers and practitioners in digital libraries to choose appropriate methods for automatic subject indexing, as is finally shown by a recent case study.
      PubDate: 2018-05-15
      DOI: 10.1007/s00799-018-0240-3
  • Ranking Dublin Core descriptor lists from user interactions: a case study
           with Dublin Core Terms using the Dendro platform
    • Authors: João Rocha da Silva; Cristina Ribeiro; João Correia Lopes
      Abstract: Dublin Core descriptors capture metadata in most repositories, and this includes recent repositories dedicated to datasets. DC descriptors are generic and are being adapted to the requirements of different communities with the so-called Dublin Core Application Profiles that rely on the agreement within user communities, taking into account their evolving needs. In this paper, we propose an automated process to help curators and users discover the descriptors that best suit the needs of a specific research group in the task of describing and depositing datasets. Our approach is supported on Dendro, a prototype research data management platform, where an experimental method is used to rank and present DC Terms descriptors to the users based on their usage patterns. User interaction is recorded and used to score descriptors. In a controlled experiment, we gathered the interactions of two groups as they used Dendro to describe datasets from selected sources. One of the groups viewed descriptors according to the ranking, while the other had the same list of descriptors throughout the experiment. Preliminary results show that (1) some DC Terms are filled in more often than others, with different distribution in the two groups, (2) descriptors in higher ranks were increasingly accepted by users in detriment of manual selection, (3) users were satisfied with the performance of the platform, and (4) the quality of description was not hindered by descriptor ranking.
      PubDate: 2018-04-26
      DOI: 10.1007/s00799-018-0238-x
  • Toward meaningful notions of similarity in NLP embedding models
    • Authors: Ábel Elekes; Adrian Englhardt; Martin Schäler; Klemens Böhm
      Abstract: Finding similar words with the help of word embedding models, such as Word2Vec or GloVe, computed on large-scale digital libraries has yielded meaningful results in many cases. However, the underlying notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful. To do so, we analyze the statistical distribution of similarity values systematically, conducting two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values and thresholds actually are meaningful for a given embedding model. Based on these results, we calculate how these thresholds, when taken into account during evaluation, change the evaluation scores of the models in similarity test sets. In more abstract terms, our insights give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.
      PubDate: 2018-04-20
      DOI: 10.1007/s00799-018-0237-y
  • Capisco: low-cost concept-based access to digital libraries
    • Authors: Annika Hinze; David Bainbridge; Sally Jo Cunningham; Craig Taube-Schock; Rangi Matamua; J. Stephen Downie; Edie Rasmussen
      Abstract: In this article, we present the conceptual design and report on the implementation of Capisco—a low-cost approach to concept-based access to digital libraries. Capisco avoids the need for complete semantic document markup using ontologies by leveraging an automatically generated Concept-in-Context (CiC) network. The network is seeded by a priori analysis of Wikipedia texts and identification of semantic metadata. Our Capisco system disambiguates the semantics of terms in the documents by their semantics and context and identifies the relevant CiC concepts. Supplementary to this, the disambiguation of search queries is done interactively, to fully utilize the domain knowledge of the scholar. For established digital library systems, completely replacing, or even making significant changes to the document retrieval mechanism (document analysis, indexing strategy, query processing, and query interface) would require major technological effort and would most likely be disruptive. In addition to presenting Capisco, we describe ways to harness the results of our developed semantic analysis and disambiguation, while retaining the existing keyword-based search and lexicographic index. We engineer this so the output of semantic analysis (performed off-line) is suitable for import directly into existing digital library metadata and index structures, and thus incorporated without the need for architecture modifications.
      PubDate: 2018-03-14
      DOI: 10.1007/s00799-018-0232-3
  • Content-based video retrieval in historical collections of the German
           Broadcasting Archive
    • Authors: Markus Mühling; Manja Meister; Nikolaus Korfhage; Jörg Wehling; Angelika Hörth; Ralph Ewerth; Bernd Freisleben
      Abstract: The German Broadcasting Archive maintains the cultural heritage of radio and television broadcasts of the former German Democratic Republic (GDR). The uniqueness and importance of the video material fosters a large scientific interest in the video content. In this paper, we present a system for automatic video content analysis and retrieval to facilitate search in historical collections of GDR television recordings. It relies on a distributed, service-oriented architecture and includes video analysis algorithms for shot boundary detection, concept classification, person recognition, text recognition and similarity search. The combination of different search modalities allows users to obtain answers for a wide range of queries, leading to satisfactory results in short time. The performance of the system is evaluated using 2500 h of GDR television recordings.
      PubDate: 2018-03-08
      DOI: 10.1007/s00799-018-0236-z
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-