for Journals by Title or ISSN
for Articles by Keywords
Journal Cover International Journal on Digital Libraries
  [SJR: 0.375]   [H-I: 28]   [657 followers]  Follow
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012
   Published by Springer-Verlag Homepage  [2350 journals]
  • Guest editors’ introduction to the special issue on web archiving
    • Authors: Edward A. Fox; Martin Klein; Zhiwu Xie
      Pages: 1 - 2
      PubDate: 2018-03-01
      DOI: 10.1007/s00799-016-0203-5
      Issue No: Vol. 19, No. 1 (2018)
  • Focused crawler for events
    • Authors: Mohamed M. G. Farag; Sunshin Lee; Edward A. Fox
      Pages: 3 - 19
      Abstract: There is need for an Integrated Event Focused Crawling system to collect Web data about key events. When a disaster or other significant event occurs, many users try to locate the most up-to-date information about that event. Yet, there is little systematic collecting and archiving anywhere of event information. We propose intelligent event focused crawling for automatic event tracking and archiving, ultimately leading to effective access. We developed an event model that can capture key event information, and incorporated that model into a focused crawling algorithm. For the focused crawler to leverage the event model in predicting webpage relevance, we developed a function that measures the similarity between two event representations. We then conducted two series of experiments to evaluate our system about two recent events: California shooting and Brussels attack. The first experiment series evaluated the effectiveness of our proposed event model representation when assessing the relevance of webpages. Our event model-based representation outperformed the baseline method (topic-only); it showed better results in precision, recall, and F1-score with an improvement of 20% in F1-score. The second experiment series evaluated the effectiveness of the event model-based focused crawler for collecting relevant webpages from the WWW. Our event model-based focused crawler outperformed the state-of-the-art baseline focused crawler (best-first); it showed better results in harvest ratio with an average improvement of 40%.
      PubDate: 2018-03-01
      DOI: 10.1007/s00799-016-0207-1
      Issue No: Vol. 19, No. 1 (2018)
  • API-based social media collecting as a form of web archiving
    • Authors: Justin Littman; Daniel Chudnov; Daniel Kerchner; Christie Peterson; Yecheng Tan; Rachel Trent; Rajat Vij; Laura Wrubel
      Pages: 21 - 38
      Abstract: Social media is increasingly a topic of study across a range of disciplines. Despite this popularity, current practices and open source tools for social media collecting do not adequately support today’s scholars or support building robust collections for future researchers. We are continuing to develop and improve Social Feed Manager (SFM), an open source application assisting scholars collecting data from Twitter’s API for their research. Based on our experience with SFM to date and the viewpoints of archivists and researchers, we are reconsidering assumptions about API-based social media collecting and identifying requirements to guide the application’s further development. We suggest that aligning social media collecting with web archiving practices and tools addresses many of the most pressing needs of current and future scholars conducting quality social media research. In this paper, we consider the basis for these new requirements, describe in depth an alignment between social media collecting and web archiving, outline a technical approach for effecting this alignment, and show how the technical approach has been implemented in SFM.
      PubDate: 2018-03-01
      DOI: 10.1007/s00799-016-0201-7
      Issue No: Vol. 19, No. 1 (2018)
  • ArchiveWeb: collaboratively extending and exploring web archive
           collections—How would you like to work with your collections'
    • Authors: Zeon Trevor Fernando; Ivana Marenzi; Wolfgang Nejdl
      Pages: 39 - 55
      Abstract: Curated web archive collections contain focused digital content which is collected by archiving organizations, groups, and individuals to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In this paper, we discuss how to best support collaborative construction and exploration of these collections through the ArchiveWeb system. ArchiveWeb has been developed using an iterative evaluation-driven design-based research approach, with considerable user feedback at all stages. The first part of this paper describes the important insights we gained from our initial requirements engineering phase during the first year of the project and the main functionalities of the current ArchiveWeb system for searching, constructing, exploring, and discussing web archive collections. The second part summarizes the feedback we received on this version from archiving organizations and libraries, as well as our corresponding plans for improving and extending the system for the next release.
      PubDate: 2018-03-01
      DOI: 10.1007/s00799-016-0206-2
      Issue No: Vol. 19, No. 1 (2018)
  • Quantifying retrieval bias in Web archive search
    • Authors: Thaer Samar; Myriam C. Traub; Jacco van Ossenbruggen; Lynda Hardman; Arjen P. de Vries
      Pages: 57 - 75
      Abstract: A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.
      PubDate: 2018-03-01
      DOI: 10.1007/s00799-017-0215-9
      Issue No: Vol. 19, No. 1 (2018)
  • Avoiding spoilers: wiki time travel with Sheldon Cooper
    • Authors: Shawn M. Jones; Michael L. Nelson; Herbert Van de Sompel
      Pages: 77 - 93
      Abstract: A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if fans are behind in their viewing they run the risk of encountering “spoilers”—information that gives away key plot points before the intended time of the show’s writers. Because the wiki history is indexed by revisions, finding specific dates can be tedious, especially for pages with hundreds or thousands of edits. A wiki’s history interface does not permit browsing across historic pages without visiting current ones, thus revealing spoilers in the current page. Enterprising fans can resort to web archives and navigate there across wiki pages that were live prior to a specific episode date. In this paper, we explore the use of Memento with the Internet Archive as a means of avoiding spoilers in fan wikis. We conduct two experiments: one to determine the probability of encountering a spoiler when using Memento with the Internet Archive for a given wiki page, and a second to determine which date prior to an episode to choose when trying to avoid spoilers for that specific episode. Our results indicate that the Internet Archive is not safe for avoiding spoilers, and therefore we highlight the inherent capability of fan wikis to address the spoiler problem internally using existing, off-the-shelf technology. We use the spoiler use case to define and analyze different ways of discovering the best past version of a resource to avoid spoilers. We propose Memento as a structural solution to the problem, distinguishing it from prior content-based solutions to the spoiler problem. This research promotes the idea that content management systems can benefit from exposing their version information in the standardized Memento way used by other archives. We support the idea that there are use cases for which specific prior versions of web resources are invaluable.
      PubDate: 2018-03-01
      DOI: 10.1007/s00799-016-0200-8
      Issue No: Vol. 19, No. 1 (2018)
  • The colors of the national Web: visual data analysis of the historical
           Yugoslav Web domain
    • Authors: Anat Ben-David; Adam Amram; Ron Bekkerman
      Pages: 95 - 106
      Abstract: This study examines the use of visual data analytics as a method for historical investigation of national Webs, using Web archives. It empirically analyzes all graphically designed (non-photographic) images extracted from Websites hosted in the historical .yu domain and archived by the Internet Archive between 1997 and 2000, to assess the utility and value of visual data analytics as a measure of nationality of a Web domain. First, we report that only \(23.5\%\) of Websites hosted in the .yu domain over the studied years had their graphically designed images properly archived. Second, we detect significant differences between the color palettes of .yu sub-domains (commercial, organizational, academic, and governmental), as well as between Montenegrin and Serbian Websites. Third, we show that the similarity of the domains’ colors to the colors of the Yugoslav national flag decreases over time. However, there are spikes in the use of Yugoslav national colors that correlate with major developments on the Kosovo frontier.
      PubDate: 2018-03-01
      DOI: 10.1007/s00799-016-0202-6
      Issue No: Vol. 19, No. 1 (2018)
  • Fusion architectures for automatic subject indexing under concept drift
    • Authors: Martin Toepfer; Christin Seifert
      Abstract: Indexing documents with controlled vocabularies enables a wealth of semantic applications for digital libraries. Due to the rapid growth of scientific publications, machine learning-based methods are required that assign subject descriptors automatically. While stability of generative processes behind the underlying data is often assumed tacitly, it is being violated in practice. Addressing this problem, this article studies explicit and implicit concept drift, that is, settings with new descriptor terms and new types of documents, respectively. First, the existence of concept drift in automatic subject indexing is discussed in detail and demonstrated by example. Subsequently, architectures for automatic indexing are analyzed in this regard, highlighting individual strengths and weaknesses. The results of the theoretical analysis justify research on fusion of different indexing approaches with special consideration on information sharing among descriptors. Experimental results on titles and author keywords in the domain of economics underline the relevance of the fusion methodology, especially under concept drift. Fusion approaches outperformed non-fusion strategies on the tested data sets, which comprised shifts in priors of descriptors as well as covariates. These findings can help researchers and practitioners in digital libraries to choose appropriate methods for automatic subject indexing, as is finally shown by a recent case study.
      PubDate: 2018-05-15
      DOI: 10.1007/s00799-018-0240-3
  • Ranking Dublin Core descriptor lists from user interactions: a case study
           with Dublin Core Terms using the Dendro platform
    • Authors: João Rocha da Silva; Cristina Ribeiro; João Correia Lopes
      Abstract: Dublin Core descriptors capture metadata in most repositories, and this includes recent repositories dedicated to datasets. DC descriptors are generic and are being adapted to the requirements of different communities with the so-called Dublin Core Application Profiles that rely on the agreement within user communities, taking into account their evolving needs. In this paper, we propose an automated process to help curators and users discover the descriptors that best suit the needs of a specific research group in the task of describing and depositing datasets. Our approach is supported on Dendro, a prototype research data management platform, where an experimental method is used to rank and present DC Terms descriptors to the users based on their usage patterns. User interaction is recorded and used to score descriptors. In a controlled experiment, we gathered the interactions of two groups as they used Dendro to describe datasets from selected sources. One of the groups viewed descriptors according to the ranking, while the other had the same list of descriptors throughout the experiment. Preliminary results show that (1) some DC Terms are filled in more often than others, with different distribution in the two groups, (2) descriptors in higher ranks were increasingly accepted by users in detriment of manual selection, (3) users were satisfied with the performance of the platform, and (4) the quality of description was not hindered by descriptor ranking.
      PubDate: 2018-04-26
      DOI: 10.1007/s00799-018-0238-x
  • Toward meaningful notions of similarity in NLP embedding models
    • Authors: Ábel Elekes; Adrian Englhardt; Martin Schäler; Klemens Böhm
      Abstract: Finding similar words with the help of word embedding models, such as Word2Vec or GloVe, computed on large-scale digital libraries has yielded meaningful results in many cases. However, the underlying notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful. To do so, we analyze the statistical distribution of similarity values systematically, conducting two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values and thresholds actually are meaningful for a given embedding model. Based on these results, we calculate how these thresholds, when taken into account during evaluation, change the evaluation scores of the models in similarity test sets. In more abstract terms, our insights give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.
      PubDate: 2018-04-20
      DOI: 10.1007/s00799-018-0237-y
  • Capisco: low-cost concept-based access to digital libraries
    • Authors: Annika Hinze; David Bainbridge; Sally Jo Cunningham; Craig Taube-Schock; Rangi Matamua; J. Stephen Downie; Edie Rasmussen
      Abstract: In this article, we present the conceptual design and report on the implementation of Capisco—a low-cost approach to concept-based access to digital libraries. Capisco avoids the need for complete semantic document markup using ontologies by leveraging an automatically generated Concept-in-Context (CiC) network. The network is seeded by a priori analysis of Wikipedia texts and identification of semantic metadata. Our Capisco system disambiguates the semantics of terms in the documents by their semantics and context and identifies the relevant CiC concepts. Supplementary to this, the disambiguation of search queries is done interactively, to fully utilize the domain knowledge of the scholar. For established digital library systems, completely replacing, or even making significant changes to the document retrieval mechanism (document analysis, indexing strategy, query processing, and query interface) would require major technological effort and would most likely be disruptive. In addition to presenting Capisco, we describe ways to harness the results of our developed semantic analysis and disambiguation, while retaining the existing keyword-based search and lexicographic index. We engineer this so the output of semantic analysis (performed off-line) is suitable for import directly into existing digital library metadata and index structures, and thus incorporated without the need for architecture modifications.
      PubDate: 2018-03-14
      DOI: 10.1007/s00799-018-0232-3
  • Content-based video retrieval in historical collections of the German
           Broadcasting Archive
    • Authors: Markus Mühling; Manja Meister; Nikolaus Korfhage; Jörg Wehling; Angelika Hörth; Ralph Ewerth; Bernd Freisleben
      Abstract: The German Broadcasting Archive maintains the cultural heritage of radio and television broadcasts of the former German Democratic Republic (GDR). The uniqueness and importance of the video material fosters a large scientific interest in the video content. In this paper, we present a system for automatic video content analysis and retrieval to facilitate search in historical collections of GDR television recordings. It relies on a distributed, service-oriented architecture and includes video analysis algorithms for shot boundary detection, concept classification, person recognition, text recognition and similarity search. The combination of different search modalities allows users to obtain answers for a wide range of queries, leading to satisfactory results in short time. The performance of the system is evaluated using 2500 h of GDR television recordings.
      PubDate: 2018-03-08
      DOI: 10.1007/s00799-018-0236-z
  • Time-focused analysis of connectivity and popularity of historical persons
           in Wikipedia
    • Authors: Adam Jatowt; Daisuke Kawai; Katsumi Tanaka
      Abstract: Wikipedia contains large amounts of content related to history. It is being used extensively for many knowledge intensive tasks within computer science, digital humanities and related fields. In this paper, we look into Wikipedia articles on historical people for studying link-related temporal features of articles on past people. Our study sheds new light on the characteristics of information about historical people recorded in the English Wikipedia and quantifies user interest in such data. We propose a novel style of analysis in which we use signals derived from the hyperlink structure of Wikipedia as well as from article view logs, and we overlay them over temporal dimension to understand relations between time periods, link structure and article popularity. In the latter part of the paper, we also demonstrate several ways for estimating person importance based on the temporal aspects of the link structure as well as a method for ranking cities using the computed importance scores of their related persons.
      PubDate: 2018-02-08
      DOI: 10.1007/s00799-018-0231-4
  • Crowdsourcing Linked Data on listening experiences through reuse and
           enhancement of library data
    • Authors: Alessandro Adamou; Simon Brown; Helen Barlow; Carlo Allocca; Mathieu d’Aquin
      Abstract: Research has approached the practice of musical reception in a multitude of ways, such as the analysis of professional critique, sales figures and psychological processes activated by the act of listening. Studies in the Humanities, on the other hand, have been hindered by the lack of structured evidence of actual experiences of listening as reported by the listeners themselves, a concern that was voiced since the early Web era. It was however assumed that such evidence existed, albeit in pure textual form, but could not be leveraged until it was digitised and aggregated. The Listening Experience Database (LED) responds to this research need by providing a centralised hub for evidence of listening in the literature. Not only does LED support search and reuse across nearly 10,000 records, but it also provides machine-readable structured data of the knowledge around the contexts of listening. To take advantage of the mass of formal knowledge that already exists on the Web concerning these contexts, the entire framework adopts Linked Data principles and technologies. This also allows LED to directly reuse open data from the British Library for the source documentation that is already published. Reused data are re-published as open data with enhancements obtained by expanding over the model of the original data, such as the partitioning of published books and collections into individual stand-alone documents. The database was populated through crowdsourcing and seamlessly incorporates data reuse from the very early data entry phases. As the sources of the evidence often contain vague, fragmentary of uncertain information, facilities were put in place to generate structured data out of such fuzziness. Alongside elaborating on these functionalities, this article provides insights into the most recent features of the latest instalment of the dataset and portal, such as the interlinking with the MusicBrainz database, the relaxation of geographical input constraints through text mining, and the plotting of key locations in an interactive geographical browser.
      PubDate: 2018-02-06
      DOI: 10.1007/s00799-018-0235-0
  • Comparing published scientific journal articles to their pre-print
    • Authors: Martin Klein; Peter Broadwell; Sharon E. Farb; Todd Grappone
      Abstract: Academic publishers claim that they add value to scholarly communications by coordinating reviews and contributing and enhancing text during publication. These contributions come at a considerable cost: US academic libraries paid $$\\(1.7\)$ billion for serial subscriptions in 2008 alone. Library budgets, in contrast, are flat and not able to keep pace with serial price inflation. We have investigated the publishers’ value proposition by conducting a comparative study of pre-print papers from two distinct science, technology, and medicine corpora and their final published counterparts. This comparison had two working assumptions: (1) If the publishers’ argument is valid, the text of a pre-print paper should vary measurably from its corresponding final published version, and (2) by applying standard similarity measures, we should be able to detect and quantify such differences. Our analysis revealed that the text contents of the scientific papers generally changed very little from their pre-print to final published versions. These findings contribute empirical indicators to discussions of the added value of commercial publishers and therefore should influence libraries’ economic decisions regarding access to scholarly publications.
      PubDate: 2018-02-05
      DOI: 10.1007/s00799-018-0234-1
  • Benchmarking and evaluating the interpretation of bibliographic records
    • Authors: Trond Aalberg; Fabien Duchateau; Naimdjon Takhirov; Joffrey Decourselle; Nicolas Lumineau
      Abstract: In a global context which promotes the use of explicit semantics for sharing information and developing new services, the MAchine Readable Cataloguing (MARC) format that is commonly used by libraries worldwide has demonstrated its many limitations. The conceptual reference model for bibliographic information presented in the Functional Requirements for Bibliographic Records (FRBR) is expected to be the foundation for a new generation of catalogs that will replace MARC and the digital card catalog. The need for transformation of legacy MARC records to FRBR representation (FRBRization) has led to the proposal of various tools and approaches. However, these projects and the results they achieve are difficult to compare due to lack of common datasets and well defined and appropriate metrics. Our contributions fill this gap by proposing BIB-R, the first public benchmark for the FRBRization process. It is composed of two datasets that enable the identification of the strengths and weaknesses of a FRBRization tool. It also defines a set of well defined metrics that evaluate the different steps of the FRBRization process. Those resources, as well as the results of a large experiment involving three FRBRization tools tested against our benchmark, are available to the community under an open licence.
      PubDate: 2018-01-30
      DOI: 10.1007/s00799-018-0233-2
  • Extending, mapping, and focusing the CIDOC CRM
    • Authors: Franco Niccolucci
      Pages: 251 - 252
      PubDate: 2017-11-01
      DOI: 10.1007/s00799-016-0198-y
      Issue No: Vol. 18, No. 4 (2017)
  • Harmonizing the CRMba and CRMarchaeo models
    • Authors: Paola Ronzino
      Pages: 253 - 261
      Abstract: This work presents the initial thoughts towards the harmonization of the CRMba and CRMarchaeo models, two extensions of the CIDOC CRM, the former developed to model the complexity of a built structure from the perspective of buildings archaeology, while the latter was developed to model the processes involved in the investigation of subsurface archaeological deposits. The paper describes the modelling principles of CRMba and CRMarchaeo, and identifies common concepts that will allow to merge the two ontological models.
      PubDate: 2017-11-01
      DOI: 10.1007/s00799-016-0193-3
      Issue No: Vol. 18, No. 4 (2017)
  • Scripta manent: a CIDOC CRM semiotic reading of ancient texts
    • Authors: Achille Felicetti; Francesca Murano
      Pages: 263 - 270
      Abstract: This paper tries to identify the most important concepts involved in the study of ancient texts and proposes the use of CIDOC CRM to encode them and to model the scientific process of investigation related to the study of ancient texts to foster integration with other cultural heritage research fields. After identifying the key concepts, assessing the available technologies and analysing the entities provided by CIDOC CRM and by its extensions, we introduce more specific classes to be used as the basis for creating a new extension, CRMtex, which is more responsive to the specific needs of the various disciplines involved (including papyrology, palaeography, codicology and epigraphy).
      PubDate: 2017-11-01
      DOI: 10.1007/s00799-016-0189-z
      Issue No: Vol. 18, No. 4 (2017)
  • Introduction to the special issue on bibliometric-enhanced information
           retrieval and natural language processing for digital libraries (BIRNDL)
    • Authors: Philipp Mayr; Ingo Frommholz; Guillaume Cabanac; Muthu Kumar Chandrasekaran; Kokil Jaidka; Min-Yen Kan; Dietmar Wolfram
      Abstract: The large scale of scholarly publications poses a challenge for scholars in information seeking and sensemaking. Bibliometric, information retrieval (IR), text mining, and natural language processing techniques can assist to address this challenge, but have yet to be widely used in digital libraries (DL). This special issue on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL) was compiled after the first joint BIRNDL workshop that was held at the joint conference on digital libraries (JCDL 2016) in Newark, New Jersey, USA. It brought together IR and DL researchers and professionals to elaborate on new approaches in natural language processing, information retrieval, scientometric, and recommendation techniques that can advance the state of the art in scholarly document understanding, analysis, and retrieval at scale. This special issue includes 14 papers: four extended papers originating from the first BIRNDL workshop 2016 and the BIR workshop at ECIR 2016, four extended system reports of the CL-SciSumm Shared Task 2016 and six original research papers submitted via the open call for papers.
      PubDate: 2017-11-09
      DOI: 10.1007/s00799-017-0230-x
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-