Journal Cover International Journal on Digital Libraries
  [SJR: 0.375]   [H-I: 28]   [578 followers]  Follow
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012
   Published by Springer-Verlag Homepage  [2341 journals]
  • Guest editors’ introduction to the special issue on knowledge maps and
           information retrieval (KMIR)
    • Authors: Peter Mutschke; Andrea Scharnhorst; Nicholas J. Belkin; André Skupin; Philipp Mayr
      Pages: 1 - 3
      PubDate: 2017-03-01
      DOI: 10.1007/s00799-016-0204-4
      Issue No: Vol. 18, No. 1 (2017)
  • Font attributes enrich knowledge maps and information retrieval
    • Authors: Richard Brath; Ebad Banissi
      Pages: 5 - 24
      Abstract: Abstract Typography is overlooked in knowledge maps (KM) and information retrieval (IR), and some deficiencies in these systems can potentially be improved by encoding information into font attributes. A review of font use across domains is used to itemize font attributes and information visualization theory is used to characterize each attribute. Tasks associated with KM and IR, such as skimming, opinion analysis, character analysis, topic modelling and sentiment analysis can be aided through the use of novel representations using font attributes such as skim formatting, proportional encoding, textual stem and leaf plots and multi-attribute labels.
      PubDate: 2017-03-01
      DOI: 10.1007/s00799-016-0168-4
      Issue No: Vol. 18, No. 1 (2017)
  • Mapping metadata to DDC classification structures for searching and
    • Authors: Xia Lin; Michael Khoo; Jae-Wook Ahn; Doug Tudhope; Ceri Binding; Diana Massam; Hilary Jones
      Pages: 25 - 39
      Abstract: Abstract In this paper, we introduce a metadata visual interface based on metadata aggregation and automatic classification mapping. We demonstrate that it is possible to aggregate metadata records from multiple unrelated repositories, enhance them through automatic classification, and present them in a unified visual interface. The main features of the interface include dynamic querying using DDC classes as filters, interactive visual views of search results and related DDC classes, and drill-down options for searching and browsing in different levels of details. The interface was tested in a user study of 30 subjects. A comparison was done on three modules of the interface, namely ‘search interface’, ‘hierarchical interface’, and ‘visual interface.’ The results indicate that subjects performed well with all the three interfaces, and they had more positive experience with the hierarchical interface than with the search interface and the visual interface.
      PubDate: 2017-03-01
      DOI: 10.1007/s00799-016-0197-z
      Issue No: Vol. 18, No. 1 (2017)
  • Creating knowledge maps using Memory Island
    • Authors: Bin Yang; Jean-Gabriel Ganascia
      Pages: 41 - 57
      Abstract: Abstract Knowledge maps are useful tools, now beginning to be widely applied to the management and sharing of large-scale hierarchical knowledge. In this paper, we discuss how knowledge maps can be generated using Memory Island. Memory Island is our cartographic visualization technique, which was inspired by the ancient “Art of Memory”. It consists of automatically creating the spatial cartographic representation of a given hierarchical knowledge (e.g., ontology). With the help of its interactive functions, users can navigate through an artificial landscape, to learn and retrieve information from the knowledge. We also present some preliminary results of representing different hierarchical knowledge to show how the knowledge maps created by our technique work.
      PubDate: 2017-03-01
      DOI: 10.1007/s00799-016-0196-0
      Issue No: Vol. 18, No. 1 (2017)
  • Supporting academic search tasks through citation visualization and
    • Authors: Taraneh Khazaei; Orland Hoeber
      Pages: 59 - 72
      Abstract: Abstract Despite ongoing advances in information retrieval algorithms, people continue to experience difficulties when conducting online searches within digital libraries. Because their information-seeking goals are often complex, searchers may experience difficulty in precisely describing what they are seeking. Current search interfaces provide limited support for navigating and exploring among the search results and helping searchers to more accurately describe what they are looking for. In this paper, we present a novel visual library search interface, designed with the goal of providing interactive support for common library search tasks and behaviours. This system takes advantage of the rich metadata available in academic collections and employs information visualization techniques to support search results evaluation, forward and backward citation exploration, and interactive query refinement.
      PubDate: 2017-03-01
      DOI: 10.1007/s00799-016-0170-x
      Issue No: Vol. 18, No. 1 (2017)
  • Quantifying retrieval bias in Web archive search
    • Authors: Thaer Samar; Myriam C. Traub; Jacco van Ossenbruggen; Lynda Hardman; Arjen P. de Vries
      Abstract: Abstract A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.
      PubDate: 2017-04-18
      DOI: 10.1007/s00799-017-0215-9
  • Automatic summarization of scientific publications using a feature
           selection approach
    • Authors: Hazem Al Saied; Nicolas Dugué; Jean-Charles Lamirel
      Abstract: Abstract Feature Maximization is a feature selection method that deals efficiently with textual data: to design systems that are altogether language-agnostic, parameter-free and do not require additional corpora to function. We propose to evaluate its use in text summarization, in particular in cases where documents are structured. We first experiment this approach in a single-document summarization context. We evaluate it on the DUC AQUAINT corpus and show that despite the unstructured nature of the corpus, our system is above the baseline and produces encouraging results. We also observe that the produced summaries seem robust to redundancy. Next, we evaluate our method in the more appropriate context of SciSumm challenge, which is dedicated to research publications summarization. These publications are structured in sections and our class-based approach is thus relevant. We more specifically focus on the task that aims to summarize papers using those that refer to them. We consider and evaluate several systems using our approach dealing with specific bag of words. Furthermore, in these systems, we also evaluate cosine and graph-based distance for sentence weighting and comparison. We show that our Feature Maximization based approach performs very well in the SciSumm 2016 context for the considered task, providing better results than the known results so far, and obtaining high recall. We thus demonstrate the flexibility and the relevance of Feature Maximization in this context.
      PubDate: 2017-04-13
      DOI: 10.1007/s00799-017-0214-x
  • Reuse and plagiarism in Speech and Natural Language Processing
    • Authors: Joseph Mariani; Gil Francopoulo; Patrick Paroubek
      Abstract: Abstract The aim of this experiment is to present an easy way to compare fragments of texts in order to detect (supposed) results of copy and paste operations between articles in the domain of Natural Language Processing (NLP), including Speech Processing. The search space of the comparisons is a corpus labeled as NLP4NLP, which includes 34 different conferences and journals and gathers a large part of the NLP activity over the past 50 years. This study considers the similarity between the papers of each individual event and the complete set of papers in the whole corpus, according to four different types of relationship (self-reuse, self-plagiarism, reuse and plagiarism) and in both directions: a paper borrowing a fragment of text from another paper of the corpus (that we will call the source paper), or in the reverse direction, fragments of text from the source paper being borrowed and inserted in another paper of the corpus. The results show that self-reuse is rather a common practice, but that plagiarism seems to be very unusual, and that both stay within legal and ethical limits.
      PubDate: 2017-03-21
      DOI: 10.1007/s00799-017-0211-0
  • The references of references: a method to enrich humanities library
           catalogs with citation data
    • Authors: Giovanni Colavizza; Matteo Romanello; Frédéric Kaplan
      Abstract: Abstract The advent of large-scale citation indexes has greatly impacted the retrieval of scientific information in several domains of research. The humanities have largely remained outside of this shift, despite their increasing reliance on digital means for information seeking. Given that publications in the humanities have a longer than average life-span, mainly due to the importance of monographs for the field, this article proposes to use domain-specific reference monographs to bootstrap the enrichment of library catalogs with citation data. Reference monographs are works considered to be of particular importance in a research library setting, and likely to possess characteristic citation patterns. The article shows how to select a corpus of reference monographs, and proposes a pipeline to extract the network of publications they refer to. Results using a set of reference monographs in the domain of the history of Venice show that only 7% of extracted citations are made to publications already within the initial seed. Furthermore, the resulting citation network suggests the presence of a core set of works in the domain, cited more frequently than average.
      PubDate: 2017-03-08
      DOI: 10.1007/s00799-017-0210-1
  • Task-oriented search for evidence-based medicine
    • Authors: Bevan Koopman; Jack Russell; Guido Zuccon
      Abstract: Abstract Research on how clinicians search shows that they pose queries according to three common clinical tasks: searching for diagnoses, searching for treatments and searching for tests. We hypothesise, therefore, that structuring an information retrieval system around these three tasks would be beneficial when searching for evidence-based medicine (EBM) resources in medical digital libraries. Task-oriented (diagnosis, test and treatment) information was extracted from free-text medical articles using a natural language processing pipeline. This information was integrated into a retrieval and visualisation system for EBM search that allowed searchers to interact with the system via task-oriented filters. The effectiveness of the system was empirically evaluated using TREC CDS—a gold standard of medical articles and queries designed for EBM search. Task-oriented information was successfully extracted from 733,138 articles taken from a medical digital library. Task-oriented search led to improvements in the quality of search results and savings in searcher workload. An analysis of how different tasks affected retrieval showed that searching for treatments was the most challenging and that the task-oriented approach improved search for treatments. The most savings in terms of workload were observed when searching for treatments and tests. Overall, taking into account different clinical tasks can improve search according to these tasks. Each task displayed different results, making systems that are more adaptive to the clinical task type desirable. A future user study would help quantify the actual cost-saving estimates.
      PubDate: 2017-03-01
      DOI: 10.1007/s00799-017-0209-7
  • Tape music archives: from preservation to access
    • Authors: Carlo Fantozzi; Federica Bressan; Niccolò Pretto; Sergio Canazza
      Abstract: Abstract This article presents a methodology for the active preservation of, and the access to, magnetic tapes of audio archives. The methodology has been defined and implemented by a multidisciplinary team involving engineers as well as musicians, composers and archivists. The strong point of the methodology is the philological awareness that influenced the development of digital tools, which consider the critical questions in the historian and musicologist’s approach: the secondary information and the history of transmission of an audio document.
      PubDate: 2017-02-28
      DOI: 10.1007/s00799-017-0208-8
  • ArchiveWeb: collaboratively extending and exploring web archive
           collections—How would you like to work with your collections?
    • Authors: Zeon Trevor Fernando; Ivana Marenzi; Wolfgang Nejdl
      Abstract: Abstract Curated web archive collections contain focused digital content which is collected by archiving organizations, groups, and individuals to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In this paper, we discuss how to best support collaborative construction and exploration of these collections through the ArchiveWeb system. ArchiveWeb has been developed using an iterative evaluation-driven design-based research approach, with considerable user feedback at all stages. The first part of this paper describes the important insights we gained from our initial requirements engineering phase during the first year of the project and the main functionalities of the current ArchiveWeb system for searching, constructing, exploring, and discussing web archive collections. The second part summarizes the feedback we received on this version from archiving organizations and libraries, as well as our corresponding plans for improving and extending the system for the next release.
      PubDate: 2017-01-18
      DOI: 10.1007/s00799-016-0206-2
  • Focused crawler for events
    • Authors: Mohamed M. G. Farag; Sunshin Lee; Edward A. Fox
      Abstract: Abstract There is need for an Integrated Event Focused Crawling system to collect Web data about key events. When a disaster or other significant event occurs, many users try to locate the most up-to-date information about that event. Yet, there is little systematic collecting and archiving anywhere of event information. We propose intelligent event focused crawling for automatic event tracking and archiving, ultimately leading to effective access. We developed an event model that can capture key event information, and incorporated that model into a focused crawling algorithm. For the focused crawler to leverage the event model in predicting webpage relevance, we developed a function that measures the similarity between two event representations. We then conducted two series of experiments to evaluate our system about two recent events: California shooting and Brussels attack. The first experiment series evaluated the effectiveness of our proposed event model representation when assessing the relevance of webpages. Our event model-based representation outperformed the baseline method (topic-only); it showed better results in precision, recall, and F1-score with an improvement of 20% in F1-score. The second experiment series evaluated the effectiveness of the event model-based focused crawler for collecting relevant webpages from the WWW. Our event model-based focused crawler outperformed the state-of-the-art baseline focused crawler (best-first); it showed better results in harvest ratio with an average improvement of 40%.
      PubDate: 2017-01-07
      DOI: 10.1007/s00799-016-0207-1
  • Applications of RISM data in digital libraries and digital musicology
    • Authors: Klaus Keil; Jennifer A. Ward
      Abstract: Abstract Information about manuscripts and printed music indexed in RISM (Répertoire International des Sources Musicales), a large, international project that records and describes musical sources, was for decades available solely through book publications, CD-ROMs, or subscription services. Recent initiatives to make the data available on a wider scale have resulted in, most significantly, a freely accessible online database and the availability of its data as open data and linked open data. Previously, the task of increasing the amount of data was primarily carried out by RISM national groups and the Zentralredaktion (Central Office). The current opportunities available by linking to other freely accessible databases and importing data from other resources open new perspectives and prospects. This paper describes the RISM data and their applications for digital libraries and digital musicological projects. We discuss the possibilities and challenges in making available a large and growing quantity of data and how the data have been utilized in external library and musicological projects. Interactive functions in the RISM OPAC are planned for the future, as is closer collaboration with the projects that use RISM data. Ultimately, RISM would like to arrange a “take and give” system in which the RISM data are used in external projects, enhanced by the project participants, and then delivered back to the RISM Zentralredaktion.
      PubDate: 2017-01-06
      DOI: 10.1007/s00799-016-0205-3
  • Research-paper recommender systems: a literature survey
    • Authors: Joeran Beel; Bela Gipp; Stefan Langer; Corinna Breitinger
      Pages: 305 - 338
      Abstract: Abstract In the last 16 years, more than 200 research articles were published about research-paper recommender systems. We reviewed these articles and present some descriptive statistics in this paper, as well as a discussion about the major advancements and shortcomings and an overview of the most common recommendation concepts and approaches. We found that more than half of the recommendation approaches applied content-based filtering (55 %). Collaborative filtering was applied by only 18 % of the reviewed approaches, and graph-based recommendations by 16 %. Other recommendation concepts included stereotyping, item-centric recommendations, and hybrid recommendations. The content-based filtering approaches mainly utilized papers that the users had authored, tagged, browsed, or downloaded. TF-IDF was the most frequently applied weighting scheme. In addition to simple terms, n-grams, topics, and citations were utilized to model users’ information needs. Our review revealed some shortcomings of the current research. First, it remains unclear which recommendation concepts and approaches are the most promising. For instance, researchers reported different results on the performance of content-based and collaborative filtering. Sometimes content-based filtering performed better than collaborative filtering and sometimes it performed worse. We identified three potential reasons for the ambiguity of the results. (A) Several evaluations had limitations. They were based on strongly pruned datasets, few participants in user studies, or did not use appropriate baselines. (B) Some authors provided little information about their algorithms, which makes it difficult to re-implement the approaches. Consequently, researchers use different implementations of the same recommendations approaches, which might lead to variations in the results. (C) We speculated that minor variations in datasets, algorithms, or user populations inevitably lead to strong variations in the performance of the approaches. Hence, finding the most promising approaches is a challenge. As a second limitation, we noted that many authors neglected to take into account factors other than accuracy, for example overall user satisfaction. In addition, most approaches (81 %) neglected the user-modeling process and did not infer information automatically but let users provide keywords, text snippets, or a single paper as input. Information on runtime was provided for 10 % of the approaches. Finally, few research papers had an impact on research-paper recommender systems in practice. We also identified a lack of authority and long-term research interest in the field: 73 % of the authors published no more than one paper on research-paper recommender systems, and there was little cooperation among different co-author groups. We concluded that several actions could improve the research landscape: developing a common evaluation framework, agreement on the information to include in research papers, a stronger focus on non-accuracy aspects and user modeling, a platform for researchers to exchange information, and an open-source framework that bundles the available recommendation approaches.
      PubDate: 2016-11-01
      DOI: 10.1007/s00799-015-0156-0
      Issue No: Vol. 17, No. 4 (2016)
  • Location-triggered mobile access to a digital library of audio books using
    • Authors: Annika Hinze; David Bainbridge
      Pages: 339 - 365
      Abstract: Abstract This paper explores the role of audio as a means to access books while being at locations referred to within the books, through a mobile app, called Tipple. The books are sourced from a digital library—either self-contained on the mobile phone, or else over the network—and can either be accompanied by pre-recorded audio or synthesized using text-to-speech. The paper details the functional requirements, design and implementation of Tipple. The developed concept was explored and evaluated through three field studies.
      PubDate: 2016-11-01
      DOI: 10.1007/s00799-015-0165-z
      Issue No: Vol. 17, No. 4 (2016)
  • API-based social media collecting as a form of web archiving
    • Authors: Justin Littman; Daniel Chudnov; Daniel Kerchner; Christie Peterson; Yecheng Tan; Rachel Trent; Rajat Vij; Laura Wrubel
      Abstract: Abstract Social media is increasingly a topic of study across a range of disciplines. Despite this popularity, current practices and open source tools for social media collecting do not adequately support today’s scholars or support building robust collections for future researchers. We are continuing to develop and improve Social Feed Manager (SFM), an open source application assisting scholars collecting data from Twitter’s API for their research. Based on our experience with SFM to date and the viewpoints of archivists and researchers, we are reconsidering assumptions about API-based social media collecting and identifying requirements to guide the application’s further development. We suggest that aligning social media collecting with web archiving practices and tools addresses many of the most pressing needs of current and future scholars conducting quality social media research. In this paper, we consider the basis for these new requirements, describe in depth an alignment between social media collecting and web archiving, outline a technical approach for effecting this alignment, and show how the technical approach has been implemented in SFM.
      PubDate: 2016-12-28
      DOI: 10.1007/s00799-016-0201-7
  • Avoiding spoilers: wiki time travel with Sheldon Cooper
    • Authors: Shawn M. Jones; Michael L. Nelson; Herbert Van de Sompel
      Abstract: Abstract A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if fans are behind in their viewing they run the risk of encountering “spoilers”—information that gives away key plot points before the intended time of the show’s writers. Because the wiki history is indexed by revisions, finding specific dates can be tedious, especially for pages with hundreds or thousands of edits. A wiki’s history interface does not permit browsing across historic pages without visiting current ones, thus revealing spoilers in the current page. Enterprising fans can resort to web archives and navigate there across wiki pages that were live prior to a specific episode date. In this paper, we explore the use of Memento with the Internet Archive as a means of avoiding spoilers in fan wikis. We conduct two experiments: one to determine the probability of encountering a spoiler when using Memento with the Internet Archive for a given wiki page, and a second to determine which date prior to an episode to choose when trying to avoid spoilers for that specific episode. Our results indicate that the Internet Archive is not safe for avoiding spoilers, and therefore we highlight the inherent capability of fan wikis to address the spoiler problem internally using existing, off-the-shelf technology. We use the spoiler use case to define and analyze different ways of discovering the best past version of a resource to avoid spoilers. We propose Memento as a structural solution to the problem, distinguishing it from prior content-based solutions to the spoiler problem. This research promotes the idea that content management systems can benefit from exposing their version information in the standardized Memento way used by other archives. We support the idea that there are use cases for which specific prior versions of web resources are invaluable.
      PubDate: 2016-12-21
      DOI: 10.1007/s00799-016-0200-8
  • The colors of the national Web: visual data analysis of the historical
           Yugoslav Web domain
    • Authors: Anat Ben-David; Adam Amram; Ron Bekkerman
      Abstract: Abstract This study examines the use of visual data analytics as a method for historical investigation of national Webs, using Web archives. It empirically analyzes all graphically designed (non-photographic) images extracted from Websites hosted in the historical .yu domain and archived by the Internet Archive between 1997 and 2000, to assess the utility and value of visual data analytics as a measure of nationality of a Web domain. First, we report that only \(23.5\%\) of Websites hosted in the .yu domain over the studied years had their graphically designed images properly archived. Second, we detect significant differences between the color palettes of .yu sub-domains (commercial, organizational, academic, and governmental), as well as between Montenegrin and Serbian Websites. Third, we show that the similarity of the domains’ colors to the colors of the Yugoslav national flag decreases over time. However, there are spikes in the use of Yugoslav national colors that correlate with major developments on the Kosovo frontier.
      PubDate: 2016-12-18
      DOI: 10.1007/s00799-016-0202-6
  • Documenting archaeological science with CIDOC CRM
    • Authors: Franco Niccolucci
      Abstract: Abstract The paper proposes to use CIDOC CRM and its extensions CRMsci and CRMdig to document the scientific experiments involved in archaeological investigations. The nature of such experiments is analysed and ways to document their important aspects are provided using existing classes and properties from the CRM or from the above-mentioned schemas, together with newly defined ones, forming an extension of the CRM called CRMas.
      PubDate: 2016-11-30
      DOI: 10.1007/s00799-016-0199-x
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2016