Journal Cover International Journal on Digital Libraries
  [SJR: 0.375]   [H-I: 28]   [583 followers]  Follow
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012
   Published by Springer-Verlag Homepage  [2329 journals]
  • On research data publishing
    • Authors: Leonardo Candela; Donatella Castelli; Paolo Manghi; Sarah Callaghan
      Pages: 73 - 75
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-017-0213-y
      Issue No: Vol. 18, No. 2 (2017)
  • Key components of data publishing: using current best practices to develop
           a reference model for data publishing
    • Authors: Claire C. Austin; Theodora Bloom; Sünje Dallmeier-Tiessen; Varsha K. Khodiyar; Fiona Murphy; Amy Nurnberger; Lisa Raymond; Martina Stockhause; Jonathan Tedds; Mary Vardigan; Angus Whyte
      Pages: 77 - 92
      Abstract: The availability of workflows for data publishing could have an enormous impact on researchers, research practices and publishing paradigms, as well as on funding strategies and career and research evaluations. We present the generic components of such workflows to provide a reference model for these stakeholders. The RDA-WDS Data Publishing Workflows group set out to study the current data-publishing workflow landscape across disciplines and institutions. A diverse set of workflows were examined to identify common components and standard practices, including basic self-publishing services, institutional data repositories, long-term projects, curated data repositories, and joint data journal and repository arrangements. The results of this examination have been used to derive a data-publishing reference model comprising generic components. From an assessment of the current data-publishing landscape, we highlight important gaps and challenges to consider, especially when dealing with more complex workflows and their integration into wider community frameworks. It is clear that the data-publishing landscape is varied and dynamic and that there are important gaps and challenges. The different components of a data-publishing system need to work, to the greatest extent possible, in a seamless and integrated way to support the evolution of commonly understood and utilized standards and—eventually—to increased reproducibility. We therefore advocate the implementation of existing standards for repositories and all parts of the data-publishing process, and the development of new standards where necessary. Effective and trustworthy data publishing should be embedded in documented workflows. As more research communities seek to publish the data associated with their research, they can build on one or more of the components identified in this reference model.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0178-2
      Issue No: Vol. 18, No. 2 (2017)
  • Automating data sharing through authoring tools
    • Authors: John R. Kitchin; Ana E. Van Gulick; Lisa D. Zilinski
      Pages: 93 - 98
      Abstract: In the current scientific publishing landscape, there is a need for an authoring workflow that easily integrates data and code into manuscripts and that enables the data and code to be published in reusable form. Automated embedding of data and code into published output will enable superior communication and data archiving. In this work, we demonstrate a proof of concept for a workflow, org-mode, which successfully provides this authoring capability and workflow integration. We illustrate this concept in a series of examples for potential uses of this workflow. First, we use data on citation counts to compute the h-index of an author, and show two code examples for calculating the h-index. The source for each example is automatically embedded in the PDF during the export of the document. We demonstrate how data can be embedded in image files, which themselves are embedded in the document. Finally, metadata about the embedded files can be automatically included in the exported PDF, and accessed by computer programs. In our customized export, we embedded metadata about the attached files in the PDF in an Info field. A computer program could parse this output to get a list of embedded files and carry out analyses on them. Authoring tools such as Emacs + org-mode can greatly facilitate the integration of data and code into technical writing. These tools can also automate the embedding of data into document formats intended for consumption.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0173-7
      Issue No: Vol. 18, No. 2 (2017)
  • Experiences in integrated data and research object publishing using GigaDB
    • Authors: Scott C Edmunds; Peter Li; Christopher I Hunter; Si Zhe Xiao; Robert L Davidson; Nicole Nogoy; Laurie Goodman
      Pages: 99 - 111
      Abstract: In the era of computation and data-driven research, traditional methods of disseminating research are no longer fit-for-purpose. New approaches for disseminating data, methods and results are required to maximize knowledge discovery. The “long tail” of small, unstructured datasets is well catered for by a number of general-purpose repositories, but there has been less support for “big data”. Outlined here are our experiences in attempting to tackle the gaps in publishing large-scale, computationally intensive research. GigaScience is an open-access, open-data journal aiming to revolutionize large-scale biological data dissemination, organization and re-use. Through use of the data handling infrastructure of the genomics centre BGI, GigaScience links standard manuscript publication with an integrated database (GigaDB) that hosts all associated data, and provides additional data analysis tools and computing resources. Furthermore, the supporting workflows and methods are also integrated to make published articles more transparent and open. GigaDB has released many new and previously unpublished datasets and data types, including as urgently needed data to tackle infectious disease outbreaks, cancer and the growing food crisis. Other “executable” research objects, such as workflows, virtual machines and software from several GigaScience articles have been archived and shared in reproducible, transparent and usable formats. With data citation producing evidence of, and credit for, its use in the wider research community, GigaScience demonstrates a move towards more executable publications. Here data analyses can be reproduced and built upon by users without coding backgrounds or heavy computational infrastructure in a more democratized manner.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0174-6
      Issue No: Vol. 18, No. 2 (2017)
  • Advancing research data publishing practices for the social sciences: from
           archive activity to empowering researchers
    • Authors: Veerle Van den Eynden; Louise Corti
      Pages: 113 - 121
      Abstract: Sharing and publishing social science research data have a long history in the UK, through long-standing agreements with government agencies for sharing survey data and the data policy, infrastructure, and data services supported by the Economic and Social Research Council. The UK Data Service and its predecessors developed data management, documentation, and publishing procedures and protocols that stand today as robust templates for data publishing. As the ESRC research data policy requires grant holders to submit their research data to the UK Data Service after a grant ends, setting standards and promoting them has been essential in raising the quality of the resulting research data being published. In the past, received data were all processed, documented, and published for reuse in-house. Recent investments have focused on guiding and training researchers in good data management practices and skills for creating shareable data, as well as a self-publishing repository system, ReShare. ReShare also receives data sets described in published data papers and achieves scientific quality assurance through peer review of submitted data sets before publication. Social science data are reused for research, to inform policy, in teaching and for methods learning. Over a 10 years period, responsive developments in system workflows, access control options, persistent identifiers, templates, and checks, together with targeted guidance for researchers, have helped raise the standard of self-publishing social science data. Lessons learned and developments in shifting publishing social science data from an archivist responsibility to a researcher process are showcased, as inspiration for institutions setting up a data repository.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0177-3
      Issue No: Vol. 18, No. 2 (2017)
  • Meeting the challenge of environmental data publication: an operational
           infrastructure and workflow for publishing data
    • Authors: Daniel G. Wright; Philip Trembath; Kathryn A. Harrison
      Pages: 123 - 132
      Abstract: Here we describe the defined workflow and its supporting infrastructure, which are used by the Natural Environment Research Council’s (NERC) Environmental Information Data Centre (EIDC) ( to enable publication of environmental data in the fields of ecology and hydrology. The methods employed and issues discussed are also relevant to publication in other domains. By utilising a clearly defined workflow for data publication, we operate a fully auditable, quality controlled series of steps permitting publication of environmental data. The described methodology meets the needs of both data producers and data users, whose requirements are not always aligned. A stable, logically created infrastructure supporting data publication allows the process to occur in a well-managed and secure fashion, while remaining flexible enough to deal with a range of data types and user requirements. We discuss the primary issues arising from data publication, and describe how many of them have been resolved by the methods we have employed, with demonstrable results. In conclusion, we expand on future directions we wish to develop to aid data publication by both solving problems for data generators and improving the end-user experience.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0176-4
      Issue No: Vol. 18, No. 2 (2017)
  • Implementation of a workflow for publishing citeable environmental data:
           successes, challenges and opportunities from a data centre perspective
    • Authors: Kathryn A. Harrison; Daniel G. Wright; Philip Trembath
      Pages: 133 - 143
      Abstract: In recent years, the development and implementation of a robust way to cite data have encouraged many previously sceptical environmental researchers to publish the data they create, thus ensuring that more data than ever are now open and available for re-use within and between research communities. Here, we describe a workflow for publishing citeable data in the context of the environmental sciences—an area spanning many domains and generating a vast array of heterogeneous data products. The processes and tools we have developed have enabled rapid publication of quality data products including datasets, models and model outputs which can be accessed, re-used and subsequently cited. However, there are still many challenges that need to be addressed before researchers in the environmental sciences fully accept the notion that datasets are valued outputs and time should be spent in properly describing, storing and citing them. Here, we identify current challenges such as citation of dynamic datasets and issues of recording and presenting citation metrics. In conclusion, whilst data centres may have the infrastructure, tools, resources and processes available to publish citeable datasets, further work is required before large-scale uptake of the services offered is achieved. We believe that once current challenges are met, data resources will be viewed similarly to journal publications as valued outputs in a researcher’s portfolio, and therefore both the quality and quantity of data published will increase.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0175-5
      Issue No: Vol. 18, No. 2 (2017)
  • Semantic representation and enrichment of information retrieval
           experimental data
    • Authors: Gianmaria Silvello; Georgeta Bordea; Nicola Ferro; Paul Buitelaar; Toine Bogers
      Pages: 145 - 172
      Abstract: Experimental evaluation carried out in international large-scale campaigns is a fundamental pillar of the scientific and technological advancement of information retrieval (IR) systems. Such evaluation activities produce a large quantity of scientific and experimental data, which are the foundation for all the subsequent scientific production and development of new systems. In this work, we discuss how to semantically annotate and interlink this data, with the goal of enhancing their interpretation, sharing, and reuse. We discuss the underlying evaluation workflow and propose a resource description framework model for those workflow parts. We use expertise retrieval as a case study to demonstrate the benefits of our semantic representation approach. We employ this model as a means for exposing experimental data as linked open data (LOD) on the Web and as a basis for enriching and automatically connecting this data with expertise topics and expert profiles. In this context, a topic-centric approach for expert search is proposed, addressing the extraction of expertise topics, their semantic grounding with the LOD cloud, and their connection to IR experimental data. Several methods for expert profiling and expert finding are analysed and evaluated. Our results show that it is possible to construct expert profiles starting from automatically extracted expertise topics and that topic-centric approaches outperform state-of-the-art language modelling approaches for expert finding.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0172-8
      Issue No: Vol. 18, No. 2 (2017)
  • Guest editors’ introduction to the special issue on knowledge maps and
           information retrieval (KMIR)
    • Authors: Peter Mutschke; Andrea Scharnhorst; Nicholas J. Belkin; André Skupin; Philipp Mayr
      Pages: 1 - 3
      PubDate: 2017-03-01
      DOI: 10.1007/s00799-016-0204-4
      Issue No: Vol. 18, No. 1 (2017)
  • Font attributes enrich knowledge maps and information retrieval
    • Authors: Richard Brath; Ebad Banissi
      Pages: 5 - 24
      Abstract: Typography is overlooked in knowledge maps (KM) and information retrieval (IR), and some deficiencies in these systems can potentially be improved by encoding information into font attributes. A review of font use across domains is used to itemize font attributes and information visualization theory is used to characterize each attribute. Tasks associated with KM and IR, such as skimming, opinion analysis, character analysis, topic modelling and sentiment analysis can be aided through the use of novel representations using font attributes such as skim formatting, proportional encoding, textual stem and leaf plots and multi-attribute labels.
      PubDate: 2017-03-01
      DOI: 10.1007/s00799-016-0168-4
      Issue No: Vol. 18, No. 1 (2017)
  • Creating knowledge maps using Memory Island
    • Authors: Bin Yang; Jean-Gabriel Ganascia
      Pages: 41 - 57
      Abstract: Knowledge maps are useful tools, now beginning to be widely applied to the management and sharing of large-scale hierarchical knowledge. In this paper, we discuss how knowledge maps can be generated using Memory Island. Memory Island is our cartographic visualization technique, which was inspired by the ancient “Art of Memory”. It consists of automatically creating the spatial cartographic representation of a given hierarchical knowledge (e.g., ontology). With the help of its interactive functions, users can navigate through an artificial landscape, to learn and retrieve information from the knowledge. We also present some preliminary results of representing different hierarchical knowledge to show how the knowledge maps created by our technique work.
      PubDate: 2017-03-01
      DOI: 10.1007/s00799-016-0196-0
      Issue No: Vol. 18, No. 1 (2017)
  • Supporting academic search tasks through citation visualization and
    • Authors: Taraneh Khazaei; Orland Hoeber
      Pages: 59 - 72
      Abstract: Despite ongoing advances in information retrieval algorithms, people continue to experience difficulties when conducting online searches within digital libraries. Because their information-seeking goals are often complex, searchers may experience difficulty in precisely describing what they are seeking. Current search interfaces provide limited support for navigating and exploring among the search results and helping searchers to more accurately describe what they are looking for. In this paper, we present a novel visual library search interface, designed with the goal of providing interactive support for common library search tasks and behaviours. This system takes advantage of the rich metadata available in academic collections and employs information visualization techniques to support search results evaluation, forward and backward citation exploration, and interactive query refinement.
      PubDate: 2017-03-01
      DOI: 10.1007/s00799-016-0170-x
      Issue No: Vol. 18, No. 1 (2017)
  • Bag of works retrieval: TF*IDF weighting of works co-cited with a seed
    • Authors: Howard D. White
      Abstract: Although not presently possible in any system, the style of retrieval described here combines familiar components—co-citation linkages of documents and TF*IDF weighting of terms—in a way that could be implemented in future databases. Rather than entering keywords, the user enters a string identifying a work—a seed—to retrieve the strings identifying other works that are co-cited with it. Each of the latter is part of a “bag of works,” and it presumably has both a co-citation count with the seed and an overall citation count in the database. These two counts can be plugged into a standard formula for TF*IDF weighting such that all the co-cited items can be ranked for relevance to the seed, given that the entire retrieval is relevant to it by evidence from multiple co-citing authors. The result is analogous to, but different from, traditional “bag of words” retrieval, which it supplements. Some properties of the ranking are illustrated by works co-cited with three seeds: an article on search behavior, an information retrieval textbook, and an article on centrality in networks. While these are case studies, their properties apply to bag of works retrievals in general and have implications for users (e.g., humanities scholars, domain analysts) that go beyond any one example.
      PubDate: 2017-05-19
      DOI: 10.1007/s00799-017-0217-7
  • Section mixture models for scientific document summarization
    • Authors: John M. Conroy; Sashka T. Davis
      Abstract: In this paper, we present a system for summarization of scientific and structured documents that has three components: section mixture models are used for estimation of the weights of terms; a hypothesis test to select a subset of these terms; and a sentence extractor based on techniques for combinatorial optimization. The section mixture models approach is an adaptation of a bigram mixture model based on the main sections of a scientific document and a collection of citing sentences (citances) from papers that reference the document. The model was adapted from earlier work done on Biomedical documents used in the summarization task of the 2014 Text Analysis Conference (TAC 2014). The mixture model trained on the Biomedical data was used also on the data for the Computational Linguistics scientific summarization task of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (CL-SciSumm 2016). This model gives rise to machine-generated summaries with ROUGE scores that are nearly as strong as those seen on the Biomedical data and was also the highest scoring submission to the task of generating a human summary. For sentence extraction, we use the OCCAMS algorithm (Davis et al., in: Vreeken, Ling, Zaki, Siebes, Yu, Goethals, Webb, Wu (eds) ICDM workshops, IEEE Computer Society, pp 454–463, 2012) which takes the sentences from the original document and the assignment of weights of the terms computed by the language models and outputs a set of minimally overlapping sentences whose combined term coverage is maximized. Finally, we explore the importance of an appropriate background model for the hypothesis test to select terms to achieve the best quality summaries.
      PubDate: 2017-05-17
      DOI: 10.1007/s00799-017-0218-6
  • Scientific document summarization via citation contextualization and
           scientific discourse
    • Authors: Arman Cohan; Nazli Goharian
      Abstract: The rapid growth of scientific literature has made it difficult for the researchers to quickly learn about the developments in their respective fields. Scientific summarization addresses this challenge by providing summaries of the important contributions of scientific papers. We present a framework for scientific summarization which takes advantage of the citations and the scientific discourse structure. Citation texts often lack the evidence and context to support the content of the cited paper and are even sometimes inaccurate. We first address the problem of inaccuracy of the citation texts by finding the relevant context from the cited paper. We propose three approaches for contextualizing citations which are based on query reformulation, word embeddings, and supervised learning. We then train a model to identify the discourse facets for each citation. We finally propose a method for summarizing scientific papers by leveraging the faceted citations and their corresponding contexts. We evaluate our proposed method on two scientific summarization datasets in the biomedical and computational linguistics domains. Extensive evaluation results show that our methods can improve over the state of the art by large margins.
      PubDate: 2017-05-09
      DOI: 10.1007/s00799-017-0216-8
  • Quantifying retrieval bias in Web archive search
    • Authors: Thaer Samar; Myriam C. Traub; Jacco van Ossenbruggen; Lynda Hardman; Arjen P. de Vries
      Abstract: A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.
      PubDate: 2017-04-18
      DOI: 10.1007/s00799-017-0215-9
  • Automatic summarization of scientific publications using a feature
           selection approach
    • Authors: Hazem Al Saied; Nicolas Dugué; Jean-Charles Lamirel
      Abstract: Feature Maximization is a feature selection method that deals efficiently with textual data: to design systems that are altogether language-agnostic, parameter-free and do not require additional corpora to function. We propose to evaluate its use in text summarization, in particular in cases where documents are structured. We first experiment this approach in a single-document summarization context. We evaluate it on the DUC AQUAINT corpus and show that despite the unstructured nature of the corpus, our system is above the baseline and produces encouraging results. We also observe that the produced summaries seem robust to redundancy. Next, we evaluate our method in the more appropriate context of SciSumm challenge, which is dedicated to research publications summarization. These publications are structured in sections and our class-based approach is thus relevant. We more specifically focus on the task that aims to summarize papers using those that refer to them. We consider and evaluate several systems using our approach dealing with specific bag of words. Furthermore, in these systems, we also evaluate cosine and graph-based distance for sentence weighting and comparison. We show that our Feature Maximization based approach performs very well in the SciSumm 2016 context for the considered task, providing better results than the known results so far, and obtaining high recall. We thus demonstrate the flexibility and the relevance of Feature Maximization in this context.
      PubDate: 2017-04-13
      DOI: 10.1007/s00799-017-0214-x
  • Reuse and plagiarism in Speech and Natural Language Processing
    • Authors: Joseph Mariani; Gil Francopoulo; Patrick Paroubek
      Abstract: The aim of this experiment is to present an easy way to compare fragments of texts in order to detect (supposed) results of copy and paste operations between articles in the domain of Natural Language Processing (NLP), including Speech Processing. The search space of the comparisons is a corpus labeled as NLP4NLP, which includes 34 different conferences and journals and gathers a large part of the NLP activity over the past 50 years. This study considers the similarity between the papers of each individual event and the complete set of papers in the whole corpus, according to four different types of relationship (self-reuse, self-plagiarism, reuse and plagiarism) and in both directions: a paper borrowing a fragment of text from another paper of the corpus (that we will call the source paper), or in the reverse direction, fragments of text from the source paper being borrowed and inserted in another paper of the corpus. The results show that self-reuse is rather a common practice, but that plagiarism seems to be very unusual, and that both stay within legal and ethical limits.
      PubDate: 2017-03-21
      DOI: 10.1007/s00799-017-0211-0
  • The references of references: a method to enrich humanities library
           catalogs with citation data
    • Authors: Giovanni Colavizza; Matteo Romanello; Frédéric Kaplan
      Abstract: The advent of large-scale citation indexes has greatly impacted the retrieval of scientific information in several domains of research. The humanities have largely remained outside of this shift, despite their increasing reliance on digital means for information seeking. Given that publications in the humanities have a longer than average life-span, mainly due to the importance of monographs for the field, this article proposes to use domain-specific reference monographs to bootstrap the enrichment of library catalogs with citation data. Reference monographs are works considered to be of particular importance in a research library setting, and likely to possess characteristic citation patterns. The article shows how to select a corpus of reference monographs, and proposes a pipeline to extract the network of publications they refer to. Results using a set of reference monographs in the domain of the history of Venice show that only 7% of extracted citations are made to publications already within the initial seed. Furthermore, the resulting citation network suggests the presence of a core set of works in the domain, cited more frequently than average.
      PubDate: 2017-03-08
      DOI: 10.1007/s00799-017-0210-1
  • Task-oriented search for evidence-based medicine
    • Authors: Bevan Koopman; Jack Russell; Guido Zuccon
      Abstract: Research on how clinicians search shows that they pose queries according to three common clinical tasks: searching for diagnoses, searching for treatments and searching for tests. We hypothesise, therefore, that structuring an information retrieval system around these three tasks would be beneficial when searching for evidence-based medicine (EBM) resources in medical digital libraries. Task-oriented (diagnosis, test and treatment) information was extracted from free-text medical articles using a natural language processing pipeline. This information was integrated into a retrieval and visualisation system for EBM search that allowed searchers to interact with the system via task-oriented filters. The effectiveness of the system was empirically evaluated using TREC CDS—a gold standard of medical articles and queries designed for EBM search. Task-oriented information was successfully extracted from 733,138 articles taken from a medical digital library. Task-oriented search led to improvements in the quality of search results and savings in searcher workload. An analysis of how different tasks affected retrieval showed that searching for treatments was the most challenging and that the task-oriented approach improved search for treatments. The most savings in terms of workload were observed when searching for treatments and tests. Overall, taking into account different clinical tasks can improve search according to these tasks. Each task displayed different results, making systems that are more adaptive to the clinical task type desirable. A future user study would help quantify the actual cost-saving estimates.
      PubDate: 2017-03-01
      DOI: 10.1007/s00799-017-0209-7
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2016