for Journals by Title or ISSN
for Articles by Keywords
help
Followed Journals
Journal you Follow: 0
 
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Journals are published.
Already have an account? Sign In to see the journals you follow.
Journal Cover International Journal on Digital Libraries
  [SJR: 0.375]   [H-I: 28]   [603 followers]  Follow
    
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012
   Published by Springer-Verlag Homepage  [2353 journals]
  • On research data publishing
    • Authors: Leonardo Candela; Donatella Castelli; Paolo Manghi; Sarah Callaghan
      Pages: 73 - 75
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-017-0213-y
      Issue No: Vol. 18, No. 2 (2017)
       
  • Key components of data publishing: using current best practices to develop
           a reference model for data publishing
    • Authors: Claire C. Austin; Theodora Bloom; Sünje Dallmeier-Tiessen; Varsha K. Khodiyar; Fiona Murphy; Amy Nurnberger; Lisa Raymond; Martina Stockhause; Jonathan Tedds; Mary Vardigan; Angus Whyte
      Pages: 77 - 92
      Abstract: Abstract The availability of workflows for data publishing could have an enormous impact on researchers, research practices and publishing paradigms, as well as on funding strategies and career and research evaluations. We present the generic components of such workflows to provide a reference model for these stakeholders. The RDA-WDS Data Publishing Workflows group set out to study the current data-publishing workflow landscape across disciplines and institutions. A diverse set of workflows were examined to identify common components and standard practices, including basic self-publishing services, institutional data repositories, long-term projects, curated data repositories, and joint data journal and repository arrangements. The results of this examination have been used to derive a data-publishing reference model comprising generic components. From an assessment of the current data-publishing landscape, we highlight important gaps and challenges to consider, especially when dealing with more complex workflows and their integration into wider community frameworks. It is clear that the data-publishing landscape is varied and dynamic and that there are important gaps and challenges. The different components of a data-publishing system need to work, to the greatest extent possible, in a seamless and integrated way to support the evolution of commonly understood and utilized standards and—eventually—to increased reproducibility. We therefore advocate the implementation of existing standards for repositories and all parts of the data-publishing process, and the development of new standards where necessary. Effective and trustworthy data publishing should be embedded in documented workflows. As more research communities seek to publish the data associated with their research, they can build on one or more of the components identified in this reference model.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0178-2
      Issue No: Vol. 18, No. 2 (2017)
       
  • Automating data sharing through authoring tools
    • Authors: John R. Kitchin; Ana E. Van Gulick; Lisa D. Zilinski
      Pages: 93 - 98
      Abstract: Abstract In the current scientific publishing landscape, there is a need for an authoring workflow that easily integrates data and code into manuscripts and that enables the data and code to be published in reusable form. Automated embedding of data and code into published output will enable superior communication and data archiving. In this work, we demonstrate a proof of concept for a workflow, org-mode, which successfully provides this authoring capability and workflow integration. We illustrate this concept in a series of examples for potential uses of this workflow. First, we use data on citation counts to compute the h-index of an author, and show two code examples for calculating the h-index. The source for each example is automatically embedded in the PDF during the export of the document. We demonstrate how data can be embedded in image files, which themselves are embedded in the document. Finally, metadata about the embedded files can be automatically included in the exported PDF, and accessed by computer programs. In our customized export, we embedded metadata about the attached files in the PDF in an Info field. A computer program could parse this output to get a list of embedded files and carry out analyses on them. Authoring tools such as Emacs + org-mode can greatly facilitate the integration of data and code into technical writing. These tools can also automate the embedding of data into document formats intended for consumption.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0173-7
      Issue No: Vol. 18, No. 2 (2017)
       
  • Experiences in integrated data and research object publishing using GigaDB
    • Authors: Scott C Edmunds; Peter Li; Christopher I Hunter; Si Zhe Xiao; Robert L Davidson; Nicole Nogoy; Laurie Goodman
      Pages: 99 - 111
      Abstract: Abstract In the era of computation and data-driven research, traditional methods of disseminating research are no longer fit-for-purpose. New approaches for disseminating data, methods and results are required to maximize knowledge discovery. The “long tail” of small, unstructured datasets is well catered for by a number of general-purpose repositories, but there has been less support for “big data”. Outlined here are our experiences in attempting to tackle the gaps in publishing large-scale, computationally intensive research. GigaScience is an open-access, open-data journal aiming to revolutionize large-scale biological data dissemination, organization and re-use. Through use of the data handling infrastructure of the genomics centre BGI, GigaScience links standard manuscript publication with an integrated database (GigaDB) that hosts all associated data, and provides additional data analysis tools and computing resources. Furthermore, the supporting workflows and methods are also integrated to make published articles more transparent and open. GigaDB has released many new and previously unpublished datasets and data types, including as urgently needed data to tackle infectious disease outbreaks, cancer and the growing food crisis. Other “executable” research objects, such as workflows, virtual machines and software from several GigaScience articles have been archived and shared in reproducible, transparent and usable formats. With data citation producing evidence of, and credit for, its use in the wider research community, GigaScience demonstrates a move towards more executable publications. Here data analyses can be reproduced and built upon by users without coding backgrounds or heavy computational infrastructure in a more democratized manner.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0174-6
      Issue No: Vol. 18, No. 2 (2017)
       
  • Advancing research data publishing practices for the social sciences: from
           archive activity to empowering researchers
    • Authors: Veerle Van den Eynden; Louise Corti
      Pages: 113 - 121
      Abstract: Abstract Sharing and publishing social science research data have a long history in the UK, through long-standing agreements with government agencies for sharing survey data and the data policy, infrastructure, and data services supported by the Economic and Social Research Council. The UK Data Service and its predecessors developed data management, documentation, and publishing procedures and protocols that stand today as robust templates for data publishing. As the ESRC research data policy requires grant holders to submit their research data to the UK Data Service after a grant ends, setting standards and promoting them has been essential in raising the quality of the resulting research data being published. In the past, received data were all processed, documented, and published for reuse in-house. Recent investments have focused on guiding and training researchers in good data management practices and skills for creating shareable data, as well as a self-publishing repository system, ReShare. ReShare also receives data sets described in published data papers and achieves scientific quality assurance through peer review of submitted data sets before publication. Social science data are reused for research, to inform policy, in teaching and for methods learning. Over a 10 years period, responsive developments in system workflows, access control options, persistent identifiers, templates, and checks, together with targeted guidance for researchers, have helped raise the standard of self-publishing social science data. Lessons learned and developments in shifting publishing social science data from an archivist responsibility to a researcher process are showcased, as inspiration for institutions setting up a data repository.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0177-3
      Issue No: Vol. 18, No. 2 (2017)
       
  • Meeting the challenge of environmental data publication: an operational
           infrastructure and workflow for publishing data
    • Authors: Daniel G. Wright; Philip Trembath; Kathryn A. Harrison
      Pages: 123 - 132
      Abstract: Abstract Here we describe the defined workflow and its supporting infrastructure, which are used by the Natural Environment Research Council’s (NERC) Environmental Information Data Centre (EIDC) (http://eidc.ceh.ac.uk/) to enable publication of environmental data in the fields of ecology and hydrology. The methods employed and issues discussed are also relevant to publication in other domains. By utilising a clearly defined workflow for data publication, we operate a fully auditable, quality controlled series of steps permitting publication of environmental data. The described methodology meets the needs of both data producers and data users, whose requirements are not always aligned. A stable, logically created infrastructure supporting data publication allows the process to occur in a well-managed and secure fashion, while remaining flexible enough to deal with a range of data types and user requirements. We discuss the primary issues arising from data publication, and describe how many of them have been resolved by the methods we have employed, with demonstrable results. In conclusion, we expand on future directions we wish to develop to aid data publication by both solving problems for data generators and improving the end-user experience.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0176-4
      Issue No: Vol. 18, No. 2 (2017)
       
  • Implementation of a workflow for publishing citeable environmental data:
           successes, challenges and opportunities from a data centre perspective
    • Authors: Kathryn A. Harrison; Daniel G. Wright; Philip Trembath
      Pages: 133 - 143
      Abstract: Abstract In recent years, the development and implementation of a robust way to cite data have encouraged many previously sceptical environmental researchers to publish the data they create, thus ensuring that more data than ever are now open and available for re-use within and between research communities. Here, we describe a workflow for publishing citeable data in the context of the environmental sciences—an area spanning many domains and generating a vast array of heterogeneous data products. The processes and tools we have developed have enabled rapid publication of quality data products including datasets, models and model outputs which can be accessed, re-used and subsequently cited. However, there are still many challenges that need to be addressed before researchers in the environmental sciences fully accept the notion that datasets are valued outputs and time should be spent in properly describing, storing and citing them. Here, we identify current challenges such as citation of dynamic datasets and issues of recording and presenting citation metrics. In conclusion, whilst data centres may have the infrastructure, tools, resources and processes available to publish citeable datasets, further work is required before large-scale uptake of the services offered is achieved. We believe that once current challenges are met, data resources will be viewed similarly to journal publications as valued outputs in a researcher’s portfolio, and therefore both the quality and quantity of data published will increase.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0175-5
      Issue No: Vol. 18, No. 2 (2017)
       
  • Semantic representation and enrichment of information retrieval
           experimental data
    • Authors: Gianmaria Silvello; Georgeta Bordea; Nicola Ferro; Paul Buitelaar; Toine Bogers
      Pages: 145 - 172
      Abstract: Abstract Experimental evaluation carried out in international large-scale campaigns is a fundamental pillar of the scientific and technological advancement of information retrieval (IR) systems. Such evaluation activities produce a large quantity of scientific and experimental data, which are the foundation for all the subsequent scientific production and development of new systems. In this work, we discuss how to semantically annotate and interlink this data, with the goal of enhancing their interpretation, sharing, and reuse. We discuss the underlying evaluation workflow and propose a resource description framework model for those workflow parts. We use expertise retrieval as a case study to demonstrate the benefits of our semantic representation approach. We employ this model as a means for exposing experimental data as linked open data (LOD) on the Web and as a basis for enriching and automatically connecting this data with expertise topics and expert profiles. In this context, a topic-centric approach for expert search is proposed, addressing the extraction of expertise topics, their semantic grounding with the LOD cloud, and their connection to IR experimental data. Several methods for expert profiling and expert finding are analysed and evaluated. Our results show that it is possible to construct expert profiles starting from automatically extracted expertise topics and that topic-centric approaches outperform state-of-the-art language modelling approaches for expert finding.
      PubDate: 2017-06-01
      DOI: 10.1007/s00799-016-0172-8
      Issue No: Vol. 18, No. 2 (2017)
       
  • The context of multiple in-text references and their signification
    • Authors: Marc Bertin; Iana Atanassova
      Abstract: Abstract In this paper, we consider sentences that contain multiple in-text references (MIR) and their position in the rhetorical structure of articles. We carry out the analysis of MIR in a large-scale dataset of about 80,000 research articles published by the Public Library of Science in 7 journals. We analyze two major characteristics of MIR: their positions in the IMRaD structure of articles and the number of in-text references that make up a MIR in the different journals. We show that MIR are rather frequent in all sections of the rhetorical structure. In the Introduction section, sentences containing MIR account for more than half of the sentences with references. We examine the syntactic patterns that are most used in the contexts of both multiple and single in-text references and show that they are composed, for the most part, of noun groups. We point out the specificity of the Methods section in this respect.
      PubDate: 2017-07-21
      DOI: 10.1007/s00799-017-0225-7
       
  • Retrieval by recommendation: using LOD technologies to improve digital
           library search
    • Authors: Lisa Wenige; Johannes Ruhland
      Abstract: Abstract This paper investigates how Linked Open Data (LOD) can be used for recommendations and information retrieval within digital libraries. While numerous studies on both research paper recommender systems and Linked Data-enabled recommender systems have been conducted, no previous attempt has been undertaken to explore opportunities of LOD in the context of search and discovery interfaces. We identify central advantages of Linked Open Data with regard to scientific search and propose two novel recommendation strategies, namely flexible similarity detection and constraint-based recommendations. These strategies take advantage of key characteristics of data that adheres to LOD principles. The viability of Linked Data recommendations was extensively evaluated within the scope of a web-based user experiment in the domain of economics. Findings indicate that the proposed methods are well suited to enhance established search functionalities and are thus offering novel ways of resource access. In addition to that, RDF triples from LOD repositories can complement local bibliographic records that are sparse or of poor quality.
      PubDate: 2017-07-19
      DOI: 10.1007/s00799-017-0224-8
       
  • Discovering the structure and impact of the digital library evaluation
           domain
    • Authors: Leonidas Papachristopoulos; Giannis Tsakonas; Moses Boudourides; Michalis Sfakakis; Nikos Kleidis; Sergios Lenis; Christos Papatheodorou
      Abstract: Abstract The multidimensional nature of digital libraries evaluation domain poses several challenges to the research communities that intend to assess criteria, methods, products and tools, and also practice them. The amount of scientific production that is published in the domain hinders and disorientates the interested researchers. These researchers need guidance to exploit effectively the considerable amount of data and the diversity of methods, as well as to identify new research goals and develop their plans for future studies. This paper proposes a methodological pathway to investigate the core topics that structure the digital library evaluation domain and their impact. Further to the exploration of these topical entities, this study investigates also the researchers that contribute substantially to key topics, their communities and their relationships. The proposed methodology exploits topic modeling and network analysis in combination with citation and altmetrics analysis on a corpus consisting of the digital library evaluation papers presented in JCDL, ECDL/TDPL and ICADL conferences in the period 2001–2013.
      PubDate: 2017-06-24
      DOI: 10.1007/s00799-017-0222-x
       
  • Identifying reference spans: topic modeling and word embeddings help IR
    • Authors: Luis Moraes; Shahryar Baki; Rakesh Verma; Daniel Lee
      Abstract: Abstract The CL-SciSumm 2016 shared task introduced an interesting problem: given a document D and a piece of text that cites D, how do we identify the text spans of D being referenced by the piece of text' The shared task provided the first annotated dataset for studying this problem. We present an analysis of our continued work in improving our system’s performance on this task. We demonstrate how topic models and word embeddings can be used to surpass the previously best performing system.
      PubDate: 2017-06-20
      DOI: 10.1007/s00799-017-0220-z
       
  • Insights from CL-SciSumm 2016: the faceted scientific document
           summarization Shared Task
    • Authors: Kokil Jaidka; Muthu Kumar Chandrasekaran; Sajal Rustagi; Min-Yen Kan
      Abstract: Abstract We describe the participation and the official results of the 2nd Computational Linguistics Scientific Summarization Shared Task (CL-SciSumm), held as a part of the BIRNDL workshop at the Joint Conference for Digital Libraries 2016 in Newark, New Jersey. CL-SciSumm is the first medium-scale Shared Task on scientific document summarization in the computational linguistics (CL) domain. Participants were provided a training corpus of 30 topics, each comprising of a reference paper (RP) and 10 or more citing papers, all of which cite the RP. For each citation, the text spans (i.e., citances) that pertain to the RP have been identified. Participants solved three sub-tasks in automatic research paper summarization using this text corpus. Fifteen teams from six countries registered for the Shared Task, of which ten teams ultimately submitted and presented their results. The annotated corpus comprised 30 target papers—currently the largest available corpora of its kind. The corpus is available for free download and use at https://github.com/WING-NUS/scisumm-corpus.
      PubDate: 2017-06-14
      DOI: 10.1007/s00799-017-0221-y
       
  • Computational linguistics literature and citations oriented citation
           linkage, classification and summarization
    • Authors: Lei Li; Liyuan Mao; Yazhao Zhang; Junqi Chi; Taiwen Huang; Xiaoyue Cong; Heng Peng
      Abstract: Scientific literature is currently the most important resource for scholars, and their citations have provided researchers with a powerful latent way to analyze scientific trends, influences and relationships of works and authors. This paper is focused on automatic citation analysis and summarization for the scientific literature of computational linguistics, which are also the shared tasks in the 2016 workshop of the 2nd Computational Linguistics Scientific Document Summarization at BIRNDL 2016 (The Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries). Each citation linkage between a citation and the spans of text in the reference paper is recognized according to their content similarities via various computational methods. Then the cited text span is classified to five pre-defined facets, i.e., Hypothesis, Implication, Aim, Results and Method, based on various features of lexicons and rules via Support Vector Machine and Voting Method. Finally, a summary of the reference paper from the cited text spans is generated within 250 words. hLDA (hierarchical Latent Dirichlet Allocation) topic model is adopted for content modeling, which provides knowledge about sentence clustering (subtopic) and word distributions (abstractiveness) for summarization. We combine hLDA knowledge with several other classical features using different weights and proportions to evaluate the sentences in the reference paper. Our systems have been ranked top one and top two according to the evaluation results published by BIRNDL 2016, which has verified the effectiveness of our methods.
      PubDate: 2017-06-13
      DOI: 10.1007/s00799-017-0219-5
       
  • Bag of works retrieval: TF*IDF weighting of works co-cited with a seed
    • Authors: Howard D. White
      Abstract: Abstract Although not presently possible in any system, the style of retrieval described here combines familiar components—co-citation linkages of documents and TF*IDF weighting of terms—in a way that could be implemented in future databases. Rather than entering keywords, the user enters a string identifying a work—a seed—to retrieve the strings identifying other works that are co-cited with it. Each of the latter is part of a “bag of works,” and it presumably has both a co-citation count with the seed and an overall citation count in the database. These two counts can be plugged into a standard formula for TF*IDF weighting such that all the co-cited items can be ranked for relevance to the seed, given that the entire retrieval is relevant to it by evidence from multiple co-citing authors. The result is analogous to, but different from, traditional “bag of words” retrieval, which it supplements. Some properties of the ranking are illustrated by works co-cited with three seeds: an article on search behavior, an information retrieval textbook, and an article on centrality in networks. While these are case studies, their properties apply to bag of works retrievals in general and have implications for users (e.g., humanities scholars, domain analysts) that go beyond any one example.
      PubDate: 2017-05-19
      DOI: 10.1007/s00799-017-0217-7
       
  • Section mixture models for scientific document summarization
    • Authors: John M. Conroy; Sashka T. Davis
      Abstract: Abstract In this paper, we present a system for summarization of scientific and structured documents that has three components: section mixture models are used for estimation of the weights of terms; a hypothesis test to select a subset of these terms; and a sentence extractor based on techniques for combinatorial optimization. The section mixture models approach is an adaptation of a bigram mixture model based on the main sections of a scientific document and a collection of citing sentences (citances) from papers that reference the document. The model was adapted from earlier work done on Biomedical documents used in the summarization task of the 2014 Text Analysis Conference (TAC 2014). The mixture model trained on the Biomedical data was used also on the data for the Computational Linguistics scientific summarization task of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (CL-SciSumm 2016). This model gives rise to machine-generated summaries with ROUGE scores that are nearly as strong as those seen on the Biomedical data and was also the highest scoring submission to the task of generating a human summary. For sentence extraction, we use the OCCAMS algorithm (Davis et al., in: Vreeken, Ling, Zaki, Siebes, Yu, Goethals, Webb, Wu (eds) ICDM workshops, IEEE Computer Society, pp 454–463, 2012) which takes the sentences from the original document and the assignment of weights of the terms computed by the language models and outputs a set of minimally overlapping sentences whose combined term coverage is maximized. Finally, we explore the importance of an appropriate background model for the hypothesis test to select terms to achieve the best quality summaries.
      PubDate: 2017-05-17
      DOI: 10.1007/s00799-017-0218-6
       
  • Scientific document summarization via citation contextualization and
           scientific discourse
    • Authors: Arman Cohan; Nazli Goharian
      Abstract: Abstract The rapid growth of scientific literature has made it difficult for the researchers to quickly learn about the developments in their respective fields. Scientific summarization addresses this challenge by providing summaries of the important contributions of scientific papers. We present a framework for scientific summarization which takes advantage of the citations and the scientific discourse structure. Citation texts often lack the evidence and context to support the content of the cited paper and are even sometimes inaccurate. We first address the problem of inaccuracy of the citation texts by finding the relevant context from the cited paper. We propose three approaches for contextualizing citations which are based on query reformulation, word embeddings, and supervised learning. We then train a model to identify the discourse facets for each citation. We finally propose a method for summarizing scientific papers by leveraging the faceted citations and their corresponding contexts. We evaluate our proposed method on two scientific summarization datasets in the biomedical and computational linguistics domains. Extensive evaluation results show that our methods can improve over the state of the art by large margins.
      PubDate: 2017-05-09
      DOI: 10.1007/s00799-017-0216-8
       
  • Quantifying retrieval bias in Web archive search
    • Authors: Thaer Samar; Myriam C. Traub; Jacco van Ossenbruggen; Lynda Hardman; Arjen P. de Vries
      Abstract: Abstract A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.
      PubDate: 2017-04-18
      DOI: 10.1007/s00799-017-0215-9
       
  • Automatic summarization of scientific publications using a feature
           selection approach
    • Authors: Hazem Al Saied; Nicolas Dugué; Jean-Charles Lamirel
      Abstract: Abstract Feature Maximization is a feature selection method that deals efficiently with textual data: to design systems that are altogether language-agnostic, parameter-free and do not require additional corpora to function. We propose to evaluate its use in text summarization, in particular in cases where documents are structured. We first experiment this approach in a single-document summarization context. We evaluate it on the DUC AQUAINT corpus and show that despite the unstructured nature of the corpus, our system is above the baseline and produces encouraging results. We also observe that the produced summaries seem robust to redundancy. Next, we evaluate our method in the more appropriate context of SciSumm challenge, which is dedicated to research publications summarization. These publications are structured in sections and our class-based approach is thus relevant. We more specifically focus on the task that aims to summarize papers using those that refer to them. We consider and evaluate several systems using our approach dealing with specific bag of words. Furthermore, in these systems, we also evaluate cosine and graph-based distance for sentence weighting and comparison. We show that our Feature Maximization based approach performs very well in the SciSumm 2016 context for the considered task, providing better results than the known results so far, and obtaining high recall. We thus demonstrate the flexibility and the relevance of Feature Maximization in this context.
      PubDate: 2017-04-13
      DOI: 10.1007/s00799-017-0214-x
       
  • Reuse and plagiarism in Speech and Natural Language Processing
           publications
    • Authors: Joseph Mariani; Gil Francopoulo; Patrick Paroubek
      Abstract: Abstract The aim of this experiment is to present an easy way to compare fragments of texts in order to detect (supposed) results of copy and paste operations between articles in the domain of Natural Language Processing (NLP), including Speech Processing. The search space of the comparisons is a corpus labeled as NLP4NLP, which includes 34 different conferences and journals and gathers a large part of the NLP activity over the past 50 years. This study considers the similarity between the papers of each individual event and the complete set of papers in the whole corpus, according to four different types of relationship (self-reuse, self-plagiarism, reuse and plagiarism) and in both directions: a paper borrowing a fragment of text from another paper of the corpus (that we will call the source paper), or in the reverse direction, fragments of text from the source paper being borrowed and inserted in another paper of the corpus. The results show that self-reuse is rather a common practice, but that plagiarism seems to be very unusual, and that both stay within legal and ethical limits.
      PubDate: 2017-03-21
      DOI: 10.1007/s00799-017-0211-0
       
 
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
 
Home (Search)
Subjects A-Z
Publishers A-Z
Customise
APIs
Your IP address: 54.158.253.134
 
About JournalTOCs
API
Help
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2016