for Journals by Title or ISSN
for Articles by Keywords
Followed Journals
Journal you Follow: 0
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Jurnals are published.
Already have an account? Sign In to see the journals you follow.
Journal Cover International Journal on Digital Libraries
   [528 followers]  Follow    
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
     ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012
     Published by Springer-Verlag Homepage  [2210 journals]   [SJR: 0.649]   [H-I: 22]
  • How to assess image quality within a workflow chain: an overview
    • Abstract: Abstract Image quality assessment (IQA) is a multi-dimensional research problem and an active and evolving research area. This paper aims to provide an overview of the state of the art of the IQA methods, putting in evidence their applicability and limitations in different application domains. We outline the relationship between the image workflow chain and the IQA approaches reviewing the literature on IQA methods, classifying and summarizing the available metrics. We present general guidelines for three workflow chains in which IQA policies are required. The three workflow chains refer to: high-quality image archives, biometric system and consumer collections of personal photos. Finally, we illustrate a real case study referring to a printing workflow chain, where we suggest and actually evaluate the performance of a set of specific IQA methods.
      PubDate: 2014-08-15
  • A comprehensive evaluation of scholarly paper recommendation using
           potential citation papers
    • Abstract: Abstract To help generate relevant suggestions for researchers, recommendation systems have started to leverage the latent interests in the publication profiles of the researchers themselves. While using such a publication citation network has been shown to enhance performance, the network is often sparse, making recommendation difficult. To alleviate this sparsity, in our former work, we identified “potential citation papers” through the use of collaborative filtering. Also, as different logical sections of a paper have different significance, as a secondary contribution, we investigated which sections of papers can be leveraged to represent papers effectively. While this initial approach works well for researchers vested in a single discipline, it generates poor predictions for scientists who work on several different topics in the discipline (hereafter, “intra-disciplinary”). We thus extend our previous work in this paper by proposing an adaptive neighbor selection method to overcome this problem in our imputation-based collaborative filtering framework. On a publicly-available scholarly paper recommendation dataset, we show that recommendation accuracy significantly outperforms state-of-the-art recommendation baselines as measured by nDCG and MRR, when using our adaptive neighbor selection method. While recommendation performance is enhanced for all researchers, improvements are more marked for intra-disciplinary researchers, showing that our method does address the targeted audience.
      PubDate: 2014-08-10
  • Evaluating sliding and sticky target policies by measuring temporal drift
           in acyclic walks through a web archive
    • Abstract: Abstract When viewing an archived page using the archive’s user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed, potentially drifting away from the datetime originally selected. For sparsely archived resources, this almost transparent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive’s Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to \(<\) 30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.
      PubDate: 2014-08-05
  • Evaluating distance-based clustering for user (browse and click) sessions
           in a domain-specific collection
    • Abstract: Abstract We seek to improve information retrieval in a domain-specific collection by clustering user sessions from a click log and then classifying later user sessions in real time. As a preliminary step, we explore the main assumption of this approach: whether user sessions in such a site are related to the question that they are answering. Since a large class of machine learning algorithms use a distance measure at the core, we evaluate the suitability of common machine learning distance measures to distinguish sessions of users searching for the answer to same or different questions. We found that two distance measures work very well for our task and three others do not. As a further step, we then investigate how effective the distance measures are when used in clustering. For our dataset, we conducted a user study where we had multiple users answer the same set of questions. This data, grouped by question, was used as our gold standard for evaluating the clusters produced by the clustering algorithms. We found that the observed difference between the two classes of distance measures affected the quality of the clusterings, as expected. We also found that one of the two distance measures that worked well to differentiate sessions, worked significantly better than the other when clustering. Finally, we discuss why some distance metrics performed better than others in the two parts of our work.
      PubDate: 2014-08-01
  • Information-theoretic term weighting schemes for document clustering and
    • Abstract: Abstract We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term’s (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms’ prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.
      PubDate: 2014-07-30
  • Profiling web archive coverage for top-level domain and content language
    • Abstract: Abstract The Memento Aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile fifteen public web archives using data from a variety of sources (the web, archives’ access logs, and fulltext queries to archives) and use these profiles as resource descriptor. These profiles are used in matching the URI-lookup requests to the most probable web archives. We define \(Recall_{TM}(n)\) as the percentage of a TimeMap that was returned using \(n\) web archives. We discover that only sending queries to the top three web archives (i.e., 80 % reduction in the number of queries) for any request reaches on average \(Recall_{TM}=0.96\) . If we exclude the Internet Archive from the list, we can reach \(Recall_{TM}=0.647\) on average using only the remaining top three web archives.
      PubDate: 2014-06-27
  • Unsupervised document structure analysis of digital scientific articles
    • Abstract: Abstract Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.
      PubDate: 2014-06-08
  • Sustainability of digital libraries: a conceptual model and a research
    • Abstract: Abstract This paper aims to develop a conceptual model and a research framework for study of the economic, social and environmental sustainability of digital libraries. The major factors that are related to the economic, social and environmental sustainability of digital libraries have been identified. Relevant research in digital information systems and services in general, and digital libraries in particular, have been discussed to illustrate different issues and challenges associated with each of the three forms of sustainability. Based on the discussions of relevant research that have implications on sustainability of information systems and services, the paper proposes a conceptual model and a theoretical research framework for study of the sustainability of digital libraries. It shows that the sustainable business models to support digital libraries should also support equitable access supported by specific design and usability guidelines that facilitate easier, better and cheaper access; support the personal, institutional and social culture of users; and at the same time conform with the policy and regulatory frameworks of the respective regions, countries and institutions. It is also shown that measures taken to improve the economic and social sustainability should also support the environmental sustainability guidelines, i.e. reduce the overall environmental impact of digital libraries. It is argued that the various factors affecting the different sustainability issues of digital libraries need to be studied together to build digital libraries that are economically, socially and environmentally sustainable.
      PubDate: 2014-06-07
  • Introduction to the focused issue on the 17th International Conference on
           Theory and Practice of Digital Libraries (TPDL 2013)
    • PubDate: 2014-06-06
  • Who and what links to the Internet Archive
    • Abstract: Abstract The Internet Archive’s (IA) Wayback Machine is the largest and oldest public Web archive and has become a significant repository of our recent history and cultural heritage. Despite its importance, there has been little research about how it is discovered and used. Based on Web access logs, we analyze what users are looking for, why they come to IA, where they come from, and how pages link to IA. We find that users request English pages the most, followed by the European languages. Most human users come to Web archives because they do not find the requested pages on the live Web. About 65 % of the requested archived pages no longer exist on the live Web. We find that more than 82 % of human sessions connect to the Wayback Machine via referrals from other Web sites, while only 15 % of robots have referrers. Most of the links (86 %) from Websites are to individual archived pages at specific points in time, and of those 83 % no longer exist on the live Web. Finally, we find that users who come from search engines browse more pages than users who come from external Web sites.
      PubDate: 2014-04-23
  • Metadata management, interoperability and Linked Data publishing support
           for Natural History Museums
    • Abstract: Abstract Natural history museums (NHMs) form a rich source of knowledge about Earth’s biodiversity and natural history. However, an impressive abundance of high-quality scientific content available in NHMs around Europe remains largely unexploited due to a number of barriers, such as the lack of interconnection and interoperability between the management systems used by museums, the lack of centralized access through a European point of reference such as Europeana and the inadequacy of the current metadata and content organization. The Natural Europe project offers a coordinated solution at European level that aims to overcome those barriers. In this article, we present the architecture, deployment and evaluation of the Natural Europe infrastructure allowing the curators to publish, semantically describe and manage the museums’ cultural heritage objects, as well as disseminate them to and BioCASE/GBIF. Additionally, we discuss the methodology followed for the transition of the infrastructure to the Semantic Web and the publishing of NHMs’ cultural heritage metadata as Linked Data, supporting the Europeana Data Model.
      PubDate: 2014-04-11
  • A system for high quality crowdsourced indigenous language transcription
    • Abstract: Abstract In this article, a crowdsourcing method is proposed to transcribe manuscripts from the Bleek and Lloyd Collection, where non-expert volunteers transcribe pages of the handwritten text using an online tool. The digital Bleek and Lloyd Collection is a rare collection that contains artwork, notebooks and dictionaries of the indigenous people of Southern Africa. The notebooks, in particular, contain stories that encode the language, culture and beliefs of these people, handwritten in now-extinct languages with a specialized notation system. Previous attempts have been made to convert the approximately 20,000 pages of text to a machine-readable form using machine learning algorithms but, due to the complexity of the text, the recognition accuracy was low. This article presents details of the system used to enable transcription by volunteers as well as results from experiments that were conducted to determine the quality and consistency of transcriptions. The results show that volunteers are able to produce reliable transcriptions of high quality. The inter-transcriber agreement is 80 % for Xam text and 95 % for English text. When the Xam text transcriptions produced by the volunteers are compared with a gold standard, the volunteers achieve an average accuracy of 64.75 %, which exceeded that in previous work. Finally, the degree of transcription agreement correlates with the degree of transcription accuracy. This suggests that the quality of unseen data can be assessed based on the degree of agreement among transcribers.
      PubDate: 2014-04-11
  • Word occurrence based extraction of work contributors from statements of
    • Abstract: Abstract This paper addresses the identification of all contributors of an intellectual work, when they are recorded in bibliographic data but in unstructured form. National bibliographies are very reliable on representing the first author of a work; however, secondary contributors are frequently represented only in the statements of responsibility that are transcribed by the cataloguer from the book into the bibliographic records. The identification of work contributors mentioned in statements of responsibility is a typical motivation for the application of information extraction techniques. This paper presents an approach developed for the specific application scenario of the ARROW rights infrastructure being deployed in several European countries to assist in the determination of the copyright status of works that may not be under public domain. An evaluation of our approach was performed in catalogues of nine European national libraries of countries that are available in the ARROW rights infrastructure, which cover eight different languages. The evaluation has shown that it performs reliably across languages and bibliographic datasets. It achieved an overall precision of 98.7 % and recall of 96.7 %.
      PubDate: 2014-04-05
  • Improved bibliographic reference parsing based on repeated patterns
    • Abstract: Abstract Parsing details like author names and titles out of bibliographic references of scientific publications is an important issue that has received considerable attention recently. However, most existing techniques are tailored to the highly standardized reference styles used in the last two to three decades. They do not perform well with the wide variety of reference styles used in older, historic publications. Thus, they are of limited use when creating comprehensive bibliographies covering both historic and contemporary scientific publications. This paper presents a generic approach to bibliographic reference parsing, named RefParse, which is independent of any specific reference style. Its core feature is an inference mechanism that exploits the regularities inherent in any list of references to deduce its format. In addition, our approach learns names of authors, journals, and publishers to increase the accuracy in scenarios where human users double check parsing results to increase data quality. Our evaluation shows that our approach performs comparably to existing ones with contemporary reference lists and also works well with older ones.
      PubDate: 2014-04-01
  • Researcher Name Resolver: identifier management system for Japanese
    • Abstract: Abstract We built a researcher identifier management system called the Researcher Name Resolver (RNR) to assist with the name disambiguation of authors in digital libraries on the Web. RNR, which is designed to cover all researchers in Japan, is a Web-oriented service that can be openly connected with external scholastic systems. We expect it to be widely used for enriched scholarly communications. In this paper, we first outline the conceptual framework of RNR, which is jointly focused on researcher identifier management and Web resource linking. We based our researcher identifier scheme on the reuse of multiple sets of existing researcher identifiers belonging to the Japanese grant database KAKEN and the researcher directory ReaD & Researchmap. Researcher identifiers are associated by direct links to related resources on the Web through a combination of methods, including descriptive mapping, focused crawling on campus directories and researcher identification by matching names and affiliations. Second, we discuss our implementation of RNR based on this framework. Researcher identifiers construct uniform resource identifiers to show Web pages that describe researcher profiles and provide links to related external resources. We have adapted Web-friendly technologies—e.g., OpenSearch and the RDFs of Linked Data technology—in this implementation to provide Web-friendly services. Third, we discuss our application of RNR to a name disambiguation task for the search portal of the Japanese Institutional Repositories Online to determine how well the researcher identifier management system cooperates with external systems. Finally, we discuss lessons learned from the entire project as well as the future development directions we intend to take.
      PubDate: 2014-04-01
  • Moved but not gone: an evaluation of real-time methods for discovering
           replacement web pages
    • Abstract: Abstract Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, we are able to recommend not only the best performing methods but also the sequence in which they should be applied, based on their performance, complexity required to generate them, and evolution over time. Our least complex single method results in a rediscovery rate of almost $70\,\%$ of Web pages of our sample dataset based on URIs sampled from the Open Directory Project (DMOZ). By increasing the complexity level and combining three different methods, our results show an increase of the success rate of up to $77\,\%$ . The results, based on our sample dataset, indicate that Web pages are often not completely lost but have moved to a different location and “just” need to be rediscovered.
      PubDate: 2014-04-01
  • Harvesting latent and usage-based metadata in a course management system
           to enrich the underlying educational digital library
    • Abstract: Abstract In this case study, we demonstrate how in an integrated digital library and course management system, metadata can be generated using a bootstrapping mechanism. The integration encompasses sequencing of content by teachers and deployment of content to learners. We show that taxonomy term assignments and a recommender system can be based almost solely on usage data (especially correlations on what teachers have put in the same course or assignment). In particular, we show that with minimal human intervention, taxonomy terms, quality measures, and an association ruleset can be established for a large pool of fine-granular educational assets.
      PubDate: 2013-11-29
  • A vision towards Scientific Communication Infrastructures
    • Abstract: Abstract The two pillars of the modern scientific communication are Data Centers and Research Digital Libraries (RDLs), whose technologies and admin staff support researchers at storing, curating, sharing, and discovering the data and the publications they produce. Being realized to maintain and give access to the results of complementary phases of the scientific research process, such systems are poorly integrated with one another and generally do not rely on the strengths of the other. Today, such a gap hampers achieving the objectives of the modern scientific communication, that is, publishing, interlinking, and discovery of all outcomes of the research process, from the experimental and observational datasets to the final paper. In this work, we envision that instrumental to bridge the gap is the construction of “Scientific Communication Infrastructures”. The main goal of these infrastructures is to facilitate interoperability between Data Centers and RDLs and to provide services that simplify the implementation of the large variety of modern scientific communication patterns.
      PubDate: 2013-07-14
  • The different roles of ‘Design Process Champions’ for digital
           libraries in African higher education
    • Abstract: Abstract The concept of design stakeholders is central to effective design of digital libraries. We report on research findings that identified the presence of a key subset of stakeholders which we term ‘design process champions’. Our findings have identified that these champions can change interaction patterns and the eventual output of the other stakeholders (project participants) in the design process of digital library projects. This empirical research is based upon 38 interviews with key stakeholders and a review of documentary evidence in 10 innovative digital library design projects (e.g. mobile clinical libraries) located in three African universities in Kenya, Uganda, and South Africa. Through a grounded theory approach, two different types of the ‘design process champions’ emerged from the data with varying levels of effectiveness in the design process: (i) domain champions and (ii) multidisciplinary champions. The domain champions assume a ‘siloed’ approach of engagement while the multidisciplinary champions take on a participatory engagement throughout the design process. A discussion of the implications of information specialists functioning as domain champions is highlighted. We conclude by suggesting that the multidisciplinary champions’ approach is particularly useful in supporting sustainability of digital library design projects.
      PubDate: 2013-03-28
  • On the applicability of word sense discrimination on 201 years of
           modern english
    • Abstract: Abstract As language evolves over time, documents stored in long- term archives become inaccessible to users. Automatically, detecting and handling language evolution will become a necessity to meet user’s information needs. In this paper, we investigate the performance of modern tools and algorithms applied on modern English to find word senses that will later serve as a basis for finding evolution. We apply the curvature clustering algorithm on all nouns and noun phrases extracted from The Times Archive (1785–1985). We use natural language processors for part-of-speech tagging and lemmatization and report on the performance of these processors over the entire period. We evaluate our clusters using WordNet to verify whether they correspond to valid word senses. Because The Times Archive contains OCR errors, we investigate the effects of such errors on word sense discrimination results. Finally, we present a novel approach to correct OCR errors present in the archive and show that the coverage of the curvature clustering algorithm improves. We increase the number of clusters by 24 %. To verify our results, we use the New York Times corpus (1987–2007), a recent collection that is considered error free, as a ground truth for our experiments. We find that after correcting OCR errors in The Times Archive, the performance of word sense discrimination applied on The Times Archive is comparable to the ground truth.
      PubDate: 2013-03-16
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2014