for Journals by Title or ISSN
for Articles by Keywords
help
Followed Journals
Journal you Follow: 0
 
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Jurnals are published.
Already have an account? Sign In to see the journals you follow.
Journal Cover International Journal on Digital Libraries
   Journal TOC RSS feeds Export to Zotero [562 followers]  Follow    
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
     ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012
     Published by Springer-Verlag Homepage  [2209 journals]   [SJR: 0.649]   [H-I: 22]
  • A locality-aware similar information searching scheme
    • Abstract: Abstract In a database, a similar information search means finding data records which contain the majority of search keywords. Due to the rapid accumulation of information nowadays, the size of databases has increased dramatically. An efficient information searching scheme can speed up information searching and retrieve all relevant records. This paper proposes a Hilbert curve-based similarity searching scheme (HCS). HCS considers a database to be a multidimensional space and each data record to be a point in the multidimensional space. By using a Hilbert space filling curve, each point is projected from a high-dimensional space to a low-dimensional space, so that the points close to each other in the high-dimensional space are gathered together in the low-dimensional space. Because the database is divided into many clusters of close points, a query is mapped to a certain cluster instead of searching the entire database. Experimental results prove that HCS dramatically reduces the search time latency and exhibits high effectiveness in retrieving similar information.
      PubDate: 2014-10-12
       
  • Exploring publication metadata graphs with the LODmilla browser and editor
    • Abstract: With the LODmilla browser, we try to support linked data exploration in a generic way learning from the 20 years of web browser evolution as well as from scholars’ opinions who try to use it as a research exploration tool. In this paper, generic functions for linked open data (LOD) browsing are presented, and it is also explained what kind of information search tactics they enable with linked data describing publications. Furthermore, LODmilla also supports the sharing of graph views and the correction of LOD data during browsing.
      PubDate: 2014-10-12
       
  • Linked data authority records for Irish place names
    • Abstract: Linked Data technologies are increasingly being implemented to enhance cataloguing workflows in libraries, archives and museums. We review current best practice in library cataloguing, how Linked Data is used to link collections and provide consistency in indexing, and briefly describe the relationship between Linked Data, library data models and descriptive standards. As an example we look at the Logainm.ie dataset, an online database holding the authoritative hierarchical list of Irish and English language place names in Ireland. This paper describes the process of creating the new Linked Logainm dataset, including the transformation of the data from XML to RDF and the generation of links to external geographic datasets like DBpedia and the Faceted Application of Subject Terminology. This dataset was then used to enhance the National Library of Ireland’s metadata MARCXML metadata records for its Longfield maps collection. We also describe the potential benefits of Linked Data for libraries, focusing on the use of the Linked Logainm dataset and its future potential for Irish heritage institutions.
      PubDate: 2014-10-10
       
  • Digital field scholarship and the liberal arts: results from a
           2012–13 sandbox
    • Abstract: We summarize a recent multi-institutional collaboration in digital field scholarship involving four liberal arts colleges: Davidson College, Lewis & Clark College, Muhlenberg College, and Reed College. Digital field scholarship (DFS) can be defined as scholarship in the arts and sciences for which field-based research and concepts are significant, and digital tools support data collection, analysis, and communication; DFS thus gathers together and extends a wide range of existing scholarship, offering new possibilities for appreciating the connections that define liberal education. Our collaboration occurred as a sandbox, a collective online experiment using a modified WordPress platform (including mapping and other advanced capabilities) built and supported by Lewis & Clark College, with sponsorship provided by the National Institute for Technology in Liberal Education. Institutions selected course-based DFS projects for fall 2012 and/or spring 2013. Projects ranged from documentary photojournalism to home energy efficiency assessment. One key feature was the use of readily available mobile devices and apps for field-based reconnaissance and data collection; another was our public digital scholarship approach, in which student participants shared the process and products of their work via public posts on the DFS website. Descriptive and factor analysis results from anonymous assessment data suggest strong participant response and likely future potential of digital field scholarship across class level and gender. When set into the context of the four institutions that supported the 2012–2013 sandbox, we see further opportunities for digital field scholarship on our and other campuses, provided that an optimal balance is struck between challenges and rewards along technical, pedagogical, and practical axes. Ultimately, digital field scholarship will be judged for its scholarship and for extending the experimental, open-ended inquiry that characterizes liberal education.
      PubDate: 2014-09-20
       
  • Evaluating a digital humanities research environment: the CULTURA approach
    • Abstract: Digital humanities initiatives play an important role in making cultural heritage collections accessible to the global community of researchers and general public for the first time. Further work is needed to provide useful and usable tools to support users in working with those digital contents in virtual environments. The CULTURA project has developed a corpus agnostic research environment integrating innovative services that guide, assist and empower a broad spectrum of users in their interaction with cultural artefacts. This article presents (1) the CULTURA system and services and the two collections that have been used for testing and deploying the digital humanities research environment, and (2) an evaluation methodology and formative evaluation study with apprentice researchers. An evaluation model was developed which has served as a common ground for systematic evaluations of the CULTURA environment with user communities around the two test bed collections. The evaluation method has proven to be suitable for accommodating different evaluation strategies and allows meaningful consolidation of evaluation results. The evaluation outcomes indicate a positive perception of CULTURA. A range of useful suggestions for future improvement has been collected and fed back into the development of the next release of the research environment.
      PubDate: 2014-09-16
       
  • A case study on propagating and updating provenance information using the
           CIDOC CRM
    • Abstract: Provenance information of digital objects maintained by digital libraries and archives is crucial for authenticity assessment, reproducibility and accountability. Such information is commonly stored on metadata placed in various Metadata Repositories (MRs) or Knowledge Bases (KBs). Nevertheless, in various settings it is prohibitive to store the provenance of each digital object due to the high storage space requirements that are needed for having complete provenance. In this paper, we introduce provenance-based inference rules as a means to complete the provenance information, to reduce the amount of provenance information that has to be stored, and to ease quality control (e.g., corrections). Roughly, we show how provenance information can be propagated by identifying a number of basic inference rules over a core conceptual model for representing provenance. The propagation of provenance concerns fundamental modelling concepts such as actors, activities, events, devices and information objects, and their associations. However, since a MR/KB is not static but changes over time due to several factors, the question that arises is how we can satisfy update requests while still supporting the aforementioned inference rules. Towards this end, we elaborate on the specification of the required add/delete operations, consider two different semantics for deletion of information, and provide the corresponding update algorithms. Finally, we report extensive comparative results for different repository policies regarding the derivation of new knowledge, in datasets containing up to one million RDF triples. The results allow us to understand the tradeoffs related to the use of inference rules on storage space and performance of queries and updates.
      PubDate: 2014-08-29
       
  • Sifting useful comments from Flickr Commons and YouTube
    • Abstract: Cultural institutions are increasingly contributing content to social media platforms to raise awareness and promote use of their collections. Furthermore, they are often the recipients of user comments containing information that may be incorporated in their catalog records. However, not all user-generated comments can be used for the purpose of enriching metadata records. Judging the usefulness of a large number of user comments is a labor-intensive task. Accordingly, our aim was to provide automated support for curation of potentially useful social media comments on digital objects. In this paper, the notion of usefulness is examined in the context of social media comments and compared from the perspective of both end-users and expert users. A machine-learning approach is then introduced to automatically classify comments according to their usefulness. This approach uses syntactic and semantic comment features while taking user context into consideration. We present the results of an experiment we conducted on user comments collected from Flickr Commons collections and YouTube. A study is then carried out on the correlation between the commenting culture of a platform (YouTube and Flickr) with usefulness prediction. Our findings indicate that a few relatively straightforward features can be used for inferring useful comments. However, the influence of features on usefulness classification may vary according to the commenting cultures of platforms.
      PubDate: 2014-08-20
       
  • How to assess image quality within a workflow chain: an overview
    • Abstract: Image quality assessment (IQA) is a multi-dimensional research problem and an active and evolving research area. This paper aims to provide an overview of the state of the art of the IQA methods, putting in evidence their applicability and limitations in different application domains. We outline the relationship between the image workflow chain and the IQA approaches reviewing the literature on IQA methods, classifying and summarizing the available metrics. We present general guidelines for three workflow chains in which IQA policies are required. The three workflow chains refer to: high-quality image archives, biometric system and consumer collections of personal photos. Finally, we illustrate a real case study referring to a printing workflow chain, where we suggest and actually evaluate the performance of a set of specific IQA methods.
      PubDate: 2014-08-15
       
  • A comprehensive evaluation of scholarly paper recommendation using
           potential citation papers
    • Abstract: To help generate relevant suggestions for researchers, recommendation systems have started to leverage the latent interests in the publication profiles of the researchers themselves. While using such a publication citation network has been shown to enhance performance, the network is often sparse, making recommendation difficult. To alleviate this sparsity, in our former work, we identified “potential citation papers” through the use of collaborative filtering. Also, as different logical sections of a paper have different significance, as a secondary contribution, we investigated which sections of papers can be leveraged to represent papers effectively. While this initial approach works well for researchers vested in a single discipline, it generates poor predictions for scientists who work on several different topics in the discipline (hereafter, “intra-disciplinary”). We thus extend our previous work in this paper by proposing an adaptive neighbor selection method to overcome this problem in our imputation-based collaborative filtering framework. On a publicly-available scholarly paper recommendation dataset, we show that recommendation accuracy significantly outperforms state-of-the-art recommendation baselines as measured by nDCG and MRR, when using our adaptive neighbor selection method. While recommendation performance is enhanced for all researchers, improvements are more marked for intra-disciplinary researchers, showing that our method does address the targeted audience.
      PubDate: 2014-08-10
       
  • Evaluating sliding and sticky target policies by measuring temporal drift
           in acyclic walks through a web archive
    • Abstract: When viewing an archived page using the archive’s user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed, potentially drifting away from the datetime originally selected. For sparsely archived resources, this almost transparent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive’s Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to \(<\) 30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.
      PubDate: 2014-08-05
       
  • Evaluating distance-based clustering for user (browse and click) sessions
           in a domain-specific collection
    • Abstract: Abstract We seek to improve information retrieval in a domain-specific collection by clustering user sessions from a click log and then classifying later user sessions in real time. As a preliminary step, we explore the main assumption of this approach: whether user sessions in such a site are related to the question that they are answering. Since a large class of machine learning algorithms use a distance measure at the core, we evaluate the suitability of common machine learning distance measures to distinguish sessions of users searching for the answer to same or different questions. We found that two distance measures work very well for our task and three others do not. As a further step, we then investigate how effective the distance measures are when used in clustering. For our dataset, we conducted a user study where we had multiple users answer the same set of questions. This data, grouped by question, was used as our gold standard for evaluating the clusters produced by the clustering algorithms. We found that the observed difference between the two classes of distance measures affected the quality of the clusterings, as expected. We also found that one of the two distance measures that worked well to differentiate sessions, worked significantly better than the other when clustering. Finally, we discuss why some distance metrics performed better than others in the two parts of our work.
      PubDate: 2014-08-01
       
  • Information-theoretic term weighting schemes for document clustering and
           classification
    • Abstract: We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term’s (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms’ prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.
      PubDate: 2014-07-30
       
  • Profiling web archive coverage for top-level domain and content language
    • Abstract: The Memento Aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile fifteen public web archives using data from a variety of sources (the web, archives’ access logs, and fulltext queries to archives) and use these profiles as resource descriptor. These profiles are used in matching the URI-lookup requests to the most probable web archives. We define \(Recall_{TM}(n)\) as the percentage of a TimeMap that was returned using \(n\) web archives. We discover that only sending queries to the top three web archives (i.e., 80 % reduction in the number of queries) for any request reaches on average \(Recall_{TM}=0.96\) . If we exclude the Internet Archive from the list, we can reach \(Recall_{TM}=0.647\) on average using only the remaining top three web archives.
      PubDate: 2014-06-27
       
  • Unsupervised document structure analysis of digital scientific articles
    • Abstract: Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.
      PubDate: 2014-06-08
       
  • Sustainability of digital libraries: a conceptual model and a research
           framework
    • Abstract: This paper aims to develop a conceptual model and a research framework for study of the economic, social and environmental sustainability of digital libraries. The major factors that are related to the economic, social and environmental sustainability of digital libraries have been identified. Relevant research in digital information systems and services in general, and digital libraries in particular, have been discussed to illustrate different issues and challenges associated with each of the three forms of sustainability. Based on the discussions of relevant research that have implications on sustainability of information systems and services, the paper proposes a conceptual model and a theoretical research framework for study of the sustainability of digital libraries. It shows that the sustainable business models to support digital libraries should also support equitable access supported by specific design and usability guidelines that facilitate easier, better and cheaper access; support the personal, institutional and social culture of users; and at the same time conform with the policy and regulatory frameworks of the respective regions, countries and institutions. It is also shown that measures taken to improve the economic and social sustainability should also support the environmental sustainability guidelines, i.e. reduce the overall environmental impact of digital libraries. It is argued that the various factors affecting the different sustainability issues of digital libraries need to be studied together to build digital libraries that are economically, socially and environmentally sustainable.
      PubDate: 2014-06-07
       
  • Introduction to the focused issue on the 17th International Conference on
           Theory and Practice of Digital Libraries (TPDL 2013)
    • PubDate: 2014-06-06
       
  • Who and what links to the Internet Archive
    • Abstract: The Internet Archive’s (IA) Wayback Machine is the largest and oldest public Web archive and has become a significant repository of our recent history and cultural heritage. Despite its importance, there has been little research about how it is discovered and used. Based on Web access logs, we analyze what users are looking for, why they come to IA, where they come from, and how pages link to IA. We find that users request English pages the most, followed by the European languages. Most human users come to Web archives because they do not find the requested pages on the live Web. About 65 % of the requested archived pages no longer exist on the live Web. We find that more than 82 % of human sessions connect to the Wayback Machine via referrals from other Web sites, while only 15 % of robots have referrers. Most of the links (86 %) from Websites are to individual archived pages at specific points in time, and of those 83 % no longer exist on the live Web. Finally, we find that users who come from search engines browse more pages than users who come from external Web sites.
      PubDate: 2014-04-23
       
  • Metadata management, interoperability and Linked Data publishing support
           for Natural History Museums
    • Abstract: Natural history museums (NHMs) form a rich source of knowledge about Earth’s biodiversity and natural history. However, an impressive abundance of high-quality scientific content available in NHMs around Europe remains largely unexploited due to a number of barriers, such as the lack of interconnection and interoperability between the management systems used by museums, the lack of centralized access through a European point of reference such as Europeana and the inadequacy of the current metadata and content organization. The Natural Europe project offers a coordinated solution at European level that aims to overcome those barriers. In this article, we present the architecture, deployment and evaluation of the Natural Europe infrastructure allowing the curators to publish, semantically describe and manage the museums’ cultural heritage objects, as well as disseminate them to Europeana.eu and BioCASE/GBIF. Additionally, we discuss the methodology followed for the transition of the infrastructure to the Semantic Web and the publishing of NHMs’ cultural heritage metadata as Linked Data, supporting the Europeana Data Model.
      PubDate: 2014-04-11
       
  • A system for high quality crowdsourced indigenous language transcription
    • Abstract: In this article, a crowdsourcing method is proposed to transcribe manuscripts from the Bleek and Lloyd Collection, where non-expert volunteers transcribe pages of the handwritten text using an online tool. The digital Bleek and Lloyd Collection is a rare collection that contains artwork, notebooks and dictionaries of the indigenous people of Southern Africa. The notebooks, in particular, contain stories that encode the language, culture and beliefs of these people, handwritten in now-extinct languages with a specialized notation system. Previous attempts have been made to convert the approximately 20,000 pages of text to a machine-readable form using machine learning algorithms but, due to the complexity of the text, the recognition accuracy was low. This article presents details of the system used to enable transcription by volunteers as well as results from experiments that were conducted to determine the quality and consistency of transcriptions. The results show that volunteers are able to produce reliable transcriptions of high quality. The inter-transcriber agreement is 80 % for Xam text and 95 % for English text. When the Xam text transcriptions produced by the volunteers are compared with a gold standard, the volunteers achieve an average accuracy of 64.75 %, which exceeded that in previous work. Finally, the degree of transcription agreement correlates with the degree of transcription accuracy. This suggests that the quality of unseen data can be assessed based on the degree of agreement among transcribers.
      PubDate: 2014-04-11
       
  • Word occurrence based extraction of work contributors from statements of
           responsibility
    • Abstract: This paper addresses the identification of all contributors of an intellectual work, when they are recorded in bibliographic data but in unstructured form. National bibliographies are very reliable on representing the first author of a work; however, secondary contributors are frequently represented only in the statements of responsibility that are transcribed by the cataloguer from the book into the bibliographic records. The identification of work contributors mentioned in statements of responsibility is a typical motivation for the application of information extraction techniques. This paper presents an approach developed for the specific application scenario of the ARROW rights infrastructure being deployed in several European countries to assist in the determination of the copyright status of works that may not be under public domain. An evaluation of our approach was performed in catalogues of nine European national libraries of countries that are available in the ARROW rights infrastructure, which cover eight different languages. The evaluation has shown that it performs reliably across languages and bibliographic datasets. It achieved an overall precision of 98.7 % and recall of 96.7 %.
      PubDate: 2014-04-05
       
 
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
 
About JournalTOCs
API
Help
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2014