for Journals by Title or ISSN for Articles by Keywords help
Followed Journals
 International Journal on Digital Libraries    [510 followers]  Follow        Hybrid journal (It can contain Open Access articles)      ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012      Published by Springer-Verlag  [2208 journals]   [SJR: 0.649]   [H-I: 22]
• Evaluating distance-based clustering for user (browse and click) sessions
in a domain-specific collection
• Abstract: Abstract We seek to improve information retrieval in a domain-specific collection by clustering user sessions from a click log and then classifying later user sessions in real time. As a preliminary step, we explore the main assumption of this approach: whether user sessions in such a site are related to the question that they are answering. Since a large class of machine learning algorithms use a distance measure at the core, we evaluate the suitability of common machine learning distance measures to distinguish sessions of users searching for the answer to same or different questions. We found that two distance measures work very well for our task and three others do not. As a further step, we then investigate how effective the distance measures are when used in clustering. For our dataset, we conducted a user study where we had multiple users answer the same set of questions. This data, grouped by question, was used as our gold standard for evaluating the clusters produced by the clustering algorithms. We found that the observed difference between the two classes of distance measures affected the quality of the clusterings, as expected. We also found that one of the two distance measures that worked well to differentiate sessions, worked significantly better than the other when clustering. Finally, we discuss why some distance metrics performed better than others in the two parts of our work.
PubDate: 2014-08-01

• Profiling web archive coverage for top-level domain and content language
• Abstract: Abstract The Memento Aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile fifteen public web archives using data from a variety of sources (the web, archives’ access logs, and fulltext queries to archives) and use these profiles as resource descriptor. These profiles are used in matching the URI-lookup requests to the most probable web archives. We define $$Recall_{TM}(n)$$ as the percentage of a TimeMap that was returned using $$n$$ web archives. We discover that only sending queries to the top three web archives (i.e., 80 % reduction in the number of queries) for any request reaches on average $$Recall_{TM}=0.96$$ . If we exclude the Internet Archive from the list, we can reach $$Recall_{TM}=0.647$$ on average using only the remaining top three web archives.
PubDate: 2014-06-27

• Unsupervised document structure analysis of digital scientific articles
• Abstract: Abstract Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.
PubDate: 2014-06-08

• Sustainability of digital libraries: a conceptual model and a research
framework
• Abstract: Abstract This paper aims to develop a conceptual model and a research framework for study of the economic, social and environmental sustainability of digital libraries. The major factors that are related to the economic, social and environmental sustainability of digital libraries have been identified. Relevant research in digital information systems and services in general, and digital libraries in particular, have been discussed to illustrate different issues and challenges associated with each of the three forms of sustainability. Based on the discussions of relevant research that have implications on sustainability of information systems and services, the paper proposes a conceptual model and a theoretical research framework for study of the sustainability of digital libraries. It shows that the sustainable business models to support digital libraries should also support equitable access supported by specific design and usability guidelines that facilitate easier, better and cheaper access; support the personal, institutional and social culture of users; and at the same time conform with the policy and regulatory frameworks of the respective regions, countries and institutions. It is also shown that measures taken to improve the economic and social sustainability should also support the environmental sustainability guidelines, i.e. reduce the overall environmental impact of digital libraries. It is argued that the various factors affecting the different sustainability issues of digital libraries need to be studied together to build digital libraries that are economically, socially and environmentally sustainable.
PubDate: 2014-06-07

• Introduction to the focused issue on the 17th International Conference on
Theory and Practice of Digital Libraries (TPDL 2013)
• PubDate: 2014-06-06

• Who and what links to the Internet Archive
• Abstract: Abstract The Internet Archive’s (IA) Wayback Machine is the largest and oldest public Web archive and has become a significant repository of our recent history and cultural heritage. Despite its importance, there has been little research about how it is discovered and used. Based on Web access logs, we analyze what users are looking for, why they come to IA, where they come from, and how pages link to IA. We find that users request English pages the most, followed by the European languages. Most human users come to Web archives because they do not find the requested pages on the live Web. About 65 % of the requested archived pages no longer exist on the live Web. We find that more than 82 % of human sessions connect to the Wayback Machine via referrals from other Web sites, while only 15 % of robots have referrers. Most of the links (86 %) from Websites are to individual archived pages at specific points in time, and of those 83 % no longer exist on the live Web. Finally, we find that users who come from search engines browse more pages than users who come from external Web sites.
PubDate: 2014-04-23

for Natural History Museums
• Abstract: Abstract Natural history museums (NHMs) form a rich source of knowledge about Earth’s biodiversity and natural history. However, an impressive abundance of high-quality scientific content available in NHMs around Europe remains largely unexploited due to a number of barriers, such as the lack of interconnection and interoperability between the management systems used by museums, the lack of centralized access through a European point of reference such as Europeana and the inadequacy of the current metadata and content organization. The Natural Europe project offers a coordinated solution at European level that aims to overcome those barriers. In this article, we present the architecture, deployment and evaluation of the Natural Europe infrastructure allowing the curators to publish, semantically describe and manage the museums’ cultural heritage objects, as well as disseminate them to Europeana.eu and BioCASE/GBIF. Additionally, we discuss the methodology followed for the transition of the infrastructure to the Semantic Web and the publishing of NHMs’ cultural heritage metadata as Linked Data, supporting the Europeana Data Model.
PubDate: 2014-04-11

• A system for high quality crowdsourced indigenous language transcription
• Abstract: Abstract In this article, a crowdsourcing method is proposed to transcribe manuscripts from the Bleek and Lloyd Collection, where non-expert volunteers transcribe pages of the handwritten text using an online tool. The digital Bleek and Lloyd Collection is a rare collection that contains artwork, notebooks and dictionaries of the indigenous people of Southern Africa. The notebooks, in particular, contain stories that encode the language, culture and beliefs of these people, handwritten in now-extinct languages with a specialized notation system. Previous attempts have been made to convert the approximately 20,000 pages of text to a machine-readable form using machine learning algorithms but, due to the complexity of the text, the recognition accuracy was low. This article presents details of the system used to enable transcription by volunteers as well as results from experiments that were conducted to determine the quality and consistency of transcriptions. The results show that volunteers are able to produce reliable transcriptions of high quality. The inter-transcriber agreement is 80 % for Xam text and 95 % for English text. When the Xam text transcriptions produced by the volunteers are compared with a gold standard, the volunteers achieve an average accuracy of 64.75 %, which exceeded that in previous work. Finally, the degree of transcription agreement correlates with the degree of transcription accuracy. This suggests that the quality of unseen data can be assessed based on the degree of agreement among transcribers.
PubDate: 2014-04-11

• Word occurrence based extraction of work contributors from statements of
responsibility
• Abstract: Abstract This paper addresses the identification of all contributors of an intellectual work, when they are recorded in bibliographic data but in unstructured form. National bibliographies are very reliable on representing the first author of a work; however, secondary contributors are frequently represented only in the statements of responsibility that are transcribed by the cataloguer from the book into the bibliographic records. The identification of work contributors mentioned in statements of responsibility is a typical motivation for the application of information extraction techniques. This paper presents an approach developed for the specific application scenario of the ARROW rights infrastructure being deployed in several European countries to assist in the determination of the copyright status of works that may not be under public domain. An evaluation of our approach was performed in catalogues of nine European national libraries of countries that are available in the ARROW rights infrastructure, which cover eight different languages. The evaluation has shown that it performs reliably across languages and bibliographic datasets. It achieved an overall precision of 98.7 % and recall of 96.7 %.
PubDate: 2014-04-05

• Improved bibliographic reference parsing based on repeated patterns
• Abstract: Abstract Parsing details like author names and titles out of bibliographic references of scientific publications is an important issue that has received considerable attention recently. However, most existing techniques are tailored to the highly standardized reference styles used in the last two to three decades. They do not perform well with the wide variety of reference styles used in older, historic publications. Thus, they are of limited use when creating comprehensive bibliographies covering both historic and contemporary scientific publications. This paper presents a generic approach to bibliographic reference parsing, named RefParse, which is independent of any specific reference style. Its core feature is an inference mechanism that exploits the regularities inherent in any list of references to deduce its format. In addition, our approach learns names of authors, journals, and publishers to increase the accuracy in scenarios where human users double check parsing results to increase data quality. Our evaluation shows that our approach performs comparably to existing ones with contemporary reference lists and also works well with older ones.
PubDate: 2014-04-01

• Researcher Name Resolver: identifier management system for Japanese
researchers
• Abstract: Abstract We built a researcher identifier management system called the Researcher Name Resolver (RNR) to assist with the name disambiguation of authors in digital libraries on the Web. RNR, which is designed to cover all researchers in Japan, is a Web-oriented service that can be openly connected with external scholastic systems. We expect it to be widely used for enriched scholarly communications. In this paper, we first outline the conceptual framework of RNR, which is jointly focused on researcher identifier management and Web resource linking. We based our researcher identifier scheme on the reuse of multiple sets of existing researcher identifiers belonging to the Japanese grant database KAKEN and the researcher directory ReaD & Researchmap. Researcher identifiers are associated by direct links to related resources on the Web through a combination of methods, including descriptive mapping, focused crawling on campus directories and researcher identification by matching names and affiliations. Second, we discuss our implementation of RNR based on this framework. Researcher identifiers construct uniform resource identifiers to show Web pages that describe researcher profiles and provide links to related external resources. We have adapted Web-friendly technologies—e.g., OpenSearch and the RDFs of Linked Data technology—in this implementation to provide Web-friendly services. Third, we discuss our application of RNR to a name disambiguation task for the search portal of the Japanese Institutional Repositories Online to determine how well the researcher identifier management system cooperates with external systems. Finally, we discuss lessons learned from the entire project as well as the future development directions we intend to take.
PubDate: 2014-04-01

• Moved but not gone: an evaluation of real-time methods for discovering
replacement web pages
• Abstract: Abstract Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, we are able to recommend not only the best performing methods but also the sequence in which they should be applied, based on their performance, complexity required to generate them, and evolution over time. Our least complex single method results in a rediscovery rate of almost $70\,\%$ of Web pages of our sample dataset based on URIs sampled from the Open Directory Project (DMOZ). By increasing the complexity level and combining three different methods, our results show an increase of the success rate of up to $77\,\%$ . The results, based on our sample dataset, indicate that Web pages are often not completely lost but have moved to a different location and “just” need to be rediscovered.
PubDate: 2014-04-01

• Harvesting latent and usage-based metadata in a course management system
to enrich the underlying educational digital library
• Abstract: Abstract In this case study, we demonstrate how in an integrated digital library and course management system, metadata can be generated using a bootstrapping mechanism. The integration encompasses sequencing of content by teachers and deployment of content to learners. We show that taxonomy term assignments and a recommender system can be based almost solely on usage data (especially correlations on what teachers have put in the same course or assignment). In particular, we show that with minimal human intervention, taxonomy terms, quality measures, and an association ruleset can be established for a large pool of fine-granular educational assets.
PubDate: 2013-11-29

• A vision towards Scientific Communication Infrastructures
• Abstract: Abstract The two pillars of the modern scientific communication are Data Centers and Research Digital Libraries (RDLs), whose technologies and admin staff support researchers at storing, curating, sharing, and discovering the data and the publications they produce. Being realized to maintain and give access to the results of complementary phases of the scientific research process, such systems are poorly integrated with one another and generally do not rely on the strengths of the other. Today, such a gap hampers achieving the objectives of the modern scientific communication, that is, publishing, interlinking, and discovery of all outcomes of the research process, from the experimental and observational datasets to the final paper. In this work, we envision that instrumental to bridge the gap is the construction of “Scientific Communication Infrastructures”. The main goal of these infrastructures is to facilitate interoperability between Data Centers and RDLs and to provide services that simplify the implementation of the large variety of modern scientific communication patterns.
PubDate: 2013-07-14

• The different roles of ‘Design Process Champions’ for digital
libraries in African higher education
• Abstract: Abstract The concept of design stakeholders is central to effective design of digital libraries. We report on research findings that identified the presence of a key subset of stakeholders which we term ‘design process champions’. Our findings have identified that these champions can change interaction patterns and the eventual output of the other stakeholders (project participants) in the design process of digital library projects. This empirical research is based upon 38 interviews with key stakeholders and a review of documentary evidence in 10 innovative digital library design projects (e.g. mobile clinical libraries) located in three African universities in Kenya, Uganda, and South Africa. Through a grounded theory approach, two different types of the ‘design process champions’ emerged from the data with varying levels of effectiveness in the design process: (i) domain champions and (ii) multidisciplinary champions. The domain champions assume a ‘siloed’ approach of engagement while the multidisciplinary champions take on a participatory engagement throughout the design process. A discussion of the implications of information specialists functioning as domain champions is highlighted. We conclude by suggesting that the multidisciplinary champions’ approach is particularly useful in supporting sustainability of digital library design projects.
PubDate: 2013-03-28

• On the applicability of word sense discrimination on 201 years of
modern english
• Abstract: Abstract As language evolves over time, documents stored in long- term archives become inaccessible to users. Automatically, detecting and handling language evolution will become a necessity to meet user’s information needs. In this paper, we investigate the performance of modern tools and algorithms applied on modern English to find word senses that will later serve as a basis for finding evolution. We apply the curvature clustering algorithm on all nouns and noun phrases extracted from The Times Archive (1785–1985). We use natural language processors for part-of-speech tagging and lemmatization and report on the performance of these processors over the entire period. We evaluate our clusters using WordNet to verify whether they correspond to valid word senses. Because The Times Archive contains OCR errors, we investigate the effects of such errors on word sense discrimination results. Finally, we present a novel approach to correct OCR errors present in the archive and show that the coverage of the curvature clustering algorithm improves. We increase the number of clusters by 24 %. To verify our results, we use the New York Times corpus (1987–2007), a recent collection that is considered error free, as a ground truth for our experiments. We find that after correcting OCR errors in The Times Archive, the performance of word sense discrimination applied on The Times Archive is comparable to the ground truth.
PubDate: 2013-03-16

• Can the Web turn into a digital library?
• Abstract: Abstract There is no doubt that the enormous amounts of information on the WWW are influencing how we work, live, learn and think. However, information on the WWW is in general too chaotic, not reliable enough and specific material often too difficult to locate that it cannot be considered a serious digital library. In this paper we concentrate on the question how we can retrieve reliable information from the Web, a task that is fraught with problems, but essential if the WWW is supposed to be used as serious digital library. It turns out that the use of search engines has many dangers. We will point out some of the possible ways how those dangers can be reduced and how dangerous traps can be avoided. Another approach to find useful information on the Web is to use “classical” resources of information like specialized dictionaries, lexica or encyclopaedias in electronic form, such as the Britannica. Although it seemed for a while that such resources might more or less disappear from the Web due to attempts such as Wikipedia, some to the classical encyclopaedias and specialized offerings have picked up steam again and should not be ignored. They do sometimes suffer from what we will call the “wishy-washy” syndrome explained in this paper. It is interesting to note that Wikipedia which is also larger than all other encyclopaedias (at least the English version) is less afflicted by this syndrome, yet has some other serious drawbacks. We discuss how those could be avoided and present a system that is halfway between prototype and production system that does take care of many of the aforementioned problems and hence may be a model for further undertakings in turning (part of) the Web into a useable digital library.
PubDate: 2013-03-01

• Symbiosis between the TRECVid benchmark and video libraries at the
Netherlands Institute for Sound and Vision
• Abstract: Abstract Audiovisual archives are investing in large-scale digitization efforts of their analogue holdings and, in parallel, ingesting an ever-increasing amount of born-digital files in their digital storage facilities. Digitization opens up new access paradigms and boosted re-use of audiovisual content. Query-log analyses show the shortcomings of manual annotation, therefore archives are complementing these annotations by developing novel search engines that automatically extract information from both audio and the visual tracks. Over the past few years, the TRECVid benchmark has developed a novel relationship with the Netherlands Institute of Sound and Vision (NISV) which goes beyond the NISV just providing data and use cases to TRECVid. Prototype and demonstrator systems developed as part of TRECVid are set to become a key driver in improving the quality of search engines at the NISV and will ultimately help other audiovisual archives to offer more efficient and more fine-grained access to their collections. This paper reports the experiences of NISV in leveraging the activities of the TRECVid benchmark.
PubDate: 2013-03-01

• What impact do healthcare digital libraries have? An evaluation of
national resource of infection control at the point of care using the
Impact-ED framework
• Abstract: Abstract Over the last decade billions of dollars’ worth of investments have been directed into ICT solutions for healthcare. In particular, new evidence-based digital libraries and web portals designed to keep busy clinicians up to date with the latest evidence were created in the UK and US. While usability and performance of digital libraries were widely researched, evaluation of impact did not seem to be sufficiently addressed. This is of major concern for healthcare digital libraries as their success or failure has a direct impact on patients’ health, clinical practice, government policies and funding initiatives. In order to fill this gap, we developed the Impact-ED evaluation framework measuring impact on four dimensions of digital libraries—content, community, services and technology. Applying a triangulation technique we analysed pre- and post-visit questionnaires to assess the clinical query or aim of the visit and subsequent satisfaction with each visit, mapped it against weblogs analysis for each session and triangulated with data from semi-structured interviews. In this paper, we present the complete description of the Impact-ED framework, a definition of the comparative Impact score and application of the framework to a real-world medical digital library, the National Resource of Infection Control (NRIC, http://www.nric.org.uk), to evaluate its impact at the point of care and demonstrate the generalisability of this novel methodology. We analysed the data from a cohort of 53 users who completed the registration questionnaire, of which 32 completed pre- and post-visit questionnaires of which 72 sets were matched for analysis and five users out of these were interviewed using Dervin’s method. NRIC is generally perceived to be a useful resource with 93 % of users reporting it provides relevant information regularly or occasionally ( $n=28$ ) and provided relevant information in over 65 % of visits ( $n=47$ ). NRIC has a positive impact on user knowledge in over half of visits to the library (52.8 %), NRIC actual impact score $I_{\text{ a}} = 0.65$ and the study revealed several areas for potential development to increase its impact.
PubDate: 2013-03-01

• Developing a video game metadata schema for the Seattle Interactive Media
Museum
• Abstract: Abstract As interest in video games increases, so does the need for intelligent access to them. However, traditional organizational systems and standards fall short. To fill this gap, we are collaborating with the Seattle Interactive Media Museum to develop a formal metadata schema for video games. In the paper, we describe how the schema was established from a user-centered design approach and introduce the core elements from our schema. We also discuss the challenges we encountered as we were conducting a domain analysis and cataloging real-world examples of video games. Inconsistent, vague, and subjective sources of information for title, genre, release date, feature, region, language, developer and publisher information confirm the importance of developing a standardized description model for video games.
PubDate: 2013-03-01

JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327