for Journals by Title or ISSN for Articles by Keywords help
Followed Journals
International Journal on Digital Libraries    [429 followers]  Follow
Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012
• Who and what links to the Internet Archive
• Abstract: Abstract The Internet Archive’s (IA) Wayback Machine is the largest and oldest public Web archive and has become a significant repository of our recent history and cultural heritage. Despite its importance, there has been little research about how it is discovered and used. Based on Web access logs, we analyze what users are looking for, why they come to IA, where they come from, and how pages link to IA. We find that users request English pages the most, followed by the European languages. Most human users come to Web archives because they do not find the requested pages on the live Web. About 65 % of the requested archived pages no longer exist on the live Web. We find that more than 82 % of human sessions connect to the Wayback Machine via referrals from other Web sites, while only 15 % of robots have referrers. Most of the links (86 %) from Websites are to individual archived pages at specific points in time, and of those 83 % no longer exist on the live Web. Finally, we find that users who come from search engines browse more pages than users who come from external Web sites.
PubDate: 2014-04-23

for Natural History Museums
• Abstract: Abstract Natural history museums (NHMs) form a rich source of knowledge about Earth’s biodiversity and natural history. However, an impressive abundance of high-quality scientific content available in NHMs around Europe remains largely unexploited due to a number of barriers, such as the lack of interconnection and interoperability between the management systems used by museums, the lack of centralized access through a European point of reference such as Europeana and the inadequacy of the current metadata and content organization. The Natural Europe project offers a coordinated solution at European level that aims to overcome those barriers. In this article, we present the architecture, deployment and evaluation of the Natural Europe infrastructure allowing the curators to publish, semantically describe and manage the museums’ cultural heritage objects, as well as disseminate them to Europeana.eu and BioCASE/GBIF. Additionally, we discuss the methodology followed for the transition of the infrastructure to the Semantic Web and the publishing of NHMs’ cultural heritage metadata as Linked Data, supporting the Europeana Data Model.
PubDate: 2014-04-11

• A system for high quality crowdsourced indigenous language transcription
• Abstract: Abstract In this article, a crowdsourcing method is proposed to transcribe manuscripts from the Bleek and Lloyd Collection, where non-expert volunteers transcribe pages of the handwritten text using an online tool. The digital Bleek and Lloyd Collection is a rare collection that contains artwork, notebooks and dictionaries of the indigenous people of Southern Africa. The notebooks, in particular, contain stories that encode the language, culture and beliefs of these people, handwritten in now-extinct languages with a specialized notation system. Previous attempts have been made to convert the approximately 20,000 pages of text to a machine-readable form using machine learning algorithms but, due to the complexity of the text, the recognition accuracy was low. This article presents details of the system used to enable transcription by volunteers as well as results from experiments that were conducted to determine the quality and consistency of transcriptions. The results show that volunteers are able to produce reliable transcriptions of high quality. The inter-transcriber agreement is 80 % for Xam text and 95 % for English text. When the Xam text transcriptions produced by the volunteers are compared with a gold standard, the volunteers achieve an average accuracy of 64.75 %, which exceeded that in previous work. Finally, the degree of transcription agreement correlates with the degree of transcription accuracy. This suggests that the quality of unseen data can be assessed based on the degree of agreement among transcribers.
PubDate: 2014-04-11

• Word occurrence based extraction of work contributors from statements of
responsibility
• Abstract: Abstract This paper addresses the identification of all contributors of an intellectual work, when they are recorded in bibliographic data but in unstructured form. National bibliographies are very reliable on representing the first author of a work; however, secondary contributors are frequently represented only in the statements of responsibility that are transcribed by the cataloguer from the book into the bibliographic records. The identification of work contributors mentioned in statements of responsibility is a typical motivation for the application of information extraction techniques. This paper presents an approach developed for the specific application scenario of the ARROW rights infrastructure being deployed in several European countries to assist in the determination of the copyright status of works that may not be under public domain. An evaluation of our approach was performed in catalogues of nine European national libraries of countries that are available in the ARROW rights infrastructure, which cover eight different languages. The evaluation has shown that it performs reliably across languages and bibliographic datasets. It achieved an overall precision of 98.7 % and recall of 96.7 %.
PubDate: 2014-04-05

• Improved bibliographic reference parsing based on repeated patterns
• Abstract: Abstract Parsing details like author names and titles out of bibliographic references of scientific publications is an important issue that has received considerable attention recently. However, most existing techniques are tailored to the highly standardized reference styles used in the last two to three decades. They do not perform well with the wide variety of reference styles used in older, historic publications. Thus, they are of limited use when creating comprehensive bibliographies covering both historic and contemporary scientific publications. This paper presents a generic approach to bibliographic reference parsing, named RefParse, which is independent of any specific reference style. Its core feature is an inference mechanism that exploits the regularities inherent in any list of references to deduce its format. In addition, our approach learns names of authors, journals, and publishers to increase the accuracy in scenarios where human users double check parsing results to increase data quality. Our evaluation shows that our approach performs comparably to existing ones with contemporary reference lists and also works well with older ones.
PubDate: 2014-04-01

• Researcher Name Resolver: identifier management system for Japanese
researchers
• Abstract: Abstract We built a researcher identifier management system called the Researcher Name Resolver (RNR) to assist with the name disambiguation of authors in digital libraries on the Web. RNR, which is designed to cover all researchers in Japan, is a Web-oriented service that can be openly connected with external scholastic systems. We expect it to be widely used for enriched scholarly communications. In this paper, we first outline the conceptual framework of RNR, which is jointly focused on researcher identifier management and Web resource linking. We based our researcher identifier scheme on the reuse of multiple sets of existing researcher identifiers belonging to the Japanese grant database KAKEN and the researcher directory ReaD & Researchmap. Researcher identifiers are associated by direct links to related resources on the Web through a combination of methods, including descriptive mapping, focused crawling on campus directories and researcher identification by matching names and affiliations. Second, we discuss our implementation of RNR based on this framework. Researcher identifiers construct uniform resource identifiers to show Web pages that describe researcher profiles and provide links to related external resources. We have adapted Web-friendly technologies—e.g., OpenSearch and the RDFs of Linked Data technology—in this implementation to provide Web-friendly services. Third, we discuss our application of RNR to a name disambiguation task for the search portal of the Japanese Institutional Repositories Online to determine how well the researcher identifier management system cooperates with external systems. Finally, we discuss lessons learned from the entire project as well as the future development directions we intend to take.
PubDate: 2014-04-01

• Moved but not gone: an evaluation of real-time methods for discovering
replacement web pages
• Abstract: Abstract Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, we are able to recommend not only the best performing methods but also the sequence in which they should be applied, based on their performance, complexity required to generate them, and evolution over time. Our least complex single method results in a rediscovery rate of almost $70\,\%$ of Web pages of our sample dataset based on URIs sampled from the Open Directory Project (DMOZ). By increasing the complexity level and combining three different methods, our results show an increase of the success rate of up to $77\,\%$ . The results, based on our sample dataset, indicate that Web pages are often not completely lost but have moved to a different location and “just” need to be rediscovered.
PubDate: 2014-04-01

• Harvesting latent and usage-based metadata in a course management system
to enrich the underlying educational digital library
• Abstract: Abstract In this case study, we demonstrate how in an integrated digital library and course management system, metadata can be generated using a bootstrapping mechanism. The integration encompasses sequencing of content by teachers and deployment of content to learners. We show that taxonomy term assignments and a recommender system can be based almost solely on usage data (especially correlations on what teachers have put in the same course or assignment). In particular, we show that with minimal human intervention, taxonomy terms, quality measures, and an association ruleset can be established for a large pool of fine-granular educational assets.
PubDate: 2013-11-29

• A vision towards Scientific Communication Infrastructures
• Abstract: Abstract The two pillars of the modern scientific communication are Data Centers and Research Digital Libraries (RDLs), whose technologies and admin staff support researchers at storing, curating, sharing, and discovering the data and the publications they produce. Being realized to maintain and give access to the results of complementary phases of the scientific research process, such systems are poorly integrated with one another and generally do not rely on the strengths of the other. Today, such a gap hampers achieving the objectives of the modern scientific communication, that is, publishing, interlinking, and discovery of all outcomes of the research process, from the experimental and observational datasets to the final paper. In this work, we envision that instrumental to bridge the gap is the construction of “Scientific Communication Infrastructures”. The main goal of these infrastructures is to facilitate interoperability between Data Centers and RDLs and to provide services that simplify the implementation of the large variety of modern scientific communication patterns.
PubDate: 2013-07-14

• The different roles of ‘Design Process Champions’ for digital
libraries in African higher education
• Abstract: Abstract The concept of design stakeholders is central to effective design of digital libraries. We report on research findings that identified the presence of a key subset of stakeholders which we term ‘design process champions’. Our findings have identified that these champions can change interaction patterns and the eventual output of the other stakeholders (project participants) in the design process of digital library projects. This empirical research is based upon 38 interviews with key stakeholders and a review of documentary evidence in 10 innovative digital library design projects (e.g. mobile clinical libraries) located in three African universities in Kenya, Uganda, and South Africa. Through a grounded theory approach, two different types of the ‘design process champions’ emerged from the data with varying levels of effectiveness in the design process: (i) domain champions and (ii) multidisciplinary champions. The domain champions assume a ‘siloed’ approach of engagement while the multidisciplinary champions take on a participatory engagement throughout the design process. A discussion of the implications of information specialists functioning as domain champions is highlighted. We conclude by suggesting that the multidisciplinary champions’ approach is particularly useful in supporting sustainability of digital library design projects.
PubDate: 2013-03-28

• On the applicability of word sense discrimination on 201 years of
modern english
• Abstract: Abstract As language evolves over time, documents stored in long- term archives become inaccessible to users. Automatically, detecting and handling language evolution will become a necessity to meet user’s information needs. In this paper, we investigate the performance of modern tools and algorithms applied on modern English to find word senses that will later serve as a basis for finding evolution. We apply the curvature clustering algorithm on all nouns and noun phrases extracted from The Times Archive (1785–1985). We use natural language processors for part-of-speech tagging and lemmatization and report on the performance of these processors over the entire period. We evaluate our clusters using WordNet to verify whether they correspond to valid word senses. Because The Times Archive contains OCR errors, we investigate the effects of such errors on word sense discrimination results. Finally, we present a novel approach to correct OCR errors present in the archive and show that the coverage of the curvature clustering algorithm improves. We increase the number of clusters by 24 %. To verify our results, we use the New York Times corpus (1987–2007), a recent collection that is considered error free, as a ground truth for our experiments. We find that after correcting OCR errors in The Times Archive, the performance of word sense discrimination applied on The Times Archive is comparable to the ground truth.
PubDate: 2013-03-16

• Can the Web turn into a digital library?
• Abstract: Abstract There is no doubt that the enormous amounts of information on the WWW are influencing how we work, live, learn and think. However, information on the WWW is in general too chaotic, not reliable enough and specific material often too difficult to locate that it cannot be considered a serious digital library. In this paper we concentrate on the question how we can retrieve reliable information from the Web, a task that is fraught with problems, but essential if the WWW is supposed to be used as serious digital library. It turns out that the use of search engines has many dangers. We will point out some of the possible ways how those dangers can be reduced and how dangerous traps can be avoided. Another approach to find useful information on the Web is to use “classical” resources of information like specialized dictionaries, lexica or encyclopaedias in electronic form, such as the Britannica. Although it seemed for a while that such resources might more or less disappear from the Web due to attempts such as Wikipedia, some to the classical encyclopaedias and specialized offerings have picked up steam again and should not be ignored. They do sometimes suffer from what we will call the “wishy-washy” syndrome explained in this paper. It is interesting to note that Wikipedia which is also larger than all other encyclopaedias (at least the English version) is less afflicted by this syndrome, yet has some other serious drawbacks. We discuss how those could be avoided and present a system that is halfway between prototype and production system that does take care of many of the aforementioned problems and hence may be a model for further undertakings in turning (part of) the Web into a useable digital library.
PubDate: 2013-03-01

• Symbiosis between the TRECVid benchmark and video libraries at the
Netherlands Institute for Sound and Vision
• Abstract: Abstract Audiovisual archives are investing in large-scale digitization efforts of their analogue holdings and, in parallel, ingesting an ever-increasing amount of born-digital files in their digital storage facilities. Digitization opens up new access paradigms and boosted re-use of audiovisual content. Query-log analyses show the shortcomings of manual annotation, therefore archives are complementing these annotations by developing novel search engines that automatically extract information from both audio and the visual tracks. Over the past few years, the TRECVid benchmark has developed a novel relationship with the Netherlands Institute of Sound and Vision (NISV) which goes beyond the NISV just providing data and use cases to TRECVid. Prototype and demonstrator systems developed as part of TRECVid are set to become a key driver in improving the quality of search engines at the NISV and will ultimately help other audiovisual archives to offer more efficient and more fine-grained access to their collections. This paper reports the experiences of NISV in leveraging the activities of the TRECVid benchmark.
PubDate: 2013-03-01

• What impact do healthcare digital libraries have? An evaluation of
national resource of infection control at the point of care using the
Impact-ED framework
• Abstract: Abstract Over the last decade billions of dollars’ worth of investments have been directed into ICT solutions for healthcare. In particular, new evidence-based digital libraries and web portals designed to keep busy clinicians up to date with the latest evidence were created in the UK and US. While usability and performance of digital libraries were widely researched, evaluation of impact did not seem to be sufficiently addressed. This is of major concern for healthcare digital libraries as their success or failure has a direct impact on patients’ health, clinical practice, government policies and funding initiatives. In order to fill this gap, we developed the Impact-ED evaluation framework measuring impact on four dimensions of digital libraries—content, community, services and technology. Applying a triangulation technique we analysed pre- and post-visit questionnaires to assess the clinical query or aim of the visit and subsequent satisfaction with each visit, mapped it against weblogs analysis for each session and triangulated with data from semi-structured interviews. In this paper, we present the complete description of the Impact-ED framework, a definition of the comparative Impact score and application of the framework to a real-world medical digital library, the National Resource of Infection Control (NRIC, http://www.nric.org.uk), to evaluate its impact at the point of care and demonstrate the generalisability of this novel methodology. We analysed the data from a cohort of 53 users who completed the registration questionnaire, of which 32 completed pre- and post-visit questionnaires of which 72 sets were matched for analysis and five users out of these were interviewed using Dervin’s method. NRIC is generally perceived to be a useful resource with 93 % of users reporting it provides relevant information regularly or occasionally ( $n=28$ ) and provided relevant information in over 65 % of visits ( $n=47$ ). NRIC has a positive impact on user knowledge in over half of visits to the library (52.8 %), NRIC actual impact score $I_{\text{ a}} = 0.65$ and the study revealed several areas for potential development to increase its impact.
PubDate: 2013-03-01

• Developing a video game metadata schema for the Seattle Interactive Media
Museum
• Abstract: Abstract As interest in video games increases, so does the need for intelligent access to them. However, traditional organizational systems and standards fall short. To fill this gap, we are collaborating with the Seattle Interactive Media Museum to develop a formal metadata schema for video games. In the paper, we describe how the schema was established from a user-centered design approach and introduce the core elements from our schema. We also discuss the challenges we encountered as we were conducting a domain analysis and cataloging real-world examples of video games. Inconsistent, vague, and subjective sources of information for title, genre, release date, feature, region, language, developer and publisher information confirm the importance of developing a standardized description model for video games.
PubDate: 2013-03-01

• Archiving the web using page changes patterns: a case study
• Abstract: Abstract A pattern is a model or a template used to summarize and describe the behavior (or the trend) of data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend) or, more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of Web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive Websites. We first define our pattern model that describes the importance of page changes. Then, we present the strategy used to (i) extract the temporal evolution of page changes, (ii) discover patterns, to (iii) exploit them to improve Web archives. The archive of French public TV channels France Télévisions is chosen as a case study to validate our approach. Our experimental evaluation based on real Web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.
PubDate: 2012-12-01

• SharedCanvas: a collaborative model for digital facsimiles
• Abstract: Abstract In this article, we present a model based on the principles of Linked Data that can be used to describe the interrelationships of images, texts and other resources to facilitate the interoperability of repositories of medieval manuscripts or other culturally important handwritten documents. The model is designed from a set of requirements derived from the real world use cases of some of the largest digitized medieval content holders, and instantiations of the model are intended as the description to be provided as input to collection-independent page turning and scholarly presentation interfaces. A canvas painting paradigm, such as in PDF and SVG, was selected due to the lack of a one to one correlation between image and page, and to fulfill complex requirements such as when the full text of a page is known, but only fragments of the physical object remain. The model is implemented using technologies such as OAI-ORE Aggregations and Open Annotations, as the fundamental building blocks of emerging Linked Digital Libraries. The model and implementation are evaluated through prototypes of both content providing and consuming applications. Although the system was designed from requirements drawn from the medieval manuscript domain, it is applicable to any layout-oriented presentation of images of text.
PubDate: 2012-12-01

• Interactive context-aware user-driven metadata correction in digital
libraries
• Abstract: Abstract Personal name variants are a common problem in digital libraries, reducing the precision of searches and complicating browsing-based interaction. The book-centric approach of name authority control has not scaled to match the growth and diversity of digital repositories. In this paper, we present a novel system for user-driven integration of name variants when interacting with web-based information—in particular digital library—systems. We approach these issues via a client-side JavaScript browser extension that can reorganize web content and also integrate remote data sources. Designed to be agnostic towards the web sites it is applied to, we illustrate the developed proof-of-concept system through worked examples using three different digital libraries. We discuss the extensibility of the approach in the context of other user-driven information systems and the growth of the Semantic Web.
PubDate: 2012-12-01

• Joint conference on digital libraries (JCDL) 2011
• PubDate: 2012-12-01

• Automated approaches to characterizing educational digital library usage:
linking computational methods with qualitative analyses
• Abstract: Abstract The need for automatic methods capable of characterizing adoption and use has grown in operational digital libraries. This paper describes a computational method for producing two, inter-related, user typologies based on use diffusion. Furthermore, a case study is described that demonstrates the utility and applicability of the method: it is used to understand how middle and high school science teachers participating in an academic year-long field trial adopted and integrated digital library resources into their instructional planning and teaching. Use diffusion theory views technology adoption as a process that can lead to widely different patterns of use across a given population of potential users; these models use measures of frequency and variety to characterize and describe such usage patterns. By using computational techniques such as clickstream entropy and clustering, the method produces both coarse- and fine-grained user typologies. As a part of improving the initial coarse-grain typology, clickstream entropy improvements are described that aim at better separation of users. In addition, a fine-grained user typology is described that identifies five different types of teacher-users, including “interactive resource specialists” and “community seeker specialists.” This typology was validated through comparison with qualitative and quantitative data collected using traditional educational field research methods. Results indicate that qualitative analyses correlate with the computational results, suggesting automatic methods may prove an important tool in discovering valid usage characteristics and user types.
PubDate: 2012-12-01

JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327