for Journals by Title or ISSN
for Articles by Keywords
help
Followed Journals
Journal you Follow: 0
 
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Jurnals are published.
Already have an account? Sign In to see the journals you follow.
Journal Cover   International Journal on Digital Libraries
  [SJR: 0.203]   [H-I: 24]   [542 followers]  Follow
    
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012
   Published by Springer-Verlag Homepage  [2281 journals]
  • Guest editors’ introduction to the special issue on the digital
           libraries conference 2014
    • PubDate: 2015-09-01
       
  • On the composition of ISO 25964 hierarchical relations (BTG, BTP, BTI)
    • Abstract: Abstract Knowledge organization systems (KOS) can use different types of hierarchical relations: broader generic (BTG), broader partitive (BTP), and broader instantial (BTI). The latest ISO standard on thesauri (ISO 25964) has formalized these relations in a corresponding OWL ontology (De Smedt et al., ISO 25964 part 1: thesauri for information retrieval: RDF/OWL vocabulary, extension of SKOS and SKOS-XL. http://purl.org/iso25964/skos-thes, 2013) and expressed them as properties: broaderGeneric, broaderPartitive, and broaderInstantial, respectively. These relations are used in actual thesaurus data. The compositionality of these types of hierarchical relations has not been investigated systematically yet. They all contribute to the general broader (BT) thesaurus relation and its transitive generalization broader transitive defined in the SKOS model for representing KOS. But specialized relationship types cannot be arbitrarily combined to produce new statements that have the same semantic precision, leading to cases where inference of broader transitive relationships may be misleading. We define Extended properties (BTGE, BTPE, BTIE) and analyze which compositions of the original “one-step” properties and the Extended properties are appropriate. This enables providing the new properties with valuable semantics usable, e.g., for fine-grained information retrieval purposes. In addition, we relax some of the constraints assigned to the ISO properties, namely the fact that hierarchical relationships apply to SKOS concepts only. This allows us to apply them to the Getty Art and Architecture Thesaurus (AAT), where they are also used for non-concepts (facets, hierarchy names, guide terms). In this paper, we present extensive examples derived from the recent publication of AAT as linked open data.
      PubDate: 2015-08-20
       
  • A sharing-oriented design strategy for networked knowledge organization
           systems
    • Abstract: Abstract Designers of networked knowledge organization systems often follow a service-oriented design strategy, assuming an organizational model where one party outsources clearly delineated business processes to another party. But the logic of outsourcing is a poor fit for some knowledge organization practices. When knowledge organization is understood as a process of exchange among peers, a sharing-oriented design strategy makes more sense. As an example of a sharing-oriented strategy for designing networked knowledge organization systems, we describe the design of the PeriodO period gazetteer. We analyze the PeriodO data model, its representation using JavaScript Object Notation-Linked Data, and the management of changes to the PeriodO dataset. We conclude by discussing why a sharing-oriented design strategy is appropriate for organizing scholarly knowledge.
      PubDate: 2015-08-14
       
  • CRM ba a CRM extension for the documentation of standing buildings
    • Abstract: Abstract Exploring the connections between successive phases and overlapping layers from different ages in an ancient building is paramount for its understanding and study. Archaeologists and cultural heritage experts are always eager to unveil the hidden relations of an archaeological building to reconstruct its history and for its interpretation. This paper presents CRMba, a CIDOC CRM extension developed to facilitate the discovery and the interpretation of archaeological resources through the definition of new concepts required to describe the complexity of historic buildings. The CRMba contributes to solving the datasets interoperability issue by exploiting the use of the CIDOC CRM to overcome data fragmentation, to investigate the semantics of building components, of functional spaces and of the construction phases of historic buildings and complexes, making explicit their physical and topological relations through time and space. The approach used for the development of the CRMba makes the model valid for the documentation of different kinds of buildings, across periods, styles and conservation state.
      PubDate: 2015-08-04
       
  • Research-paper recommender systems: a literature survey
    • Abstract: Abstract In the last 16 years, more than 200 research articles were published about research-paper recommender systems. We reviewed these articles and present some descriptive statistics in this paper, as well as a discussion about the major advancements and shortcomings and an overview of the most common recommendation concepts and approaches. We found that more than half of the recommendation approaches applied content-based filtering (55 %). Collaborative filtering was applied by only 18 % of the reviewed approaches, and graph-based recommendations by 16 %. Other recommendation concepts included stereotyping, item-centric recommendations, and hybrid recommendations. The content-based filtering approaches mainly utilized papers that the users had authored, tagged, browsed, or downloaded. TF-IDF was the most frequently applied weighting scheme. In addition to simple terms, n-grams, topics, and citations were utilized to model users’ information needs. Our review revealed some shortcomings of the current research. First, it remains unclear which recommendation concepts and approaches are the most promising. For instance, researchers reported different results on the performance of content-based and collaborative filtering. Sometimes content-based filtering performed better than collaborative filtering and sometimes it performed worse. We identified three potential reasons for the ambiguity of the results. (A) Several evaluations had limitations. They were based on strongly pruned datasets, few participants in user studies, or did not use appropriate baselines. (B) Some authors provided little information about their algorithms, which makes it difficult to re-implement the approaches. Consequently, researchers use different implementations of the same recommendations approaches, which might lead to variations in the results. (C) We speculated that minor variations in datasets, algorithms, or user populations inevitably lead to strong variations in the performance of the approaches. Hence, finding the most promising approaches is a challenge. As a second limitation, we noted that many authors neglected to take into account factors other than accuracy, for example overall user satisfaction. In addition, most approaches (81 %) neglected the user-modeling process and did not infer information automatically but let users provide keywords, text snippets, or a single paper as input. Information on runtime was provided for 10 % of the approaches. Finally, few research papers had an impact on research-paper recommender systems in practice. We also identified a lack of authority and long-term research interest in the field: 73 % of the authors published no more than one paper on research-paper recommender systems, and there was little cooperation among different co-author groups. We concluded that several actions could improve the research landscape: developing a common evaluation framework, agreement on the information to include in research papers, a stronger focus on non-accuracy aspects and user modeling, a platform for researchers to exchange information, and an open-source framework that bundles the available recommendation approaches.
      PubDate: 2015-07-26
       
  • Knowledge infrastructures in science: data, diversity, and digital
           libraries
    • Abstract: Abstract Digital libraries can be deployed at many points throughout the life cycles of scientific research projects from their inception through data collection, analysis, documentation, publication, curation, preservation, and stewardship. Requirements for digital libraries to manage research data vary along many dimensions, including life cycle, scale, research domain, and types and degrees of openness. This article addresses the role of digital libraries in knowledge infrastructures for science, presenting evidence from long-term studies of four research sites. Findings are based on interviews ( \(n=208\) ), ethnographic fieldwork, document analysis, and historical archival research about scientific data practices, conducted over the course of more than a decade. The Transformation of Knowledge, Culture, and Practice in Data-Driven Science: A Knowledge Infrastructures Perspective project is based on a 2  \(\times \)  2 design, comparing two “big science” astronomy sites with two “little science” sites that span physical sciences, life sciences, and engineering, and on dimensions of project scale and temporal stage of life cycle. The two astronomy sites invested in digital libraries for data management as part of their initial research design, whereas the smaller sites made smaller investments at later stages. Role specialization varies along the same lines, with the larger projects investing in information professionals, and smaller teams carrying out their own activities internally. Sites making the largest investments in digital libraries appear to view their datasets as their primary scientific legacy, while other sites stake their legacy elsewhere. Those investing in digital libraries are more concerned with the release and reuse of data; types and degrees of openness vary accordingly. The need for expertise in digital libraries, data science, and data stewardship is apparent throughout all four sites. Examples are presented of the challenges in designing digital libraries and knowledge infrastructures to manage and steward research data.
      PubDate: 2015-07-25
       
  • Representing gazetteers and period thesauri in four-dimensional
           space–time
    • Abstract: Abstract Gazetteers, i.e., lists of place-names, enable having a global vision of places of interest through the assignment of a point, or a region, to a place name. However, such identification of the location corresponding to a place name is often a difficult task. There is no one-to-one correspondence between the two sets, places and names, because of name variants, different names for the same place and homonymy; the location corresponding to a place name may vary in time, changing its extension or even the position; and, in general, there is the imprecision deriving from the association of a concept belonging to language (the place name) to a precise concept (the spatial location). Also for named time periods, e.g., early Bronze Age, which are of current use in archaeology, the situation is similar: they depend on the location to which they refer as the same period may have different time-spans in different locations. The present paper avails of a recent extension of the CIDOC CRM called CRMgeo, which embeds events in a spatio-temporal 4-dimensional framework. The paper uses concepts from CRMgeo and introduces extensions to model gazetteers and period thesauri. This approach enables dealing with time-varying location appellations as well as with space-varying period appellations on a robust basis. For this purpose a refinement/extension of CRMgeo is proposed and a discretization of space and time is used to approximate real space–time extents occupied by events. Such an approach solves the problem and suggests further investigations in various directions.
      PubDate: 2015-07-21
       
  • On the combination of domain-specific heuristics for author name
           disambiguation: the nearest cluster method
    • Abstract: Abstract Author name disambiguation has been one of the hardest problems faced by digital libraries since their early days. Historically, supervised solutions have empirically outperformed those based on heuristics, but with the burden of having to rely on manually labeled training sets for the learning process. Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem. In this article, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions, and apply supervision only to optimize such parameters for each particular dataset. As our experiments show, the result is a very effective, efficient and practical author name disambiguation method that can be used in many different scenarios. In fact, we show that our method can beat state-of-the-art supervised methods in terms of effectiveness in many situations while being orders of magnitude faster. It can also run without any training information, using only default parameters, and still be very competitive when compared to these supervised methods (beating several of them) and better than most existing unsupervised author name disambiguation solutions.
      PubDate: 2015-07-07
       
  • When should I make preservation copies of myself?
    • Abstract: Abstract We investigate how different replication policies ranging from least aggressive to most aggressive affect the level of preservation achieved by autonomic processes used by web objects (WOs). Based on simulations of small-world graphs of WOs created by the Unsupervised Small-World algorithm, we report quantitative and qualitative results for graphs ranging in order from 10 to 5000 WOs. Our results show that a moderately aggressive replication policy makes the best use of distributed host resources by not causing spikes in CPU resources nor spikes in network activity while meeting preservation goals. We examine different approaches that WOs can communicate with each other and determine the how long it would take for a message from one WO to reach a specific WO, or all WOs.
      PubDate: 2015-06-21
       
  • Systems integration of heterogeneous cultural heritage information systems
           in museums: a case study of the National Palace Museum
    • Abstract: Abstract This study addresses the process of information systems integration in museums. Research emphasis has concentrated on systems integration in the business community after restructuring of commercial enterprises. Museums fundamentally differ from commercial enterprises and thus cannot wholly rely on the business model for systems integration. A case study of the National Palace Museum in Taiwan was conducted to investigate its systems integration of five legacy systems into one information system for museum and public use. Participatory observation methods were used to collect data for inductive analysis. The results suggested that museums are motivated to integrate their systems by internal cultural and administrative operations, external cultural and creative industries, public expectations, and information technology attributes. Four factors related to the success of the systems integration project: (1) the unique attributes of a museum’s artifacts, (2) the attributes and needs of a system’s users, (3) the unique demands of museum work, and (4) the attributes of existing information technology resources within a museum. The results provide useful reference data for other museums when they carry out systems integration.
      PubDate: 2015-06-06
       
  • Lost but not forgotten: finding pages on the unarchived web
    • Abstract: Abstract Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites.
      PubDate: 2015-06-03
       
  • A comprehensive evaluation of scholarly paper recommendation using
           potential citation papers
    • Abstract: Abstract To help generate relevant suggestions for researchers, recommendation systems have started to leverage the latent interests in the publication profiles of the researchers themselves. While using such a publication citation network has been shown to enhance performance, the network is often sparse, making recommendation difficult. To alleviate this sparsity, in our former work, we identified “potential citation papers” through the use of collaborative filtering. Also, as different logical sections of a paper have different significance, as a secondary contribution, we investigated which sections of papers can be leveraged to represent papers effectively. While this initial approach works well for researchers vested in a single discipline, it generates poor predictions for scientists who work on several different topics in the discipline (hereafter, “intra-disciplinary”). We thus extend our previous work in this paper by proposing an adaptive neighbor selection method to overcome this problem in our imputation-based collaborative filtering framework. On a publicly-available scholarly paper recommendation dataset, we show that recommendation accuracy significantly outperforms state-of-the-art recommendation baselines as measured by nDCG and MRR, when using our adaptive neighbor selection method. While recommendation performance is enhanced for all researchers, improvements are more marked for intra-disciplinary researchers, showing that our method does address the targeted audience.
      PubDate: 2015-06-01
       
  • Sifting useful comments from Flickr Commons and YouTube
    • Abstract: Abstract Cultural institutions are increasingly contributing content to social media platforms to raise awareness and promote use of their collections. Furthermore, they are often the recipients of user comments containing information that may be incorporated in their catalog records. However, not all user-generated comments can be used for the purpose of enriching metadata records. Judging the usefulness of a large number of user comments is a labor-intensive task. Accordingly, our aim was to provide automated support for curation of potentially useful social media comments on digital objects. In this paper, the notion of usefulness is examined in the context of social media comments and compared from the perspective of both end-users and expert users. A machine-learning approach is then introduced to automatically classify comments according to their usefulness. This approach uses syntactic and semantic comment features while taking user context into consideration. We present the results of an experiment we conducted on user comments collected from Flickr Commons collections and YouTube. A study is then carried out on the correlation between the commenting culture of a platform (YouTube and Flickr) with usefulness prediction. Our findings indicate that a few relatively straightforward features can be used for inferring useful comments. However, the influence of features on usefulness classification may vary according to the commenting cultures of platforms.
      PubDate: 2015-06-01
       
  • A generalized topic modeling approach for automatic document annotation
    • Abstract: Abstract Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.
      PubDate: 2015-06-01
       
  • Information-theoretic term weighting schemes for document clustering and
           classification
    • Abstract: Abstract We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term’s (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms’ prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.
      PubDate: 2015-06-01
       
  • Evaluating sliding and sticky target policies by measuring temporal drift
           in acyclic walks through a web archive
    • Abstract: Abstract When viewing an archived page using the archive’s user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed, potentially drifting away from the datetime originally selected. For sparsely archived resources, this almost transparent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive’s Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to \(<\) 30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.
      PubDate: 2015-06-01
       
  • Bridging the gap between real world repositories and scalable preservation
           environments
    • Abstract: Abstract Integrating large-scale processing environments, such as Hadoop, with traditional repository systems, such as Fedora Commons 3, has long proved to be a daunting task. In this paper, we will show how this integration can be achieved using software developed in the scalable preservation environments (SCAPE) project, and also how it can be achieved using a local more direct implementation at the Danish State and University Library inspired by the SCAPE project. Both allow full use of the Hadoop system for massively distributed processing without causing excessive load on the repository. We present a proof of concept SCAPE integration and an in-production local integration based on repository systems at the Danish State and University Library and the Hadoop execution environment. Both use data from the Newspaper Digitisation Project, a collection that will grow to more than 32 million JP2 images. The use case for the SCAPE integration is to perform feature extraction and validation of the JP2 images. The validation is done against an institutional preservation policy expressed in the machine readable SCAPE Control Policy vocabulary. The feature extraction is done using the Jpylyzer tool. We perform an experiment with various-sized sets of JP2 images, to test the scalability and correctness of the solution. The first use case considered from the local Danish State and University Library integration is also feature extraction and validation of the JP2 images, this time using Jpylyzer and Schematron requirements translated from the project specification by hand. We further look at two other use cases: generation of histograms of the tonal distributions of the images; and generation of dissemination copies. We discuss the challenges and benefits of the two integration approaches when having to perform preservation actions on massive collections stored in traditional digital repositories.
      PubDate: 2015-05-29
       
  • Results of a digital library curriculum field test
    • Abstract: Abstract The DL Curriculum Development project was launched in 2006, responding to an urgent need for consensus on DL curriculum across the fields of computer science and information and library science. Over the course of several years, 13 modules of a digital libraries (DL) curriculum were developed and were ready for field testing. The modules were evaluated in DL courses in real classroom environments in 37 classes by 15 instructors and their students. Interviews with instructors and questionnaires completed by their students were used to collect evaluative feedback. Findings indicate that the modules have been well designed to educate students on important topics and issues in DLs, in general. Suggestions to improve the modules based on the interviews and questionnaires were discussed as well. After the field test, module development has been continued, not only for the DL community but also others associated with DLs, such as information retrieval, big data, and multimedia. Currently, 56 modules are readily available for use through the project website or the Wikiversity site.
      PubDate: 2015-05-20
       
  • Introduction to the focused issue of award-nominated papers from JCDL 2013
    • PubDate: 2015-05-14
       
  • Not all mementos are created equal: measuring the impact of missing
           resources
    • Abstract: Abstract Web archives do not always capture every resource on every page that they attempt to archive. This results in archived pages missing a portion of their embedded resources. These embedded resources have varying historic, utility, and importance values. The proportion of missing embedded resources does not provide an accurate measure of their impact on the Web page; some embedded resources are more important to the utility of a page than others. We propose a method to measure the relative value of embedded resources and assign a damage rating to archived pages as a way to evaluate archival success. In this paper, we show that Web users’ perceptions of damage are not accurately estimated by the proportion of missing embedded resources. In fact, the proportion of missing embedded resources is a less accurate estimate of resource damage than a random selection. We propose a damage rating algorithm that provides closer alignment to Web user perception, providing an overall improved agreement with users on memento damage by 17 % and an improvement by 51 % if the mementos have a damage rating delta \(>\) 0.30. We use our algorithm to measure damage in the Internet Archive, showing that it is getting better at mitigating damage over time (going from a damage rating of 0.16 in 1998 to 0.13 in 2013). However, we show that a greater number of important embedded resources (2.05 per memento on average) are missing over time. Alternatively, the damage in WebCite is increasing over time (going from 0.375 in 2007 to 0.475 in 2014), while the missing embedded resources remain constant (13 % of the resources are missing on average). Finally, we investigate the impact of JavaScript on the damage of the archives, showing that a crawler that can archive JavaScript-dependent representations will reduce memento damage by 13.5 %.
      PubDate: 2015-05-06
       
 
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
 
About JournalTOCs
API
Help
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2015