for Journals by Title or ISSN
for Articles by Keywords
Followed Journals
Journal you Follow: 0
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Jurnals are published.
Already have an account? Sign In to see the journals you follow.
Journal Cover   International Journal on Digital Libraries
  [SJR: 0.203]   [H-I: 24]   [545 followers]  Follow
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1432-1300 - ISSN (Online) 1432-5012
   Published by Springer-Verlag Homepage  [2292 journals]
  • When should I make preservation copies of myself?
    • Abstract: Abstract We investigate how different replication policies ranging from least aggressive to most aggressive affect the level of preservation achieved by autonomic processes used by web objects (WOs). Based on simulations of small-world graphs of WOs created by the Unsupervised Small-World algorithm, we report quantitative and qualitative results for graphs ranging in order from 10 to 5000 WOs. Our results show that a moderately aggressive replication policy makes the best use of distributed host resources by not causing spikes in CPU resources nor spikes in network activity while meeting preservation goals. We examine different approaches that WOs can communicate with each other and determine the how long it would take for a message from one WO to reach a specific WO, or all WOs.
      PubDate: 2015-06-21
  • Systems integration of heterogeneous cultural heritage information systems
           in museums: a case study of the National Palace Museum
    • Abstract: Abstract This study addresses the process of information systems integration in museums. Research emphasis has concentrated on systems integration in the business community after restructuring of commercial enterprises. Museums fundamentally differ from commercial enterprises and thus cannot wholly rely on the business model for systems integration. A case study of the National Palace Museum in Taiwan was conducted to investigate its systems integration of five legacy systems into one information system for museum and public use. Participatory observation methods were used to collect data for inductive analysis. The results suggested that museums are motivated to integrate their systems by internal cultural and administrative operations, external cultural and creative industries, public expectations, and information technology attributes. Four factors related to the success of the systems integration project: (1) the unique attributes of a museum’s artifacts, (2) the attributes and needs of a system’s users, (3) the unique demands of museum work, and (4) the attributes of existing information technology resources within a museum. The results provide useful reference data for other museums when they carry out systems integration.
      PubDate: 2015-06-06
  • Lost but not forgotten: finding pages on the unarchived web
    • Abstract: Abstract Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites.
      PubDate: 2015-06-03
  • A comprehensive evaluation of scholarly paper recommendation using
           potential citation papers
    • Abstract: Abstract To help generate relevant suggestions for researchers, recommendation systems have started to leverage the latent interests in the publication profiles of the researchers themselves. While using such a publication citation network has been shown to enhance performance, the network is often sparse, making recommendation difficult. To alleviate this sparsity, in our former work, we identified “potential citation papers” through the use of collaborative filtering. Also, as different logical sections of a paper have different significance, as a secondary contribution, we investigated which sections of papers can be leveraged to represent papers effectively. While this initial approach works well for researchers vested in a single discipline, it generates poor predictions for scientists who work on several different topics in the discipline (hereafter, “intra-disciplinary”). We thus extend our previous work in this paper by proposing an adaptive neighbor selection method to overcome this problem in our imputation-based collaborative filtering framework. On a publicly-available scholarly paper recommendation dataset, we show that recommendation accuracy significantly outperforms state-of-the-art recommendation baselines as measured by nDCG and MRR, when using our adaptive neighbor selection method. While recommendation performance is enhanced for all researchers, improvements are more marked for intra-disciplinary researchers, showing that our method does address the targeted audience.
      PubDate: 2015-06-01
  • Sifting useful comments from Flickr Commons and YouTube
    • Abstract: Abstract Cultural institutions are increasingly contributing content to social media platforms to raise awareness and promote use of their collections. Furthermore, they are often the recipients of user comments containing information that may be incorporated in their catalog records. However, not all user-generated comments can be used for the purpose of enriching metadata records. Judging the usefulness of a large number of user comments is a labor-intensive task. Accordingly, our aim was to provide automated support for curation of potentially useful social media comments on digital objects. In this paper, the notion of usefulness is examined in the context of social media comments and compared from the perspective of both end-users and expert users. A machine-learning approach is then introduced to automatically classify comments according to their usefulness. This approach uses syntactic and semantic comment features while taking user context into consideration. We present the results of an experiment we conducted on user comments collected from Flickr Commons collections and YouTube. A study is then carried out on the correlation between the commenting culture of a platform (YouTube and Flickr) with usefulness prediction. Our findings indicate that a few relatively straightforward features can be used for inferring useful comments. However, the influence of features on usefulness classification may vary according to the commenting cultures of platforms.
      PubDate: 2015-06-01
  • A generalized topic modeling approach for automatic document annotation
    • Abstract: Abstract Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.
      PubDate: 2015-06-01
  • Information-theoretic term weighting schemes for document clustering and
    • Abstract: Abstract We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term’s (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms’ prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.
      PubDate: 2015-06-01
  • Evaluating sliding and sticky target policies by measuring temporal drift
           in acyclic walks through a web archive
    • Abstract: Abstract When viewing an archived page using the archive’s user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed, potentially drifting away from the datetime originally selected. For sparsely archived resources, this almost transparent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive’s Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to \(<\) 30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.
      PubDate: 2015-06-01
  • Bridging the gap between real world repositories and scalable preservation
    • Abstract: Abstract Integrating large-scale processing environments, such as Hadoop, with traditional repository systems, such as Fedora Commons 3, has long proved to be a daunting task. In this paper, we will show how this integration can be achieved using software developed in the scalable preservation environments (SCAPE) project, and also how it can be achieved using a local more direct implementation at the Danish State and University Library inspired by the SCAPE project. Both allow full use of the Hadoop system for massively distributed processing without causing excessive load on the repository. We present a proof of concept SCAPE integration and an in-production local integration based on repository systems at the Danish State and University Library and the Hadoop execution environment. Both use data from the Newspaper Digitisation Project, a collection that will grow to more than 32 million JP2 images. The use case for the SCAPE integration is to perform feature extraction and validation of the JP2 images. The validation is done against an institutional preservation policy expressed in the machine readable SCAPE Control Policy vocabulary. The feature extraction is done using the Jpylyzer tool. We perform an experiment with various-sized sets of JP2 images, to test the scalability and correctness of the solution. The first use case considered from the local Danish State and University Library integration is also feature extraction and validation of the JP2 images, this time using Jpylyzer and Schematron requirements translated from the project specification by hand. We further look at two other use cases: generation of histograms of the tonal distributions of the images; and generation of dissemination copies. We discuss the challenges and benefits of the two integration approaches when having to perform preservation actions on massive collections stored in traditional digital repositories.
      PubDate: 2015-05-29
  • Results of a digital library curriculum field test
    • Abstract: Abstract The DL Curriculum Development project was launched in 2006, responding to an urgent need for consensus on DL curriculum across the fields of computer science and information and library science. Over the course of several years, 13 modules of a digital libraries (DL) curriculum were developed and were ready for field testing. The modules were evaluated in DL courses in real classroom environments in 37 classes by 15 instructors and their students. Interviews with instructors and questionnaires completed by their students were used to collect evaluative feedback. Findings indicate that the modules have been well designed to educate students on important topics and issues in DLs, in general. Suggestions to improve the modules based on the interviews and questionnaires were discussed as well. After the field test, module development has been continued, not only for the DL community but also others associated with DLs, such as information retrieval, big data, and multimedia. Currently, 56 modules are readily available for use through the project website or the Wikiversity site.
      PubDate: 2015-05-20
  • Introduction to the focused issue of award-nominated papers from JCDL 2013
    • PubDate: 2015-05-14
  • Not all mementos are created equal: measuring the impact of missing
    • Abstract: Abstract Web archives do not always capture every resource on every page that they attempt to archive. This results in archived pages missing a portion of their embedded resources. These embedded resources have varying historic, utility, and importance values. The proportion of missing embedded resources does not provide an accurate measure of their impact on the Web page; some embedded resources are more important to the utility of a page than others. We propose a method to measure the relative value of embedded resources and assign a damage rating to archived pages as a way to evaluate archival success. In this paper, we show that Web users’ perceptions of damage are not accurately estimated by the proportion of missing embedded resources. In fact, the proportion of missing embedded resources is a less accurate estimate of resource damage than a random selection. We propose a damage rating algorithm that provides closer alignment to Web user perception, providing an overall improved agreement with users on memento damage by 17 % and an improvement by 51 % if the mementos have a damage rating delta \(>\) 0.30. We use our algorithm to measure damage in the Internet Archive, showing that it is getting better at mitigating damage over time (going from a damage rating of 0.16 in 1998 to 0.13 in 2013). However, we show that a greater number of important embedded resources (2.05 per memento on average) are missing over time. Alternatively, the damage in WebCite is increasing over time (going from 0.375 in 2007 to 0.475 in 2014), while the missing embedded resources remain constant (13 % of the resources are missing on average). Finally, we investigate the impact of JavaScript on the damage of the archives, showing that a crawler that can archive JavaScript-dependent representations will reduce memento damage by 13.5 %.
      PubDate: 2015-05-06
  • Exploring publication metadata graphs with the LODmilla browser and editor
    • Abstract: Abstract With the LODmilla browser, we try to support linked data exploration in a generic way learning from the 20 years of web browser evolution as well as from scholars’ opinions who try to use it as a research exploration tool. In this paper, generic functions for linked open data (LOD) browsing are presented, and it is also explained what kind of information search tactics they enable with linked data describing publications. Furthermore, LODmilla also supports the sharing of graph views and the correction of LOD data during browsing.
      PubDate: 2015-05-01
  • Digital field scholarship and the liberal arts: results from a
           2012–13 sandbox
    • Abstract: Abstract We summarize a recent multi-institutional collaboration in digital field scholarship involving four liberal arts colleges: Davidson College, Lewis & Clark College, Muhlenberg College, and Reed College. Digital field scholarship (DFS) can be defined as scholarship in the arts and sciences for which field-based research and concepts are significant, and digital tools support data collection, analysis, and communication; DFS thus gathers together and extends a wide range of existing scholarship, offering new possibilities for appreciating the connections that define liberal education. Our collaboration occurred as a sandbox, a collective online experiment using a modified WordPress platform (including mapping and other advanced capabilities) built and supported by Lewis & Clark College, with sponsorship provided by the National Institute for Technology in Liberal Education. Institutions selected course-based DFS projects for fall 2012 and/or spring 2013. Projects ranged from documentary photojournalism to home energy efficiency assessment. One key feature was the use of readily available mobile devices and apps for field-based reconnaissance and data collection; another was our public digital scholarship approach, in which student participants shared the process and products of their work via public posts on the DFS website. Descriptive and factor analysis results from anonymous assessment data suggest strong participant response and likely future potential of digital field scholarship across class level and gender. When set into the context of the four institutions that supported the 2012–2013 sandbox, we see further opportunities for digital field scholarship on our and other campuses, provided that an optimal balance is struck between challenges and rewards along technical, pedagogical, and practical axes. Ultimately, digital field scholarship will be judged for its scholarship and for extending the experimental, open-ended inquiry that characterizes liberal education.
      PubDate: 2015-05-01
  • Towards robust tags for scientific publications from natural language
           processing tools and Wikipedia
    • Abstract: In this work, two simple methods of tagging scientific publications with labels reflecting their content are presented and compared. As a first source of labels, Wikipedia is employed. A second label set is constructed from the noun phrases occurring in the analyzed corpus. The corpus itself consists of abstracts from 0.7 million scientific documents deposited in the ArXiv preprint collection. We present a comparison of both approaches, which shows that discussed methods are to a large extent complementary. Moreover, the results give interesting insights into the completeness of Wikipedia knowledge in various scientific domains. As a next step, we examine the statistical properties of the obtained tags. It turns out that both methods show qualitatively similar rank–frequency dependence, which is best approximated by the stretched exponential curve. The distribution of the number of distinct tags per document follows also the same distribution for both methods and is well described by the negative binomial distribution. The developed tags are meant for use as features in various text mining tasks. Therefore, as a final step we show the preliminary results on their application to topic modeling.
      PubDate: 2015-05-01
  • The new knowledge infrastructure
    • PubDate: 2015-02-27
  • Introduction to the special issue on digital scholarship
    • PubDate: 2015-02-24
  • What lies beneath?: Knowledge infrastructures in the subseafloor
           biosphere and beyond
    • Abstract: Abstract We present preliminary findings from a three-year research project comprised of longitudinal qualitative case studies of data practices in four large, distributed, highly multidisciplinary scientific collaborations. This project follows a 2 \(\times \) 2 research design: two of the collaborations are big science while two are little science, two have completed data collection activities while two are ramping up data collection. This paper is centered on one of these collaborations, a project bringing together scientists to study subseafloor microbial life. This collaboration is little science, characterized by small teams, using small amounts of data, to address specific questions. Our case study employs participant observation in a laboratory, interviews ( \(n=49\) to date) with scientists in the collaboration, and document analysis. We present a data workflow that is typical for many of the scientists working in the observed laboratory. In particular, we show that, although this workflow results in datasets apparently similar in form, nevertheless a large degree of heterogeneity exists across scientists in this laboratory in terms of the methods they employ to produce these datasets—even between scientists working on adjacent benches. To date, most studies of data in little science focus on heterogeneity in terms of the types of data produced: this paper adds another dimension of heterogeneity to existing knowledge about data in little science. This additional dimension makes more complex the task of management and curation of data for subsequent reuse. Furthermore, the nature of the factors that contribute to heterogeneity of methods suggest that this dimension of heterogeneity is a persistent and unavoidable feature of little science.
      PubDate: 2015-02-15
  • VisInfo: a digital library system for time series research data based on
           exploratory search—a user-centered design approach
    • Abstract: Abstract To this day, data-driven science is a widely accepted concept in the digital library (DL) context (Hey et al. in The fourth paradigm: data-intensive scientific discovery. Microsoft Research, 2009). In the same way, domain knowledge from information visualization, visual analytics, and exploratory search has found its way into the DL workflow. This trend is expected to continue, considering future DL challenges such as content-based access to new document types, visual search, and exploration for information landscapes, or big data in general. To cope with these challenges, DL actors need to collaborate with external specialists from different domains to complement each other and succeed in given tasks such as making research data publicly available. Through these interdisciplinary approaches, the DL ecosystem may contribute to applications focused on data-driven science and digital scholarship. In this work, we present VisInfo (2014) , a web-based digital library system (DLS) with the goal to provide visual access to time series research data. Based on an exploratory search (ES) concept (White and Roth in Synth Lect Inf Concepts Retr Serv 1(1):1–98, 2009), VisInfo at first provides a content-based overview visualization of large amounts of time series research data. Further, the system enables the user to define visual queries by example or by sketch. Finally, VisInfo presents visual-interactive capability for the exploration of search results. The development process of VisInfo was based on the user-centered design principle. Experts from computer science, a scientific digital library, usability engineering, and scientists from the earth, and environmental sciences were involved in an interdisciplinary approach. We report on comprehensive user studies in the requirement analysis phase based on paper prototyping, user interviews, screen casts, and user questionnaires. Heuristic evaluations and two usability testing rounds were applied during the system implementation and the deployment phase and certify measurable improvements for our DLS. Based on the lessons learned in VisInfo, we suggest a generalized project workflow that may be applied in related, prospective approaches.
      PubDate: 2014-12-03
  • A pipeline for digital restoration of deteriorating photographic negatives
    • Abstract: Abstract Extending work presented at the second International Workshop on Historical Document Imaging and Processing, we demonstrate a digitization pipeline to capture and restore negatives in low-dynamic range file formats. The majority of early photographs were captured on acetate-based film. However, it has been determined that these negatives will deteriorate beyond repair even with proper conservation and no suitable restoration method is available without physically altering each negative. In this paper, we present an automatic method to remove various non-linear illumination distortions caused by deteriorating photographic support material. First, using a high-dynamic range structured-light scanning method, a 2D Gaussian model for light transmission is estimated for each pixel of the negative image. Estimated amplitude at each pixel provides an accurate model of light transmission, but also includes regions of lower transmission caused by damaged areas. Principal component analysis is then used to estimate the photometric error and effectively restore the original illumination information of the negative. A novel tone mapping approach is then used to produce the final restored image. Using both the shift in the Gaussian light stripes between pixels and their variations in standard deviation, a 3D surface estimate is calculated. Experiments of real historical negatives show promising results for widespread implementation in memory institutions.
      PubDate: 2014-11-08
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2015