Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In this paper, we compare the performance of several popular pre-trained reference extraction and segmentation toolkits combined in different pipeline configurations on three different datasets. The extraction is end-to-end, i.e. the input is PDF documents, and the output is parsed reference objects. The evaluation is for reference strings and individual fields in the reference objects using alignment by identical fields and close-to-identical values. Our results show that Grobid and AnyStyle perform best of all compared tools, although one may want to use them in combination. Our work is meant to serve as a reference for researchers interested in applying out-of-the-box reference extraction and -parsing tools, for example, as a preprocessing step to a more complex research question. Our detailed results on different datasets with results for individual parsed fields will allow them to focus on aspects that are particularly important to them. PubDate: 2024-06-20
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This special issue features the selected works of authors who have presented papers at the 2022 iteration of the Joint Conference on Digital Libraries (JCDL) in Cologne, Germany. The motto of the conference was “Bridging Worlds” and was run as a fully hybrid event. Ten papers covering all aspects of Digital Libraries, namely Natural Language Processing, Information Retrieval, User Behavior, Scholarly Communication, Classification, Information Extraction are included in this issue. PubDate: 2024-06-06 DOI: 10.1007/s00799-024-00407-3
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Digital Library Systems are widely used in the Higher Education sector, through the use of Institutional Repositories (IRs), to collect, store, manage and make available scholarly research output produced by Higher Education Institutions (HEIs). This wide application of IRs is a direct response to the increase in scholarly research output produced. In order to facilitate discoverability of digital content in IRs, accurate, consistent and comprehensive association of descriptive metadata to digital objects during ingestion into IRs is crucial. However, due to human errors resulting from complex IR ingestion workflows, most digital content in IRs have incorrect and inconsistent descriptive metadata. While there exists a broad spectrum of descriptive metadata elements, subject headings present a classic example of a crucial metadata element that adversely affects discoverability of digital content when incorrectly and inconsistently specified. This paper outlines a case study conducted at an HEI—The University of Zambia—in order to demonstrate the effectiveness of integrating controlled subject vocabularies during the ingestion of digital objects in to IRs. A situational analysis was conducted to understand how subject headings are associated with digital objects and to analyse subject headings associated with already ingested digital objects. In addition, an exploratory study was conducted to determine domain-specific subject headings to be integrated with the IR. Furthermore, a usability study was conducted in order to comparatively determine the usefulness of using controlled vocabularies during the ingestion of digital objects into IRs. Finally, multi-label classification experiments were carried out where digital objects were assigned with more than one class. The results of the study revealed that a noticeable number of digital content is associated with incorrect subject categories and, additionally, associated with few subjects headings: two or less subject headings (71.2 \(\%\) ), with a significant number of subject headings (92.1 \(\%\) ) being associated with a single publication. A comparative study conducted suggests that IRs integrated with controlled vocabularies are perceived to be more usable (SUS Score = 68.9) when compared with IRs without controlled vocabularies (SUS Score = 66.2). Furthermore, the effectiveness of the multi-label arXiv subjects classifier demonstrates the viability of integrating automated techniques for subject classification. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Historians and researchers rely on web archives to preserve social media content that no longer exists on the live web. However, what we see on the live web and how it is replayed in the archive are not always the same. In this study, we document and analyze the problems in archiving Twitter after Twitter switched to a new user interface (UI) in June 2020. Most web archives could not archive the new UI, resulting in archived Twitter pages displaying Twitter’s “Something went wrong” error. The challenges in archiving the new UI forced web archives to continue using the old UI. But, features such as Twitter labels were a part of the new UI; hence, web archives archiving Twitter’s old UI would be missing these labels. To analyze the potential loss of information in web archival data due to this change, we used the personal Twitter account of the 45th President of the USA, @realDonaldTrump, which was suspended by Twitter on January 8, 2021. Trump’s account was heavily labeled by Twitter for spreading misinformation; however, we discovered that there is no evidence in web archives to prove that some of his tweets ever had a label assigned to them. We also studied the possibility of temporal violations in archived versions of the new UI, which may result in the replay of pages that never existed on the live web. We also discovered that when some tweets with embedded media are replayed, portions of the rewritten t.co URL, meant to be hidden from the end-user, are partially exposed in the replayed page. Our goal is to educate researchers who may use web archives and caution them when drawing conclusions based on archived Twitter pages. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The digitalisation of indigenous knowledge has been challenging considering epistemological differences and the lack of involvement of indigenous people. Drawing from our most recent community projects in Namibia, we share insights on indigenous ecospatial worldviews guiding the design of digital information organization and access of indigenous knowledge. With emerging technologies, such as augmented and virtual reality, offering new opportunities for richer and more meaningful spatial and embodied accounts of indigenous knowledge, we re-imagine digital libraries inclusive of indigenous people and their worldviews. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Our civilization creates enormous volumes of digital data, a substantial fraction of which is preserved and made publicly available for present and future usage. Additionally, historical born-analog records are progressively being digitized and incorporated into digital document repositories. While professionals often have a clear idea of what they are looking for in document archives, average users are likely to have no precise search needs when accessing available archives (e.g., through their online interfaces). Thus, if the results are to be relevant and appealing to average people, they should include engaging and recognizable material. However, state-of-the-art document archival retrieval systems essentially use the same approaches as search engines for synchronic document collections. In this article, we develop unique ranking criteria for assessing the usefulness of archived contents based on their estimated relationship with current times, which we call contemporary relevance. Contemporary relevance may be utilized to enhance access to archival document collections, increasing the likelihood that users will discover interesting or valuable material. We next present an effective strategy for estimating contemporary relevance degrees of news articles by utilizing learning to rank approach based on a variety of diverse features, and we then successfully test it on the New York Times news collection. The incorporation of the contemporary relevance computation into archival retrieval systems should enable a new search style in which search results are meant to relate to the context of searchers’ times, and by this have the potential to engage the archive users. As a proof of concept, we develop and demonstrate a working prototype of a simplified ranking model that operates on the top of the Portuguese Web Archive portal (arquivo.pt). PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Due to the growing number of scholarly publications, finding relevant articles becomes increasingly difficult. Scholarly knowledge graphs can be used to organize the scholarly knowledge presented within those publications and represent them in machine-readable formats. Natural language processing (NLP) provides scalable methods to automatically extract knowledge from articles and populate scholarly knowledge graphs. However, NLP extraction is generally not sufficiently accurate and, thus, fails to generate high granularity quality data. In this work, we present TinyGenius, a methodology to validate NLP-extracted scholarly knowledge statements using microtasks performed with crowdsourcing. TinyGenius is employed to populate a paper-centric knowledge graph, using five distinct NLP methods. We extend our previous work of the TinyGenius methodology in various ways. Specifically, we discuss the NLP tasks in more detail and include an explanation of the data model. Moreover, we present a user evaluation where participants validate the generated NLP statements. The results indicate that employing microtasks for statement validation is a promising approach despite the varying participant agreement for different microtasks. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Retrievability measures the influence a retrieval system has on the access to information in a given collection of items. This measure can help in making an evaluation of the search system based on which insights can be drawn. In this paper, we investigate the retrievability in an integrated search system consisting of items from various categories, particularly focussing on datasets, publications and variables in a real-life digital library. The traditional metrics, that is, the Lorenz curve and Gini coefficient, are employed to visualise the diversity in retrievability scores of the three retrievable document types (specifically datasets, publications, and variables). Our results show a significant popularity bias with certain items being retrieved more often than others. Particularly, it has been shown that certain datasets are more likely to be retrieved than other datasets in the same category. In contrast, the retrievability scores of items from the variable or publication category are more evenly distributed. We have observed that the distribution of document retrievability is more diverse for datasets as compared to publications and variables. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The peer review process is the main academic resource to ensure that science advances and is disseminated. To contribute to this important process, classification models were created to perform two tasks: the review score prediction (RSP) and the paper decision prediction (PDP). But what challenges prevent us from having a fully efficient system responsible for these tasks' And how far are we from having an automated system to take care of these two tasks' To answer these questions, in this work, we evaluated the general performance of existing state-of-the-art models for RSP and PDP tasks and investigated what types of instances these models tend to have difficulty classifying and how impactful they are. We found, for example, that the performance of a model to predict the final decision of a paper is 23.31% lower when it is exposed to difficult instances and that the classifiers make mistake with a very high confidence. These and other results lead us to conclude that there are groups of instances that can negatively impact the model’s performance. That way, the current state-of-the-art models have potential to helping editors to decide whether to approve or reject a paper; however, we are still far from having a system that is fully responsible for scoring a paper and decide if it will be accepted or rejected. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Information extraction can support novel and effective access paths for digital libraries. Nevertheless, designing reliable extraction workflows can be cost-intensive in practice. On the one hand, suitable extraction methods rely on domain-specific training data. On the other hand, unsupervised and open extraction methods usually produce not-canonicalized extraction results. This paper is an extension of our original work and tackles the question of how digital libraries can handle such extractions and whether their quality is sufficient in practice. We focus on unsupervised extraction workflows by analyzing them in case studies in the domains of encyclopedias (Wikipedia), Pharmacy, and Political Sciences. As an extension, we analyze the extractions in more detail, verify our findings on a second extraction method, discuss another canonicalizing method, and give an outlook on how non-English texts can be handled. Therefore, we report on opportunities and limitations. Finally, we discuss best practices for unsupervised extraction workflows. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: The number of scientific publications nowadays is rapidly increasing, causing information overload for researchers and making it hard for scholars to keep up to date with current trends and lines of work. Recent work has tried to address this problem by developing methods for automated summarization in the scholarly domain, but concentrated so far only on monolingual settings, primarily English. In this paper, we consequently explore how state-of-the-art neural abstract summarization models based on a multilingual encoder–decoder architecture can be used to enable cross-lingual extreme summaries of scholarly texts. To this end, we compile a new abstractive cross-lingual summarization dataset for the scholarly domain in four different languages, which enables us to train and evaluate models that process English papers and generate summaries in German, Italian, Chinese and Japanese. We present our new X-SCITLDR dataset for multilingual summarization and thoroughly benchmark different models based on a state-of-the-art multilingual pre-trained model, including a two-stage pipeline approach that independently summarizes and translates, as well as a direct cross-lingual model. We additionally explore the benefits of intermediate-stage training using English monolingual summarization and machine translation as intermediate tasks and analyze performance in zero- and few-shot scenarios. Finally, we investigate how to make our approach more efficient on the basis of knowledge distillation methods, which make it possible to shrink the size of our models, so as to reduce the computational complexity of the summarization inference. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract When searching within an academic digital library, a variety of information seeking strategies may be employed. The purpose of this study is to determine whether graduate students choose appropriate information seeking strategies for the complexity of a given search scenario and to explore among other factors that could influence their decisions. We used a survey method in which participants ( \(n=176\) ) were asked to recall their most recent instance of an academic digital library search session that matched two given scenarios (randomly chosen from four alternatives) and, for each scenario, identify whether they employed search strategies associated with four different information seeking models. Among the search strategies, only lookup search was used in a manner that was consistent with the complexity of the search scenario. Other factors that influenced the choice of strategy were the discipline of study and the type of academic search training received. Patterns of search tool use with respect to the complexity of the search scenarios were also identified. These findings highlight that not only is it important to train graduate students on how to conduct academic digital library searches, more work is needed to train them on matching the information seeking strategies to the complexity of their search tasks and developing interfaces that guide their search process. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Metadata is a key data source for researchers seeking to apply machine learning (ML) to the vast collections of digitized biological specimens that can be found online. Unfortunately, the associated metadata is often sparse and, at times, erroneous. This paper extends previous research conducted with the Illinois Natural History Survey (INHS) collection (7244 specimen images) that uses computational approaches to analyze image quality, and then automatically generates 22 metadata properties representing the image quality and morphological features of the specimens. In the research reported here, we demonstrate the extension of our initial work to University of the Wisconsin Zoological Museum (UWZM) collection (4155 specimen images). Further, we enhance our computational methods in four ways: (1) augmenting the training set, (2) applying contrast enhancement, (3) upscaling small objects, and (4) refining our processing logic. Together these new methods improved our overall error rates from 4.6 to 1.1%. These enhancements also allowed us to compute an additional set of 17 image-based metadata properties. The new metadata properties provide supplemental features and information that may also be used to analyze and classify the fish specimens. Examples of these new features include convex area, eccentricity, perimeter, skew, etc. The newly refined process further outperforms humans in terms of time and labor cost, as well as accuracy, providing a novel solution for leveraging digitized specimens with ML. This research demonstrates the ability of computational methods to enhance the digital library services associated with the tens of thousands of digitized specimens stored in open-access repositories world-wide by generating accurate and valuable metadata for those repositories. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Systematic literature reviews in educational research have become a popular research method. A key point hereby is the choice of bibliographic databases to reach a maximum probability of finding all potentially relevant literature that deals with the research question analyzed in a systematic literature review. Guidelines and handbooks on review recommend proper databases and information sources for education, along with specific search strategies. However, in many disciplines, among them educational research, there is a lack of evidence on the relevance of databases that need to be considered to find relevant literature and lessen the risk of missing relevant publications. Educational research is an interdisciplinary field and has no core database. Instead, the field is covered by multiple disciplinary and multidisciplinary information sources that have either a national or international focus. In this article, we discuss the relevance of seven databases in systematic literature reviews in education, based on results of an empirical data analysis of three recently published reviews. To evaluate the relevance of a database, the relevant literature of those reviews served as the gold standard. Results indicate that discipline-specific databases outperform international multidisciplinary sources, and a combination of discipline-specific international and national sources is most efficient in finding a high proportion of relevant literature. The article discusses the relevance of the databases in relation to their coverage of relevant literature, while considering practical implications for researchers performing a systematic literature search. We, thus, present evidence for proper database choices for educational and discipline-related systematic literature reviews. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: The automatic semantic structuring of scientific text allows for more efficient reading of research articles and is an important indexing step for academic search engines. Sequential sentence classification is an essential structuring task and targets the categorisation of sentences based on their content and context. However, the potential of transfer learning for sentence classification across different scientific domains and text types, such as full papers and abstracts, has not yet been explored in prior work. In this paper, we present a systematic analysis of transfer learning for scientific sequential sentence classification. For this purpose, we derive seven research questions and present several contributions to address them: (1) We suggest a novel uniform deep learning architecture and multi-task learning for cross-domain sequential sentence classification in scientific text. (2) We tailor two transfer learning methods to deal with the given task, namely sequential transfer learning and multi-task learning. (3) We compare the results of the two best models using qualitative examples in a case study. (4) We provide an approach for the semi-automatic identification of semantically related classes across annotation schemes and analyse the results for four annotation schemes. The clusters and underlying semantic vectors are validated using k-means clustering. (5) Our comprehensive experimental results indicate that when using the proposed multi-task learning architecture, models trained on datasets from different scientific domains benefit from one another. Our approach significantly outperforms state of the art on full paper datasets while being on par for datasets consisting of abstracts. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In the past two decades, digital libraries (DL) have increasingly supported computational studies of digitized books (Jett et al. The hathitrust research center extracted features dataset (2.0), 2020; Underwood, Distant horizons: digital evidence and literary change, University of Chicago Press, Chicago, 2019; Organisciak et al. J Assoc Inf Sci Technol 73:317–332, 2022; Michel et al. Science 331:176–182, 2011). Nonetheless, there remains a dearth of DL data provisions or infrastructures for research on book reception, and user-generated book reviews have opened up unprecedented research opportunities in this area. However, insufficient attention has been paid to real-world complexities and limitations of using these datasets in scholarly research, which may cause analytical oversights (Crawford and Finn, Geo J 80:491–502, 2015), methodological pitfalls (Olteanu et al. Front Big Data 2:13, 2019), and ethical concerns (Hu et al. Research with user-generated book review data: legal and ethical pitfalls and contextualized mitigations, Springer, Berlin, 2023; Diesner and Chin, Gratis, libre, or something else' regulations and misassumptions related to working with publicly available text data, 2016). In this paper, we present three case studies that contextually and empirically investigate book reviews for their temporal, cultural, and socio-participatory complexities: (1) a longitudinal analysis of a ranked book list across ten years and over one month; (2) a text classification of 20,000 sponsored and 20,000 non-sponsored books reviews; and (3) a comparative analysis of 537 book ratings from Anglophone and non-Anglophone readerships. Our work reflects on both (1) data curation challenges that researchers may encounter (e.g., platform providers’ lack of bibliographic control) when studying book reviews and (2) mitigations that researchers might adopt to address these challenges (e.g., how to align data from various platforms). Taken together, our findings illustrate some of the sociotechnical complexities of working with user-generated book reviews by revealing the transiency, power dynamics, and cultural dependency in these datasets. This paper explores some of the limitations and challenges of using user-generated book reviews for scholarship and calls for critical and contextualized usage of user-generated book reviews in future scholarly research. PubDate: 2024-06-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Despite the millions of electronic theses and dissertations (ETDs) publicly available online, digital library services for ETDs have not evolved past simple search and browse at the metadata level. We need better digital library services that allow users to discover and explore the content buried in these long documents. Recent advances in machine learning have shown promising results for decomposing documents into their constituent parts, but these models and techniques require data for training and evaluation. In this article, we present high-quality datasets to train, evaluate, and compare machine learning methods in tasks that are specifically suited to identify and extract key elements of ETD documents. We explain how we construct the datasets by manual labeling the data or by deriving labeled data through synthetic processes. We demonstrate how our datasets can be used to develop downstream applications and to evaluate, retrain, or fine-tune pre-trained machine learning models. We describe our ongoing work to compile benchmark datasets and exploit machine learning techniques to build intelligent digital libraries for ETDs. PubDate: 2024-05-03 DOI: 10.1007/s00799-024-00395-4