Publisher: U of Edinburgh Journal Hosting Service
|
Similar Journals
![]() |
International Journal of Digital Curation
Number of Followers: 87 ![]() ISSN (Online) 1746-8256 Published by U of Edinburgh Journal Hosting Service ![]() |
- Infra Finder: a New Tool to Enhance Transparency, Discoverability and
Trust in Open Infrastructure
Authors: Lauren Collister, Emmy Tsang, Chrys Wu
Abstract: This paper describes Infra Finder, a new tool built by Invest in Open Infrastructure to help institutional budget holders and libraries make more informed decisions around adoption of and investment in open infrastructure. Through increased transparency and discoverability, we aim for this tool to foster trust in the decision-making process and to help build connections between services, users, and funders. The design of Infra Finder is intended to contribute to ongoing discussions and developments regarding trust and transparency in open scholarly infrastructure, as well as help level the playing field between organizations with limited resources to conduct extensive due diligence processes and those with their own analyst teams. In this work, we describe the landscape analysis that led to the creation of Infra Finder, the use cases for the tool, and the approach IOI is taking to create and foster use of Infra Finder in the open infrastructure environment. We also address some of the principles of trust in open source and open infrastructure that have informed and impacted the Infra Finder project and our work in creating this tool.
PubDate: 2024-08-15
DOI: 10.2218/ijdc.v18i1.927
Issue No: Vol. 18, No. 1 (2024)
- Transparent Disclosure, Curation & Preservation of Dynamic Digital
Resources
Authors: Deirdre Lungley, Darren Bell, Hervé L’Hours
Abstract: This paper explores an enhanced curation lifecycle being developed at the UK Data Service (UKDS), with our Data Product Builder. Through a Graphical User Interface, we aim to provide the researcher with a tailored digital resource. We detail the threefold motivation behind this initiative: data dissemination scalability, researcher satisfaction and the reduction of nationwide duplication of research effort. Subsequent sections detail the technical components and challenges involved. In addition to more standard data subsetting, filtering and linking components, this data dissemination platform offers dynamic disclosure assessments – identifying combinations of variables that present a potential disclosure risk. All components are underpinned by the Data Documentation Initiative’s new Cross-Domain Integration standard (DDI-CDI), designed to handle the many structures in which data may be organised. Ever conscious of the scale of the task we are embarking on, we remain motivated by the need for such advances in data dissemination and optimistic of the feasibility of such a system to meet the needs of the researcher while balancing the data disclosivity concerns of the data depositor.
PubDate: 2024-08-13
DOI: 10.2218/ijdc.v18i1.937
Issue No: Vol. 18, No. 1 (2024)
- Curation is Communal: Transparency, Trust, and (In)visible Labour
Authors: Halle Burns, Sand Caldrone, Mikala Narlock
Abstract: Research about trust and transparency within the realm of research data management and sharing typically centres on accreditation and compliance. Missing from many of these conversations are the social systems and enabling structures that are built on interpersonal connections. As members of the Data Curation Network (DCN), a consortium of United States-based institutional and non-profit data repositories, we have experienced first-hand the effort required to develop and sustain interpersonal trust and the benefits it provides to curation. In this paper, we reflect on the well-documented realities of curator and labour invisibility; the importance of fostering active communities (such as the DCN); and how trust, vulnerability and connectivity among colleagues leads to better curation practices. Through an investigation into data curators in the DCN, we found that, while curation can be isolating and invisible work, having a network of trusted peers helps alleviate these burdens and makes us better curators. We conclude with practical suggestions for implementing trust and transparency in relationships with colleagues and researchers.
PubDate: 2024-08-12
DOI: 10.2218/ijdc.v18i1.938
Issue No: Vol. 18, No. 1 (2024)
- Reproducible and Attributable Materials Science Curation Practices: A Case
Study
Authors: Ye Li, Sara Wilson, Micah Altman
Abstract: While small labs produce much of the fundamental experimental research in Material Science and Engineering (MSE), little is known about their data management and sharing practices and the extent to which they promote trust in, and transparency of, the published research. In this research, we conduct a case study of a leading MSE research lab to characterize the limits of current data management and sharing practices concerning reproducibility and attribution. We systematically reconstruct the workflows, underpinning four research projects by combining interviews, document review, and digital forensics. We then apply information graph analysis and computer-assisted retrospective auditing to identify where critical research information is unavailable or at risk. We find that while data management and sharing practices in this leading lab protect against computer and disk failure, they are insufficient to ensure reproducibility or correct attribution of work — especially when a group member withdraws before project completion. We conclude with recommendations for adjustments to MSE data management and sharing practices to promote trustworthiness and transparency by adding lightweight automated file-level auditing and automated data transfer processes.
PubDate: 2024-07-28
DOI: 10.2218/ijdc.v18i1.940
Issue No: Vol. 18, No. 1 (2024)
- Trusted Research Environments: Analysis of Characteristics and Data
Availability
Authors: Martin Weise, Andreas Rauber
Abstract: Trusted Research Environments (TREs) enable the analysis of sensitive data under strict security assertions that protect the data with technical, organizational, and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks and their slight technical variations. To highlight on these problems, an overview of the existing, publicly described TREs and a bibliography linking to the system description are provided. Their technical characteristics, especially in commonalities and variations, are analysed, and insight is provided into their data type characteristics and availability. The literature study shows that 47 TREs worldwide provide access to sensitive data, of which two-thirds provide data predominantly via secure remote access. Statistical offices (SOs) make the majority of sensitive data records included in this study available.
PubDate: 2024-07-22
DOI: 10.2218/ijdc.v18i1.939
Issue No: Vol. 18, No. 1 (2024)
- Preserving Secondary Knowledge
Authors: Klaus Rechert, Rafael Gieschke
Abstract: Emulation and migration are still our main tools for digital curation and preservation practice. Both strategies have been discussed extensively and have been demonstrated to be effective and applicable in various scenarios. Discussions have primarily centered on technical feasibility, workflow integration, and usability. However, there remains one important aspect when discussing these two techniques: managing and preserving operational knowledge. Both approaches require specialized knowledge but especially emulation requires future users to also have a great variety of knowledge about past software and computer systems for successful operation. We investigate how this knowledge can be stored and utilized, and to what extent it can be rendered machine-actionable, using modern large language models. We demonstrate a proof-of-concept implementation that operates an emulated software environment through natural language.
PubDate: 2024-07-08
DOI: 10.2218/ijdc.v18i1.930
Issue No: Vol. 18, No. 1 (2024)
- DMPs as Management Tool for Intellectual Assets by SMART-metrics
Authors: Federico Grasso Toro
Abstract: Data Management Plans (DMPs) are vital components of effective research data management (RDM). They serve not only as organisational tools but also as a structured framework dictating the collection, processing, sharing/publishing, and management of data throughout the research data life cycle. This can include existing data curation standards, the establishment of data handling protocols, and the creation, when necessary, of community curation policies. Therefore, DMPs present a unique opportunity to harmonise project management efforts for optimising the formulation and execution of project objectives. To harness the full potential of DMPs as project management tools, the SMART approach (i.e., Specific, Measurable, Achievable, Relevant, and Time-bound) emerges as a compelling methodology. During the initial stage of the project proposal, drafted SMART metrics can offer a systematic approach to map work packages (WPs) and deliverables to the overarching project objectives. Then, the Principal Investigators (PIs) can ensure the consortia that all the project potential intellectual assets (i.e., expected research results) were considered properly, as well as their necessary timelines, resources, and execution. It becomes imperative for data stewards (DSs) and governance policymakers to educate and provide guidelines to researchers on the advantages of developing well-curated DMPs that align results with SMART metrics. This alignment ensures that every intellectual asset intended as a research result (e.g., intellectual properties, publications, datasets, and software) within the project is subject to rigorous drafted planning, execution, and accountability. Consequently, the risk of unforeseen setbacks and/or deviations from the original objectives is minimised, increasing the traceability and transparency of the research data life cycle. In addition, the integration of Technology Readiness Levels (TRLs) into this proposed enhanced DMP provides a systematic method to evaluate the maturity and readiness of technologies across scientific disciplines. Regular TRL assessments will allow PIs: (1) to monitor the WP progress, (2) to adapt research strategies if required, and (3) to ensure the projects remain in line with the drafted SMART metrics in the enhanced DMP before the project started. The TRLs can also help PIs maintain their focus on project milestones and specific tasks aligned with the original objectives, contributing to the overall success of their endeavours, while improving the transparency for the reporting and divulgation of the research results. The paper presents the overall framework for enhancing DMPs as project management tools for any intellectual assets using SMART metrics and TRLs, as well as introducing suggested support services for data stewardship teams to assist PIs when implementing this novel framework effectively.
PubDate: 2024-06-17
DOI: 10.2218/ijdc.v18i1.919
Issue No: Vol. 18, No. 1 (2024)
- Factors Influencing Perceptions of Trust in Data Infrastructures
Authors: Katharina Flicker, Andreas Rauber, Bettina Kern, Fajar J. Ekaputra
Abstract: Trust is an essential pre-condition for the acceptance of digital infrastructures and services. Transparency has been identified as one mechanism for increasing trustworthiness. Yet, it is difficult to assess to which extent and how exactly different aspects of transparency contribute to trust, or potentially impede it in cases of overwhelming complexity of the information provided. To address these issues, we performed two initial studies to help determining the factors that influence or have impact on trust, focusing on transparency across a range of elements associated with data, data infrastructures and virtual research environments. On one hand, we performed a survey among IT experts in the field of data science focusing on quality aspects in the context of re-using and sharing open source software, assessing issues such as the need for documentation, test cases, and accountability. On the other hand, we complemented this with a set of semi-structured interviews with senior researchers to address specific issues of the degree of transparency achievable with different approaches. They include, for example, the amount of transparency we can achieve with approaches from explainable AI, or the usefulness and limitations of data provenance in determining the suitability of data for reuse and others. Specifically, we consider mechanisms on three levels, i.e. technical, process-oriented as well as social mechanisms. Starting from attributes of trust in the “analogue world”, we aim to understand which of these can be applied in the digital world, how they differ, and what additional mechanisms need to be established, in order to support trust in complex socio-technological processes and their emergent results when the traditional approaches cannot be applied anymore.
PubDate: 2024-05-13
DOI: 10.2218/ijdc.v18i1.921
Issue No: Vol. 18, No. 1 (2024)
- Assessing Quality Variations in Early Career Researchers’ Data
Management Plans
Authors: Jukka Rantasaari
Abstract: This paper aims to better understand early career researchers’ (ECRs’) research data management (RDM) competencies by assessing the contents and quality of data management plans (DMPs) developed during a multi-stakeholder RDM course. We also aim to identify differences between DMPs in relation to several background variables (e.g., discipline, course track). The Basics of Research Data Management (BRDM) course has been held in two multi-faculty, research-intensive universities in Finland since 2020. In this study, 223 ECRs’ DMPs created in the BRDM of 2020 - 2022 were assessed, using the recommendations and criteria of the Finnish DMP Evaluation Guide + General Finnish DMP Guidance (FDEG). The median quality of DMPs appeared to be satisfactory. The differences in rating according to FDEG’s three-point performance criteria were statistically insignificant between DMPs developed in separate years, course tracks or disciplines. However, using content analysis, differences were found between disciplines or course tracks regarding DMP’s key characteristics such as sharing, storing, and preserving data. DMPs that contained a data table (DtDMPs) also differed highly significantly from prose DMPs. DtDMPs better acknowledged the data handling needs of different data types and improved the overall quality of a DMP. The results illustrated that the ECRs had learned the basic RDM competencies and grasped their significance to the integrity, reliability, and reusability of data. However, more focused, further training to reach the advanced competency is needed, especially in areas of handling and sharing personal data, legal issues, long-term preserving, and funders’ data policies. Equally important to the cultural change when RDM is an organic part of the research practices is to merge research support services, processes, and infrastructure into the research projects’ processes. Additionally, incentives are needed for sharing and reusing data.
PubDate: 2024-04-14
DOI: 10.2218/ijdc.v18i1.873
Issue No: Vol. 18, No. 1 (2024)
- Artificial Intelligence Assisted Curation of Population Groups in
Biomedical Literature
Authors: Latrice Landry, Mary Lucas, Anietie Andy, Ebelechukwu Nwafor
Pages: 9 - 9
Abstract: Curation of the growing body of published biomedical research is of great importance to both the synthesis of contemporary science and the archiving of historical biomedical literature. Each of these tasks has become increasingly challenging given the expansion of journal titles, preprint repositories and electronic databases. Added to this challenge is the need for curation of biomedical literature across population groups to better capture study populations for improved understanding of the generalizability of findings. To address this, our study aims to explore the use of generative artificial intelligence (AI) in the form of large language models (LLMs) such as GPT-4 as an AI curation assistant for the task of curating biomedical literature for population groups. We conducted a series of experiments which qualitatively and quantitatively evaluate the performance of OpenAI’s GPT-4 in curating population information from biomedical literature. Using OpenAI’s GPT-4 and curation instructions, executed through prompts, we evaluate the ability of GPT-4 to classify study ‘populations’, ‘continents’ and ‘countries’ from a previously curated dataset of public health COVID-19 studies. Using three different experimental approaches, we examined performance by: A) evaluation of accuracy (concordance with human curation) using both exact and approximate string matches within a single experimental approach; B) evaluation of accuracy across experimental approaches; and C) conducting a qualitative phenomenology analysis to describe and classify the nature of difference between human curation and GPT curation. Our study shows that GPT-4 has the potential to provide assistance in the curation of population groups in biomedical literature. Additionally, phenomenology provided key information for prompt design that further improved the LLM’s performance in these tasks. Future research should aim to improve prompt design, as well as explore other generative AI models to improve curation performance. An increased understanding of the populations included in research studies is critical for the interpretation of findings, and we believe this study provides keen insight on the potential to increase the scalability of population curation in biomedical studies.
PubDate: 2024-08-18
DOI: 10.2218/ijdc.v18i1.950
Issue No: Vol. 18, No. 1 (2024)
- Community-based Curate-a-Thons to Enhance Preservation of Global Genetic
Biodiversity Data
Authors: Andrea L Pritt, Briana E Wham, Rachel H Toczydlowski, Eric D Crandall
Pages: 13 - 13
Abstract: Science, Technology, Engineering, and Mathematics (STEM) and Research Data Librarians collaborated with an international research team of conservation geneticists to create an instructional and practical guide combining genetic biodiversity initiatives and data curation. Over the course of two months, the academic librarians held multiple community-based Curate-A-Thons where an international group of students, researchers, librarians, and faculty researchers participated in tracking down publications and metadata for genomic sequence data, thus crowd-sourcing this effort of metadata enhancement. This article details the successful Curate-a-Thon design and implementation process; the openly available instructional materials created and used to host the Curate-a-Thons; and the challenges and successes of these community-based events.
PubDate: 2024-02-11
DOI: 10.2218/ijdc.v18i1.891
Issue No: Vol. 18, No. 1 (2024)
- Closing Gaps: A Model of Cumulative Curation and Preservation Levels for
Trustworthy Digital Repositories
Authors: Jonas Recker, Mari Kleemola, Hervé L'Hours
Pages: 16 - 16
Abstract: Curation and preservation measures carried out by digital repository staff are an important building block in maintaining the accessibility and usability of digital resources over time. The measures adequate to achieve long-term usability for a given audience strongly depend on scenarios of (re)use, the (intended) users’ needs and skills, the organisational setting (e.g., mission, resources, policies), as well as the characteristics of the digital objects to be preserved. The assessment of curation and preservation measures also forms an important part of existing certification procedures for trustworthy digital repositories (TDRs) as offered, for example, by the CoreTrustSeal foundation, the nestor network, or ISO. The digital curation community is presented with the challenge of finding community-, organisation-, and object-specific approaches to curation and preservation at the same time as defining the minimum level of curation and preservation measures expected from a TDR in sufficiently generic terms to ensure applicability to a wide array of repositories. Against this backdrop, this paper discusses the need for and benefits of community-agreed levels of curation and preservation to address this challenge, and considers the tiered model proposed by the CoreTrustSeal Board as an example. The proposed model is then applied in an analysis of successful CoreTrustSeal applications from 2018–2022 in an effort to better understand the capacity of the curation and preservation levels to capture the respective practices of repositories and to identify potential gaps.
PubDate: 2024-08-15
DOI: 10.2218/ijdc.v18i1.926
Issue No: Vol. 18, No. 1 (2024)
- The Generation of Revision Identifier (rsid) Numbers in MS Word
Authors: Dirk H.R. Spennemann, Claire L Singh
Pages: 22 - 22
Abstract: The 2007 implementation of the Office Open XML standard for Microsoft Word introduced the assignation of individual revision save identifiers (Rsid) to document editing sessions that end in a save action. The relevant standards ECMA (2016) and ISO/ IEC 29500-1:2016 (2016) stipulate that these Rsid should be allocated randomised but with increasing numerical value, thereby documenting the progress of the editing. As MS Word is the most ubiquitous word processing software, Rsid appear to be a useful tool to examine and provide evidence for a wide range of common document generation editing and modification processes and file management operations, with implications for document analysis including, but not limited to academic integrity issues in student assignment submissions (e.g. contract cheating). This paper presents the results of a series of experiments conducted to assess whether and how well MS Word implements the ECMA and ISO/ IEC standards. The results show that the number of allocated Rsid indeed increases with each edit and save action, with the previous Rsids carried over and retained. The newly allocated Rsid, however, do not conform to the standard as the numerical value of a Rsid associated with a save action may be larger or smaller than any or all of those allocated during that of the previous save actions. The allocation of a new Rsid is not necessarily caused by an edit event but that a new Rsid can also be generated if a file is saved as rtf or if it is sent as an e-mail from within MS Word, although the file was not edited in any way. Rsid numbers are not generated if a person opens a MS Word document, reads it and closes the file without saving, making this action impossible to detect. MS Word template files on a given machine contain document (root) Rsid numbers that are generated when a newly installed application is launched for the first time. As these will be embedded as legacy Rsid into every new file generated from that template file, they act as signatures for all MS Word documents that are created. The experiments have shown that user behaviour has a direct influence on the number of Rsid represented in a given file. Although the implementation of Office Open XML chosen by Microsoft is not compliant with the relevant standards, and thus Rsid cannot be used determine the exact chronological order of all editing sequences within a given document, the Rsid retain their value for document forensics as they are associated with specific edit events, and illuminate the document writing and editing process.
PubDate: 2024-02-11
DOI: 10.2218/ijdc.v18i1.870
Issue No: Vol. 18, No. 1 (2024)
- Selecting Efficient and Reliable Preservation Strategies:
Authors: Micah Altman, Richard Landau
Pages: 24 - 24
Abstract: This article addresses the problem of formulating efficient and reliable operational preservation policies that ensure bit-level information integrity over long periods, and in the presence of a diverse range of real-world technical, legal, organizational, and economic threats. We develop a systematic, quantitative prediction framework that combines formal modelling, discrete-event-based simulation, hierarchical modelling, and then use empirically calibrated sensitivity analysis to identify effective strategies. Specifically, the framework formally defines an objective function for preservation that maps a set of preservation policies and a risk profile to a set of preservation costs, and an expected collection loss distribution. In this framework, a curator’s objective is to select optimal policies that minimize expected loss subject to budget constraints. To estimate preservation loss under different policy conditions optimal policies, we develop a statistical hierarchical risk model that includes four sources of risk: the storage hardware; the physical environment; the curating institution; and the global environment. We then employ a general discrete event-based simulation framework to evaluate the expected loss and the cost of employing varying preservation strategies under specific parameterization of risks. Source code is available at:https://github.com/MIT-Informatics/PreservationSimulation The framework offers flexibility for the modeling of a wide range of preservation policies and threats. Since this framework is open source and easily deployed in a cloud computing environment, it can be used to produce analysis based on independent estimates of scenario-specific costs, reliability, and risk. We present results summarizing hundreds of thousands of simulations using this framework. This exploratory analysis points to a number of robust and broadly applicable preservation strategies, provides novel insights into specific preservation tactics, and provides evidence that challenges received wisdom. An earlier version of this paper was published previously in IJDC 15(1) 2020
PubDate: 2024-03-04
DOI: 10.2218/ijdc.v18i1.743
Issue No: Vol. 18, No. 1 (2024)