Subjects -> COMPUTER SCIENCE (Total: 2313 journals)
    - ANIMATION AND SIMULATION (33 journals)
    - ARTIFICIAL INTELLIGENCE (133 journals)
    - AUTOMATION AND ROBOTICS (116 journals)
    - CLOUD COMPUTING AND NETWORKS (75 journals)
    - COMPUTER ARCHITECTURE (11 journals)
    - COMPUTER ENGINEERING (12 journals)
    - COMPUTER GAMES (23 journals)
    - COMPUTER PROGRAMMING (25 journals)
    - COMPUTER SCIENCE (1305 journals)
    - COMPUTER SECURITY (59 journals)
    - DATA BASE MANAGEMENT (21 journals)
    - DATA MINING (50 journals)
    - E-BUSINESS (21 journals)
    - E-LEARNING (30 journals)
    - ELECTRONIC DATA PROCESSING (23 journals)
    - IMAGE AND VIDEO PROCESSING (42 journals)
    - INFORMATION SYSTEMS (109 journals)
    - INTERNET (111 journals)
    - SOCIAL WEB (61 journals)
    - SOFTWARE (43 journals)
    - THEORY OF COMPUTING (10 journals)

COMPUTER SCIENCE (1305 journals)                  1 2 3 4 5 6 7 | Last

Showing 1 - 200 of 872 Journals sorted alphabetically
3D Printing and Additive Manufacturing     Full-text available via subscription   (Followers: 27)
Abakós     Open Access   (Followers: 3)
ACM Computing Surveys     Hybrid Journal   (Followers: 29)
ACM Inroads     Full-text available via subscription   (Followers: 1)
ACM Journal of Computer Documentation     Free   (Followers: 4)
ACM Journal on Computing and Cultural Heritage     Hybrid Journal   (Followers: 5)
ACM Journal on Emerging Technologies in Computing Systems     Hybrid Journal   (Followers: 12)
ACM SIGACCESS Accessibility and Computing     Free   (Followers: 2)
ACM SIGAPP Applied Computing Review     Full-text available via subscription  
ACM SIGBioinformatics Record     Full-text available via subscription  
ACM SIGEVOlution     Full-text available via subscription  
ACM SIGHIT Record     Full-text available via subscription  
ACM SIGHPC Connect     Full-text available via subscription  
ACM SIGITE Newsletter     Open Access   (Followers: 1)
ACM SIGMIS Database: the DATABASE for Advances in Information Systems     Hybrid Journal  
ACM SIGUCCS plugged in     Full-text available via subscription  
ACM SIGWEB Newsletter     Full-text available via subscription   (Followers: 3)
ACM Transactions on Accessible Computing (TACCESS)     Hybrid Journal   (Followers: 3)
ACM Transactions on Algorithms (TALG)     Hybrid Journal   (Followers: 13)
ACM Transactions on Applied Perception (TAP)     Hybrid Journal   (Followers: 3)
ACM Transactions on Architecture and Code Optimization (TACO)     Hybrid Journal   (Followers: 9)
ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)     Hybrid Journal  
ACM Transactions on Autonomous and Adaptive Systems (TAAS)     Hybrid Journal   (Followers: 10)
ACM Transactions on Computation Theory (TOCT)     Hybrid Journal   (Followers: 11)
ACM Transactions on Computational Logic (TOCL)     Hybrid Journal   (Followers: 5)
ACM Transactions on Computer Systems (TOCS)     Hybrid Journal   (Followers: 19)
ACM Transactions on Computer-Human Interaction     Hybrid Journal   (Followers: 15)
ACM Transactions on Computing Education (TOCE)     Hybrid Journal   (Followers: 9)
ACM Transactions on Computing for Healthcare     Hybrid Journal  
ACM Transactions on Cyber-Physical Systems (TCPS)     Hybrid Journal   (Followers: 1)
ACM Transactions on Design Automation of Electronic Systems (TODAES)     Hybrid Journal   (Followers: 5)
ACM Transactions on Economics and Computation     Hybrid Journal  
ACM Transactions on Embedded Computing Systems (TECS)     Hybrid Journal   (Followers: 4)
ACM Transactions on Information Systems (TOIS)     Hybrid Journal   (Followers: 18)
ACM Transactions on Intelligent Systems and Technology (TIST)     Hybrid Journal   (Followers: 11)
ACM Transactions on Interactive Intelligent Systems (TiiS)     Hybrid Journal   (Followers: 6)
ACM Transactions on Internet of Things     Hybrid Journal   (Followers: 2)
ACM Transactions on Modeling and Performance Evaluation of Computing Systems (ToMPECS)     Hybrid Journal  
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)     Hybrid Journal   (Followers: 10)
ACM Transactions on Parallel Computing     Full-text available via subscription  
ACM Transactions on Reconfigurable Technology and Systems (TRETS)     Hybrid Journal   (Followers: 6)
ACM Transactions on Sensor Networks (TOSN)     Hybrid Journal   (Followers: 9)
ACM Transactions on Social Computing     Hybrid Journal  
ACM Transactions on Spatial Algorithms and Systems (TSAS)     Hybrid Journal   (Followers: 1)
ACM Transactions on Speech and Language Processing (TSLP)     Hybrid Journal   (Followers: 11)
ACM Transactions on Storage     Hybrid Journal  
ACS Applied Materials & Interfaces     Hybrid Journal   (Followers: 41)
Acta Informatica Malaysia     Open Access  
Acta Universitatis Cibiniensis. Technical Series     Open Access   (Followers: 1)
Ad Hoc Networks     Hybrid Journal   (Followers: 12)
Adaptive Behavior     Hybrid Journal   (Followers: 8)
Additive Manufacturing Letters     Open Access   (Followers: 3)
Advanced Engineering Materials     Hybrid Journal   (Followers: 32)
Advanced Science Letters     Full-text available via subscription   (Followers: 9)
Advances in Adaptive Data Analysis     Hybrid Journal   (Followers: 9)
Advances in Artificial Intelligence     Open Access   (Followers: 33)
Advances in Catalysis     Full-text available via subscription   (Followers: 7)
Advances in Computational Mathematics     Hybrid Journal   (Followers: 20)
Advances in Computer Engineering     Open Access   (Followers: 13)
Advances in Computer Science : an International Journal     Open Access   (Followers: 19)
Advances in Computing     Open Access   (Followers: 3)
Advances in Data Analysis and Classification     Hybrid Journal   (Followers: 52)
Advances in Engineering Software     Hybrid Journal   (Followers: 27)
Advances in Geosciences (ADGEO)     Open Access   (Followers: 19)
Advances in Human-Computer Interaction     Open Access   (Followers: 19)
Advances in Image and Video Processing     Open Access   (Followers: 20)
Advances in Materials Science     Open Access   (Followers: 20)
Advances in Multimedia     Open Access   (Followers: 1)
Advances in Operations Research     Open Access   (Followers: 13)
Advances in Remote Sensing     Open Access   (Followers: 59)
Advances in Science and Research (ASR)     Open Access   (Followers: 8)
Advances in Technology Innovation     Open Access   (Followers: 5)
AEU - International Journal of Electronics and Communications     Hybrid Journal   (Followers: 8)
African Journal of Information and Communication     Open Access   (Followers: 6)
African Journal of Mathematics and Computer Science Research     Open Access   (Followers: 5)
AI EDAM     Hybrid Journal   (Followers: 2)
Air, Soil & Water Research     Open Access   (Followers: 6)
AIS Transactions on Human-Computer Interaction     Open Access   (Followers: 5)
Al-Qadisiyah Journal for Computer Science and Mathematics     Open Access   (Followers: 2)
AL-Rafidain Journal of Computer Sciences and Mathematics     Open Access   (Followers: 3)
Algebras and Representation Theory     Hybrid Journal  
Algorithms     Open Access   (Followers: 13)
American Journal of Computational and Applied Mathematics     Open Access   (Followers: 8)
American Journal of Computational Mathematics     Open Access   (Followers: 6)
American Journal of Information Systems     Open Access   (Followers: 4)
American Journal of Sensor Technology     Open Access   (Followers: 2)
Analog Integrated Circuits and Signal Processing     Hybrid Journal   (Followers: 15)
Animation Practice, Process & Production     Hybrid Journal   (Followers: 4)
Annals of Combinatorics     Hybrid Journal   (Followers: 3)
Annals of Data Science     Hybrid Journal   (Followers: 14)
Annals of Mathematics and Artificial Intelligence     Hybrid Journal   (Followers: 16)
Annals of Pure and Applied Logic     Open Access   (Followers: 4)
Annals of Software Engineering     Hybrid Journal   (Followers: 12)
Annual Reviews in Control     Hybrid Journal   (Followers: 7)
Anuario Americanista Europeo     Open Access  
Applicable Algebra in Engineering, Communication and Computing     Hybrid Journal   (Followers: 3)
Applied and Computational Harmonic Analysis     Full-text available via subscription  
Applied Artificial Intelligence: An International Journal     Hybrid Journal   (Followers: 17)
Applied Categorical Structures     Hybrid Journal   (Followers: 4)
Applied Clinical Informatics     Hybrid Journal   (Followers: 4)
Applied Computational Intelligence and Soft Computing     Open Access   (Followers: 16)
Applied Computer Systems     Open Access   (Followers: 6)
Applied Computing and Geosciences     Open Access   (Followers: 3)
Applied Mathematics and Computation     Hybrid Journal   (Followers: 31)
Applied Medical Informatics     Open Access   (Followers: 11)
Applied Numerical Mathematics     Hybrid Journal   (Followers: 4)
Applied Soft Computing     Hybrid Journal   (Followers: 13)
Applied Spatial Analysis and Policy     Hybrid Journal   (Followers: 5)
Applied System Innovation     Open Access   (Followers: 1)
Archive of Applied Mechanics     Hybrid Journal   (Followers: 4)
Archive of Numerical Software     Open Access  
Archives and Museum Informatics     Hybrid Journal   (Followers: 97)
Archives of Computational Methods in Engineering     Hybrid Journal   (Followers: 5)
arq: Architectural Research Quarterly     Hybrid Journal   (Followers: 7)
Array     Open Access   (Followers: 1)
Artifact : Journal of Design Practice     Open Access   (Followers: 8)
Artificial Life     Hybrid Journal   (Followers: 7)
Asian Journal of Computer Science and Information Technology     Open Access   (Followers: 3)
Asian Journal of Control     Hybrid Journal  
Asian Journal of Research in Computer Science     Open Access   (Followers: 4)
Assembly Automation     Hybrid Journal   (Followers: 2)
Automatic Control and Computer Sciences     Hybrid Journal   (Followers: 6)
Automatic Documentation and Mathematical Linguistics     Hybrid Journal   (Followers: 5)
Automatica     Hybrid Journal   (Followers: 13)
Automatika : Journal for Control, Measurement, Electronics, Computing and Communications     Open Access  
Automation in Construction     Hybrid Journal   (Followers: 8)
Balkan Journal of Electrical and Computer Engineering     Open Access  
Basin Research     Hybrid Journal   (Followers: 7)
Behaviour & Information Technology     Hybrid Journal   (Followers: 32)
BenchCouncil Transactions on Benchmarks, Standards, and Evaluations     Open Access   (Followers: 7)
Big Data and Cognitive Computing     Open Access   (Followers: 5)
Big Data Mining and Analytics     Open Access   (Followers: 10)
Biodiversity Information Science and Standards     Open Access   (Followers: 2)
Bioinformatics     Hybrid Journal   (Followers: 226)
Bioinformatics Advances : Journal of the International Society for Computational Biology     Open Access   (Followers: 1)
Biomedical Engineering     Hybrid Journal   (Followers: 11)
Biomedical Engineering and Computational Biology     Open Access   (Followers: 11)
Briefings in Bioinformatics     Hybrid Journal   (Followers: 43)
British Journal of Educational Technology     Hybrid Journal   (Followers: 96)
Bulletin of Taras Shevchenko National University of Kyiv. Series: Physics and Mathematics     Open Access  
c't Magazin fuer Computertechnik     Full-text available via subscription   (Followers: 1)
Cadernos do IME : Série Informática     Open Access  
CALCOLO     Hybrid Journal  
CALICO Journal     Full-text available via subscription   (Followers: 3)
Calphad     Hybrid Journal  
Canadian Journal of Electrical and Computer Engineering     Full-text available via subscription   (Followers: 14)
Catalysis in Industry     Hybrid Journal  
CCF Transactions on High Performance Computing     Hybrid Journal  
CCF Transactions on Pervasive Computing and Interaction     Hybrid Journal  
CEAS Space Journal     Hybrid Journal   (Followers: 6)
Cell Communication and Signaling     Open Access   (Followers: 3)
Central European Journal of Computer Science     Hybrid Journal   (Followers: 5)
CERN IdeaSquare Journal of Experimental Innovation     Open Access  
Chaos, Solitons & Fractals     Hybrid Journal   (Followers: 1)
Chaos, Solitons & Fractals : X     Open Access   (Followers: 1)
Chemometrics and Intelligent Laboratory Systems     Hybrid Journal   (Followers: 13)
ChemSusChem     Hybrid Journal   (Followers: 7)
China Communications     Full-text available via subscription   (Followers: 8)
Chinese Journal of Catalysis     Full-text available via subscription   (Followers: 2)
Chip     Full-text available via subscription   (Followers: 5)
Ciencia     Open Access  
CIN : Computers Informatics Nursing     Hybrid Journal   (Followers: 11)
Circuits and Systems     Open Access   (Followers: 16)
CLEI Electronic Journal     Open Access  
Clin-Alert     Hybrid Journal   (Followers: 1)
Clinical eHealth     Open Access  
Cluster Computing     Hybrid Journal   (Followers: 1)
Cognitive Computation     Hybrid Journal   (Followers: 2)
Cognitive Computation and Systems     Open Access  
COMBINATORICA     Hybrid Journal  
Combinatorics, Probability and Computing     Hybrid Journal   (Followers: 4)
Combustion Theory and Modelling     Hybrid Journal   (Followers: 18)
Communication Methods and Measures     Hybrid Journal   (Followers: 12)
Communication Theory     Hybrid Journal   (Followers: 29)
Communications in Algebra     Hybrid Journal   (Followers: 1)
Communications in Partial Differential Equations     Hybrid Journal   (Followers: 2)
Communications of the ACM     Full-text available via subscription   (Followers: 59)
Communications of the Association for Information Systems     Open Access   (Followers: 15)
Communications on Applied Mathematics and Computation     Hybrid Journal   (Followers: 1)
COMPEL: The International Journal for Computation and Mathematics in Electrical and Electronic Engineering     Hybrid Journal   (Followers: 4)
Complex & Intelligent Systems     Open Access   (Followers: 1)
Complex Adaptive Systems Modeling     Open Access  
Complex Analysis and Operator Theory     Hybrid Journal   (Followers: 2)
Complexity     Hybrid Journal   (Followers: 8)
Computación y Sistemas     Open Access  
Computation     Open Access   (Followers: 1)
Computational and Applied Mathematics     Hybrid Journal   (Followers: 3)
Computational and Mathematical Methods     Hybrid Journal  
Computational and Mathematical Methods in Medicine     Open Access   (Followers: 2)
Computational and Mathematical Organization Theory     Hybrid Journal   (Followers: 1)
Computational and Structural Biotechnology Journal     Open Access   (Followers: 1)
Computational and Theoretical Chemistry     Hybrid Journal   (Followers: 11)
Computational Astrophysics and Cosmology     Open Access   (Followers: 7)
Computational Biology and Chemistry     Hybrid Journal   (Followers: 13)
Computational Biology Journal     Open Access   (Followers: 6)
Computational Brain & Behavior     Hybrid Journal   (Followers: 1)
Computational Chemistry     Open Access   (Followers: 3)
Computational Communication Research     Open Access   (Followers: 1)
Computational Complexity     Hybrid Journal   (Followers: 5)
Computational Condensed Matter     Open Access   (Followers: 1)

        1 2 3 4 5 6 7 | Last

Similar Journals
Journal Cover
Biodiversity Information Science and Standards
Number of Followers: 2  

  This is an Open Access Journal Open Access journal
ISSN (Online) 2535-0897
Published by Pensoft Homepage  [58 journals]
  • Challenges in Curating Interdisciplinary Data in the Biodiversity Research
           Community

    • Abstract: Biodiversity Information Science and Standards 5: e79084
      DOI : 10.3897/biss.5.79084
      Authors : Inna Kouper, Kimberly Cook : Panelists: James Macklin, Agriculture and Agri-Food Canada; Anne Thessen, University of Colorado Anschutz Medical Campus; Robbie Burger, University of Kentucky; Ben Norton, North Carolina Museum of Natural SciencesOrganizers: Kimberly Cook, University of Kentucky; Inna Kouper, Indiana UniversityAs research incentives become increasingly focused on collaborative work, addressing the challenges of curating interdisciplinary data becomes a priority. A panel convened at the TDWG 2021 virtual conference on October 19 discussed these issues and provided the space where people with a variety of experience curating interdisciplinary biodiversity data shared their knowledge and expertise.The panel started with a brief introduction to the challenges of interdisciplinary and highly collaborative research (IHCR), which the panel organizers have previously observed (Kouper et al. 2021). In addition to varying definitions that focus on crossing the disciplinary boundaries or synthesizing knowledge, IHCR is characterized by an increasing emphasis on computation, integration of heterogeneous data sources, and work with multiple stakeholders. As such, IHCR data does not fit with traditional lifecycle models as it requires more iterations, coordination, and shared language.Narrowing the scope to biodiversity data, the panelists acknowledged that biodiversity is a truly interdisciplinary domain where researchers and practitioners bring their diverse expertise to take care of data. The domain has a variety of contributors, including data producers, users, and curators. While they share common goals, these contributors are often fragmented in separate projects that prioritize academic disciplines or public engagement. Lack of knowledge and awareness about contributors and their projects and expertise as well as a certain vulnerability in branching out into new areas, are among the factors that make it difficult to tear down silos. As James Macklin put it, “... you're crossing a boundary into a place you don't maybe know a lot about, and for some people, that's hard to do. Right' It takes a lot of listening and thinking.”Due to their complex and interactive nature, IHCR projects almost always have a higher overhead in terms of communication, coordination, and management. Panelists agreed that for such projects there needs to be a collaboration handbook that assigns roles and responsibilities and establishes rules for various aspects of collaboration, including authorship and handling disagreements. Successful IHCR projects create such handbooks at the beginning and revisit them regularly. Another useful strategy mentioned was to hold debriefing sessions that evaluate what went well and what didn’t.Strong leadership that takes IHCR complexities into account and builds a network of capable facilitators and “bridge-builders” or “translators” is a big factor that makes projects succeed. Recognizing and encouraging the role of facilitators from the onset of the project helps to develop productive relationships across disciplines and areas of expertise. It also enables everyone to focus on their strengths and build trust.Data and metadata integration is one of the big challenges in biodiversity, although it is not unique to it. Biodiversity brings together many disciplines and each of them identifies its own problems and collects data to address them. Data silos stem from disciplinary silos, and it will take a different, more integrated, kind of cyberinfrastructure and modeling to bring these pieces together. Creating such infrastructures and standards around interdisciplinary data and metadata are serious needs, although they are not valued and rewarded enough compared to, say publishing academic papers.Lack of standardization and infrastructure also stands in the way of improving the quality of data in biodiversity. To evaluate the quality of data and to trust its creators, data users need to know who gathered and processed the data and how. When the data is re-used within a collaborative project, there is an opportunity to ask questions and find out why, for example, someone had certain naming conventions or processing and analytical approaches. Long-term data such as species’ life history traits, however, can be collected over long periods of time. Improving the quality of biodiversity data requires going beyond interpersonal communication and addressing the issues of metadata and standards more systematically.Panelists also discussed the issue of openness in connection to biodiversity data. Openness contributes to the improved quality of data and an increased return on public investment in science and research. Panelists’ positions diverged in the degree to which biodiversity data should be open and approaches to address competitiveness and sensitivity in research. On one hand, they acknowledged the need for some form of embargo on data sharing to allow data originators to benefit from their effort; on the other, they argued that lack of openness promotes silos and diminishes the quality of research and its reproducibility. Panelists briefly discussed the COVID pandemic data as an example of how lack of openness and silos can be detrimental to finding solutions:“COVID has given us the best example we have of how silos do damage to things that could have gone better. ... the data wasn't available, if it had been open or not even necessarily open but had anybody had any idea that it existed somewhere, that would have helped a lot. … We are learning those lessons, governments are changing the way they do business because of it. And so for us, I mean, our community, I think this has been one of the best things that could have happened to us in some ways, simply because it forced a change ...
      PubDate: Wed, 8 Dec 2021 20:30:00 +0200
       
  • Translating TDWG Controlled Vocabularies

    • Abstract: Biodiversity Information Science and Standards 5: e79050
      DOI : 10.3897/biss.5.79050
      Authors : Steven J Baskauf, Paula Zermoglio : Users may be more likely to understand and utilize standards if they are able to read labels and definitions of terms in their own languages. Increasing standards usage in non-English speaking parts of the world will be important for making biodiversity data from across the globe more uniformly available. For these reasons, it is important for Biodiversity Information Standards (TDWG) to make its standards widely available in as many languages as possible. Currently, TDWG has six ratified controlled vocabularies*1, 2, 3, 4, 5, 6 that were originally available only in English. As an outcome of this workshop, we have made term labels and definitions in those vocabularies available in the languages of translators who participated in its sessions. In the introduction, we reviewed the concept of vocabularies, explained the distinction between term labels and controlled value strings, and described how multilingual labels and definitions fit into the standards development process. The introduction was followed by working sessions in which individual translators or small groups working in a single language filled out Google Sheets with their translations. The resulting translations were compiled along with attribution information for the translators and made freely available in JavaScript Object Notation (JSON) and comma separated values (CSV) formats.*7 HTML XML PDF
      PubDate: Wed, 8 Dec 2021 11:30:00 +0200
       
  • Authoritative Taxonomic Databases for Progress in Edible Insect and Host
           Plant Inventories

    • Abstract: Biodiversity Information Science and Standards 5: e75908
      DOI : 10.3897/biss.5.75908
      Authors : Papy Nsevolo : Insects play a vital role for humans. Apart from well-known ecosystem services (e.g., pollination, biological control, decomposition), they also serve as food for humans. An increasing number of research reports (Mitsuhashi 2017, Jongema 2018) indicate that entomophagy (the practice of eating insects by humans), is a long-standing practice in many countries around the globe. In Africa notably, more than 524 insects have been reported to be consumed by different ethnic groups, serving as a cheap, ecofriendly and renewable source of nutrients on the continent.Given the global recession due to the pandemic (COVID-19) and the threat induced to food security and food production systems, edible insects are of special interest in African countries, particularly the Democratic Republic of the Congo (DRC), where they have been reported as vital to sustain food security. Indeed, to date, the broadest lists of edible insects of the DRC reported (a maximum) 98 insects identified at species level (Monzambe 2002, Mitsuhashi 2017, Jongema 2018). But these lists are hampered by spelling mistakes or by redundancy. An additional problem is raised by insects only known by their vernacular names (ethnospecies) as local languages (more than 240 living ones) do not necessarily give rigorous information due to polysemy concerns.Based on the aforementioned challenges, entomophagy practices and edible insect species reported for DRC (from the independence year, 1960, to date) have been reviewed using four authoritative taxonomic databases: Catalogue of Life (CoL), Integrated Taxonomic Information System, Global Biodiversity Information Facility taxonomic backbone, and the Global Lepidoptera Names Index. Results confirm the top position of edible caterpillars (Lepidoptera, 50.8%) followed by Orthoptera (12.5%), Coleoptera and Hymenoptera (10.0% each). A total of 120 edible species (belonging to eighty genera, twenty-nine families and nine orders of insects) have been listed and mapped on a national scale. Likewise, host plants of edible insects have been inventoried after checking (using CoL, Plant Resources of Tropical Africa, and the International Union for Conservation of Nature's Red List of Threatened Species). The host plant diversity is dominated by multi-use trees belonging to Fabaceae (34.4%) followed by Phyllanthaceae (10.6%) and Meliaceae (4.9%). However, data indicated endangered (namely Millettia laurentii, Prioria balsamifera ) or critically endangered (Autranella congolensis) host plant species that call for conservation strategies. To the best of our knowledge, aforementioned results are the very first reports of such findings in Africa.Moreover, given issues encountered during data compilation and during cross-checking of scientific names, a call was made for greater collaboration between local people and expert taxonomists (through citizen science), in order to unravel unidentified ethnospecies. Given the challenge of information technology infrastructure in Africa, such a target could be achieved thanks to mobile apps. Likewise, a further call should be made for:bеtter synchronization of taxonomic databases,the need of qualitative scientific photographs in taxonomic databases, andadditional data (i.e., conservational status, proteins or DNA sequences notably) as edible insects need to be rigorously identified and durably managed.Indeed, these complementary data are very crucial, given the limitations and issues of conventional/traditional identification methods based on morphometric or dichotomous keys and the lack of voucher specimens in many African museums and/or collections. This could be achieved by QR (Quick Response) coding insect species and centralizing data about edible insects in a main authoritative taxonomic database whose role is undebatable, as edible insects are today earmarked as nutrient-rich source of proteins, fat, vitamins and fiber to mitigate food insecurity and poor diets, which are an aggravating factor for the impact of COVID-19. HTML XML PDF
      PubDate: Wed, 29 Sep 2021 10:00:00 +030
       
  • Challenges in Curating 2D Multimedia Data in the Application of Machine
           Learning in Biodiversity Image Analysis

    • Abstract: Biodiversity Information Science and Standards 5: e75856
      DOI : 10.3897/biss.5.75856
      Authors : Yasin Bakış, Xiaojun Wang, Hank Bart : Over 1 billion biodiversity collection specimens ranging from fungi to fish to fossils are housed in more than 1,600 natural history collections across the United States. The digitization of these specimens has risen significantly within the last few decades and this is only likely to increase, as the use of digitized data gains more importance every day. Numerous experiments with automated image analysis have proven the practicality and usefulness of digitized biodiversity images by computational techniques such as neural networks and image processing. However, most of the computational techniques to analyze images of biodiversity collection specimens require a good curation of this data. One of the challenges in curating multimedia data of biodiversity collection specimens is the quality of the multimedia objects—in our case, two dimensional images. To tackle the image quality problem, multimedia needs to be captured in a specific format and presented with appropriate descriptors. In this study we present an analysis of two image repositories each consisting of 2D images of fish specimens from several institutions—the Integrated Digitized Biocollections (iDigBio) and the Great Lakes Invasives Network (GLIN).  Approximately 70 thousand images have been processed from the GLIN repository and 450 thousand images have been processed from the iDigBio repository and their suitability assessed for use in neural network-based species identification and trait extraction applications. Our findings showed that images that came from the GLIN dataset were more successful for image processing and machine learning purposes. Almost 40% of the species have been represented with less than 10 images while only 20% have more than 100 images per species.We identified and captured 20 metadata descriptors that define quality and usability of the image. According to the captured metadata information, 70% of the GLIN dataset images were found to be useful for further analysis according to the overall image quality score. Quality issues with the remaining images included: curved specimens, non-fish objects in the images such as tags, labels and rocks that obstructed the view of the specimen; color, focus and brightness issues; folded or overlapping parts as well as missing parts.We used both the web interface and the API (Application Programming Interface) for downloading images from iDigBio. We searched for all fish genera, families and classes in three different searches with the images-only option selected. Then we combined all of the search results and removed duplicates. Our search on the iDigBio database for fish taxa returned approximately 450 thousand records with images. We narrowed this down to 90 thousand fish images aided by the multimedia metadata with the downloaded search results, excluding some non-fish images, fossil samples, X-ray and CT (computed tomography) scans and several others. Only 44% of these 90 thousand images were found to be suitable for further analysis.In this study, we discovered some of the limitations of biodiversity image datasets and built an infrastructure for assessing the quality of biodiversity images for neural network analysis. Our experience with the fish images gathered from two different image repositories has enabled describing image quality metadata features. With the help of these metadata descriptors, one can simply create a dataset for a desired image quality for the purpose of analysis. Likewise, the availability of the metadata descriptors will help advance our understanding of quality issues, while helping data technicians, curators and the other digitization staff be more aware of multimedia. HTML XML PDF
      PubDate: Tue, 28 Sep 2021 14:15:00 +030
       
  • GBIF Data Processing and Validation

    • Abstract: Biodiversity Information Science and Standards 5: e75686
      DOI : 10.3897/biss.5.75686
      Authors : John Waller, Nikolay Volik, Federico Mendez, Andrea Hahn : GBIF (Global Biodiversity Information Facility) is the largest data aggregator of biological occurrences in the world. GBIF was officially established in 2001 and has since aggregated 1.8 billion occurrence records from almost 2000 publishers. GBIF relies heavily on Darwin Core (DwC) for organising the data it receives. GBIF Data Processing PipelinesEvery single occurrence record that gets published to GBIF goes through a series of three processing steps until it becomes available on GBIF.org.source downloadingparsing into verbatim occurrences interpreting verbatim valuesOnce all records are available in the standard verbatim form, they go through a set of interpretations.In 2018, GBIF processing underwent a significant rewrite in order to improve speed and maintainablility. One of the main goals of this rewrite was to improve the consistency between GBIF's processing and that of the Living Atlases. In connection with this, GBIF's current data validator fell out of sync with GBIF pipelines processing.New GBIF Data ValidatorThe current GBIF data validator is a service that allows anyone with a GBIF-relevant dataset to receive a report on the syntactical correctness and the validity of the content contained within the dataset. By submitting a dataset to the validator, users can go through the validation and interpretation procedures usually associated with publishing in GBIF and quickly determine potential issues in data, without having to publish it. GBIF is planning to rework the current validator because the current validator does not exactly match current GBIF pipelines processing.Planned Changes The new validator will match the processing of the GBIF pipelines project.Validations will be saved and show up on user pages similar to the way downloads and derived datasets appear now (no more bookmarking validations!)A downloadable report of issues found will be produced.Suggested Changes/Ideas One of the main guiding philosophies for the new validator user interface will be avoiding information overload. The current validator is often quite verbose in its feedback, highlighting data issues that may or may not be fixable or particularly important. The new validator will:generate a map of record geolocations;give users issues by order of importance;give "What", "Where", "When" flags priority;give some possible solutions or suggested fixes for flagged records.We see the hosted portal environment as a way to quickly implement a pre-publication validation environment that is interactive and visual. Potential New Data Quality Flags The GBIF team has been compiling a list of new data quality flags. Not all of the suggested flags are easy to implement, so GBIF cannot promise the flags will get implemented, even if they are a great idea. The advantage of the new processing pipelines is that almost any new data quality flag or processing step in pipelines will be available for the data validator. Easy new potential flags:country centroid flag: Country/province centroids are a known data quality problem.any zero coordinate flag: Sometimes publishers leave either the latitude or longitude field as zero when it should have been left blank or NULL.default coordinate uncertainty in meters flag: Sometimes a default value or code is used for dwc:coordinateUncertaintyInMeters, which might indicate that it is incorrect. This is especially the case for values 301, 3036, 999, 9999.no higher taxonomy flag: Often publishers will leave out the higher taxonomy of a record. This can cause problems for matching to the GBIF backbone taxonomy..null coordinate uncertainty in meters flag: There has been some discussion that GBIF should encourage publishers more to fill in dwc:coordinateUncertaintyInMeters. This is because every record, even ones taken from a Global Positioning System (GPS) reading, have an associated dwc:coordinateUncertaintyInMetersIt is also nice when a data quality flag has an escape hatch, such that a data publisher can get rid of false positives or remove a flag through filling in a value.Batch-type validations that are doable for pipelines, but probably not in the validator include:outlier: Outliers are a known data quality problem. There are generally two types of outliers: environmental outliers and distance outliers. Currently GBIF does not flag either type of outlier.record is sensitive species: A sensitive species would be a record where the species is considered vulnerable in some way. Usually this is due to poaching threat or the species is only found in one area.gridded dataset: Rasterized or gridded datasets are common on GBIF. These are datasets where location information is pinned to a low-resolution grid. This is already available with an experimental API (Application Programming Interface).ConclusionData quality and data processing are moving targets. Variable source data will always be an issue when aggregating large amounts of data. With GBIF's new processing architecture, we hope that new features and data quality flags can be added more easily. Time and staffing resources are always in short supply, so we plan to prioritise the feedback we give to publishers, in order for them to work on correcting the most important and fixable issues. With new GBIF projects like the vocabulary server, we also hope that GBIF data processing can have more community participation. HTML XML PDF
      PubDate: Mon, 27 Sep 2021 11:00:00 +030
       
  • bddashboard: An infrastructure for biodiversity dashboards in R

    • Abstract: Biodiversity Information Science and Standards 5: e75684
      DOI : 10.3897/biss.5.75684
      Authors : Tomer Gueta, Rahul Chauhan, Thiloshon Nagarajah, Vijay Barve, Povilas Gibas, Martynas Jočys, Rahul Saxena, Sunny Dhoke, Yohay Carmel : The bdverse is a collection of packages that form a general framework for facilitating biodiversity science in R (programming language). Exploratory and diagnostic visualization can unveil hidden patterns and anomalies in data and allow quick and efficient exploration of massive datasets. The development of an interactive yet flexible dashboard that can be easily deployed locally or remotely is a highly valuable biodiversity informatics tool. To this end, we have developed 'bddashboard', which serves as an agile framework for biodiversity dashboard development. This project is built in R, using the Shiny package (RStudio, Inc 2021) that helps build interactive web apps in R. The following key components were developed:Core Interactive Components The basic building blocks of every dashboard are interactive plots, maps, and tables. We have explored all major visualization libraries in R and have concluded that 'plotly' (Sievert 2020) is the most mature and showcases the best value for effort. Additionally, we have concluded that 'leaflet' (Graul 2016) shows the most diverse and high-quality mapping features, and DT (DataTables library) (Xie et al. 2021) is best for rendering tabular data. Each component was modularized to better adjust it for biodiversity data and to enhance its flexibility.Field Selector The field selector is a unique module that makes each interactive component much more versatile. Users have different data and needs; thus, every combination or selection of fields can tell a different story. The field selector allows users to change the X and Y axis on plots, to choose the columns that are visible on a table, and to easily control map settings. All that in real-time, without reloading the page or disturbing the reactivity. The field selector automatically detects how many columns a plot needs and what type of columns can be passed to the X-axis or Y-axis. The field selector also displays the completeness of each field. Plot Navigation We developed the plot navigation module to prevent unwanted extreme cases. Technically, drawing 1,000 bars on a single bar plot is possible, but this visualization is not human-friendly. Navigation allows users to decide how many values they want to see on a single plot. This technique allows for fast drawing of extensive datasets without affecting page reactivity, dramatically improving performance and functioning as a fail-safe mechanism. Reactivity Reactivity creates the connection between different components. The changes in input values automatically flow to the plots, text, maps, and tables that use the input, and cause them to update. Reactivity facilitates drilling down functionality, which enhances the user’s ability to explore and investigate the data. We developed a novel and robust reactivity technique that allows us to add a new component and effectively connect it with all existing components within a dashboard tab, using only one line of code.Generic Biodiversity Tabs We developed five useful dashboard tabs (Fig. 1): (i) the Data Summary tab to give a quick overview of a dataset; (ii) the Data Completeness tab helps users get valuable information about missing records and missing Darwin Core fields; (iii) the Spatial tab is dedicated to spatial visualizations; (iv) the Taxonomic tab is designed to visualize taxonomy; and (v) the Temporal tab is designed to visualize time-related aspects. Performance and Agility To make a dashboard work smoothly and react quickly, hundreds of small and large modules, functions, and techniques must work together. Our goal was to minimize dashboard latency and maximize its data capacity. We used asynchronous modules to write non-blocking code, clusters in map components, and preprocessing and filtering data before passing it to plots to reduce the load. The 'bddashboard' package modularized architecture allows us to develop completely different interactive and reactive dashboards within mere minutes. HTML XML PDF
      PubDate: Mon, 27 Sep 2021 11:00:00 +030
       
  • Internet of Samples: Progress report

    • Abstract: Biodiversity Information Science and Standards 5: e75797
      DOI : 10.3897/biss.5.75797
      Authors : Dave Vieglais, Stephen Richard, Hong Cui, Neil Davies, John Deck, Quan Gan, Eric Kansa, Sarah Whitcher Kansa, John Kunze, Danny Mandel, Christopher Meyer, Thomas Orrell, Sarah Ramdeen, Rebecca Snyder, Ramona Walls, Yuxuan Zhou, Kerstin Lehnert : Material samples form an important portion of the data infrastructure for many disciplines. Here, a material sample is a physical object, representative of some physical thing, on which observations can be made. Material samples may be collected for one project initially, but can also be valuable resources for other studies in other disciplines. Collecting and curating material samples can be a costly process. Integrating institutionally managed sample collections, along with those sitting in individual offices or labs, is necessary to faciliate large-scale evidence-based scientific research. Many have recognized the problems and are working to make data related to material samples FAIR: findable, accessible, interoperable, and reusable. The Internet of Samples (i.e., iSamples) is one of these projects. iSamples was funded by the United States National Science Foundation in 2020 with the following aims:enable previously impossible connections between diverse and disparate sample-based observations;support existing research programs and facilities that collect and manage diverse sample types;facilitate new interdisciplinary collaborations; andprovide an efficient solution for FAIR samples, avoiding duplicate efforts in different domains (Davies et al. 2021) The initial sample collections that will make up the internet of samples include those from the System for Earth Sample Registration (SESAR), Open Context, the Genomic Observatories Meta-Database (GEOME), and Smithsonian Institution Museum of Natural History (NMNH), representing the disciplines of geoscience, archaeology/anthropology, and biology.To achieve these aims, the proposed iSamples infrastructure (Fig. 1) has two key components: iSamples in a Box (iSB) and iSamples Central (iSC). The iSC component will be a permanent Internet service that preserves, indexes, and provides access to sample metadata aggregated from iSBs. It will also ensure that persistent identifiers and sample descriptions assigned and used by individual iSBs are synchronized with the records in iSC and with identifier authorities like International Geo Sample Number (IGSN) or Archival Resource Key (ARK). The iSBs create and maintain identifiers and metadata for their respective collection of samples. While providing access to the samples held locally, an iSB also allows iSC to harvest its metadata records. The metadata modeling strategy adopted by the iSamples project is a metadata profile-based approach, where core metadata fields that are applicable to all samples, form the core metadata schema for iSamples. Each individual participating collectionis free to include additional metadata in their records, which will also be harvested by iSC and are discoverable through the iSC user interface or APIs (Application Programming Interfaces), just like the core. In-depth analysis of metadata profiles used by participating collections, including Darwin Core, has resulted in an iSamples core schema currently being tested and refined through use. See the current version of the iSamples core schema.A number of properties require a controlled vocabulary. Controlled vocabularies used by existing records are kept, while new vocabularies are also being developed to support high-level grouping with consistent semantics across collection types. Examples include vocabularies for Context Category, Material Category, and Specimen Type (Table 1). These vocabularies were also developed in a bottom-up manner, based on the terms used in the existing collections. For each vocabulary, a decision tree graph was created to illustrate relations among the terms, and a card sorting exercise was conducted within the project team to collect feedback. Domain experts are invited to take part in this exercise here, here, and here. These terms will be used as upper-level terms to the existing category terms used in the participating collections and hence create connections among individual participating collections.iSample project members are also active in the TDWG Material Sample Task Group and the global consultation on Digital Extended Specimens. Many members of the iSamples project also lead or participate in a sister research coordination network (RCN), Sampling Nature. The goal of this RCN is to develop and refine metadata standards and controlled vocabularies for the iSamples and other projects focusing on material samples. We cordially invite you to participate in the Sampling Nature RCN and help shape the future standards for material samples. Contact Sarah Ramdeen (sramdeen@ideo.columbia.edu) to engage with the RCN. HTML XML PDF
      PubDate: Mon, 27 Sep 2021 10:45:00 +030
       
  • Open Forest Data: Digitalizing and building an online repository

    • Abstract: Biodiversity Information Science and Standards 5: e75783
      DOI : 10.3897/biss.5.75783
      Authors : Beata Bramorska : Poland is characterised by a relatively high variety of living organisms attributed to terrestrial and water environments. Currently, close to 57.000 species of living organisms are described that occur in Poland (Symonides 2008), including lowland and mountain species, those attributed to oceanic and continental areas, as well as species from forested and open habitats. Poland comprehensively represents biodiversity of living organisms on a continental scale and thus, is considered to have an important role for biodiversity maintenance.The Mammal Research Institute of Polish Academy of Sciences (MRI PAS), located in Białowieża Forest, a UNESCO Heritage Site, has been collecting biodiversity data for 90 years. However, a great amount of data gathered over the years, especially old data, is gradually being forgotten and hard to access. Old catalogues and databases have never been digitalized or publicly shared, and not many Polish scientists are aware of the existence of such resources, not to mention the rest of the scientific world.Recognizing the need for an online, interoperable platform, following FAIR data principles (findable, accessible, interoperable, reusable), where biodiversity and scientific data can be shared, MRI PAS took a lead in creation of an Open Forest Data (OFD) repository. OpenForestData.pl (Fig. 1) is a newly created (2020) digital repository, designed to provide access to natural sciences data and provide scientists with an infrastructure for storing, sharing and archiving their research outcomes. Creating such a platform is a part of an ongoing development of life sciences in Poland, aiming for an open, modern science, where data are published as free-access. OFD also allows for the consolidation of natural science data, enabling the use and processing of shared data, including API (Application Programming Interface) tools. OFD is indexed by the Directory of Open Repositories (OpenDOAR) and Registry of Research Data Repositories (re3data).The OFD platform is based entirely on reliable, globally recognized open source software: DATAVERSE, an interactive database app which supports sharing, storing, exploration, citation and analysis of scientific data; GEONODE, a content management geospatial system used for storing, publicly sharing and visualising vector and raster layers, GRAFANA, a system meant for storing and analysis of metrics and large scale measurement data, as well as visualisation of historical graphs at any time range and analysis for trends; and external tools for database storage (Orthanc) and data visualisation (Orthanc plugin Osimis Web Viewer and Online 3D Viewer (https://3dviewer.net/), which were integrated with the system mechanism of Dataverse. Furthermore, according to the need for specimen description, Darwin Core (Wieczorek et al. 2012) metadata schema was decided to be the most suitable for specimen and collections description and mapped into a Dataverse additional metadata block. The use of Darwin Core is based on the same file format, the Darwin Core Archive (DwC-A) which allows for sharing data using common terminology and provides the possibility for easy evaluation and comparison of biodiversity datasets. It allows the contributors to OFD to optionally choose Darwin Core for object descriptions making it possible to share biodiversity datasets in a standardized way for users to download, analyse and compare.Currently, OFD stores more than 10.000 datasets and objects from the collections of Mammal Research Institute of Polish Academy of Sciences and Forest Science Institute of Białystok University of Technology. The objects from natural collections were digitalized, described, catalogued and made public in free-access. OFD manages seven types of collection materials:3D and 2D scans of specimen in Herbarium, Fungarium, Insect and Mammal Collections,images from microscopes (including stereoscopic and scanning electron microscopes),morphometric measurements,computed tomography and microtomography scans in Mammal Collection,mammal telemetry data,satellite imagery, geospatial climatic and environmental data,georeferenced historical maps.In the OFD repository, researchers have the possibility to share data in standardized way, which nowadays is often a requirement during the publishing process of a scientific article. Beside scientists, OFD is designed to be open and free for students and specialists in nature protection, but also for officials, foresters and nature enthusiasts. Creation of the OFD repository supports the development of citizen science in Poland, increases visibility and access to published data, improves scientific collaboration, exchange and reuse of data within and across borders. HTML XML PDF
      PubDate: Mon, 27 Sep 2021 10:45:00 +030
       
  • Architectural Pattern: Study of orchid architecture using tools to take
           quick measurements of virtual specimens

    • Abstract: Biodiversity Information Science and Standards 5: e75752
      DOI : 10.3897/biss.5.75752
      Authors : Aurore Gourraud, Régine Vignes Lebbe, Adeline Kerner, Marc Pignal : The joint use of two tools applied to plant description, XPER3 and Recolnat Annotate, made it possible to study vegetative architectural patterns (Fig. 1) of the Dendrobium (Orchidaceae) in New Caledonia defined by N. Hallé (1977). This approach is not directly related to taxonomy, but to the definition of sets of species grouped according to a growth pattern. In the course of this work, the characters stated by N. Hallé were analysed and eventually amended to produce a data matrix and generate an identification key.Study materials: Dendrobium Sw. in New CaledoniaNew Caledonia is an archipelago in the Pacific Ocean, a French overseas territory located east of Australia. It is one of the 36 biodiversity hotspots in the world. The genus Dendrobium Sw. sensu lato is one of the largest in the family Orchidaceae and contains over 1220 species. In New Caledonia, it includes 46 species. In his revision of the family, N. Hallé (1977) defined 14 architectural groups, into which he divided the 31 species known at that time. These models are based on those defined by F. Hallé and Oldeman (1970). But they are clearly intended to group species together for identification purposes.Architectural pattern: A pattern is a set of vegetative or reproductive characters that define the general shape of an individual. Developed by mechanisms linked to the dominance of the terminal buds, the architectural groups are differentiated by the arrangement of the leaves, the position of the inflorescences or the shape of the stem (Fig. 1). Plants obeying a given pattern do not necessarily have phylogenetic relationships. These models have a useful application in the field for identifying groups of plants. Monocotyledonous plants, and in particular the Orchidaceae, lend themselves well to this approach, which produces stable architectural patterns.Recolnat AnnotateRecolnat Annotate is a free tool for observing qualitative features and making physical measurements (angle, length, area) of images. It can be used offline and downloaded from https://www.recolnat.org/en/annotate. The software is based on the setting up observation projects that group together a batch of herbarium images to be studied, associating it with a descriptive model. A file of measurements can be exported in comma separated value (csv) format for further analysis (Fig. 2).XPER3Usually used in the context of systematics in which the items studied are taxa, XPER3 can also be used to distinguish architectural groups that are not phylogenetically related.Developed by the Laboratoire d'Informatique et Systématique (LIS) of the Institut de Systématique, Evolution, Biodiversité in Paris, XPER3 is an online collaborative platform that allows the editing of descriptive data (https://www.xper3.fr/'language=en). This tool allows the cross-referencing of items (in this case architectural groups) and descriptors (or characters). It allows the development of free access identification keys (it means without fixed sequence of identification steps). The latter can be used directly online. But it also offers to produce single-access keys, with or without using character weighting and dependencies between characters.Links between XPER3 and Recolnat AnnotateThe descriptive model used by Recolnat Annotate can be developed within the framework of XPER3, which provides for characters and character states. Thus the observations made by the Recolnat Annotate measurement tool can be integrated into the XPER3 platform. Specimens can then be compared, or several descriptions can be merged to express the description of a species (Fig. 3).RESULTSThe joint use of XPER3 and Recolnat Annotate to manage both herbarium specimens and architectural patterns has proven to be relevant. Moreover, the measurements on the virtual specimens are fast and reliable.N. Hallé (1977) had produced a dichotomous single-accesskey that allowed the identification and attribution of a pattern to a plant observed in the field or in a herbarium. The project to build a polytomous and interactive key with XPER3 required completing the observations to give a status for each character of each vegetative architectural model.Recolnat Annotate was used to produce observations from herbarium network in France. The use of XPER3 has allowed us to redefine these models in the light of new data from the herbaria and to publish the interactive key available at dendrobium-nc.identificationkey.org. HTML XML PDF
      PubDate: Mon, 27 Sep 2021 10:45:00 +030
       
  • Author-Driven Computable Data and Ontology Production for Taxonomists

    • Abstract: Biodiversity Information Science and Standards 5: e75741
      DOI : 10.3897/biss.5.75741
      Authors : Hong Cui, Bruce Ford, Julian Starr, James Macklin, Anton Reznicek, Noah Giebink, Dylan Longert, Étienne Léveillé-Bourret, Limin Zhang : It takes great effort to manually or semi-automatically convert free-text phenotype narratives (e.g., morphological descriptions in taxonomic works) to a computable format before they can be used in large-scale analyses. We argue that neither a manual curation approach nor an information extraction approach based on machine learning is a sustainable solution to produce computable phenotypic data that are FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016).  This is because these approaches do not scale to all biodiversity, and they do not stop the publication of free-text phenotypes that would need post-publication curation. In addition, both manual and machine learning approaches face great challenges: the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other) in manual curation, and keywords to ontology concept translation in automated information extraction, make it difficult for either approach to produce data that are truly FAIR. Our empirical studies show that inter-curator variation in translating phenotype characters to Entity-Quality statements (Mabee et al. 2007) is as high as 40% even within a single project. With this level of variation, curated data integrated from multiple curation projects may still not be FAIR.The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardized vocabularies (ontologies). We argue that the authors describing characters are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of the descriptions from the moment of publication. In this presentation, we will introduce the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists, which consists of three components:a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions);a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; andan Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts.Fig. 1 shows the system diagram of the platform. The presentation will consist of:a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder;a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; anda software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates.The software modules currently incorporated in Character Recorder and Conflict Resolver have undergone formal usability studies. We are actively recruiting Carex experts to participate in a 3-day usability study of the entire system of the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists. Participants will use the platform to record 100 characters about one Carex species. In addition to usability data, we will collect the terms that participants submit to the underlying ontology and the data related to conflict resolution. Such data allow us to examine the types and the quantities of logical conflicts that may result from the terms added by the users and to use Discrete Event Simulation models to understand if and how term additions and conflict resolutions converge.  We look forward to a discussion on how the tools (Character Recorder is online at http://shark.sbs.arizona.edu/chrecorder/public) described in our presentation can contribute to producing and publishing FAIR data in taxonomic studies. HTML XML PDF
      PubDate: Mon, 27 Sep 2021 10:45:00 +030
       
  • The Digital Extended Specimen will Enable New Science and Applications

    • Abstract: Biodiversity Information Science and Standards 5: e75736
      DOI : 10.3897/biss.5.75736
      Authors : Michael Webster, Jutta Buschbom, Alex Hardisty, Andrew Bentley : Specimens have long been viewed as critical to research in the natural sciences because each specimen captures the phenotype (and often the genotype) of a particular individual at a particular point in space and time. In recent years there has been considerable focus on digitizing the many physical specimens currently in the world’s natural history research collections. As a result, a growing number of specimens are each now represented by their own “digital specimen”, that is, a findable, accessible, interoperable and re-usable (FAIR) digital representation of the physical specimen, which contains data about it. At the same time, there has been growing recognition that each digital specimen can be extended, and made more valuable for research, by linking it to data/samples derived from the curated physical specimen itself (e.g., computed tomography (CT) scan imagery, DNA sequences or tissue samples), directly related specimens or data about the organism's life (e.g., specimens of parasites collected from it, photos or recordings of the organism in life, immediate surrounding ecological community), and the wide range of associated specimen-independent data sets and model-based contextualisations (e.g., taxonomic information, conservation status, bioclimatological region, remote sensing images, environmental-climatological data, traditional knowledge, genome annotations). The resulting connected network of extended digital specimens will enable new research on a number of fronts, and indeed this has already begun. The new types of research enabled fall into four distinct but overlapping categories.First, because the digital specimen is a surrogate—acting on the Internet for a physical specimen in a natural science collection—it is amenable to analytical approaches that are simply not possible with physical specimens. For example, digital specimens can serve as training, validation and test sets for predictive process-based or machine learning algorithms, which are opening new doors of discovery and forecasting. Such sophisticated and powerful analytical approaches depend on FAIR, and on extended digital specimen data being as open as possible. These analytical approaches are derived from biodiversity monitoring outputs that are critically needed by the biodiversity community because they are central to conservation efforts at all levels of analysis, from genetics to species to ecosystem diversity.Second, linking specimens to closely associated specimens (potentially across multiple disparate collections) allows for the coordinated co-analysis of those specimens. For example, linking specimens of parasites/pathogens to specimens of the hosts from which they were collected, allows for a powerful new understanding of coevolution, including pathogen range expansion and shifts to new hosts. Similarly, linking specimens of pollinators, their food plants, and their predators can help untangle complex food webs and multi-trophic interactions.Third, linking derived data to their associated voucher specimens increases information richness, density, and robustness, thereby allowing for novel types of analyses, strengthening validation through linked independent data and thus, improving confidence levels and risk assessment. For example, digital representations of specimens, which incorporate e.g., images, CT scans, or vocalizations, may capture important information that otherwise is lost during preservation, such as coloration or behavior. In addition, permanently linking genetic and genomic data to the specimen of the individual from which they were derived—something that is currently done inconsistently—allows for detailed studies of the connections between genotype and phenotype. Furthermore, persistent links to physical specimens, of additional information and associated transactions, are the building blocks of documentation and preservation of chains of custody. The links will also facilitate data cleaning, updating, as well as maintenance of digital specimens and their derived and associated datasets, with ever-expanding research questions and applied uses materializing over time. The resulting high-quality data resources are needed for fact-based decision-making and forecasting based on monitoring, forensics and prediction workflows in conservation, sustainable management and policy-making.Finally, linking specimens to diverse but associated datasets allows for detailed, often transdisciplinary, studies of topics ranging from local adaptation, through the forces driving range expansion and contraction (critically important to our understanding of the consequences of climate change), and social vectors in disease transmission.A network of extended digital specimens will enable new and critically important research and applications in all of these categories, as well as science and uses that we cannot yet envision. HTML XML PDF
      PubDate: Mon, 27 Sep 2021 10:45:00 +030
       
  • ecoTeka, Urban Forestry Data Management

    • Abstract: Biodiversity Information Science and Standards 5: e75705
      DOI : 10.3897/biss.5.75705
      Authors : Mathias Aloui, Gaëtan Duhamel, Manon Frédout, Olivier Rovellotti : It is now well known that a healthy urban ecosystem is a crucial element to healthier citizens (Astell-Burt and Feng 2019), better air (Ning et al. 2016) and water quality (Livesley et al. 2016), and overall, to a more resilient urban environment (Huff et al. 2020). With ecoTeka, an open-source platform for tree management, we leverage the power of OpenStreetMap (Mooney 2015), Mappilary, and open data to allow decision makers to improve their urban forestry practices. To have the most comprehensive data about the ecosystems, we plan use all available sources from satellite imagery to LIDAR (light detection and ranging) and compute them with the DeepForest (Weinstein et al. 2020) learning algorithm. We also teamed with the French government to build an open standard for tree data to improve the interoperability of the system. Finally, we calculate a Shannon-Wiener diversity index (used by ecologists to estimate species diversity by their relative abundance in a habitat) to inform the decision making of urban ecosystems. HTML XML PDF
      PubDate: Mon, 27 Sep 2021 10:45:00 +030
       
  • ecoBalade, Reconnecting People to Their Bioregion

    • Abstract: Biodiversity Information Science and Standards 5: e75706
      DOI : 10.3897/biss.5.75706
      Authors : Lilya Dif, Eric Woloch, Olivier Rovellotti : Many studies have identified that people, the stories they tell (Prévot-Julliard et al. 2014) and the products they buy (Kesebir and Kesebir 2017) are getting more and more disconnected from nature. As a side effect, it is getting harder to understand the complexity of conservation issues (Zhang et al. 2014). The result (Pyle 2003) is an inexorable cycle of disconnection, apathy, and progressive depletion of awareness. Even though remarkable progress has been made by software technologies to help us to give names to plants, birds and animals, we need a deeper connection to our environment. By permitting exploration of an ecopath and the surrounding species (thanks to the identification keys), the new version of ecoBalade aims to reconnect people to people and nature. It provides a new way for public localities to put their natural heritage in the spotlight. We also believe that this new version will showcase the local bioregions (Pezzoli and Leiter 2016, Ebach et al. 2013) and will provide a key to understanding the imbrications of local biodiversity. HTML XML PDF
      PubDate: Mon, 27 Sep 2021 10:45:00 +030
       
  • GeoNature, Open-Source FAIR Biodiversity Data Management

    • Abstract: Biodiversity Information Science and Standards 5: e75704
      DOI : 10.3897/biss.5.75704
      Authors : Adrien Pajot, Aurélie Jambon, Camille Monchicourt, Olivier Rovellotti : Huge improvements have been made throughout the years in collecting and standardising biodiversity data (Bisby 2000, Osawa 2019, Hardisty and Roberts 2013) and in overhauling how to make information in the field of biodiversity data management more FAIR (Findable, Accessible, Interoperable, Reusable) (Simons 2021), but there is still room for improvement. Most professionals working in protected areas, conservation groups, and research organisations lack the required know-how to improve the reuse ratio of their data. The GeoNature and GeoNature-Atlas (Monchicourt 2018, Corny et al. 2019) are a set of open-source software that facilitate data collection, management, validation, sharing (e.g., via Darwin Core standard) and visualisation. It is a powerful case study of collaborative work, which includes teams from private and public sectors with at least fifteen national parks and forty other organisations currently using and contributing to the package in France and Belgium (view it on github). HTML XML PDF
      PubDate: Mon, 27 Sep 2021 10:45:00 +030
       
  • TreatmentBank: Plazi's strategies and its implementation to most
           efficiently liberate data from scholarly publications 

    • Abstract: Biodiversity Information Science and Standards 5: e75690
      DOI : 10.3897/biss.5.75690
      Authors : Marcus Guidoti, Carolina Sokolowicz, Felipe Simoes, Valdenar Gonçalves, Tatiana Ruschel, Diego Alvares, Donat Agosti : Plazi's TreatmentBank is a research infrastructure and partner of the recent European Union-funded Biodiversity Community Integrated Knowledge Library (BiCIKL) project to provide a single knowledge portal to open, interlinked and machine-readable, findable, accessible, interoperable and reusable (FAIR) data. Plazi is liberating published biodiversity data that is trapped in so-called flat formats, such as portable document format (PDF), to increase its FAIRness. This can pose a variety of challenges for both data mining and curation of the extracted data. The automation of such a complex process requires internal organization and a well established workflow of specific steps (e.g., decoding of the PDF, extraction of data) to handle the challenges that the immense variety of graphic layouts existing in the biodiversity publishing landscape can impose. These challenges may vary according to the origin of the document: scanned documents that were not initially digital, need optical character recognition in order to be processed.Processing a document can either be an individual, one-time-only process, or a batch process, in which a template for a specific document type must be produced. Templates consist of a set of parameters that tell Plazi-dedicated software how to read and where to find key pieces of information for the extraction process, such as the related metadata. These parameters aim to improve the outcome of the data extraction process, and lead to more consistent results than manual extraction. In order to produce such templates, a set of tests and accompanying statistics are evaluated, and these same statistics are constantly checked against ongoing processing tasks in order to assess the template performance in a continuous manner.In addition to these steps that are intrinsically associated with the automated process, different granularity levels (e.g., low granularity level might consist of a treatment and its subsections versus a high granularity level that includes material citations down to named entities such as collection codes, collector, collecting date) were defined to accommodate specific needs for particular projects and user requirements. The higher the granularity level, the more thoroughly checked the resulting data is expected to be.Additionally, steps related to the quality control (qc), such as the “pre-qc”, “qc” and “extended qc” were designed and implemented to ensure data quality and enhanced data accuracy.Data on all these different stages of the processing workflow are constantly being collected and assessed in order to improve these very same stages, aiming for a more reliable and efficient operation. This is also associated with a current Data Architecture plan to move this data assessment to a cloud provider to promote real-time assessment and constant analyses of template performance and processing stages as a whole.In this talk, the steps of this entire process are explained in detail, highlighting how data are being used to improve these steps towards a more efficient, accurate, and less costly operation. HTML XML PDF
      PubDate: Mon, 27 Sep 2021 10:45:00 +030
       
  • pyOpenSci: Open and reproducible research, powered by Python

    • Abstract: Biodiversity Information Science and Standards 5: e75688
      DOI : 10.3897/biss.5.75688
      Authors : Michael Trizna, Leah Wasser, David Nicholson : pyOpenSci (short for Python Open Science), funded by the Alfred P. Sloan Foundation, is building a diverse community that supports well documented, open source Python software that enables open reproducible science. pyOpenSci will work with the community to openly develop best practice guidelines and open standards for scientific Python software, which will be reinforced through a community-led peer review process and training. Packages that complete the peer review process become a part of the pyOpenSci ecosystem, where maintenance can be shared to ensure longevity and stability in code. pyOpenSci packages are also eligible for a “fast tracked” acceptance to JOSS (Journal of Open Source Software). In addition, we provide review for open science tools that would be of interest to TDWG members but are not within scope for JOSS, such as API (Application Programming Interface) wrappers. pyOpenSci is built on top of the successful model of rOpenSci, founded in 2011, which has fostered the development of several useful biodiversity informatics R packages. The pyOpenSci team looks to following the lessons learned by rOpenSci, to create a similarly successful community. We invite TDWG members developing open source software tools in Python to become part of the pyOpenSci community. HTML XML PDF
      PubDate: Mon, 27 Sep 2021 10:45:00 +030
       
  • CitSci.org & PPSR Core: Sharing biodiversity observations across
           platforms

    • Abstract: Biodiversity Information Science and Standards 5: e75666
      DOI : 10.3897/biss.5.75666
      Authors : Brandon Budnicki, Gregory Newman : CitSci.org is a global citizen science software platform and support organization housed at Colorado State University. The mission of CitSci is to help people do high quality citizen science by amplifying impacts and outcomes. This platform hosts over one thousand projects and a diverse volunteer base that has amassed over one million observations of the natural world, focused on biodiversity and ecosystem sustainability. It is a custom platform built using open source components including: PostgreSQL, Symfony, Vue.js, with React Native for the mobile apps. CitSci sets itself apart from other Citizen Science platforms through the flexibility in the types of projects it supports rather than having a singular focus. This flexibility allows projects to define their own datasheets and methodologies.The diversity of programs we host motivated us to take a founding role in the design of the PPSR Core, a set of global, transdisciplinary data and metadata standards for use in Public Participation in Scientific Research (Citizen Science) projects. Through an international partnership between the Citizen Science Association, European Citizen Science Association, and Australian Citizen Science Association, the PPSR team and associated standards enable interoperability of citizen science projects, datasets, and observations.Here we share our experience over the past 10+ years of supporting biodiversity research both as developers of the CitSci.org platform and as stewards of, and contributors to, the PPSR Core standard. Specifically, we share details about:the origin, development, and informatics infrastructure for CitSciour support for biodiversity projects such as population and community surveysour experiences in platform interoperability through PPSR Core working with the Zooniverse, SciStarter, and CyberTrackerdata qualitydata sharing goals and use cases. We conclude by sharing overall successes, limitations, and recommendations as they pertain to trust and rigor in citizen science data sharing and interoperability. As the scientific community moves forward, we show that Citizen Science is a key tool to enabling a systems-based approach to ecosystem problems. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Components of a Digital Specimen Architecture for Biological Collections

    • Abstract: Biodiversity Information Science and Standards 5: e75655
      DOI : 10.3897/biss.5.75655
      Authors : Aimee Stewart : In 2020, we began developing software components for an Application Programming Interface (API)-based integration architecture (the “Specify Network”) to leverage the global footprint of the Specify 7 collections management platform (www.specifysoftware.org) and the analytical services of the Lifemapper (lifemapper.org) and Biotaphy (biotaphy.org) Projects. The University of Kansas Lifemapper Project is a community gateway for species distribution and macroecological modeling.  The Biotaphy Project, an extension of Lifemapper, is the product of a six-year, U.S. National Science Foundation-funded collaboration among researchers at the Universities of Michigan, Florida, and Kansas. Biotaphy's primary scope is to use big data methods and high-performance computing to integrate species occurrence data with phylogenetic and biogeographic data sets for large taxonomic and spatial scale analyses. Our initial integrations between Biotaphy and the Specify Network enable Specify users to easily discover remote information related to the specimens in their collection.  The widely-discussed, digital specimen architecture being championed by DiSSCo (Distributed System of Scientific Collections www.dissco.eu) and others (https://bit.ly/3jfsAgz) will change data communications between biodiversity collections and the broader biodiversity data community. Those network interactions will evolve from being predominantly one-way, batch-oriented transfers of information from museums to aggregators, to an n-way communications topology that will make specimen record discovery, updates and usage much easier to accomplish. But museum specimens and their catalogs will no longer be an intellectual endpoint of species documentation. Rather, records in collections management systems will increasingly serve as a point of departure for data synthesis, which takes place outside of institutional data domains, and which will overlay the legacy role of museums as authoritative sources of information about the diversity and distribution of life on Earth. Biological museum institutions will continue to play a vital role as the foundation of a global data infrastructure connecting aggregators, collaborative databases, analysis engines, journal publishers, and data set archives.In this presentation, we will provide an update on the components and capabilities that make up integrations in the Specify Network as an exemplar of the global architecture envisaged by the biodiversity research community. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Biodiversity 4.0: Standardizing biodiversity protocols for the private
           sector

    • Abstract: Biodiversity Information Science and Standards 5: e75652
      DOI : 10.3897/biss.5.75652
      Authors : Andre Acosta, Valeria Tavares, Guilherme Oliveira, Leonardo Trevelin, Ronnie Alves, Nubia Marques, Tereza Giannini : Ensuring the preservation of biodiversity is essential for humankind, as the ecosystem services it provides are directly linked to human well-being and health. The private sector has increasingly recognized the need to achieve Environmental, Social, and Corporate Governance (ESG) through measurable indicators and effective data collection (Rashed 2021).Extensive field research is often needed for private sector initiatives to generate socio-economic and environmental assessments, which usually requires hiring service providers. Regarding environmental and biodiversity information collections, the wide variety of data requires service providers to be specialized in many types of information, and therefore able to collect data on fauna and flora, soil and its microorganisms, genetic and evolutionary data, monitoring of the climate, conservation, and restoration areas, among many others.Long-term monitoring, a generally common demand for the private sector (e.g., Shackelford (2018)), also relies on collecting various types of data often surveyed, gathered, and stored in a non-standardized fashion.The lack of data standardization makes it difficult to integrate information into central databases (Henle 2013), creating a new demand to extract and convert data from different reports, which is often time and energy-consuming, and cost-ineffective. This task is generally conducted by non-specialists and may result in misinterpretation and digitization failures, compromising information quality.The digital standardization of data is a key solution for solving these problems (Kuhl 2020), increasing efficiency in the collection, curation, and sharing of data, improving the quality and accuracy of the information, and reducing the risk of misinterpretation. The primary advantage is that the same professional who collects the data will digitize it into a common database. The direct population of raw information into the database eliminates intermediate data conversion steps optimizing quality.Here, we propose to generate a protocol for data collection in our institution (from the field, labs, museums, herbaria). This protocol is based on consolidated data standards, namely the Darwin Core (DwC). DwC is a glossary of terms that aims to standardize biodiversity information, which enables sharing data publicly. However, we are also creating new customized terms, classes, and respective metadata, such as species interaction, primarily to meet our need for long-term monitoring and assessments that are not covered by standard repositories.To assess the types of surveyed and stored data required, we are interviewing biodiversity researchers from diverse scientific backgrounds about their specific data needs and the definitions of their recommended terms (metadata). Using this method, we aim to involve people in the development process, creating a more inclusive data protocol, ensuring that all possible data demands are covered, making the protocol more likely to be generally accepted.Based on our interviews, one of the main difficulties in using a standardized glossary of terms is many unnecessary or unfillable data. This results from the search for comprehensiveness that also generates excessiveness. Taking this into account, we created a modular logic, selecting the best set of data (from a complete standardized database) for the specific demand or use.For example, if this standard database is used to guide a floral survey, it will most likely not require variables on fauna, caves, hydrology, etc. In this way, the system exports a perfectly customized digital spreadsheet containing the variables that the research team wants to collect, but also recommending other variables of interest that can be obtained during fieldwork, increasing the efficiency and scope of the activity (which may be financially onerous).We intend to make the system compatible with mobile technologies to be used indoors and outdoors, transferring the information directly to a virtual and integrative database. These open data collection protocols could be freely applied in other communities e.g., public research institutions, researchers' fieldwork, and citizen science projects.We want our framework to be FAIR, making our data more Findable, Accessible, Interoperable, and Re-usable, and will integrate the Internet of Things (IoT), Artificial Intelligence (AI), and Location Intelligence, concepts in our projects of long-term biodiversity and environmental field monitoring (Fig. 1). HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Facing the Future Together: Anchoring informatics progress in community at
           NMNH

    • Abstract: Biodiversity Information Science and Standards 5: e75648
      DOI : 10.3897/biss.5.75648
      Authors : Rebecca Snyder, Holly Little : A 2020 external review of science at the Smithsonian National Museum of Natural History (NMNH) noted that increased investment in informatics was a key element for becoming a modern knowledge institution. This review charged NMNH with developing and implementing a comprehensive strategy for the future of the museum’s informatics program including reorganization, innovation, and new approaches to staffing to address urgent needs in data science and informatics capacity. After completing assessments of current capacity and needs and the role of NMNH informatics within the global biodiversity informatics landscape, the informatics task force found that robust community building, both internally and externally, would be critical to an expanded vision of informatics at NMNH.Approaches for local and global community strategies across an organization, like NMNH and its people, go hand in hand. Solidifying a strong foundation locally is often necessary for enabling robust, coordinated participation and resource sharing at the global level. Although the task force's primary focus has been internal community building to support the increasing need for local informatics capacity, much of that internal work is closely aligned with and often driven by external participation and networks. It is also clear that many organizations are contending with similar challenges, highlighting the importance of sharing strategies and lessons learned through peer-to-peer discussions and knowledge sharing.Based on the results of the task force’s surveys, interviews and research, the new NMNH model will be anchored on the development of a community of practice. This model extends knowledge and strengthens communication and coordination with departments, programs, and collaborators both within the Smithsonian and globally. It focuses on expanding capacity through improved knowledge sharing, cross training and more strategic application of resources and tasking, hopefully resulting in a robust, innovative environment.Here we open discussions on the importance of community for increasing capacity in support of the expanding natural history informatics landscape and strategies for the future at many levels. We highlight findings from efforts of the NMNH task force to explore what successful, supported informatics capacity looks like and initial proposed plans for revitalizing the NMNH informatics program. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Improving the Adoption and Evolution of Data Standards for Fossil
           Specimens

    • Abstract: Biodiversity Information Science and Standards 5: e75646
      DOI : 10.3897/biss.5.75646
      Authors : Holly Little, Talia Karim, Erica Krimmel : As we atomize and expand the digital representation of specimen information through data standards, it is critical to evaluate the implementation of these developments, including how well they serve discipline-specific needs. In particular, fossil specimens often present challenges because they require information to be captured that is seemingly parallel to, but not entirely aligned with, that of their extant counterparts. Previous work to evaluate data sharing practices of paleontology collections has shown an imbalance in the use of Darwin Core (DwC) (Wieczorek et al. 2012) terms and many instances of underutilized terms (Little 2018). To expand upon that broad assessment and encourage better adoption of evolving standards and data practices by fossil collections, a more in-depth review of term usage is necessary. Here we review specific DwC terms that are underutilized or that present challenges for fossil occurrence records, and we examine the subsequent impact on data discovery of paleo specimens. We conclude by sharing options for improving standards implementation within a paleo context.We see key patterns and challenges in current implementation of DwC in paleo collections, as evidenced by evaluations of the typical mappings found in occurrence records for fossil specimens, data flags applied by aggregators, and discussions within the paleo collections community. These can be organized into three broad groupings.Group 1: Some DwC terms (or classes of terms) are clear to implement, but are underutilized due to issues that are also found within the neontological community. Example: Location. In the case of terms related to the Location class, paleontology has a need for a way to deal with sensitive locality information. The sensitivity here typically relates to laws restricting the sharing of locality information to protect fossil sites versus neontological requirements to protect threatened, rare, or endangered species. The end goal of needing to fuzz locality information without completely making the specimen record undiscoverable or unusable is the same. There is a need for better education at the paleo data provider-level related to standards for recording and sharing information in this category, which could be based on existing neontological community standards.Group 2: A second group of DwC terms often seem clear to implement, but the terminology used to describe and define them might be unfamiliar to paleontologists or read as unnecessary for fossil occurrences. This uncertainty about the applicability of a term to paleo data can often result in data not being mapped or fully shared. Example: recordedBy (= collector). In these cases, a simple translation of what the definition means in verbiage that is familiar to paleontologists, or the inclusion of paleo-oriented examples in the DwC documentation, can make implementation clear.Group 3: A third group of issues relates to DwC terms, classes, and/or extensions that are more complicated in the context of fossil vs. neontological data. In some cases use of these terms is complicated for neontological data as well, but perhaps for different reasons. The terms impacted by these challenges can sometimes have the same general use, but due to the nature of fossil preservation, or because a term has a different meaning within the discipline of paleontology, additional layers of uncertainty or ambiguity are present. Examples: Resource Relationship/Interactions, Individual count, Preparations, Taxon. Review of these terms and their related classes and/or the extensions they are part of has revealed that they might require qualification, further explanation, additional vocabulary terms, or even the need for special handling instructions when data are ingested and normalized at the aggregator level. This group of issues is more complicated to resolve, but the problems are not intractable and can progress toward solutions through further discussion within the community, active participation in the standards development and review process, and development of clear guidelines. Strategically assessing these terms and generating discipline-specific guidelines to be used by the paleo community can improve the mobilization and discovery of fossil occurrence data. Documenting these paleo data practices not only helps data providers, it also increases the utility of these data within the broader research community by clearly outlining how the terms were used. Overall, this discipline-focused approach to understanding the implementation of data standards like DwC at the term level, helps to increase knowledge sharing across the paleo community, improves data quality and standards adoption, and moves these datasets towards alignment with best practices like the FAIR (Findable, Accessible, Interoperable, Reusable) data principles.  HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • South Africa's Initiative Towards an Integrated Biodiversity Data Portal

    • Abstract: Biodiversity Information Science and Standards 5: e75638
      DOI : 10.3897/biss.5.75638
      Authors : Kiara Ricketts, Brenda Daly, Fhatani Ranwashe, Carol Lefakane : Biodiversity Advisor, developed by the South African National Biodiversity Institute (SANBI), is a system that will provide integrated biodiversity information to a wide range of users who will have access to geospatial data, plant and animal species distribution data, ecosystem-level data, literature, images and metadata. It aims to deliver a centralized location with open access to information to enable research, assessment and monitoring; to support policy development; to foster collaboration and advance governance. Data are aggregated from multiple, diverse data partners across South Africa including, CapeNature, the FitzPatrick Institute of African Ornithology, Iziko South African museum, the National Herbarium of South Africa and the South African Institute for Aquatic Biodiversity.This newly developed and integrated system promotes a shift from tactically-based information systems, aimed at delivering products for individual project initiatives to a strategic system that promotes the building of capacity within organisations and networks. It has been developed by integrating SANBI’s existing authoring layers through a service-orientated architecture approach, which enables seamless cross-platform integration. Some of the key authoring layers that will be integrated are, the Botanical Database of Southern Africa (BODATSA), the Zoological Database of Southern Africa (ZODATSA), the Biodiversity Geographic Information System (BGIS) and SANBI's institutional repository (Opus). Biodiversity Advisor will provide users, policy and decision makers, environmental impact practitioners and associated organizations with free access to view, query and download any of South Africa's biodiversity data available on the system, providing them with everything needed to make decisions around conservation and biodiversity planning in South Africa. All sensitive species data, which are those that are vulnerable to collecting, over-exploitation, commercial and/or medicinal use, will be redacted and only granted access upon application.Biodiversity Advisor will encourage more effective management of data within SANBI, but also encourage the sharing of data by the biodiversity community to provide integrated products and services, which are needed to address complex environmental issues. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Creating a National Biodiversity Database in Gabon and the Challenges of
           Mobilizing Natural History Data for Francophone Countries

    • Abstract: Biodiversity Information Science and Standards 5: e75643
      DOI : 10.3897/biss.5.75643
      Authors : Elie Tobi, Geovanne Aymar Nziengui Djiembi, Anna Feistner, Donald Midoko Iponga, Jean Felicien Liwouwou, Charlie Mabala, Jacques Mavoungou, Gauthier Moussavou, Linda Priscilla Omouendze, Edward Gilbert, Gregory Jongsma : Language is a major barrier for researchers wanting to digitize and publish collection data in Africa. Despite being the fifth most spoken language on Earth and the second most common in Africa, resources in French about digitization, data management, and publishing are lacking. Furthermore, French-speaking regions of Africa (primarily Central/West Africa and Madagascar) host some of the highest biodiversity on the continent and therefore are of great importance to scientists and decision-makers. Without having representation in online portals like the Global Biodiversity Information Facility (GBIF) and Integrated Digitized Biocollections (iDigBio), these important collections are effectively invisible. Producing relevant/applicable resources about digitization in French will help shine a light on these valuable natural history records and allow the data-holders in Africa to retain the autonomy of their collections. Awarded a GBIF-BID (Biodiversity Information for Development) grant in 2021, an international, multilingual network of partners has undertaken the important task of digitizing and mobilizing Gabon’s vertebrate collections. There are an estimated 13,500 vertebrate specimens housed in five institutions in different parts of Gabon. To date, the group has mobilized >4,600 vertebrate records to our recently launched Gabon Biodiversity Portal (https://gabonbiota.org/). The portal also hosts French guides for using Symbiota-based portals to manage, georeference, and publish natural history databases. These resources can provide much-needed guidance for other Francophone countries⁠—in Africa and beyond⁠—working to maximize the accessibility and value of their biodiversity collections.  HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Synospecies, an application to reflect changes in taxonomic names based on
           a triple store based on taxonomic data liberated from publication 

    • Abstract: Biodiversity Information Science and Standards 5: e75641
      DOI : 10.3897/biss.5.75641
      Authors : Reto Gmür, Donat Agosti : Taxonomic treatments, sections of publications documenting the features or distribution of a related group of organisms (called a “taxon”, plural “taxa”) in ways adhering to highly formalized conventions, and published in scientific journals, shape our understanding of global biodiversity (Catapano 2019).Treatments are the building blocks of the evolving scientific consensus on taxonomic entities. The semantics of these treatments and their relationships are highly structured: taxa are introduced, merged, made obsolete, split, renamed, associated with specimens and so on. Plazi makes this content available in machine-readable form using Resource Description Framework (RDF)  . RDF is the standard model for Linked Data and the Semantic Web. RDF can be exchanged in different formats (aka concrete syntaxes) such as RDF/XML or Turtle. The data model describes graph structures and relies on Internationalized Resource Identifiers (IRIs)  , ontologies such as Darwin Core basic vocabulary   are used to assign meaning to the identifiers. For Synospecies, we unite all treatments into one large knowledge graph, modelling taxonomic knowledge and its evolution with complete references to quotable treatments. However, this knowledge graph expresses much more than any individual treatment could convey because every referenced entity is linked to every other relevant treatment. On synospecies.plazi.org, we provide a user-friendly interface to find the names and treatments related to a taxon. An advanced mode allows execution of queries using the SPARQL query language.  HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • ecocomDP: A data design pattern and R package to facilitate FAIR
           biodiversity data for ecological synthesis

    • Abstract: Biodiversity Information Science and Standards 5: e75640
      DOI : 10.3897/biss.5.75640
      Authors : Eric Sokol : Two programs that provide high-quality long-term ecological data, the Environmental Data Initiative (EDI) and the National Ecological Observatory Network (NEON), have recently teamed up with data users interested in synthesizing biodiversity data, such as ecological synthesis working groups supported by the US Long Term Ecological Research (LTER) Network Office, to make their data more Findable, Interoperable, Accessible, and Reusable (FAIR). To this end:we have developed a flexible intermediate data design pattern for ecological community data (L1 formatted data in Fig. 1, see Fig. 2 for design details) called "ecocomDP" (O'Brien et al. 2021), andwe provide tools to work with data packages in which this design pattern has been implemented.The ecocomDP format provides a data pattern commonly used for reporting community level data, such as repeated observations of species-level measures of biomass, abundance, percent cover, or density across multiple locations. The ecocomDP library for R includes tools to search for data packages, download or import data packages into an R (programming language) session in a standard format, and visualization tools for data exploration steps that are recommended for data users prior to any cross-study synthesis work. To date, EDI has created 70 ecocomDP data packages derived from their holdings, which include data from the US Long Term Ecological Research (US LTER) program, Long Term Research in Environmental Biology (LTREB) program, and other projects, which are now discoverable and accessible using the ecocomDP library. Similarly, NEON data products for 12 taxonomic groups are discoverable using the ecocomDP search tool. Input from data users provided guidance for the ecocomDP developers in mapping the NEON data products to the ecocomDP format to facilitate interoperability with the ecocomDP data packages available from the EDI repository. The standardized data design pattern allows common data visualizations across data packages, and has the potential to facilitate the development of new tools and workflows for biodiversity synthesis. The broader impacts of this collaboration are intended to lower the barriers for researchers in ecology and the environmental sciences to access and work with long-term biodiversity data and provide a hub around which data providers and data users can develop best practices that will build a diverse and inclusive community of practice. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Plant-pollinator Vocabulary - a Contribution to Interaction Data
           Standardization

    • Abstract: Biodiversity Information Science and Standards 5: e75636
      DOI : 10.3897/biss.5.75636
      Authors : José Augusto Salim, Paula Zermoglio, Debora Drucker, Filipi Soares, Antonio Saraiva, Kayna Agostini, Leandro Freitas, Marina Wolowski, André Rech, Marcia Maués, Isabela Varassin : Human demands on resources such as food and energy are increasing through time while global challenges such as climate change and biodiversity loss are becoming more complex to overcome, as well as more widely acknowledged by societies and governments.  Reports from initiatives like the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES) have demanded quick and reliable access to high-quality spatial and temporal data of species occurrences, their interspecific relations and the effects of the environment on biotic interactions. Mapping species interactions is crucial to understanding and conserving ecosystem functioning and all the services it can provide (Tylianakis et al. 2010, Slade et al. 2017). Detailed data has the potential to improve our knowledge about ecological and evolutionary processes guided by interspecific interactions, as well as to assist in planning and decision making for biodiversity conservation and restoration (Menz et al. 2011).Although a great effort has been made to successfully standardize and aggregate species occurrence data, a formal standard to support biotic interaction data sharing and interoperability is still lacking. There are different biological interactions that can be studied, such as predator-prey, host-parasite and pollinator-plant and there is a variety of data practices and data representation procedures that can be used.Plant-pollinator interactions are recognized in many sources from the scientific literature (Abrol 2012, Ollerton 2021) for the importance of ecosystem functioning and sustainable agriculture. Primary data about pollination are becoming increasingly available online and can be accessed from a great number of data repositories. While a vast quantity of data on interactions, and on pollination in particular, is available, data are not integrated among sources, largely because of a lack of appropriate standards.We present a vocabulary of terms for sharing plant-pollinator interactions using one of the existing extensions to the Darwin Core standard (Wieczorek et al. 2012). In particular, the vocabulary is meant to be used for the term measurementType of the Extended Measurement Or Facts extension. The vocabulary was developed by a community of specialists in pollination biology and information science, including members of the TDWG Biological Interaction Data Interest Group, during almost four years of collaborative work. The vocabulary introduces 40 new terms, comprising many aspects of plant-pollinator interactions, and can be used to capture information produced by studies with different approaches and scales.The plant-pollinator interactions vocabulary is mainly a set of terms that can be both understood by people or interpreted by machines. The plant-pollinator vocabulary is composed of a defining a set of terms and descriptive documents explaining how the vocabulary is to be used. The terms in the vocabulary are divided into six categories: Animal, Plants, Flower, Interaction, Reproductive Success and Nectar Dynamics. The categories are not formally part of the vocabulary, they are used only to organize the vocabulary and to facilitate understanding by humans.  We expect that the plant-pollinator vocabulary will contribute to data aggregation from a variety of sources worldwide at higher levels than we have experienced, significantly amplify plant-pollinator data availability for global synthesis, and contribute to knowledge in conservation and sustainable use of biodiversity. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Machine Learning as a Service for DiSSCo’s Digital  Specimen
           Architecture

    • Abstract: Biodiversity Information Science and Standards 5: e75634
      DOI : 10.3897/biss.5.75634
      Authors : Jonas Grieb, Claus Weiland, Alex Hardisty, Wouter Addink, Sharif Islam, Sohaib Younis, Marco Schmidt : International mass digitization efforts through infrastructures like the European Distributed System of Scientific Collections (DiSSCo), the US resource for Digitization of Biodiversity Collections (iDigBio), the National Specimen Information Infrastructure (NSII) of China, and Australia’s digitization of National Research Collections (NRCA Digital) make geo- and biodiversity specimen data freely, fully and directly accessible. Complementary, overarching infrastructure initiatives like the European Open Science Cloud (EOSC) were established to enable mutual integration, interoperability and reusability of multidisciplinary data streams including biodiversity, Earth system and life sciences (De Smedt et al. 2020). Natural Science Collections (NSC) are of particular importance for such multidisciplinary and internationally linked infrastructures, since they provide hard scientific evidence by allowing direct traceability of derived data (e.g., images, sequences, measurements) to physical specimens and material samples in NSC. To open up the large amounts of trait and habitat data and to link these data to digital resources like sequence databases (e.g., ENA), taxonomic infrastructures (e.g., GBIF) or environmental repositories (e.g., PANGAEA), proper annotation of specimen data with rich (meta)data early in the digitization process is required, next to bridging technologies to facilitate the reuse of these data. This was addressed in recent studies  (Younis et al. 2018, Younis et al. 2020), where we employed computational image processing and artificial intelligence technologies (Deep Learning) for the classification and extraction of features like organs and morphological traits from digitized collection data (with a focus on herbarium sheets).However, such applications of artificial intelligence are rarely—this applies both for (sub-symbolic) machine learning and (symbolic) ontology-based annotations—integrated in the workflows of NSC’s management systems, which are the essential repositories for the aforementioned integration of data streams.This was the motivation for the development of a Deep Learning-based trait extraction and coherent Digital Specimen (DS) annotation service providing “Machine learning as a Service” (MLaaS) with a special focus on interoperability with the core services of DiSSCo, notably the DS Repository (nsidr.org) and the Specimen Data Refinery (Walton et al. 2020), as well as reusability within the data fabric of EOSC.  Taking up the use case to detect and classify regions of interest (ROI) on herbarium scans, we demonstrate a MLaaS prototype for DiSSCo involving the digital object framework, Cordra, for the management of DS as well as instant annotation of digital objects with extracted trait features (and ROIs) based on the DS specification openDS (Islam et al. 2020). Source code available at: https://github.com/jgrieb/plant-detection-service HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • LifeWebs: A (global) database of bipartite ecological interaction networks

    • Abstract: Biodiversity Information Science and Standards 5: e75626
      DOI : 10.3897/biss.5.75626
      Authors : Philip Butterill, Leonardo Jorge, Shuang Xing, Tom Fayle : The structure and dynamics of ecological interactions are nowadays recognized as a crucial challenge to comprehend the assembly, functioning and maintenance of ecological communities, their processes and the services they provide. Nevertheless, while standards and databases for information on species occurrences, traits and phylogenies have been established, interaction networks have lagged behind on the development of these standards. Here, we discuss the challenges and our experiences in developing a global database of bipartite interaction networks.LifeWebs*1 is an effort to compile community-level interaction networks from both published and unpublished sources. We focus on bipartite networks that comprise one specific type of interaction between two groups of species (e.g., plants and herbivores, hosts and parasites, mammals and their microbiota), which are usually presented in a co-occurrence matrix format. However, with LifeWebs, we attempt to go beyond simple matrices by integrating relevant metadata from the studies, especially sampling effort, explicit species information (traits and taxonomy/phylogeny), and environmental/geographic information on the communities.Specifically, we explore 1) the unique aspects of community-level interaction networks when compared to data on single inter-specific interactions, occurrence data, and other biodiversity data and how to integrate these different data types. 2) The trade-off between user friendliness in data input/output vs. machine-readable formats, especially important when data contributors need to provide large amounts of data usually compiled in a non-machine-readable format. 3) How to have a single framework that is general enough to include disparate interaction types while retaining all the meaningful information.We envision LifeWebs to be in a good position to test a general standard for interaction network data, with a large variety of already compiled networks that encompass different types of interactions. We provide a framework for integration with other types of data, and formalization of the data necessary to represent networks into established biodiversity standards. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Modelling the Heterogeneity within Citizen Science Data for Biodiversity
           Research

    • Abstract: Biodiversity Information Science and Standards 5: e75623
      DOI : 10.3897/biss.5.75623
      Authors : Diana Bowler, Nick Isaac, Aletta Bonn : Large amounts of species occurrence data are compiled by platforms such as the Global Biodiversity Information Facility (GBIF) but these data are collected by a diversity of methods and people. Statistical tools, such as occupancy-detection models, have been developed and tested as a way to analyze these heterogeneous data and extract information on species’ population trends. However, these models make many assumptions that might not always be met. More detailed metadata associated with occurrence records would help better describe the observation/detection submodel within occupancy models and improve the accuracy/precision of species’ trend estimates. Here, we present examples of occupancy-detection models applied to citizen science datasets, including dragonfly data in Germany, and typical approaches to account for variation in sampling effort and species detectability, including visit covariates, such as list length. Using results from a recent questionnaire in Germany asking citizen scientists about why and how they collect species occurrence data, we also characterize the different approaches that citizen scientists take to sample and report species observations. We use our findings to highlight examples of key metadata that are often missing (e.g., length of time spent searching, complete checklist or not) in data sharing platforms but would greatly aid modelling attempts of heterogeneous species occurrence data. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Biodiversity Heritage Library and Global Names: Successes, opportunities
           and the challenges for the future collaboration

    • Abstract: Biodiversity Information Science and Standards 5: e75620
      DOI : 10.3897/biss.5.75620
      Authors : Dmitry Mozzherin : The Biodiversity Heritage Library (BHL) is a major aggregator of biodiversity literature with more than 200,000 volumes. The Global Names Architecture (GNA) strives to develop and provide tools for finding, parsing and verifying scientific names. GNA and BHL have enjoyed 10 years of collaboration in the creation of a scientific names index for BHL. Such an index provides researchers with a means for finding data about more than a million species.Recently, BHL and GNA developed a workflow for the creation of an index that covers more than 50 million pages of BHL, and finds and verifies scientific names in less than a day. The unprecedented speed of the index creation opens an opportunity to dramatically increase its quality and reach. The following challenges can now be addressed.1. Abbreviated names reconciliation.From 20% to 25% of all scientific names in BHL are abbreviated. It is much harder to reconcile and verify abbreviated names, because their specific epithets are not unique. We plan to reconcile the vast majority of such names via a statistical approach.2. Linking of biodiversity publication titles with actual pages in BHL.Scientific names are closely connected to publications of original description, taxonomic treatments, and other usages. We plan to build algorithms for finding out how different lexical variants of the same publication reference can be disambiguated and connected to corresponding BHL pages.3. Using taxonomic intelligence for finding information about species.According to our estimation, on average, there are three scientific names (historical and current) per taxon. Names of species often change over time as a result of misspellings, and homotypic or heterotypic synonymy. We plan to link outdated names with currently accepted names of taxa. This functionality provides all information about a taxon in BHL, no matter what names were used to reference the taxon at the time of publication.4. Finding information about original descriptions of genera and species.For every species there is a publication with the original description. We want to create an index of species that are described in the publications aggregated by BHL.5. Detection of species names in spite of "incorrect" capitalization.Previously, or in horticultural sources, specific epithets were often capitalized (e.g., Bulbophyllum Nocturnum), or for patronyms in which the species was named in honor of someone (e.g., Notiospathius Johnlennoni). We plan to detect names with non-standard capitalization of this sort.6. Removal of false positives.Texts in Latin language, names of people, and geographical entities often create false positives that look like scientific names. Using machine learning techniques will allow us to detect and remove most of these errors from the names index.7. Detection of the names of biodiversity scientists and geographical entities in texts.Finding names of biologists and geographical places in addition to scientific names would allow us to draw connections between these data and to create metadata demonstrating these links. We plan to add tools and algorithms for indexing person names and geographical names.In this talk I will present plans for a dramatic quality increase in the scientific name-finding algorithms, as well as an introduction of other elements that would enhance usability of BHL for its patrons. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Connecting Taxonomic Backbones using Global Names Tools

    • Abstract: Biodiversity Information Science and Standards 5: e75619
      DOI : 10.3897/biss.5.75619
      Authors : Dmitry Mozzherin : Biodiversity taxonomy provides a means to organize information about living organisms into maintainable tree- or graph-like structures (taxonomic backbones). Taxonomy is tightly bound to biodiversity nomenclature—a collection of recommendations, rules and conventions for naming living organisms. Species are often considered to be the most important unit of taxonomy structures. Keeping scientific names of species and other taxa accurate and up to date are major challenges during creation and maintenance of large taxonomic backbones.Global Names Architecture (Global Names) is an initiative that developed tools and databases for detecting, parsing, and verifying scientific names. Verification tools also provide information about which taxonomic and nomenclatural resources contain information for a given scientific name. Taxonomic intelligence provided by resources aggregated by Global Names allows resolving of taxon names from different backbones, even if their "current" scientific names vary.Parsing of scientific names with GNparser allows for normalization of names, making them comparable. Fast name matching (reconciliation) and discovery of a taxonomic meaning (resolution) by GNverifier connects information from various resources. The most recently developed tools by Global Names provide name verification and taxon matching on an unprecedented scale.During this presentation we are going to describe Global Names tools and show how they can be used for reconciliation of lexical variants of scientific names, for extracting the authorship metadata, how names can be verified and resolved, and how data can be connected to a variety of biodiversity resources. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • GBIF Integration of Open Data 

    • Abstract: Biodiversity Information Science and Standards 5: e75606
      DOI : 10.3897/biss.5.75606
      Authors : Tim Robertson, Federico Mendez, Matthew Blissett, Morten Høfft, Thomas Stjernegaard Jeppese, Nikolay Volik, Marcos Gonzalez, Mikhail Podolskiy, Markus Döring : The Global Biodiversity Information Facility (GBIF) runs a global data infrastructure that integrates data from more than 1700 institutions. Combining data at this scale has been achieved by deploying open Application Programming Interfaces (API) that adhere to the open data standards provided by Biodiversity Information Standards (TDWG). In this presentation, we will provide an overview of the GBIF infrastructure and APIs and provide insight into lessons learned while operating and evolving the systems, such as long-term API stability, ease of use, and efficiency. This will include the following topics:The registry component provides RESTful APIs for managing the organizations, repositories and datasets that comprise the network and control access permissions. Stability and ease of use have been critical to this being embedded in many systems.Changes within the registry trigger data crawling processes, which connect to external systems through their APIs and deposit datasets into GBIF's central data warehouse. One challenge here relates to the consistency of data across a distributed network.Once a dataset is crawled, the data processing infrastructure organizes and enriches data using reference catalogues accessed through open APIs, such as the vocabulary server and the taxonomic backbone. Being able to process data quickly as source data and reference catalogues change is a challenge for this component.The data access APIs provide search and download services. Asynchronous APIs are required for some of these aspects, and long-term stability is a requirement for widespread adoption. Here we will talk about policies for schema evolution to avoid incompatible changes, which would cause failures in client systems.The APIs that drive the user interface have specific needs such as efficient use of the network bandwidth. We will present how we approached this, and how we are currently adopting GraphQL as the next generation of these APIs. There are several APIs that we believe are of use for the data publishing community. These include APIs that will help in data quality aspects, and new data of interest thanks to the data clustering algorithms GBIF deploys. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • The BHL-Plazi Partnership: Getting data from the 1800s directly into 21st
           century, reused digital accessible knowledge

    • Abstract: Biodiversity Information Science and Standards 5: e75604
      DOI : 10.3897/biss.5.75604
      Authors : Diego Alvares, Marcus Guidoti, Felipe Simoes, Carolina Sokolowicz, Donat Agosti : Plazi is a Swiss non-governmental organization dedicated to the liberation of data imprisoned in flat, dead-end formats such as PDFs. In the process, the data therein is annotated and exported in various formats, following field-specific standards, facilitating free access and reutilization by several other service providers and end-users. This data mining and enhancement process allows for the rediscovery of the known biodiversity since the knowledge on known taxa is published into an ever-growing corpus of papers, chapters and books, inaccessible to the state-of-the-art service providers, such as Global Biodiversity Information Facility (GBIF). The data liberated by Plazi focuses on taxonomic treatments, which carry the unit of knowledge about a taxon concept in a given publication and can be considered the building block of taxonomic science. Although these extracted taxonomic treatments can be found in Plazi’s TreatmentBank and Biodiversity Literature Repository (BLR), hosted in the European Organization for Nuclear Research (CERN) digital repository Zenodo, data included in treatments (e.g., material citations and treatment citations) can also be found in other applications as well, such as Plazi’s Synospecies, Zenodeo, and GBIF. Plazi’s efforts result in more Findable, Accessible, Interoperable, and Reusable (FAIR) biodiversity literature, improving, enhancing and enabling access to data included therein as digital accessible data, otherwise almost unreachable.The Biodiversity Heritage Library (BHL), on the other hand, provides a pivotal service by digitizing heritage literature and current literature for which BHL negotiates permission, and provides free access to otherwise inaccessible sources.In 2021, BHL and Plazi signed a Statement of Collaboration, aiming to combine the efforts of both institutions to contribute even further to FAIR-ifying biodiversity literature and data. In a collaborative demonstration project, we selected the earliest volumes and issues of the Revue Suisse de Zoologie in order to conduct a pilot study that combines the efforts of both BHL and Plazi.The corpus is composed of eight volumes (tomes), 24 issues (numbers) and 98 papers, including a total of over 5000 pages and 200 images. To process this material, BHL assigned CrossRef Digital Object Identifiers (
      DOI ) to these already digitally accessible publications. Plazi created a template to be used in GoldenGate-Imagine, indicating key parameters used for tailored data mining of these articles, and customized to the journal’s graphic layout characteristics at that time. Then, we proceeded with quality control steps to provide fit-for-use data for BLR and GBIF by ensuring that the data was correctly annotated and eliminating potential data transit blockages at Plazi’s built-in data gatekeeper. The data was then subsequently reused by GBIF. Finally, we present here the summary of the obtained results, highlighting the number of key publication attributes aforementioned (pages, images), but also including a drill-down into the different taxonomic groups, countries and collections of origin of the studied material, and more. All the data is available via the Plazi statistics, the Biodiversity Literature Repository Website and community at Zenodo, the Zenodeo APIs and GBIF where the data is being reused. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Biological Data Standards – A primer for data managers

    • Abstract: Biodiversity Information Science and Standards 5: e75593
      DOI : 10.3897/biss.5.75593
      Authors : Abigail Benson, Diana LaScala-Gruenewald, Robert McGuinn, Erin Satterthwaite : While a bevy of standards exist for managers of biological data to use, biological science departments or projects could benefit from an easy to digest primer about biological data standards and the value they confer. Moreover, a quick visual breakdown comparing standards could help data managers choose those that best serve their needs. The Earth Science Information Partners (ESIP) is a nonprofit that enables and supports high quality virtual and in-person collaborations between cross-domain data professionals on common data challenges and opportunities, and is supported by the National Aeronautics and Space Administration (NASA), the National Oceanic and Atmospheric Administration (NOAA) and the United States Geological Survey (USGS). The ESIP Biological Data Standards Cluster has been developing a primer on existing biological data standards for managers of biological data who may be unaware of existing standards but need to improve management, analysis, and use of the biological observation data. The goal of this primer is to spread awareness about existing standards in a simple, aesthetically pleasing way. Our hope is that this primer, shared online and at conferences, will help increase the adoption of existing biological standards and help make data more Findable, Accessible, Interoperable, and Reusable (FAIR). HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • Wanted: Standards for FAIR taxonomic concept representations and
           relationships 

    • Abstract: Biodiversity Information Science and Standards 5: e75587
      DOI : 10.3897/biss.5.75587
      Authors : Beckett Sterner, Nathan Upham, Prashant Gupta, Caleb Powell, Nico Franz : Making the most of biodiversity data requires linking observations of biological species from multiple sources both efficiently and accurately (Bisby 2000, Franz et al. 2016). Aggregating occurrence records using taxonomic names and synonyms is computationally efficient but known to experience significant limitations on accuracy when the assumption of one-to-one relationships between names and biological entities breaks down (Remsen 2016, Franz and Sterner 2018). Taxonomic treatments and checklists provide authoritative information about the correct usage of names for species, including operational representations of the meanings of those names in the form of range maps, reference genetic sequences, or diagnostic traits. They increasingly provide taxonomic intelligence in the form of precise description of the semantic relationships between different published names in the literature. Making this authoritative information Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al. 2016) would be a transformative advance for biodiversity data sharing and help drive adoption and novel extensions of existing standards such as the Taxonomic Concept Schema and the OpenBiodiv Ontology (Kennedy et al. 2006, Senderov et al. 2018). We call for the greater, global Biodiversity Information Standards (TDWG) and taxonomy community to commit to extending and expanding on how FAIR applies to biodiversity data and include practical targets and criteria for the publication and digitization of taxonomic concept representations and alignments in taxonomic treatments, checklists, and backbones.As a motivating case, consider the abundantly sampled North American deer mouse—Peromyscus maniculatus (Wagner 1845)—which was recently split from one continental species into five more narrowly defined forms, so that the name P. maniculatus is now only applied east of the Mississippi River (Bradley et al. 2019, Greenbaum et al. 2019). That single change instantly rendered ambiguous ~7% of North American mammal records in the Global Biodiversity Information Facility (n=242,663, downloaded 2021-06-04; GBIF.org 2021) and ⅓ of all National Ecological Observatory Network (NEON) small mammal samples (n=10,256, downloaded 2021-06-27). While this type of ambiguity is common in name-based databases when species are split, the example of P. maniculatus is particularly striking for its impact upon biological questions ranging from hantavirus surveillance in North America to studies of climate change impacts upon rodent life-history traits. Of special relevance to NEON sampling is recent evidence suggesting deer mice potentially transmit SARS-CoV-2 (Griffin et al. 2021).Automating the updating of occurrence records in such cases and others will require operational representations of taxonomic concepts—e.g., range maps, reference sequences, and diagnostic traits—that are FAIR in addition to taxonomic concept alignment information (Franz and Peet 2009). Despite steady progress, it remains difficult to find, access, and reuse authoritative information about how to apply taxonomic names even when it is already digitized. It can also be difficult to tell without manual inspection whether similar types of concept representations derived from multiple sources, such as range maps or reference sequences selected from different research articles or checklists, are in fact interoperable for a particular application. The issue is therefore different from important ongoing efforts to digitize trait information in species circumscriptions, for example, and focuses on how already digitized knowledge can best be packaged to inform human experts and artifical intelligence applications (Sterner and Franz 2017). We therefore propose developing community guidelines and criteria for FAIR taxonomic concept representations as "semantic artefacts" of general relevance to linked open data and life sciences research (Le Franc et al. 2020). HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • MIDS Level 1: Specification, conformance checklist, mapping template
           and instructions for use

    • Abstract: Biodiversity Information Science and Standards 5: e75574
      DOI : 10.3897/biss.5.75574
      Authors : Alex Hardisty, Elspeth Haston : Approved formally as a TDWG Task Group (TG) in September 2020, TG MIDS is working to harmonise a framework for "Minimum Information about a Digital Specimen (MIDS)". MIDS clarifies what is meant by different levels of digitization (MIDS levels) and specifies the minimum information to be captured at each level. Capturing and presenting data in future digitization in standard formats is essential so that data can be more easily understood, compared, analysed and communicated via the Internet. Adopting MIDS and working to achieve specific MIDS levels in digitization ensures that enough data are captured, curated and published such that they are useful for the widest possible range of future research, teaching and learning purposes. Adopting MIDS makes it easier to consistently measure the extent of digitization achieved over time and to set priorities for the remaining work.In the year since MIDS was first introduced at TDWG 2020, the TG has focussed on the details of MIDS level 1, representing the basic minimum level of information to be expected and captured in basic digitization activities such as creating a catalogue record and (optionally) making photographic or other digital images of specimens.To help the community adopt and embed MIDS conformance as a core part of digitization and data publishing/management pipelines, the MIDS specification consists of definitions of the expected information elements, a template for mapping terms/fields in institutional collection management systems and other data management schemas to those information elements, a conformance proforma allowing declaration of how a digitization or data publishing event conforms to MIDs, and instructions for use. HTML XML PDF
      PubDate: Thu, 23 Sep 2021 11:00:00 +030
       
  • The Biodiversity of Ecological Interactions: Challenges for recording and
           documenting the Web of Life

    • Abstract: Biodiversity Information Science and Standards 5: e75564
      DOI : 10.3897/biss.5.75564
      Authors : Pedro Jordano : Biodiversity is more than a collection of individual species. It is the combination of biological entities and processes supporting life on Earth: no single species persists without interacting with other species. A full account of biodiversity on Earth needs to document the essential ecological interactions that support Earth’s system through their functional outcomes. Quantifying biodiversity’s interactome (the whole suite of interactions among biotic organisms) is challenging not just because of the daunting task of describing ecosystem complexity, it’s also limited by the need to define and establish a proper grammar to record and catalog species interactions. Actually, a record of a pairwise interaction between two species can be identified as a "tetranomial species", with just a concatenation of the two Latin binomials. Thus sampling interactions requires solving exactly the same constraints and problems we face when sampling biodiversity. In real interaction webs, the number of actual pairwise interactions among species in local assemblages scales exponentially with species richness.I discuss the main components of these interactions and those that are key to properly sample and document them. Interactions take the form of predation, competition, commensalism, amensalism, mutualism, symbiosis, and parasitism and, in all cases, involve reciprocal effects for the interacting species and build into highly complex networks (Fig. 1).The type of metadata required to document ecological interactions between partner species depends on interaction type; yet a fraction of these metadata is shared with those of the partner species. The interaction type sets limits to between-species encounters (actually, encounters between individuals of the partner species) and, more importantly, sets the type of outcome emerging from the interactions. There is a broad range of information that can eventually be acquired when recording an ecological interaction: from its simple presence (the interaction exists, it's been just recorded) to an estimate of its frequency, to obtaining data about its outcome or per-interaction effect (e.g., number of flowers pollinated in a visit by a pollinator to a plant). In addition, the types of interaction data can be quite diverse, reflecting the variety of sampling methods: interaction records from direct observation in the field; camera-traps; DNA-barcoding; bibliographic sources; surveys of image databases, etc. Interaction biodiversity inventories may require merging information coming from these distinct data sources. All these components need to be properly defined in order to build informative metadata and to document ecological interaction records. We are just starting to delineate the main components needed to catalog and inventory ecological interactions as a part of biodiversity inventories. HTML XML PDF
      PubDate: Tue, 21 Sep 2021 13:30:00 +030
       
  • Tackling Data Quality Challenges in the Finnish Biodiversity Information
           Facility (FinBIF)

    • Abstract: Biodiversity Information Science and Standards 5: e75559
      DOI : 10.3897/biss.5.75559
      Authors : Kari Lahti, Mikko Heikkinen, Aino Juslén, Leif Schulman : The Finnish Biodiversity Information Facility (FinBIF) Research Infrastructure (Schulman et al. 2021) is a national service with a broad coverage of the components of biodiversity informatics (Bingham et al. 2017). Data flows are managed under a single information technology (IT) architecture. Services are available in a single, branded on-line portal. Data are collated from all relevant sources e.g., research institutes, scientific collections, public authorities and citizen science projects, whose data represent a major contribution. The challenge is to analyse, classify and share good quality data in a way that the user understands its utility.Need for quality dataThe philosophy of FinBIF is that all observation records are important, and that all data are assessed for quality and able to be annotated. The challenge is that, in practice, many users desire data with 100% reliability. In our experience, most user concerns about data quality are related to citizen science data. Researchers are usually able to manage raw data to serve their purposes. However, decision-making authorities often have less capacity to analyse the data and thus require data that can be used instantly. Therefore, we need tools to provide users the data that are the most relevant and reliable for their specific use. For all users, standardized metadata (information about datasets) are key, when the user has doubts about the fitness-for-use of a particular dataset. There is also a need to provide data in different formats to serve various users. Finally, the service has to be machine-actionable (using an application programming interface (API) and R-package) as well as human-accessible for viewing and downloading data.Quality assignment FinBIF data accuracy varies significantly within and between datasets, and observers. Two quality-based classifications suitable for filtering are therefore applied. The dataset origin filter is based on the quality of a whole dataset (e.g. citizen science project) and includes three broad classes assigned with an appropriate quality label: Datasets by Professionals, by Specialists and by Citizen Scientists. The observation reliability filter is based on a single observation and on annotations by FinBIF users. This classification includes Expert verified, Community verified, Unassessed (default for all records), Uncertain, and Erroneous. The dataset origin does not necessarily determine the quality of the individual records in it. Observations made by citizen scientists are often accurate, while there may be errors in the professionally collected data. Records are frequently subject to annotation, which raises their quality over time (e.g., iNaturalist). Naturally, evidence (e.g., media, detailed descriptions, specimens) is needed for reliable identification.Annotating dataWhen observations are compiled at FinBIF’s portal (Laji.fi), they are initially “Unassessed” (unless they have otherwise been assessed at the original source). When annotating occurrences, volunteers can make various entries using the tools provided. The aim of the commentary is to improve the quality of the observation data. Annotators are divided into two categories with two different roles:As a basic user, anyone who has logged in at Laji.fi can make comments or tag observations for review by experts.Users defined as experts have wider rights than basic users and their comments carry more weight. The most desired actions of expert users are to classify observations into confidence levels or to give them new or refined identifications.Information about new comments passes to the observer if the observation is recorded by using the FinBIF Observation Management System “Notebook”. However, comments cannot yet be automatically forwarded e.g., to the primary data management systems at the original source.Annotations add extra indications of quality. They do not replace or delete the original information. Nevertheless, annotations can change a record’s taxonomic identification, and by default, a record will be handled based on its latest identification.R-package for researchers and Public Authority Portal (PAP) for decision makersFinBIF has produced an R programming language interface to its API, which makes the publicly available data in FinBIF accessible from within R. For authorities, the PAP offers direct access to all available species information to authorised users, including sensitive and restricted-use data. HTML XML PDF
      PubDate: Tue, 21 Sep 2021 13:30:00 +030
       
  • TaxonWorks: Character state matrices and identification tools

    • Abstract: Biodiversity Information Science and Standards 5: e75554
      DOI : 10.3897/biss.5.75554
      Authors : Dmitry Dmitriev, Matthew Yoder : TaxonWorks is an integrated web-based application for practicing taxonomists and biodiversity specialists. It is focused on promoting collaboration between researchers and developers. TaxonWorks has a modular structure that enables various components of the application to target specific needs and requirements of different groups of users. Specific areas of interest may include nomenclature-related tasks (Yoder and Dmitriev 2021) designed to help assemble and validate scientific name checklists of a target group of organisms; and collection management tasks, including interfaces to create, filter, and edit collecting events, collection objects, and loans. This presentation focuses on matrix-related tools integrated into TaxonWorks. A matrix, which could either be used for phylogenetic analysis or to build an identification key, is structured as a table where columns represent numerous characters that could be used to describe a set of entities, taxa or specimens (presented as rows of the table). Each cell of the table may contain observations for specific character/entity combinations. TaxonWorks does not generate a table for each a particular matrix—all observations are stored as graphs. This structure allows building of a matrix of an unlimited size as well as reuse of individual observations in multiple matrices. For matrix columns, TaxonWorks supports a variety of different kinds of characters or descriptors: qualitative, presence/absence, quantitative, sample, gene, free text, and media. Each character may have specific properties, for example a qualitative descriptor may have numerous characters states, and a quantitative descriptor may have a measurement unit defined. For an entity in a matrix row, TaxonWorks supports either collection objects (specimens) or taxa as Operational Taxonomic Units (OTU). OTUs could either be linked to nomenclature or be stand alone entities (e.g., representing undescribed species).The matrix, once built, could serve several purposes. A matrix based on qualitative and quantitative characters could be used to build an interactive key (Fig. 1), construct standardized natural language descriptions for each entity, and determine a diagnosis (a minimal set of characters that separate one entity from all others). It could also be exported and used for phylogenetic analysis or to build an interactive key in an external application. TaxonWorks supports export files in several formats, including Nexus, TNT, NeXML. Application Programming Interfaces (API) are also available. A matrix based on media descriptors could be used as a pictorial identification tool (Fig. 2). HTML XML PDF
      PubDate: Tue, 21 Sep 2021 13:30:00 +030
       
  • Linking and the Role of the Material Citation

    • Abstract: Biodiversity Information Science and Standards 5: e75543
      DOI : 10.3897/biss.5.75543
      Authors : Jeremy Miller, Donat Agosti, Marcus Guidoti, Francisco Andres Rivera Quiroz : Citing the specimens used to describe new species or augment existing taxa is integral to the scholarship of taxonomic and related biodiversity-oriented publications. These so-called material citations (Darwin Core Term MaterialCitation), linked to the natural history collections in which they are archived, are the mechanism by which readers may return to the source material upon which reported observations are based. This is integral to the scientific nature of the project of documenting global biodiversity. Material citation records typically contain such information as the location and date associated with the collection of a specimen, along with other data, and taxonomic identification. Thus, material citations are a key line of evidence for biodiversity informatics, along with other evidence classes such as database records of specimens archived in natural history collections, human observations not linked to specimens, and DNA sequences that may or may not be linked to a specimen. Natural history collections are not completely databased and records of some occurrences are only available as material citations. In other cases, material citations can be linked to the record of the physical specimen in a collections database. Taxonomic treatments, sections of publications documenting the features or distribution of a related group of organisms (Catapano 2019), may contain citations of DNA sequences, which can be linked to database records. There is potential for bidirectional linking that could contribute data elements or entire records to collections and DNA databases, based on content found in material citations. We compare material citations data to other major sources of biodiversity records (preserved specimens, human observations, and material samples). We present pilot project data that reconcile material citations with their database records, and track all material citations across the taxonomic history of a species.  HTML XML PDF
      PubDate: Tue, 21 Sep 2021 13:30:00 +030
       
  • The Verification of Ecological Citizen Science Data: Current approaches
           and future possibilities

    • Abstract: Biodiversity Information Science and Standards 5: e75506
      DOI : 10.3897/biss.5.75506
      Authors : Emily Baker, Jonathan Drury, Johanna Judge, David Roy, Graham Smith, Philip Stephens : Citizen science schemes (projects) enable ecological data collection over very large spatial and temporal scales, producing datasets of high value for both pure and applied research. However, the accuracy of citizen science data is often questioned, owing to issues surrounding data quality and verification, the process by which records are checked after submission for correctness. Verification is a critical process for ensuring data quality and for increasing trust in such datasets, but verification approaches vary considerably among schemes. Here, we systematically review approaches to verification across ecological citizen science schemes, which feature in published research, aiming to identify the options available for verification, and to examine factors that influence the approaches used (Baker et al. 2021). We reviewed 259 schemes and were able to locate verification information for 142 of those. Expert verification was most widely used, especially among longer-running schemes. Community consensus was the second most common verification approach, used by schemes such as Snapshot Serengeti (Swanson et al. 2016) and MammalWeb (Hsing et al. 2018). It was more common among schemes with a larger number of participants and where photos or video had to be submitted with each record. Automated verification was not widely used among the schemes reviewed. Schemes that used automation, such as eBird (Kelling et al. 2011) and Project FeederWatch (Bonter and Cooper 2012) did so in conjunction with other methods such as expert verification. Expert verification has been the default approach for schemes in the past, but as the volume of data collected through citizen science schemes grows and the potential of automated approaches develops, many schemes might be able to implement approaches that verify data more efficiently. We present an idealised system for data verification, identifying schemes where this hierachical system could be applied and the requirements for implementation. We propose a hierarchical approach in which the bulk of records are verified by automation or community consensus, and any flagged records can then undergo additional levels of verification by experts. HTML XML PDF
      PubDate: Mon, 20 Sep 2021 17:00:00 +030
       
  • (Re)Discovering Known Biodiversity: Introduction

    • Abstract: Biodiversity Information Science and Standards 5: e75491
      DOI : 10.3897/biss.5.75491
      Authors : Donat Agosti : Biodiversity sciences, including taxonomy, are empirical sciences where all results are published in scholarly publications as part of the research life cycle. This creates a corpus of an estimated 500 million printed pages (Kalfatovic 2010) including billions of facts such as traits, biotic interactions, observations characterizing all the estimated 1.9 million known species (Costello et al. 2013). This library is continually reused, cited and extended, for example with more than an estimated 15,000–20,000 new species annually (Polaszek 2005). All of these figures are estimates because we neither know how many species have been discovered, nor how many are being discovered every day, let alone what we know about them. Following standard scientific practice, previous publications, specimens, gene sequences, or taxonomic treatments (Catapano 2019) are cited more or less explicitly. In the pre-digital age, these links were meant for the human reader to be understood. For example, "L. 1758" is an established reference and links to both, Carolus Linnaeus and Linnaeus 1758, understandable at least by an expert human, and in the digital age, provides access to the respective digital representation. These data within the hundreds of millions of printed and now increasingly digitally published pages form a seamless, albeit implicit knowledge graph. Unfortunately, most of these publications are in print—the Biodiversity Heritage Library digitized about 50 million pages (Kalfatovic 2010)—or in many cases, closed access publications, and thus this knowledge is not readily accessible in the digital age.However, in today's digital age, each of these kinds of implicit links is an expensive stumbling block to access and reuse of the referenced data, its parent publications and the cited referenced data therein. Inadequate formats, language and access to taxonomic information were already recognized in 1992 at the Rio Summit (Taxonomic Impediment). The consequences of these impediments are only now obvious with the realization of the daunting amount of human resources needed to digitally catalogue and index this unknown (not discoverable and inaccessible) known knowledge, let alone making the data itself findable, accessible, interoperable and reusable (FAIR). This is a formidable and complex scientific challenge.Plazi is taking on this challenge. Its vision is to promote and enable the discovery and liberation of data to transform the unknown known data into digitally accessible knowledge, i.e., to build a digital knowledge base aimed at discovering all the species (and other taxa) we know, and what we know about them. Taxonomic publications with their highly standardized taxonomic names, taxonomic treatments, treatment citations, material citations and illustrations are well suited to machine extraction. Together they include the entire catalogue of life with all the discovered species and their synonyms, often tens to hundreds of treatments, and figures that depict the myriad forms that comprise the world’s biodiversity. Once these data are FAIR, it allows bidirectional linking, for example of taxonomic names to the referenced taxonomic treatment, other digital resources such as gene sequences or digital specimens. At the same time, each datum is an entry point to the wealth of information that can be followed by the human user by clicking the links, but more importantly, analysed by machines. Here, digitally accessible knowledge will be defined in the context of discovering known biodiversity, including strategies of how to approach the challenge, which then will be detailed in subsequent talks in this symposium. This symposium is based on Plazi’s ongoing data liberation and discovery supported by the European Union (e.g. Biodiversity Community Integrated Knowledge Library BiCIKL), United States (e.g. NIH) and Swiss research funding (e.g. e-BioDiv  and the Arcadia Fund), collaboration with publishers (e.g. Pensoft, Muséum national d'Histoire naturelle, Consortium of European Taxonomic Facilities Publications, the Zenodo repository, Biodiversity Heritage Library), and data reusers like the Global Biodiversity Information Facility, Ocellus, Synospecies and openBiodiv. Currently, over 500,000 taxonomic treatments and 300,000 illustrations have been liberated and are accessible through TreatmentBank and the Biodiversity Literature Repository.  HTML XML PDF
      PubDate: Mon, 20 Sep 2021 17:00:00 +030
       
  • ITIS and the Global Taxonomic Backbone

    • Abstract: Biodiversity Information Science and Standards 5: e75471
      DOI : 10.3897/biss.5.75471
      Authors : David Mitchell, Thomas Orrell : The Integrated Taxonomic Information System (ITIS) provides a regularly updated, global database that currently contains over 868,000 scientific names and their hierarchy. The program exists to communicate a comprehensive taxonomy of global species across 7 kingdoms that enables biodiversity information to be discovered, indexed, and connected across all human endeavors. ITIS partners with taxonomists and experts across the world to assemble scientific names and their taxonomic relationships, and then distributes that data through publicly available software. A single taxon may be represented by multiple scientific names, so ITIS makes it a priority to provide synonymy. Linking valid or accepted names with their subjective and objective synonyms is a key component of name translation and increases the precision of searches and organization of information.ITIS and its partner Species2000 create the Catalogue of Life (CoL) checklist that provides quality scientific name data for over 2.2M species.  The CoL is the Global Biodiversity Information Facility (GBIF) taxonomic backbone.Providing automated open access to complete, current, literature-referenced, and expert-validated taxonomic information enables biological data management systems, and is elemental to enhancing the utility of the amassed scientific data across the world. Fully leveraging this information for the public good is crucial for empowering the global digital society to confront the most pressing social and environmental challenges.  HTML XML PDF
      PubDate: Mon, 20 Sep 2021 17:00:00 +030
       
  • HydroClim Data Portal: Cyberinfrastructure for providing high-resolution
           GIS modeled streamflow and water temperature data to researchers 

    • Abstract: Biodiversity Information Science and Standards 5: e75269
      DOI : 10.3897/biss.5.75269
      Authors : Xiaojun Wang, Jason Knoft, Darren Ficklin, Nelson Rios, Henry Bart : Freshwater ecosystems play a key role in sustaining aquatic biodiversity. However, human alterations to watersheds and climate change are reducing critical habitat and the viability of populations of many aquatic species. The environmental changes have also had significant adverse impacts on water temperatures and streamflow. The changes in temperature and precipitation forecast over the next century are expected to affect the freshwater ecosystems and their biodiversity to an even greater extent than in the past. The aims of the HydroClim project are to provide openly accessible data on two key measures of stream conditions in the United States (US) and Canada for use in research, to increase public understanding of issues involving water resources, and to provide training opportunities for scientists who will be responsible for the conservation of freshwater biodiversity in the future.The project has used contemporary air temperature and precipitation data and future climate data from multiple Global Climate Model scenarios to generated high-resolution, spatially explicit, monthly streamflow and water temperature data for all watersheds across the US and Canada from 1950–2099 through multiple Soil and Water Assessment Tool (SWAT) hydrologic models. This presentation describes a cyberinfrastructure we developed for hosting the HydroClim data, consisting of a relational database and a web-based data portal that allows scientists to query and download the data. We have imported almost 1.9 billion HydroClim data records into the system. At the time of this submission, 1.3 billion records of historical data and predicted streamflow and water temperature model data are available in the HydroClim data portal for 26 watersheds in the United States. The HydroClim data are also being integrated with fish occurrence data from Fishnet 2, via the Fishnet 2 API (Application Programming Interface), which provides occurence data records for over 4.1 million species lots representing over 40 million specimens in ichthyological research collections.Our plan is to extract and merge environmental data from Hydroclim API, with fish occurrences containing geospatial information from the Fishnet 2 API, displaying the integrated data on web-based interactive hydrological maps in different time-series, and providing a tool for visualizing ecosystem diversity. The combined Hydroclim and Fishnet2 data can be used for ecological niche modeling applications, such as predicting the future distribution of threatened and endangered freshwater fish species. I will describe the cyberinfrastructure of HydroClim data portal and some of the ways the data can be used in biodiverisity research in the future. HTML XML PDF
      PubDate: Mon, 20 Sep 2021 17:00:00 +030
       
  • The Reptile Database: Curating the biodiversity literature without
           funding

    • Abstract: Biodiversity Information Science and Standards 5: e75448
      DOI : 10.3897/biss.5.75448
      Authors : Peter Uetz : The Reptile Database (RDB) curates the literature and taxonomy for about 14,000 species and subspecies of reptiles (Uetz et al. 2021). Together with a few other databases, the RDB curates the literature for about 70,000 species of fish, amphibians, reptiles, birds and mammals.While it acts as a current name list for extant reptile taxa, including synonymies, it also collects images (currently ~18,000, representing half of all species), type information, diagnoses and descriptions, and a bibliography of 62,000 references, most of which are linked to online sources.The database is also extensively cross-referenced to citizen science projects (iNaturalist), the NCBI taxonomy, the IUCN Red List, and several others, and serves as data provider (for reptiles) for the Catalogue of Life.A major challenge for the Reptile database is the consistent curation of the literature, which requires the addition of about 2000 papers a year, including about 200 new species descriptions and numerous taxonomic changes. For instance, during the past five years, almost 1000 species changed their names, in addition to the ~900 species that were newly described, i.e., almost 20% of all reptile species were described or changed their name within just a half decade!While the database can keep track of name changes, it remains a largely unsolved problem of how these name changes can or should be translated into related databases such as the National Center for Biotechnology Information (NCBI), which keeps track of the literature independently (but exchanges data with the RDB). Some sites use the web services of the RDB to update their taxonomy, such as Calphotos or iNaturalist, but many do not or have not been able to implement automated name tracking.The RDB also works with the Global Assessment of Reptile Distributions (GARD Initiative) to keep track of range changes. After GARD published a collection of ~10,000 range maps for reptiles in 2017, more than half of these maps have changed in area size by more than 5% since the initial release.The database has developed several avenues for streamlining and optimizing curation of the literature, e.g., (semi-) automated requests for publications, species descriptions, and photos from authors, but the process is far from fully automated.Questions remain: how can taxonomic databases develop, share, and exchange better tools for curation' Can we standardize data collection and processing' How can we automatically exchange data with other data sources' How can we optimize the process of scientific publication to streamline databasing and automated information extraction' HTML XML PDF
      PubDate: Mon, 20 Sep 2021 17:00:00 +030
       
  • Nomenclature over 5 years in TaxonWorks: Approach, implementation,
           limitations and outcomes

    • Abstract: Biodiversity Information Science and Standards 5: e75441
      DOI : 10.3897/biss.5.75441
      Authors : Matthew Yoder, Dmitry Dmitriev : We are now over four decades into digitally managing the names of Earth's species. As the number of federating (i.e., software that brings together previously disparate projects under a common infrastructure, for example TaxonWorks) and aggregating (e.g., International Plant Name Index, Catalog of Life (CoL)) efforts increase, there remains an unmet need for both the migration forward of old data, and for the production of new, precise and comprehensive nomenclatural catalogs. Given this context, we provide an overview of how TaxonWorks seeks to contribute to this effort, and where it might evolve in the future.In TaxonWorks, when we talk about governed names and relationships, we mean it in the sense of existing international codes of nomenclature (e.g., the International Code of Zoological Nomenclature (ICZN)). More technically, nomenclature is defined as a set of objective assertions that describe the relationships between the names given to biological taxa and the rules that determine how those names are governed. It is critical to note that this is not the same thing as the relationship between a name and a biological entity, but rather nomenclature in TaxonWorks represents the details of the (governed) relationships between names. Rather than thinking of nomenclature as changing (a verb commonly used to express frustration with biological nomenclature), it is useful to think of nomenclature as a set of data points, which grows over time. For example, when synonymy happens, we do not erase the past, but rather record a new context for the name(s) in question. The biological concept changes, but the nomenclature (names) simply keeps adding up.Behind the scenes, nomenclature in TaxonWorks is represented by a set of nodes and edges, i.e., a mathematical graph, or network (e.g., Fig. 1). Most names (i.e., nodes in the network) are what TaxonWorks calls "protonyms," monomial epithets that are used to construct, for example, bionomial names (not to be confused with "protonym" sensu the ICZN). Protonyms are linked to other protonyms via relationships defined in NOMEN, an ontology that encodes governed rules of nomenclature. Within the system, all data, nodes and edges, can be cited, i.e., linked to a source and therefore anchored in time and tied to authorship, and annotated with a variety of annotation types (e.g., notes, confidence levels, tags). The actual building of the graphs is greatly simplified by multiple user-interfaces that allow scientists to review (e.g. Fig. 2), create, filter, and add to (again, not "change") the nomenclatural history.As in any complex knowledge-representation model, there are outlying scenarios, or edge cases that emerge, making certain human tasks more complex than others. TaxonWorks is no exception, it has limitations in terms of what and how some things can be represented. While many complex representations are hidden by simplified user-interfaces, some, for example, the handling of the ICZN's Family-group name, batch-loading of invalid relationships, and comparative syncing against external resources need more work to simplify the processes presently required to meet catalogers' needs.The depth at which TaxonWorks can capture nomenclature is only really valuable if it can be used by others. This is facilitated by the application programming interface (API) serving its data (https://api.taxonworks.org),  serving text files, and by exports to standards like the emerging Catalog of Life Data Package. With reference to real-world problems, we illustrate different ways in which the API can be used, for example, as integrated into spreadsheets, through the use of command line scripts, and serve in the generation of public-facing websites.Behind all this effort are an increasing number of people recording help videos, developing documentation, and troubleshooting software and technical issues. Major contributions have come from developers at many skill levels, from high school to senior software engineers, illustrating that TaxonWorks leads in enabling both technical and domain-based contributions. The health and growth of this community is a key factor in TaxonWork's potential long-term impact in the effort to unify the names of Earth's species. HTML XML PDF
      PubDate: Mon, 20 Sep 2021 17:00:00 +030
       
  • Does TDWG Need an API Design Guideline'

    • Abstract: Biodiversity Information Science and Standards 5: e75372
      DOI : 10.3897/biss.5.75372
      Authors : Ian Engelbrecht, Hester Steyn : RESTful APIs (REpresentational State Transfer Application Programming Interfaces) are the most commonly used mechanism for biodiversity informatics databases to provide open access to their content. In its simplest form an API provides an interface based on the HTTP protocol whereby any client can perform an action on a data resource identified by a URL using an HTTP verb (GET, POST, PUT, DELETE) to specify the intended action. For example, a GET request to a particular URL (informally called an endpoint) will return data to the client, typically in JSON format, which the client converts to the format it needs. A client can either be custom written software or commonly used programs for data analysis such as R (programming language), Microsoft Excel (everybody’s favorite data management tool), OpenRefine, or business intelligence software. APIs are therefore a valuable mechanism for making biodiversity data FAIR (findable, accessible, interoperable, reusable).There is currently no standard specifying how RESTful APIs should be designed, resulting in a variety of URL and response data formats for different APIs. This presents a challenge for API users who are not technically proficient or familiar with programming if they have to work with many different and inconsistent data sources. We undertook a brief review of eight existing APIs that provide data about taxa to assess consistency and the extent to which the Darwin Core standard (Wieczorek et al. 2021) for data exchange is applied. We assessed each API based on aspects of URL construction and the format of the response data (Fig. 1). While only cursory and limited in scope, our survey suggests that consistency across APIs is low. For example, some APIs use nouns for their endpoints (e.g. ‘taxon’ or ‘species’), emphasising their content, whereas others use verbs (e.g. ‘search’), emphasising their functionality. Response data seldom use Darwin Core terms (two out of eight examples) and a wide range of terms can be used to represent the same concept (e.g. six different terms are used for dwc:scientificName
      Authors hip). Terms that can be considered metadata for a response, such as pagination details, also vary considerably. Interestingly, the public interfaces for the majority of APIs assessed do not provide POST, PUT or DELETE endpoints that modify the database. POST is only used for providing more detailed request bodies to retrieve data than possible with GET. This indicates the primary use of APIs by biodiversity informatics platforms for data sharing.An API design guideline is a document that provides a set of rules or recommendations for how APIs should be designed in order to improve their consistency and useability. API design guidelines are typically created by particular organizations to standardize API development within the organization, or as a guideline for programmers using an organization’s software to build APIs (e.g., Microsoft and Google). The API Stylebook is an online resource that provides access to a wide range of existing design guidelines, and there is an abundance of other resources available online.This presentation will cover some of the general concepts of API design, demonstrate some examples of how existing APIs vary, and discuss potential options to encourage standardization. We hope our analysis, the available body of knowledge on API design, and the collective experience of the biodiversity informatics community working with APIs may help answer the question “Does TDWG need an API design guideline'” HTML XML PDF
      PubDate: Mon, 20 Sep 2021 17:00:00 +030
       
  • The Natural Science Collections Facility: Building community with South
           African museums and herbaria

    • Abstract: Biodiversity Information Science and Standards 5: e75373
      DOI : 10.3897/biss.5.75373
      Authors : Ian Engelbrecht, Audrey Ndaba, Shanelle Ribeiro, Fulufhelo Tambani, Michelle Hamer : The Natural Science Collections Facility (NSCF) is a South African initiative to develop a network of museum and herbarium institutions, to secure the collections under their care, and to ensure collections are used for research that is relevant to local and global priorities. A key component of the NSCF work is community building. In the past, institutions have operated independently and fall under a range of national, provincial, and local governance structures. Many collections are understaffed and resources for collections care are limited. Differing organizational cultures, histories, and sometimes individual personalities, present significant obstacles to transformation, network building and progress. In 2018, shortly after its inauguration, transformation consultants joined the NSCF to facilitate the community building process. Emphasis shifted away from a purely business-oriented focus to a ‘softer’, people-centric orientation. The consultants introduced a wide range of social technologies such as meeting check-ins, active listening, rich pictures and a range of tools from the Liberating Structures framework, as well as new leadership paradigms and a new theory of organizational change to catalyze transformation from the bottom up. Importantly, the change paradigm is non-deterministic, and removes any requirements for goals, objectives, and timelines. Instead it prioritizes change that arises organically and spontaneously as a result of greater interpersonal engagement, breaking down barriers imposed by traditional hierarchy, bringing previously unheard voices to the fore, and engaging deeply with difficult, uncomfortable problems. Given South Africa’s socio-political history, race is one such difficult problem and remains a central topic in this transformation journey. Race-related problems are still deeply entrenched within South African institutions and often reflect historical privilege of whites over people of colour despite extensive equity target-based efforts to address them. The NSCF transformation process has highlighted the importance of confronting these issues, which are often opaque to historically privileged groups within organizations. Given the sensitivity of such issues, processes of engaging with them are deep, must be appropriately contained, and focus on uncovering new insights and illuminating new information or points of view. The community building process has in many ways been successful. A sense of community and camaraderie has developed amongst a significant portion of staff members in participating institutions and has been evidenced through the support provided to each other during the COVID19 pandemic, extent of engagement in NSCF social media platforms, and the general tone and atmosphere of meetings and workshops. There are still challenges though and some members and sectors in the community remain disengaged. Key lessons have been the importance of having professional consultants to lead community building and transformation initiatives of this scale; the need for people to open their minds and hearts to a transformation process and to expect to have their assumptions, prejudices and standpoints challenged despite the discomfort they may feel; and new understanding of the time and investment needed to truly improve and integrate existing communities and networks. A key component of the NSCF process has been engaging on matters relating to specimen data, and particularly data sharing. Perspectives within the community vary widely, from those who believe data should be published openly as a matter of obligation, to those who believe that protecting their data is necessary to maintain their competitive advantage within the wider scientific community. It is apparent that many of the concerns around data sharing relate to the lack of skills in museum staff to clean, prepare, and publish their data. Since open data publication is one of the central goals of the NSCF, these are key topics that we return to repeatedly during the transformation process.This presentation will showcase some of the methodologies, highlights, and challenges of the NSCF community building journey thus far, focusing on issues around data sharing and building the skills needed within the community to manage and publish specimen data effectively. The NSCF initiative is part of the National Research Infrastructure Roadmap funded by the Department of Science and Innovation. HTML XML PDF
      PubDate: Mon, 20 Sep 2021 17:00:00 +030
       
  • Delivering Fit-for-Use Data: Quality control

    • Abstract: Biodiversity Information Science and Standards 5: e75432
      DOI : 10.3897/biss.5.75432
      Authors : Felipe Simoes, Donat Agosti, Marcus Guidoti : Automatic data mining is not an easy task and its success in the biodiversity world is deeply tied to the standardization and consistency of scientific journals' layout structure. The various formatting styles found in the over 500 million pages of published biodiversity information (Kalfatovich 2010), pose a remarkable challenge towards the goal of automating the liberation of data currently trapped on the printed page. Regular expressions and other pattern-recognition strategies invariably fail to cope with this diverse landscape of academic publishing. Challenges such as incomplete data and taxonomic uncertainty add several additional layers of complexity.However, in the era of big data, the liberation of all the different facts contained in biodiversity literature is of crucial importance. Plazi tackles this daunting task by providing workflows and technology to automatically process biodiversity publications and annotate the information therein, all within the principles of FAIR (findable, accessible, interoperable, and reusable) data usage (Agosti and Egloff 2009). It uses the concept of taxonomic treatments (Catapano 2019) as the most fundamental unit in biodiversity literature, to provide a framework that reflects the reality of taxonomic data for linking the different pieces of information contained in these taxonomic treatments. Treatment citations, composed of a taxonomic name and a bibliographic reference, and material citations carrying all specimen-related information are additional conceptual cornerstones for this framework. The resulting enhanced data are added to TreatmentBank. Figures and treatments are made Findable, Accessible, Interoperable and Reuseable (FAIR) by depositing them including specific metadata to the Biodiversity Literature Repository community (BLR) at the European Organization for Nuclear Research (CERN) repository Zenodo, and pushed to GBIF. The automation, however, is error prone due to the constraints explained above.In order to cope with this remarkable task without compromising data quality, Plazi has established a quality control process, based on logical rules that check the components of the extracted document raising errors in four different levels of severity. These errors are also used in a data transit control mechanism, “the gatekeeper”, which blocks certain data transits to create deposits (e.g., BLR) or reuse of data (e.g., GBIF) in the presence of specific errors. Finally, a set of automatic notifications were included in the plazi/community Github repository, in order to provide a channel that empowers external users to report data issues directly to a dedicated team of data miners, which will in turn and in a timely manner, fix these issues, improving data quality on demand. In this talk, we aim to explain Plazi’s internal quality control process and phases, the data transits that are potentially affected, as well as statistics on the most common issues raised by this automated endeavor and how we use the generated data to continuously improve this important step in Plazi's workflow. HTML XML PDF
      PubDate: Mon, 20 Sep 2021 17:00:00 +030
       
  • A Case Study of Publishing Internal APIs to External Users

    • Abstract: Biodiversity Information Science and Standards 5: e75386
      DOI : 10.3897/biss.5.75386
      Authors : Max Patiiuk : External service integration and adherence to industry standards has become ever more important for collections data management platforms. External APIs (Application Programming Interfaces), allow for the development of bi-directional data flows critical to service integration. In contrast to service-oriented backend APIs, public APIs must have continually up-to-date, comprehensive documentation that covers common use cases, on-the-fly request validation, and meaningful error messages. OpenAPI (OpenAPI Initiative 2021), a machine-readable API documentation specification can help significantly with testing and maintenance, and libraries can be used to automate common maintenance tasks.Specify 7 is a biological collections data management platform developed by the Specify Collections Consortium (Specify Software Consortium 2021). This presentation summarizes the challenges and lessons learned with publishing the existing backend Specify 7 API to a public-facing external API. Each Specify 7 API is composed of 200 resources. A standard set of CRUD (Create, Read, Update, Delete) operations is provided for each resource for client interaction with a group of service-based endpoints for bulk operations such as file uploads, file-based data imports, and attachment manipulation.To support the migration, we developed a custom library to enhance request validation. Parameter validation is extended through a real-time comparison against the existing schema and data. The library is available to the community under a MIT license on GitHub (https://github.com/specify/open_api_tools/).In this presentation, we will close with an overview of the next steps for the Specify 7 public API. These include:An update to the latest OpenAPI specification, version 3.1. The latest version aims to increase compatibility with the Javascript Object Notation (JSON) Schema specification, and thus would allow us to use JSON Schema (IETF Trust 2021) validation frameworks.An in-depth evaluation of GraphQL for its ability to force all endpoints to be strongly typed and automatic validation of request parameters and response objects. HTML XML PDF
      PubDate: Fri, 17 Sep 2021 11:30:00 +030
       
  • A Collective Effort to Update the Legume Checklist

    • Abstract: Biodiversity Information Science and Standards 5: e75377
      DOI : 10.3897/biss.5.75377
      Authors : Marianne Le Roux, Markus Döring, Anne Bruneau, Joe Miller, Rafaël Govaerts, Nick Black, Gwilym Lewis, Carole Sinou : Taxonomic names are critical to the communication of biodiversity—they link data together whether it be distribution data, traits or phylogeny. Large taxonomic groups, such as many plant families, are globally distributed as is the taxonomic expertise of the family. A growing knowledge base requires collaboration to develop an up-to-date checklist as a research foundation. The legume (Fabaceae) community has a strong history of collaboration including the International Legume Database and Information Service (ILDIS), which curated the names but ILDIS is no longer up to date. In 2020, under the umbrella of the Legume Phylogeny Working Group (LPWG), a group of taxonomists began updating the legume taxonomy as part of a larger collaboration around a legume data portal.Currently the World Checklist of Vascular Plants (WCVP) is the most up-to-date reference and was used as the starting point for the project. The workflow begins with over 80 volunteer taxonomic experts updating the checklist in their specialty area. These lists are manually collated, centrally creating a consensus taxonomy with synonyms. Any taxonomic conflicts are adjudicated within the group. The checklist then undergoes a comprehensive nomenclature assessment at Royal Botanic Gardens, Kew and becomes part of the WCVP. This checklist was submitted to the Catalogue of Life Checklist Bank and is integrated as the preferred legume checklist in the GBIF taxonomic backbone.After one round of taxonomic curation, 38% of the legume names in GBIF (Global Biodiversity Information Facility), which were previously unmatched to WCVP, are now connected to GBIF names, therefore also improving the occurrence records of those species. The GBIF taxonomic backbone contains names found on herbarium specimens and in the literature, which are not currently part of the legume expert community checklist or WCVP. This list of unresolved names will be forwarded to the legume community for curation, thereby developing a cycle of data improvement.  It is anticipated that after a few rounds of expert curation, the WCVP and GBIF taxonomies will converge. At each cycle, a snapshot of GBIF occurrences is taken and the improvement of the occurrences is quantified to measure the value of the expert taxonomic work.  The current checklist is also available via Catalogue of Life and soon via the World Flora Online to support research. In this talk, we describe the workflow and impact of the expert curated legume taxonomy. HTML XML PDF
      PubDate: Fri, 17 Sep 2021 11:30:00 +030
       
  • Building a Global Open Extensible Biodiversity Commons Network

    • Abstract: Biodiversity Information Science and Standards 5: e75376
      DOI : 10.3897/biss.5.75376
      Authors : Deborah Paul, Joe Miller, Michael Webster : Recent global events reinforce the need for local to global coalitions to address a variety of socio-environmental challenges such as the current COVID-19 pandemic (Cook et al. 2020) and biodiversity loss in general. Scientists reviewing data and fitness for current and future use note urgent necessary changes needed in data collection, specimen collection and preservation, infrastructure, human capacity, and standards-of-practice (Raven and Miller 2020, Morrison et al. 2017, Cook et al. 2020). Multi-faceted research questions often require cross-disciplinary collaboration. A recent paper analyzed conservation and disease mitigation research author networks and discovered that certain disciplines do not work together unless the research has outcomes that serve all groups involved (Kading and Kingston 2020). This research reinforces the finding that common goals offer a powerful way to build effective cross-disciplinary networks, speed up collaboration, and more effectively take on complex research.To move toward a Digital Extended Specimen (DES), the alliance for biodiversity knowledge is engaging in community building. The above summary when coupled with conversations from our alliance-led online consultations reinforces known threads and reveals some emerging themes about partnerships and collaborations. Our group continues to work on defining what a Digital Specimen is (or is not) and then communicating that succinctly to the worldwide community. At the same time we recognize the need for an extensible digital specimen object, we note the need for an extensible network. We note that groups need to and are motivated to solve local issues (as in for their town, or their country or continent). So, looking for and selecting common threads across these regional scales will be key to realizing and motivating effective partnerships and networks. Foremost, this includes expanding participation beyond Europe and North America. We recognize the need to form new partnerships to expand our network and learn from our new partners. For example, the Digital Humanities community would like to talk about the intersection of the humanities, social sciences, biology, and collections that can help each other to do better research.With this talk, and through participation in TDWG2021, we seek to share information and insights gathered so far about next steps and about building and sustaining the network we need to realize a biodiversity data commons and get input from those who participate in our session. HTML XML PDF
      PubDate: Fri, 17 Sep 2021 11:30:00 +030
       
  • Matching Species Names Across Biodiversity Databases: Sources, tools,
           pitfalls and best practices for taxonomic harmonization

    • Abstract: Biodiversity Information Science and Standards 5: e75359
      DOI : 10.3897/biss.5.75359
      Authors : Matthias Grenié, Emilio Berti, Juan Carvajal-Quintero, Marten Winter, Alban Sagouis : The quantity and quality of ecological data have rapidly increased in the last decades, bringing ecology into the realm of big data. Frequently, multiple databases with different origins and data characteristics are combined to address new research questions. Taxonomic name harmonization, i.e., the process of standardizing taxon names according to common sources such as taxonomic databases (TD), is necessary to properly combine multiple databases using species names. In order to be able to develop proper data matching workflows, TDs and tools using them need to be clearly and comprehensively described. But this is rarely the case. Common problems users have to deal with are: poorly described taxonomic concepts behind biological databases, lack of information when TDs are actively updated, and details regarding where the primary source of taxonomic information comes from (e.g., secondary TDs taking information from primary TDs). In addition, software to access these TDs is not always advertised, partly redundant, or developed with incompatible standards, creating additional challenges for users. As a result, taxonomic name harmonization has become a major difficulty in ecological studies. Researchers face a jungle of primary and secondary TDs with a diversity of tools to access them and no clear workflow on how to practically proceed. As a consequence, it is hard for users to know which TD, tool and workflow will fit the task at hand and lead to the most robust results when combining different biological datasets.Here, we present an overview of major TDs as well as an extensive review of R packages to access TDs, and to harmonize taxa names. We developed an R Shiny web application summarizing meta-data and linkages among TDs and R packages (Figs 1, 2), which users can explore to learn about general features of TDs and tools and how they are linked among one another. This is particularly helpful to assist users when deciding on the TDs and tools that best fit the tasks and data at hand and to develop more informed workflows for taxonomic name harmonization. Finally, from our review and using the Shiny app, we were able to provide general best practice principles to harmonize taxonomic names and avoid common pitfalls.To our knowledge, this study represents the most exhaustive review of TDs and R tools for taxonomic name harmonization. Our intuitive Shiny app can help make practical decisions when harmonizing taxonomic names across multiple datasets. Finally, our proposed workflows, based on conservative guideline principles (e.g., making sure incompatible taxonomic hypotheses are not combined together), provide a hands-on approach for taxonomic harmonization, which focuses on the quality of the end results while maximizing the number of species correctly matched. HTML XML PDF
      PubDate: Fri, 17 Sep 2021 11:30:00 +030
       
  • Challenges, Solutions, and Workflows Developed for the Taxonomic Backbone
           of the World Flora Online.

    • Abstract: Biodiversity Information Science and Standards 5: e75343
      DOI : 10.3897/biss.5.75343
      Authors : William Ulate, Sunitha Katabathuni, Alan Elliott : The World Flora Online (WFO) is the collaborative, international initiative to achieve Target 1 of the Global Strategy for Plant Conservation (GSPC): "An online flora of all known plants." WFO provides an open-access, web-based compendium of the world’s plant species, which builds upon existing knowledge and published floras, checklists and revisions but will also require the collection and generation of new information on poorly known groups and unexplored regions (Borsch et al. 2020).The construction of the WFO Taxonomic Backbone is central to the entire WFO as it determines the accessibility of additional content data and at the same time, represents a taxonomic opinion on the circumscription of those taxa. The Plant List v.1.1 (TPL 2013) was the starting point for the backbone, as this was the most comprehensive resource covering all plants available. We have since curated the higher taxonomy of the backbone, based on the following published community-derived classifications: the Angiosperm Phylogeny Group (APG IV 2016), the Pteridophyte Phylogeny Group (PPG I 2016), Bryophytes (Buck et al. 2008), and Hornworts & Liverworts (Söderström et al. 2016).The WFO presents a community-supported consensus classification with the aim of being the authoritative global source of information on the world's plant diversity. The backbone is actively curated by our Taxonomic Expert Networks (TEN), consisting of specialists of taxonomic groups,  ideally at the Family or Order level.  There are currently 37 approved TENs, involving more than 280 specialists, working with the WFO. There are small TENs like the Begonia Resource Center and the Meconopsis Group (with five specialists), medium TENs like Ericaceae and Zingiberaceae Resource Centers or SolanaceaSource.org (around 15 experts), and larger TENs like Caryophyllales.org and the Legume Phylogeny Working Group, with more than 80 specialists involved. When we do not have taxonomic oversight, the World Checklist of Vascular Plants (WCVP 2019) has been used to update those families from the TPL 2013 classification. Full credit and acknowledgement given to the original sources is a key requirement of this collaborative project, allowing users to refer to the primary data. For example, an association with the original content is kept through the local identifiers used by the taxonomic content providers as a link to their own resources.A key requirement for the WFO Taxonomic Backbone is that every name should have a globally unique identifier that is maintained, ideally forever. After considering several options, the WFO Technology Working Group recommended that the WFO Council establish a WFO Identifier (WFO-ID), a 10-digit number with a “wfo-” prefix, aimed at establishing a resolvable identifier for all existing plant names, which will not only be used in the context of WFO but can be universally used to reference plant names.Management of the WFO Taxonomic Backbone has been a challenge as TPL v1.1 was derived from multiple taxonomic datasets, which led to duplication of records. For that reason, names can be excluded from the public portal by the WFO Taxonomic Working Group or the TENs, but not deleted. A WFO-ID is not deleted nor reused after it has been excluded from the WFO Taxonomic Backbone. Keeping these allows for better matching when assigning WFO-IDs to data derived from content providers. Nevertheless, this implies certain considerations for new names and duplications.New names are added to the WFO Taxonomic Backbone via nomenclators like the International Plants Name Index (IPNI, The Royal Botanic Gardens, Kew et al. 2021) for Angiosperms, and Tropicos (Missouri Botanical Garden 2021) for Bryophytes, as well as harvesting endemic and infraspecific names from Flora providers when providing descriptive content. New names are passed to the TEN to make a judgement on their taxonomic status.When TENs provide a new authoritative taxonomic list for their group, we first produce a Name Matching report to ensure no names are missed. Several issues come from managing and maintaining taxonomic lists, but the process of curating an ever-growing integrated resource leads to an increase in the challenges we face with homonyms, non-standard author abbreviations, orthographic variants and duplicate names when Name Matching.The eMonocot database application, provided by Royal Botanic Gardens, Kew, (Santarsiero et al. 2013) and subsequently adapted by the Missouri Botanical Garden to provide the underlying functionality for WFO's current toolset, has also proven itself to be a challenging component to update.In this presentation, we will share our hands-on experience, technical solutions and workflows creating and maintaining the WFO Taxonomic Backbone.  HTML XML PDF
      PubDate: Thu, 16 Sep 2021 11:00:00 +030
       
  • An Implementation Approach for the Humboldt Extension to
           Darwin Core

    • Abstract: Biodiversity Information Science and Standards 5: e75350
      DOI : 10.3897/biss.5.75350
      Authors : Peter Brenton : The Humboldt extension to the Darwin Core Standard Event Core has been proposed in order to provide a standard framework to capture important information about the context in which biodiversity occurrence observations and samples are recorded. This information includes methods and effort, which are critical for determining species abundance and other measures of population dynamics, as well as completeness of survey coverage.As this set of terms is being developed, we are using real-world use cases to ensure that these terms can address all known situations. We are also considering approaches to implementation of the new standard to maximise opportunities for uptake and adoption.In this presentation I provide an example of how the Humboldt extension will be implemented in the Atlas of Living Australia’s (ALA) BioCollect application. BioCollect is a cloud-based multi-project platform for all types of biodiversity and ecological field data collection and is particularly suited for capturing fully described complex protocol-based systematic surveys.For example, BioCollect supports a wide array of customised survey event-based data schemas, which can be configured for different kinds of stratified (and other) sampling protocols. These schemas can record sampling effort at the event level and event effort can be aggregated across a dataset to provide a calculated measure of effort based on the whole dataset. Such data-driven approaches to providing useful dataset-level metadata can also be applied to measures of taxonomic completeness as well as spatial and temporal coverage. In addition, BioCollect automatically parses biodiversity occurrence records from event records for harvest by the ALA. In this process, the semantic relationship between the occurrence records and their respective event records is also preserved and   linkages between them enable cross-navigation for improved contextual interpretation.The BioCollect application demonstrates one approach to a practical implementation of the Humboldt extension. HTML XML PDF
      PubDate: Thu, 16 Sep 2021 11:00:00 +030
       
  • Cloud-based Software Platforms for Citizen Science: Implications and
           opportunities for biodiversity standards

    • Abstract: Biodiversity Information Science and Standards 5: e74374
      DOI : 10.3897/biss.5.74374
      Authors : Peter Brenton : Whether community created and driven, or developed and run by researchers, most citizen science projects operate on minimalistic budgets, their capacity to invest in fully featured bespoke software and databases is usually very limited. Further, the increasing number of applications and citizen science options available for public participation creates a confusing situation to navigate.Cloud-based platforms such as BioCollect, iNaturalist, eBird, CitSci.org, and Zooniverse, provide an opportunity for citizen science projects to leverage highly featured functional software capabilities at a fraction of the cost of developing their own, as well as a common channel through which the public can find and access projects. These platforms are also excellent vehicles to facilitate the implementation of data and metadata standards, which streamline interoperability and data sharing. Such services can also embed measures in their design, which uplift the descriptions and quality of data outputs, significantly amplifying their usability and value.In this presentation I outline the experiences of the Atlas of Living Australia on these issues and demonstrate how we are tackling them with the BioCollect and iNaturalist platforms. We also consider the differences and similarities of these two platforms with respect to standards and data structures in relation to suitability for different use cases. You are invited to join a discussion on approaches being adopted and offer insights for improved outcomes. HTML XML PDF
      PubDate: Thu, 16 Sep 2021 11:00:00 +030
       
  • A Set of Simple Tools For Assembling, Annotating, Versioning and
           Publishing Taxonomies

    • Abstract: Biodiversity Information Science and Standards 5: e75344
      DOI : 10.3897/biss.5.75344
      Authors : Laura Rocha Prado : Biodiversity data publishers rely on virtually assembled taxonomic hierarchies to structure their data, with operational units involving scientific names, nomenclatural acts and taxonomic trees. The main goal for the majority of biodiversity aggregators, databases, and software developed specifically for managing scientific names, biological samples and other occurrences has been to establish a single, unified biological classification, to serve as their structural "taxonomic backbone." Resources to produce and publish biological classifications digitally are thus, typically restricted to those generating unified taxonomic backbones, leaving individual researchers and decentralized communities with few options to assemble, visualize, version and disseminate multiple taxonomies online.To aid the creation of a culture of assembling, annotating, versioning, and publishing taxonomies online, and to help users interested in taxonomic classifications that lack digital communities, the development of a set of modular and independent tools is proposed, based on the following complementary features:A web application to serve as the taxonomy curator (referred to as the Curator)A web application to serve as the optional taxonomic database and information provider (referred to as the Aggregator)These tools are being designed and built following modern software development standards, in a modular architecture consisting of front-end clients, databases, and back-end applications, with the provision for a public Application Programming Interface (API) that will make data available for any interested parties and can be potentially integrated into large-scale projects like the Global Biodiversity Information Facility (GBIF), Integrated Digitized Biocollections (iDigBio), Symbiota (Gries et al. 2014), and Plazi (Agosti and Egloff 2009).Curator toolThe Curator tool will be a publicly accessible front-end web application, with which users can assemble, curate, and export taxonomies. The primary focus is to support the user-preferred taxonomy generation, with manual inputs and optional annotations of the resulting product. Users can pick between three modes of taxonomy assembly:manual mode with assisted taxon search,automated generation from an online source, andautomated generation from a file upload. Taxonomies can be edited and annotated as necessary. Once a user is satisfied with their taxonomy, they can save it in one or all of the available formats for exporting and external usage (common formats include, among others, JSON (JavaScript Object Notation), CSV (comma-separated values), and XML). Logged in users can also opt to save the taxonomy in the Aggregator database, which will make the taxonomy publicly available. Ideally, all fields in the Curator forms should correspond to terms included in the Darwin Core standard (Wieczorek et al. 2012) or Plazi’s TaxonX schema (Agosti and Egloff 2009) (for hierarchies available in published treatments).Aggregator toolThe Aggregator tool will communicate with the database and will provide users with a number of functionalities, such as:Store and publish versioned taxonomies generated with the CuratorAPI endpoints for automation (JSON/XML formats/CSV download)Optional unique identifier/
      DOI generation for published taxonomiesSearch engine with user-friendly interface as well as API endpoint for querying the databaseThe possibility of making taxonomies available as an API endpoint, as well as exporting taxonomies in different formats, will ensure that this tool behaves as a taxonomic source that can be used by virtually any interested party or application. The tools are being modelled as a decentralized community resource that can be used for any or all taxonomic groups and, as such, its scale and impact will be driven by bottom-up community use. The goal is not to provide extensive coverage of all biological organisms, but rather to provide an open digital toolkit and space for biodiversity researchers and projects that lack access to open, structured, online taxonomic publication venues and dedicated tools.Practical examples of usage for these tools include:A user generates multiple taxonomic concepts for organisms they are studying, which can then be queried and analyzed by scripts that make taxonomic alignments to compare different scientific hypotheses throughout time;An institution wants to publish a regional Symbiota portal to manage specimens in a particular collection, so they establish an annotated working taxonomic backbone with the Curator that Symbiota will then be able to ingest before samples can be imported into the portal;A researcher wants to export a biodiversity portal taxonomy at a given moment and wants to annotate and publish this version in an upcoming paper to establish scientific baselines for proper taxonomic communication. HTML XML PDF
      PubDate: Thu, 16 Sep 2021 11:00:00 +030
       
  • ELViS is in the Building: The European Loans and Visits System and first
           experiences with Transnational and Virtual Access

    • Abstract: Biodiversity Information Science and Standards 5: e75312
      DOI : 10.3897/biss.5.75312
      Authors : Sharif Islam, Helen Hardy, Scott Wilson : The European Loans and Visits System (ELViS), a DiSSCo e-service in development, will be a one-stop shop for global scientific users to access the Natural Science Collections in Europe. This talk provides a summary of important milestones: the release of version 1.0 of ELViS (released on March 18, 2021) and an analysis of the feedback received from the access providers and scientific users (over 500 submissions were received). ELViS 1.0 was used to facilitate the 3rd Transnational Access (to fund short-term research visits to consortium institutions) and the 2nd Virtual Access call (to fund digitisation-on-demand requests) for SYNTHESYS+ (a European Commission funded project to develop European collections infrastructure). This milestone is the culmination of activities in SYNTHESYS+ with partners consisting of researchers and staff members of several museums and herbaria across Europe and a commercial partner, Picturae (a Dutch company specialising in collections digitisation and preservation services for the cultural heritage and archival sectors).The talk starts with a brief summary of the activities and behind the scenes planning processes that went into ensuring a smooth transition from the existing SYNTHESYS+ transnational access portal to the new ELViS system. These activities included weekly meetings, testing, bug fixing, coordinating with transnational and virtual access coordinators from different institutions, and wireframe design. The talk also focuses on specific aspects of the data elements that enabled the call and the application process with examples of using persistent identifiers for people, institutions and facilities. The concepts behind these data elements and identifiers were based on the blueprint of the DiSSCo architecture. The talk concludes with lessons learned and issues discovered and a brief look into the future plans and upcoming milestones for ELViS. HTML XML PDF
      PubDate: Thu, 16 Sep 2021 11:00:00 +030
       
  • Digitizing Primary Data on Biodiversity to Protect Natural History
           Collections Against Catastrophic Events: The type material of dragonflies
           (Insecta: Odonata) from Museu Nacional of Brazil

    • Abstract: Biodiversity Information Science and Standards 5: e75284
      DOI : 10.3897/biss.5.75284
      Authors : Marcus De Almeida, Ângelo Pinto, Alcimar Carvalho : Natural history collections (NHC) are guardians of biodiversity (Lane 1996) and essential to understand the natural world and its evolutionary processes. They hold samples of morphological and genetic heritages of living and extinct biotas, helping to reconstruct the timeline of life over the centuries (Gardner 2014). Primary data from specimens in NHC are crucial elements for research in many areas of biological sciences, considered the “bricks” of systematics and therefore one of the pillars for evolutionary studies (Troudet 2018). For this reason, studies carried out in NHC are essential for the development of the scientific knowledge and are pivotal for the scientific-technological progress of a nation (Camargo 2015).The digitization and availability of primary data on biodiversity from NHC represents a inexpensive, practical and secure means of exchanging information, allowing collaboration between institutions and researchers. In this sense, initiatives such as the Sistema de Informação sobre a Biodiversidade Brasileira (SiBBr), a country-level branch of the Global Biodiversity Information Facility (GBIF) platform, aim to encourage and establish ways for the informatization of biological collections and their type specimens.Known for housing one of the largest and oldest collections of insects in the world focused on Neotropical fauna, the Entomological Collection of the Museu Nacional of Federal University of Rio de Janeiro (MNRJ) had more than 3,000 primary types and approximately 12,005,000 specimens, of which about 96% were lost in the tragic fire occurred at the institution on September 2, 2018. The SiBBr project was active in that collection from 2016 to 2019 and enabled the digitization and preservation of data from the type material of many insect orders, including the charismatic dragonflies (order Odonata). Due to the end of the agreement between SiBBr and the Museu Nacional, most of the obtained primary data are pending full curation and, therefore, are not yet available to the public and researchers.The MNRJ housed the biggest and most important collection of dragonflies among all Central and South American institutions. It assembled most of the physical records of neotropical dragonfly fauna gathered over the last 80 years, many of which are of undescribed taxa. Unfortunately, almost all material was permanently lost. This study aims to gather, analyze and publicize primary data of the type material of dragonflies housed in the MNRJ, ensuring the preservation of its history, as well as providing data on the taxonomy and diversity of this marvelous group of insects.A total of 11 families, 50 genera and 131 species were recorded, belonging to the suborders Anisoptera and Zygoptera with distributional records widespread in South America.The MNRJ housed 105 holotypes of dragonflies' nomina representing 11.7% of the richness of the Brazilian Odonata fauna (901 spp.), a country with the highest number of species of the biosphere. The impact of the loss of this collection to studies of these insects is unprecedented, since some enigmatic and monotypic genera such as Brasiliogomphus, Fluminagrion and Roppaneura lost 100% of their type series, while others most diverse such as Lauromacromia, Oxyagrion and Neocordulia lost 50%, 35% and 31% of their holotypes. Therefore, due to the registration and preservation of primary biodiversity data, this work reiterates the importance of curating and digitizing biological scientific collections. Furthermore, it shows extreme relevance for preserving information on existing biodiversity permanently and providing support for future research. Digitization and interconnecting digital extended specimen data proves to be one of the main and most effective ways to protect NHC heritage and their primary data against catastrophic events. HTML XML PDF
      PubDate: Thu, 16 Sep 2021 11:00:00 +030
       
  • APIs: A Common Interface for the Global Biodiversity Informatics Community

    • Abstract: Biodiversity Information Science and Standards 5: e75267
      DOI : 10.3897/biss.5.75267
      Authors : Ben Norton : Web APIs (Application Programming Interfaces) facilitate the exchange of resources (data) between two functionally independent entities across a common programmatic interface. In more general terms, Web APIs can connect almost anything to the world wide web. Unlike traditional software, APIs are not compiled, installed, or run. Instead, data are read (or consumed in API speak) through a web-based transaction, where a client makes a request and a server responds. Web APIs can be loosely grouped into two categories within the scope of biodiversity informatics, based on purpose. First, Product APIs deliver data products to end-users. Examples include the Global Biodiversity Information Facility (GBIF) and iNaturalist APIs. Designed and built to solve specific problems, web-based Service APIs are the second type and the focus of this presentation (referred to as Service APIs). Their primary function is to provide on-demand support to existing programmatic processes. Examples of this type include Elasticsearch Suggester API and geolocation, a service that delivers geographic locations from spatial input (latitude and longitude coordinates) (Pejic et al. 2010).Many challenges lie ahead for biodiversity informatics and the sharing of global biodiversity data (e.g., Blair et al. 2020). Service-driven, standardized web-based Service APIs that adhere to best practices within the scope of biodiversity informatics can provide the transformational change needed to address many of these issues. This presentation will highlight several critical areas of interest in the biodiversity data community, describing how Service APIs can address each individually. The main topics include:standardized vocabularies,interoperability of heterogeneous data sources anddata quality assessment and remediation.Fundamentally, the value of any innovative technical solution can be measured by the extent of community adoption. In the context of Service APIs, adoption takes two primary forms:financial and temporal investment in the construction of clients that utilize Service APIs andwillingness of the community to integrate Service APIs into their own systems and workflows.To achieve this, Service APIs must be simple, easy to use, pragmatic, and designed with all major stakeholder groups in mind, including users, providers, aggregators, and architects (Anderson et al. 2020Anderson et al. 2020; this study). Unfortunately, many innovative and promising technical solutions have fallen short not because of an inability to solve problems (Verner et al. 2008), rather, they were difficult to use, built in isolation, and/or designed without effective communication with stakeholders. Fortunately, projects such as Darwin Core (Wieczorek et al. 2012), the Integrated Publishing Toolkit (Robertson et al. 2014), and Megadetector (Microsoft 2021) provide the blueprint for successful community adoption of a technological solution within the biodiversity community. The final section of this presentation will examine the often overlooked non-technical aspects of this technical endeavor. Within this context, specifically how following these models can broaden community engagement and bridge the knowledge gap between the major stakeholders, resulting in the successful implementation of Service APIs. HTML XML PDF
      PubDate: Thu, 16 Sep 2021 11:00:00 +030
       
  • Experiences from the Danish Fungal Atlas: Linking mushrooming, nature
           conservation and primary biodiversity research 

    • Abstract: Biodiversity Information Science and Standards 5: e75265
      DOI : 10.3897/biss.5.75265
      Authors : Jacob Heilmann-Clausen, Tobias Frøslev, Jens Petersen, Thomas Læssøe, Thomas Jeppesen : The Danish Fungal Atlas is a citizen science project launched in 2009 in collaboration among the University of Copenhagen, Mycokey and the Danish Mycological Society. The associated database now holds almost 1 million fungal records, contributed by more than 3000 recorders. The records represent more than 8000 fungal species, of which several hundred have been recorded as new to Denmark during the project. In addition several species have been described as new to science. Data are syncronized with the Global Biodiversity Information Facility (GBIF) on a weekly basis, and is hence freely available for research and nature conservation. Data have been used for systematic conservation planning in Denmark, and several research papers have used data to explore subjects such as host selection in wood-inhabiting fungi (Heilmann‐Clausen et al. 2016), recording bias in citizen science (Geldmann et al. 2016), fungal traits (Krah et al. 2019), biodiversity patterns (e.g. Andrew et al. 2018), and species discovery (Heilmann-Clausen et al. 2019). The project database is designed to faciliate direct interactions and communication among volunteers. The validation of submitted records is interactive and combines species-specific smart filters, user credibility, and expert tools to secure the highest possible data credibility. In 2019, an AI (artificial intelligence) trained species identification tool was launched along with a new mobile app, enabling users to identify and record species directly in the field (Sulc et al. 2020). At the same time, DNA sequencing was tested as an option to test difficult identifications, and in 2021 a high-throughput sequencing facility was developed to allow DNA sequencing of hundreds of fungal collections at a low cost. The presentation will give details on data validation, data use and how we have worked with cultivation of volunteers to provide a truly coherent model for collaboration on mushroom citizen science. HTML XML PDF
      PubDate: Thu, 16 Sep 2021 11:00:00 +030
       
  • Robust Integration of Biodiversity Data by Process- and State-based
           Representation of Object Histories and Modular Application Architecture

    • Abstract: Biodiversity Information Science and Standards 5: e75178
      DOI : 10.3897/biss.5.75178
      Authors : Christian Bölling, Satpal Bilkhu, Christian Gendreau, Falko Glöckler, James Macklin, David Shorthouse : Biodiversity data is obtained by a variety of methodological approaches—including observation surveys, environmental sampling and biological object collection—employing diverse sample processing protocols and data transformations. While complete and accurate accounts of these data-generating processes are important to enable integration and informed reuse of data, the structure and content of published biodiversity data currently are often shaped by specific application goals. For example, data publishers that export specimen-based data from collection management systems for inclusion in aggregations like those in the Global Biodiversity Information Facility (GBIF) must frequently relax their internal models and produce unnatural joins to fit GBIF’s occurrences-based data structure. Third-party assertions over these aggregated data therefore assume the risk of irreproducibility or concept drift.Here we introduce process- and state-based representation of object histories as the main organizing principle for data about specimens and samples in Digital Information System for Natural History Data (DINA, Glöckler et al. 2020)-compliant collection management software (Fig. 1). Specimens, samples and objects in general are subjected to a variety of processes, including planned actions involving the object, e.g., collecting, preparing, subsampling, loaning. Object states are any particular mode of being of an object at a certain point in time. For example, any one intermediate step in preparing a collected specimen for long-term conservation in a collection would constitute an individual object state. An object’s history is the entire chain of these interrelated processes and states.We argue that using object histories as main conceptual modeling paradigm in DINA offers the generality required to accommodate a diverse, open set of use cases in biodiversity data representation, yet also offers the versatility to serve as basis for use-case specific data aggregation and presentation. Specifically, a representation based on object histories providesa coherent structure for documenting individual processes and states for any given object and for linking this documentation (e.g., textual descriptions or images pertaining to a given process or state),a natural representational structure of the real-world sequence of processes an object participates in and for the data generated in these processes (e.g., a DNA-extraction procedure and sequence information generated on its basis),a straightforward structure to link data about related objects (e.g., tissue samples, the biological specimen a bone is derived from) in a network of connected object histories.The approach is designed to be embedded in DINA’s modular application architecture, so that information on object histories can be accessed via corresponding APIs either through its own interfaces (Fig. 2) or by integration with external web services (Fig. 3). Viewing collection management tasks as part of object histories also informs delineation of modules to support these tasks with specialized functions and interfaces. It also admits the use of persistent, dereferencable identifiers for individual processes and states in object histories and for linking their representations to elements in ontologies and controlled vocabularies.In this contribution to the symposium, DINA's object histories as a main organizing principle for collection object data will be discussed and the utility of using it in the context of modular application architecture, data federation, and data integration in projects like BiCIKL will be illustrated. HTML XML PDF
      PubDate: Tue, 14 Sep 2021 15:30:00 +030
       
  • Participative Decision Making and the Sharing of Benefits: Laws, ethics,
           and data protection for building extended global communities

    • Abstract: Biodiversity Information Science and Standards 5: e75168
      DOI : 10.3897/biss.5.75168
      Authors : Jutta Buschbom, Breda Zimkus, Andrew Bentley, Mariko Kageyama, Christopher Lyal, Dirk Neumann, Andra Waagmeester, Alex Hardisty : Transdisciplinary and cross-cultural cooperation and collaboration are needed to build extended, densely interconnected information resources. These are the prerequisites for the successful implementation and execution of, for example, an ambitious monitoring framework accompanying the post-2020 Global Biodiversity Framework (GBF) of the Convention on Biological Diversity (CBD; SCBD 2021).Data infrastructures that meet the requirements and preferences of concerned communities can focus and attract community involvement, thereby promoting participatory decision making and the sharing of benefits. Community acceptance, in turn, drives the development of the data resources and data use. Earlier this year, the alliance for biodiversity knowledge (2021a) conducted forum-based consultations seeking community input on designing the next generation of digital specimen representations and consequently enhanced infrastructures.The multitudes of connections that arise from extending the digital specimen representations through linkages in all “directions” will form a powerful network of information for research and application. Yet, with the power of an extended, accessible data network comes the responsibility to protect sensitive information (e.g., the locations of threatened populations, culturally context-sensitive traditional knowledge, or businesses’ fundamental data and infrastructure assets). In addition, existing legislation regulates access and the fair and equitable sharing of benefits. Current negotiations on ‘Digital Sequence Information’ under the CBD suggest such obligations might increase and become more complex in the context of extensible information networks. For example, in the case of data and resources funded by taxpayers in the EU, such access should follow the general principle of being “as open as possible; as closed as is legally necessary” (cp. EC 2016). At the same time, the international regulations of the CBD Nagoya Protocol (SCBD 2011) need to be taken into account.Summarizing main outcomes from the consultation discussions in the forum thread “Meeting legal/regulatory, ethical and sensitive data obligations” (alliance for biodiversity knowledge 2021b), we propose a framework of ten guidelines and functionalities to achieve community building and drive application:Substantially contribute to the conservation and protection of biodiversity (cp. EC 2020).Use language that is CBD conformant.Show the importance of the digital and extensible specimen infrastructure for the continuing design and implementation of the post-2020 GBF, as well as the mobilisation and aggregation of data for its monitoring elements and indicators.Strive to openly publish as much data and metadata as possible online.Establish a powerful and well-thought-out layer of user and data access management, ensuring security of ‘sensitive data’.Encrypt data and metadata where necessary at the level of an individual specimen or digital object; provide access via digital cryptographic keys.Link obligations, rights and cultural information regarding use to the digital key (e.g. CARE principles (Carroll et al. 2020),  Local Context-labels (Local Contexts 2021), licenses, permits, use and loan agreements, etc.).Implement a transactional system that records every transaction.Amplify workforce capacity across the digital realm, its work areas and workflows.Do no harm (EC 2020): Reduce the social and ecological footprint of the implementation, aiming for a long-term sustainable infrastructure across its life-cycle, including development, implementation and management stages.Balancing the needs for open access, as well as protection, accountability and sustainability, the framework is designed to function as a robust interface between the (research) infrastructure implementing the extensible network of digital specimen representations, and the myriad of applications and operations in the real world.With the legal, ethical and data protection layers of the framework in place, the infrastructure will provide legal clarity and security for data providers and users, specifically in the context of access and benefit sharing under the CBD and its Nagoya Protocol.Forming layers of protection, the characteristics and functionalities of the framework are envisioned to be flexible and finely-grained, adjustable to fulfill the needs and preferences of a wide range of stakeholders and communities, while remaining focused on the protection and rights of the natural world. Respecting different value systems and national policies, the framework is expected to allow a divergence of views to coexist and balance differing interests. Thus, the infrastructure of the digital extensible specimen network is fair and equitable to many providers and users. This foundation has the capacity and potential to bring together the diverse global communities using, managing and protecting biodiversity. HTML XML PDF
      PubDate: Tue, 14 Sep 2021 15:30:00 +030
       
  • Practical Considerations for Implementing Species Distribution Essential
           Biodiversity Variables

    • Abstract: Biodiversity Information Science and Standards 5: e75156
      DOI : 10.3897/biss.5.75156
      Authors : Robin Boyd, Nick Isaac, Robert Cooke, Francesca Mancini, Tom August, Gary Powney, Mark Logie, David Roy : Species Distribution Essential Biodiversity Variables (SD EBVs; Pereira et al. 2013, Kissling et al. 2017, Jetz et al. 2019) are defined as measurements or estimates of species’ occupancy along the axes of space, time and taxonomy. In the “ideal” case, additional stipulations have been proposed: occupancy should be characterized contiguously along each axis at grain sizes relevant to policy and process (i.e., fine scale); and the SD EBV should be global in extent, or at least span the entirety of the focal taxa’s geographical range (Jetz et al. 2019). These stipulations set the bar very high and, unsurprisingly, most operational SD EBVs fall short of these ideal criteria. In this presentation, I will discuss the major challenges associated with developing the idealized SD EBV. I will demonstrate these challenges using an operational SD EBV spanning ~6000 species in the United Kingdom (UK) over the period 1970 to 2019 as a case study (Outhwaite et al. 2019). In short, this data product comprises annual estimates of occupancy for each species in all sampled 1 km cells across the UK; these are derived from opportunistically-collected species occurrence data using occupancy-detection models (Kéry et al. 2010). Having discussed which of the “ideal” criteria the case study satisfies, I will then touch on what are, in my view, two underappreciated challenges when constructing SD EBVs: dealing with sampling biases in the underlying data and the difficulty in evaluating the extent to which they bias the final product. These challenges should be addressed as a matter of urgency, as SD EBVs are increasingly applied in important settings such as underpinning national and international biodiversity indicators (see e.g., https://geobon.org/ebvs/indicators/). HTML XML PDF
      PubDate: Tue, 14 Sep 2021 15:30:00 +030
       
  • Biodiversity Literature Repository: Building the customized FAIR
           repository by using custom metadata

    • Abstract: Biodiversity Information Science and Standards 5: e75147
      DOI : 10.3897/biss.5.75147
      Authors : Alexandros Ioannidis-Pantopikos, Donat Agosti : In the landscape of general-purpose repositories, Zenodo was built at the European Laboratory for Particle Physics' (CERN) data center to facilitate the sharing and preservation of the long tail of research across all disciplines and scientific domains. Given Zenodo’s long tradition of making research artifacts FAIR (Findable, Accessible, Interoperable, and Reusable), there are still challenges in applying these principles effectively when serving the needs of specific research domains.Plazi’s biodiversity taxonomic literature processing pipeline liberates data from publications, making it FAIR via extensive metadata, the minting of a DataCite Digital Object Identifier (
      DOI ), a licence and both human- and machine-readable output provided by Zenodo, and accessible via the Biodiversity Literature Repository community at Zenodo. The deposits (e.g., taxonomic treatments, figures) are an example of how local networks of information can be formally linked to explicit resources in a broader context of other platforms like GBIF (Global Biodiversity Information Facility).In the context of biodiversity taxonomic literature data workflows, a general-purpose repository’s traditional submission approach is not enough to preserve rich metadata and to capture highly interlinked objects, such as taxonomic treatments and digital specimens. As a prerequisite to serve these use cases and ensure that the artifacts remain FAIR, Zenodo introduced the concept of custom metadata, which allows enhancing submissions such as figures or taxonomic treatments (see as an example the treatment of Eurygyrus peloponnesius) with custom keywords, based on terms from common biodiversity vocabularies like Darwin Core and Audubon Core and with an explicit link to the respective vocabulary term.The aforementioned pipelines and features are designed to be served first and foremost using public Representational State Transfer Application Programming Interfaces  (REST APIs) and open web technologies like webhooks. This approach allows researchers and platforms to integrate existing and new automated workflows into Zenodo and thus empowers research communities to create self-sustained cross-platform ecosystems. The BiCIKL project (Biodiversity Community Integrated Knowledge Library) exemplifies how repositories and tools can become building blocks for broader adoption of the FAIR principles.Starting with the above literature processing pipeline, the concepts of and resulting FAIR data, with a focus on the custom metadata used to enhance the deposits, will be explained. HTML XML PDF
      PubDate: Tue, 14 Sep 2021 15:30:00 +030
       
  • Classification of Biological Interactions: Challenges in the field and in
           analysis

    • Abstract: Biodiversity Information Science and Standards 5: e74375
      DOI : 10.3897/biss.5.74375
      Authors : Rafael Pinheiro, Leonardo Jorge, Thomas Lewinsohn : Within biological communities, species interact in a wide variety of ways. Species interactions have always been noted and classified by naturalists in describing living organisms and their ways. Moreover, they are essential to characterize ecological communities as functioning entities.Biodiversity databases, as a rule, are comprised of species records in certain localities and times. Many, if not most, originated as databases of museum specimens and/or published records. As such, they provide data on species occurrences and distribution, with little functional information. Currently, online databases for species interaction data are being formed or proposed. Usually, these databases set out to compile data from actual field studies, and their design reflects the singularities of particular studies that seed their development. In two online databases: the Web of Life (2021) and the Interaction Web DataBase (2020) (IWDB), the categories of interactions are quite heterogeneous (Table 1). For instance, they may refer explicitly to certain taxonomic groups (e.g., anemone-fish), or do so implicitly (host-parasitoid; parasitoids are all holometabolous insects with arthropod hosts); conversely, they may encompass almost any taxon (food webs). In another example, the Global Biotic Interactions database (Poelen et al. 2014) (GloBI) offers a choice of relational attributes when entering data, ranging from undefined to quite restricted (Table 2).Here we intend to contribute to the development of interaction databases, from two different points of view. First, what categories can be effectively applied to field observations of biotic interactions' Second, what theoretical and applied questions do we expect to address with interaction databases' These should be equally applicable to comparisons of studies of the same kind or mode of interaction, and to contrasts between interactions in multimodal studies. HTML XML PDF
      PubDate: Tue, 14 Sep 2021 15:30:00 +030
       
  • Data Standards for the Phenology of Plant Specimens

    • Abstract: Biodiversity Information Science and Standards 5: e74372
      DOI : 10.3897/biss.5.74372
      Authors : Katelin Pearson, Libby Ellwood, Edward Gilbert, Rob Guralnick, James Macklin, Gil Nelson, Patrick Sweeney, Brian Stucky, John Wieczorek, Jenn Yost : Phenological data (i.e., data on growth and reproductive events of organisms) are increasingly being used to study the effects of climate change, and biodiversity specimens have arisen as important sources of phenological data. However, phenological data are not expressly treated by the Darwin Core standard (Wieczorek et al. 2012), and specimen-based phenological data have been codified and stored in various Darwin Core fields using different vocabularies, making phenological data difficult to access, aggregate, and therefore analyze at scale across data sources. The California Phenology Network, an herbarium digitization collaboration launched in 2018, has harvested phenological data from over 1.4 million angiosperm specimens from California herbaria (Yost et al. 2020). We developed interim standards by which to score and store these data, but further development is needed for adoption of ideal phenological data standards into the Darwin Core. To this end, we are forming a Plant Specimen Phenology Task Group to develop a phenology extension for the Darwin Core standard. We will create fields into which phenological data can be entered and recommend a standardized vocabulary for use in these fields using the Plant Phenology Ontology (Stucky et al. 2018, Brenskelle et al. 2019). We invite all interested parties to become part of this Task Group and thereby contribute to the accesibility and use of these valuable data. In this talk, we will describe the need for plant phenological data standards, current challenges to developing such standards, and outline the next steps of the Task Group toward providing this valuable resource to the data user community. HTML XML PDF
      PubDate: Tue, 14 Sep 2021 15:30:00 +030
       
  • The Paleo Data Working Group: A model for developing and sustaining a
           community of practice

    • Abstract: Biodiversity Information Science and Standards 5: e74370
      DOI : 10.3897/biss.5.74370
      Authors : Erica Krimmel, Talia Karim, Holly Little, Lindsay Walker, Roger Burkhalter, Christina Byrd, Amanda Millhouse, Jessica Utrup : The Paleo Data Working Group was launched in May 2020 as a driving force for broader conversations about paleontologic data standards. Here, we present an overview of the “community of practice” model used by this group to evaluate and implement data standards such as those stewarded by Biodiversity Information Standards (TDWG). A community of practice is defined by regular and ongoing interaction among individual members, who find enough value in participating, so that the group achieves a self-sustaining level of activity (Wenger 1998, Wenger and Snyder 2000, Wenger et al. 2002). Communities of practice are not a new phenomenon in biodiversity science, and were recommended by the recent United States National Academies report on biological collections (National Academies of Sciences, Engineering, and Medicine 2020) as a way to support workforce training, data-driven discoveries, and transdisciplinary collaboration. Our collective aim to digitize specimens and mobilize the data presents new opportunities to foster communities of practice that are circumscribed not by research agendas but rather by the need for better data management practices to facilitate research.Paleontology collections professionals in the United States have been meeting to discuss digitization semi-consistently in both virtual and in-person spaces for nearly a decade, largely thanks to support from the iDigBio Paleo Digitization Working Group. The need for a community of practice within this group focused on data management in paleo collections became apparent at the biodiversity_next Conference in October 2019, where we realized that work being done in the biodiversity standards community was not being informed by or filtering back to digitization and data mobilization efforts occurring in the paleo collections community. A virtual workshop focused on georeferencing for paleo in April 2020 was conceived as an initial pathway to bridge these two communities and provided a concrete example of how useful it can be to interweave practical digitization experience with conceptual data standards.In May 2020, the Paleo Data Working Group began meeting biweekly on Zoom, with discussion topics collaboratively developed, presented, and discussed by members and supplemented with invited speakers when appropriate. Topics centered on implementation of data standards (e.g., Darwin Core) by collections staff, and how standards can evolve to better represent data. An associated Slack channel facilitated continuing conversations asynchronously. Engaging domain experts (e.g., paleo collections staff) in the conceptualization of information throughout the data lifecycle helped to pinpoint issues and gaps within the existing standards and revealed opportunities for increasing accessibility. Additionally, when domain experts gained a better understanding of the information science framework underlying the data standards they were better able to apply them to their own data. This critical step of standards implementation at the collections level has often been slow to follow standards development, except in the few collections that have the funds and/or expertise to do so. Overall, we found the Paleo Data Working Group model of knowledge sharing to be mutually beneficial for standards developers and collections professionals, and it has led to a community of practice where informatics and paleo domain expertise intersect with a low barrier to entry for new members of both groups.Serving as a loosely organized voice for the needs of the paleo collections community, the Paleo Data Working Group has contributed to several initiatives in the broader biodiversity community. For example, during the 2021 public review of Darwin Core maintenance proposals, the Paleo Data Working Group shared the workload of evaluating and commenting on issues among its members. Not only was this efficient for us, but it was also effective for the TDWG review process, which sought to engage a broad audience while also reaching consensus. The Paleo Data Working Group has also served as a coordinated point of contact for adjacent and intersecting activities related to both data standards (e.g., those led by the TDWG Earth Sciences and Paleobiology Interest Group and the TDWG Collections Description Interest Group) and paleontological research (e.g., those led by the Paleobiology Database and the Integrative Paleobotany Portal project).Sustaining activities, like those of the Paleo Data Working Group, require consideration and regular attention. Support staff at iDigBio and collections staff focusing on digitization or data projects at their own institutions, as well as a consistent pool of drop-in and occasional participants, have been instrumental in maintaining momentum for the community of practice. Socializing can also help build the personal relationships necessary for maintaining momentum. To this extent, the Paleo Data Working Group Slack encourages friendly banter (e.g., the #pets-of-paleo channel), more general collections-related conversations (e.g., the #physical-space channel), and space for those with sub-interests to connect (e.g., the #morphology channel). While the focus of the group is on data, on an individual level, our group members find it useful to network on a wide variety of topics and this usefulness is critical to sustaining the community of practice.As we look forward to Digital Extended Specimen concepts and exciting developments in cyberinfrastructure for biodiversity data, communities of practice like that exemplified by the Paleo Data Working Group are essential for success. Creating FAIR (Findable, Accessible, Interoperable and Reusable) data requires buy-in from data ...
      PubDate: Tue, 14 Sep 2021 15:30:00 +030
       
  • Discovering Known Biodiversity: Digital accessible knowledge —
           Getting the community involved

    • Abstract: Biodiversity Information Science and Standards 5: e74369
      DOI : 10.3897/biss.5.74369
      Authors : Carolina Sokolowicz, Marcus Guidoti, Donat Agosti : Plazi is a non-profit organization focused on the liberation of data from taxonomic publications. As one of Plazi’s goals of promoting the accessibility of taxonomic data, our team has developed different ways of getting the outside community involved. The Plazi community on GitHub encourages the scientific community and other contributors to post GGI-related (Golden Gate Imagine document editor) questions, requirements, ideas, and/or suggestions, including bug reports and feature requests. One can contact us via this GitHub community by creating either an Issue (to report problems on our data or related systems) or a Discussion (to post questions, ideas, or suggestions). We use Github's built-in label system to actively curate the content posted in this repository in order to facilitate further interaction, including filtering and searching before creating new entries. In the plazi/community repository, there is a Q&A (question & answer) section with selected questions and answers that might help solving the encountered problems. Aiming at increasing external participation in the task of liberating taxonomic data, we are developing training courses with independent learning modules that can be combined in different ways to target different audiences (e.g., undergraduates, researchers, developers) in various formats. This material will include text, print-screens, slides, screencasts, and, eventually to a minor extent, online teaching. Each topic within a module will have one or more ‘inline tests', which will be HTML form-based with hard-coded answers to directly assess progress regarding the subject being covered in that particular topic. At the end of each module, we will have a capstone (form-based test asking questions about the topics covered in the respective module) which the user can access whenever needed. As examples of our independent learning modules we can cite Modules I, II and III and their respective topics. Module I (Biodiversity Taxonomy Basis) includes introductory topics (e.g., Topic I — Why do we classify living things; Topic II — Linnaean binomial; Topic III — How is taxonomic information displayed in the literature) aimed at those who don't have a biology/taxonomy background. Module II (The Plazi way) topics (Topic I — Plazi mission; Topic II — Taxomic treatments; Topic III — FAIR taxonomic treatments) are designed in a way that course takers can learn about Plazi processes. Module III (The Golden Gate Imagine) includes topics (Topic I — Introduction to GGI; Topic II — Other User Interface-based alternatives to annotate documents) about the document editor for marking up documents in XML. Other modules include subjects such as individual extractions, material and treatment citations, data quality control, and others.On completion of a module, the user will be awarded a certificate. The combination of these certificates will grant badges that will translate into server permissions that will allow the user to upload new liberated taxonomic treatments and edit treatments already in the system, for instance. Taxonomic treaments are any piece of information about a given taxon concept that involves, includes, or results from an interpretation of the concept of that given taxon.Additionally, Plazi TreatmentBank APIs (Application Programming Interface) are currently being expanded and redesigned and the documentation for these long-waited endpoints will be displayed, for the first time, in this talk.  HTML XML PDF
      PubDate: Tue, 14 Sep 2021 15:30:00 +030
       
  • Connecting West and Central African Herbaria Data: A new Living Atlases
           regional data platform

    • Abstract: Biodiversity Information Science and Standards 5: e74362
      DOI : 10.3897/biss.5.74362
      Authors : Sylvain Morin, Alice Ainsa, Raoufou Radji, Anne-Sophie Archambeau, Hervé Chevillotte, Eric Chenin, Sophie Pamerlon : The label transcription and imaging of specimens in key African herbaria has been ongoing since the early 2000s. Many collections in Benin, Cameroon, Côte d’Ivoire, Gabon, Guinea Conakry, and Togo are now fully transcribed and partially digitized. More than 200 000 transcribed specimens are available with the following distribution:Benin: 45 000Cameroon: 70 000Côte d’Ivoire: 18 000Gabon: 70 000Guinea Conakry: 5 000Togo: 15 000In April 2021, a BID project was started to deliver a regional data platform of West and Central African herbaria. Biodiversity Information for Development (BID) is a multi-year programme funded by the European Union and led by GBIF with the aim of enhancing capacity for effective mobilization and use of biodiversity data in research and policy in the 'ACP' nations of sub-Saharan Africa, the Caribbean and the Pacific. Our project's funding runs from April 2021 to April 2023.At this stage of the project, we are working on defining the information technology (IT) architecture (Fig. 1) and selecting the tools that we will be using to achieve our goals. In the talk, we will present our conclusions through architecture schemas and tools demonstrations.Each of the 6 countries will have its own PostgreSQL database, storing its data. They will also have access to the RIHA data management platform (Réseau Informatique des Herbiers d'Afrique / Digital Network of African Herbaria). This is a web application, developed in PHP, allowing full management of the data by herbarium administrators (Fig. 2).An Integrated Publishing Toolkit (IPT) will fetch these herbaria data from the databases, create the Darwin Core archives, and connect these data automatically to gbif.org on a periodic basis (Fig. 3).On the databases, we will use a PostgreSQL view to ease conversion from the RIHA data model to the Darwin Core model. On the IPT, we will create one dataset per country, linked to each PostgreSQL view. The SQL query will be configured to only fetch validated data, depending on the herbarium administrator's validation in the RIHA platform.The automatic and periodic data transmission to gbif.org is a feature available in the IPT, and recently improved by the GBIF France team, which contributes to the IPT development.Another part of the automatic data workflow will be to feed a Living Atlases portal for the West and Central African herbaria. This web application will allow public users to search, display and download herbaria data from West and Central Africa (Fig. 4).Internally, this Living Atlases application will reuse open source modules developed by the Atlas of Living Australia (ALA). The application is mainly written in Java, uses JQuery/Bootstrap for the interface and relies on SolR and Spark in the backend. It has been developed to be easily reusable, by only modifying configuration and doing web customization (HTML / CSS), hiding most of the backend technological complexity.The automatic data workflow will transfer datasets generated by the IPT, in Darwin Core Archive format, to the Living Atlases portal backend. A technical task orchestrator, yet to be selected, will implement this feature.Living Atlases subportals, limited to data of one participating country, could be easily set up, leveraging the existing backend resources (Fig. 5).One of the benefits of the Living Atlases portal is that we can easily deploy additional front end applications with limited data, configured by a filter (here, a filter on the data owner country). Only configuration and web customization (HTML / CSS) are required. All the backend modules, especially the ones storing data, are shared by the multiple front-ends, limiting the hardware consumption and data administration.The full automation of the workflow will allow this platform to run at a very low maintenance cost for IT administrators. Moreover, adding a new herbarium member from West and Central Africa will be quite easy thanks to the architecture of the Integrated Publishing Toolkit and Living Atlases tools (Fig. 6). HTML XML PDF
      PubDate: Mon, 13 Sep 2021 16:00:00 +030
       
  • BioBlitz is More than a Bit of Fun

    • Abstract: Biodiversity Information Science and Standards 5: e74361
      DOI : 10.3897/biss.5.74361
      Authors : Sofie Meeus, Iolanda Silva-Rocha, Tim Adriaens, Peter Brown, Niki Chartosia, Bernat Claramunt-López, Angeliki Martinou, Michael Pocock, Cristina Preda, Helen Roy, Elena Tricarico, Quentin Groom : Emerging in the 1990s, bioblitzes have become flagship events for biodiversity assessments. Although the format varies, a bioblitz is generally an intensive, short-term survey in a specific area. Bioblitzes collect biodiversity data and can therefore play a role in research, discovery of new species at a site and monitoring. They may also promote public engagement, community building, and education and outreach. However, the question remains, how effective are bioblitzes at achieving these goals' To evaluate the value of bioblitzes for these multiple goals, we conducted two meta-analyses, one on sixty published bioblitzes and the other on 1860 bioblitzes conducted using iNaturalist. Furthermore, we made an in-depth analysis of the data collected during a bioblitz we organized ourselves.From these analyses we found bioblitzes are effective at gathering data—collecting on average more than 300 species records—despite limitations of bias, which many types of biodiversity surveys suffer from, such as preferences for charismatic taxa, and uneven sampling effort in time and space. However, because the survey intensity, duration and extent are more controlled, a bioblitz is more repeatable than some other forms of survey. We also found that bioblitzes were highly effective at engaging people in sustained activity after they participated in a bioblitz. A bioblitz may therefore act as a trigger for participation in biological recording, which is supported by the use of technology, particularly smartphone apps. Another important aspect is the involvement of both citizen scientists and professional biologists, creating learning opportunities in both directions. Indeed, it was clear that many bioblitzes acted as brokerage events between individuals and organizations, and between professionals who work in biodiversity research and conservation. Such community building is important for communication and building trust between organizations and citizens to the benefit of biodiversity research and conservation.From the impartial perspective of hypothesis-driven science, bioblitzes may seem like a lot of work with limited scientific gain. However, this largely overlooks how important people, communities and their organizations are in gathering data, and in conserving biodiversity. HTML XML PDF
      PubDate: Mon, 13 Sep 2021 16:00:00 +030
       
  • Research Infrastructure Contact Zones: A method to visualise and align
           the activities of major biodiversity informatics initiatives

    • Abstract: Biodiversity Information Science and Standards 5: e74359
      DOI : 10.3897/biss.5.74359
      Authors : Vincent Smith, Aino Juslén, Ana Casino, Francois Dusoulier, Lisa French, Dimitrios Koureas, Patricia Mergen, Joe Miller, Leif Schulman, Matt Woodburn : In an effort to characterise the various dimensions of activity within the biodiversity informatics landscape, we developed a framework to survey these dimensions for ten major organisations*1 relative to both their current activities and long-term strategic ambitions. This survey assessed the contact between these infrastructure organisations by capturing the breadth of activities for each infrastructure across five categories (data, standards, software, hardware and policy), for nine types of data (specimens, collection descriptions, opportunistic observations, systematic observations, taxonomies, traits, geological data, molecular data, and literature), and for seven phases of activity (creation, aggregation, access, annotation, interlinkage, analysis, and synthesis). This generated a dataset of 6,300 verified observations, which have been scored and validated by leading members of each infrastructure organisation. In this analysis of the resulting data, we address a set of high-level questions about the overall biodiversity informatics landscape, looking at the greatest gaps, overlap and possible rate-limiting steps. Across the infrastructure organisations, we also explore how far each is in relation to achieving its ambitions and the extent of its niche relative to other organisations. Our results show that when viewed by scope, most infrastructures occupy a relatively narrow niche in the overall landscape of activity, with the notable exception of the Global Biodiversity Information Facility (GBIF) and possibly LifeWatch. Niches associated with molecular data and biological taxonomy are very well filled, suggesting there is still considerable room for growth in other areas, with the Distributed System of Scientific Collections (DiSSCo) and the Integrated European Long-Term Ecosystem Research Infrastructure (eLTER RI) showing the highest levels of difference between their current activities and stated ambitions, potentially reflecting the relative youth of these organisations. iNaturalist, the Biodiversity Heritage Library and Catalogue of Life all occupy narrow and tightly circumscribed niches. These organisations are also amongst the closest to achieving their stated ambitions within their respective areas of activity. The largest gaps in infrastructure activity relate to the development of hardware and standards, with many gaps set to be addressed if the stated ambitions of those surveyed come to fruition. Nevertheless, some gaps persist, outlining a potential role for this survey as a planning tool to help coordinate and align investment in future biodiversity informatics activities. GBIF and LifeWatch are the two infrastructures where there is the most similarity in ambition with DiSSCo, with the greatest overlap concentrated on activities related to data/content, specimen data and their shared ambition to interlink information. While overlap appears intense, the analysis is limited by the resolution of the survey framework and ignores existing collaborations between infrastructures.In addition to presenting the results of this survey, we outline our plans to publish this work and a proposal to develop the methodology as an interactive web-based tool. This would allow other projects and infrastructures to self-score their activities and visualise their niche within the current landscape, encouraging better global alignment of activities. For example, our results should make it easier for initiatives to strengthen collaboration and differentiate work when their activities overlap. Likewise, this approach would be useful for funding agencies when targeting gaps in the informatics landscape or increasing the technical maturity of certain critical activities, e.g., to improve immature data standards. While no framework is perfect, we hope to encourage a dialogue on the potential for taking an algorithmic approach to community alignment and see this as a means of strengthening community cooperation when addressing problems that require global cooperation. HTML XML PDF
      PubDate: Mon, 13 Sep 2021 16:00:00 +030
       
  • GRSciColl: Registry of Scientific Collections maintained by the community
           for the community

    • Abstract: Biodiversity Information Science and Standards 5: e74354
      DOI : 10.3897/biss.5.74354
      Authors : Marie Grosjean, Morten Høfft, Marcos Gonzalez, Tim Robertson, Andrea Hahn : GRSciColl, the Registry of Scientific Collections, is a comprehensive, community-curated clearinghouse of collections information originally developed by the Consortium of the Barcode of Life (CBOL) and hosted by the Smithsonian Institution until 2019. It is now hosted and maintained in the Global Biodiversity Information Facility (GBIF) registry (see this news item).GRSciColl aims to improve access to information about institutions, the scientific collections they hold, and to facilitate access to the staff members who manage them. Anyone can use GRSciColl to search for collections based on their attributes (country, preservation type, etc.) as well as their codes and identifiers. These users will find information on what the collections contain, where they are located, who manages them and how to get into contact. Furthermore, institutions can use GRSciColl to be more visible and advertise their collections, both digitized and undigitized. Plus, the ability to get an overview of institutions and collections by country can help guide some of the data mobilisation efforts by national organizations. Finally, GRSciColl is a reference for institution and collections codes and identifiers, which can enable links from other systems (as exemplified in GBIF.org) and make the information more easily available.Engaging the community is crucial in maintaining that information. After the migration to GBIF, the first phase of development focused on data consolidation, integration with external systems, and on providing the necessary functionality and safeguards to move from a centrally maintained to a community curated system. With all these in place, the focus is now shifting to expanding the community of data editors, and to understanding how best to serve user needs.It can be difficult for institutions to maintain information in the various available data repositories. This is why we aim to synchronize GRSciColl with as many reliable sources as possible. In 2020, we set up weekly synchronization with Index Herbariorum, and we will be exploring synchronization with other sources such as the Consortium of European Taxonomic Facilities (CETAF) registry and the National Center for Biotechnology Information (NCBI) BioCollections database. In addition, we worked with the team at Integrated Digitized Biocollections (iDigBio) to import their collection information into GRSciColl. The data are now maintained in the GBIF registry and displayed on the iDigBio portal via the GRSciColl Application Programming Interface (API).The GRSciColl new permission model aims to facilitate community curation. Anyone can suggest updates, and those changes can be applied or discarded by the appropriate reviewers: institution editors, country mediators, or administrators.With these changes in place, in 2021, we reached out to the GBIF Network to increase our pool of editors. Many GBIF Node managers are now involved in the curation of GRSciColl, and we are planning to likewise include applicants for the GBIF-managed funding programs such as “Biodiversity Information for Development'' (BID) and “Biodiversity Information Fund for Asia” (BIFA). We also work with external collaborators, such as the Biodiversity Crisis Response Committee of the Society for the Preservation of Natural History Collections (SPNCH), to reach outside of the GBIF community.Alongside the support for data integration and curation, a second important aspect is the support for data use. The information available needs to be both accessible and relevant to the community.Specimen-related occurrences, published on GBIF.org, are cross-linked to GRSciColl entries whenever possible (see this example). As these links make use of collection and institution identifiers within individual specimen records, rather than relying on dataset entities, this procedure allows aggregation of specimen-related occurrences under their GRSciColl-registered collections and institutions, regardless of the way they were published on GBIF. This can help users and institutions get an overview of the collections digitization progress, whether through their own initiative, or from datasets contributed by other data publishers.The Collections API is under ongoing development to provide better ways to access the GRSciColl information: more filters, a way to download the result of a search, and an API (Application Programming Interface) lookup service to find institutions and collections associated with a given code or identifier. The latter was designed to improve database interoperability.By working together with the community, we want to ensure GRSciColl becomes and remains a tool they can rely on.There are many ways to get involved with GRSciColl:Anyone can check their institution and collection entries and suggest updates or additions via the suggestion buttons in the GRSciColl interface.You can become a registry editor on behalf of your institution or collection.If you work with a National registry and are interested in sharing the data on GRSciColl, please contact us at scientific-collections@gbif.org.Tell us how you would like to use the registry and GRSciColl. You can contact us by email (scientific-collections@gbif.org) or via our GitHub repository.You can become a volunteer translator to make the GRSciColl forms accessible in more languages.You can follow our 2021 Roadmap and log your feedback and ideas via the GBIF feedback system or directly on our GitHub repository. HTML XML PDF
      PubDate: Mon, 13 Sep 2021 11:45:00 +030
       
  • A Nano(publication) Approach Towards Big Data in Biodiversity

    • Abstract: Biodiversity Information Science and Standards 5: e74351
      DOI : 10.3897/biss.5.74351
      Authors : Mariya Dimitrova, Teodor Georgiev, Lyubomir Penev : One of the major challenges in biodiversity informatics is the generation of machine-readable data that is interoperable between different biodiversity-related data infrastructures. Producers of such data have to comply with existing standards and to be resourceful enough to enable efficient data generation, management and availability. Conversely, nanopublications offer a decentralised approach (Kuhn et al. 2016) towards achieving data interoperability in a robust and standarized way. A nanopublication is a named RDF graph, which serves to communicate a single fact and its original source (provenance) through the use of identifiers and linked data (Groth et al. 2010). It is composed of three constituent graphs (assertion, provenance, and publication info), which are linked to one another in the nanopublication header (Kuhn et al. 2016). For instance, a nanopublication has been published to assert a species interaction in which a hairy woodpecker (Picoides villosus) ate a beetle (genus Ips), along with the license and related bibliographic citation*1. In biodiversity, nanopublications can be used to exchange information between infrastructures in a standardised way (Fig. 1) and to enable curation and correction of knowledge. They can be implemented within different workflows to formalise biodiversity knowledge in self-enclosed graphs. We have developed several nanopublication models*2 for different biodiversity use cases: species occurrences, new species descriptions, biotic interactions, and links between taxonomy, sequences and institutions. Nanopublications can be generated by various means:semi-automatic extraction from the published literature with a consequent human curation and publication;generation during the publication process by the authors via dedicated formalisation tool and published together with the article;de novo generation of a nanopublication through decentralised networks such as Nanobench (Kuhn et al. 2021).One of the possible uses of nanopublications in biodiversity is communicating new information in a standardised way so that it can be accessed and interpreted by multiple infrastructures that have a common agreement on how information is expressed through the use of particular ontologies, vocabularies and sets of identifiers. In addition, we envision nanopublications to be useful for curation or peer-review of published knowledge by enabling any researcher to publish a nanopublication containing a comment of an assertion made in a previously published nanopublication. With this talk, we aim to showcase several nanopublication formats for biodiversity and to discuss the possible applications of nanopublications in the biodiversity domain. HTML XML PDF
      PubDate: Mon, 13 Sep 2021 11:45:00 +030
       
  • Implementing GBIF Pipelines in the Atlas of Living Australia: The first
           step towards alignment and further collaboration

    • Abstract: Biodiversity Information Science and Standards 5: e74335
      DOI : 10.3897/biss.5.74335
      Authors : Javier Molina, Peggy Newman, David Martin, Vicente Ruiz Jurado : The Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA) are two leading infrastructures serving the biodiversity community.In 2020, the ALA’s occurrence records management systems reached end of life after more than 10 years of operation, and the ALA embarked on a project to replace them.Significant overlap exists in the function of the ALA and GBIF data ingestion pipeline systems. Instead of the ALA developing new systems from scratch, we initiated a project to better align the two infrastructures. The collaboration brings benefits such as the improved reuse of modules and an overall reduction in development and operation costs.The ALA recently replaced its occurrence ingestion system with GBIF pipelines infrastructure and shared code. This is the first milestone of the broader ALA’s Core Infrastructure Project and some of the benefits from it are a more reliable, performant and scalable system, proven by the ability to ingest more and larger datasets while at the same time reducing infrastructure operational costs by more than 40% compared to the previous system. The new system is a key building block for an improved ingestion framework that is being developed within the ALA. The collaboration between the ALA and GBIF development teams will result in more consistent outputs from their respective processing pipelines. It will also allow the broader collective expertise of both infrastructure communities to inform future development and direction. The ALA’s adoption of GBIF pipelines will pave the way for the Living Atlases community to adopt GBIF systems and also contribute to them.In this talk we will introduce the project, share insights on how both the teams from the GBIF and the ALA worked together and finally we will delve into details about the technical implementation and benefits. HTML XML PDF
      PubDate: Mon, 13 Sep 2021 11:45:00 +030
       
  • Expert Group on Antarctic Biodiversity Informatics: Coordinating the
           state-of-the-art internationally for biodiversity informatics in
           Antarctica

    • Abstract: Biodiversity Information Science and Standards 5: e74332
      DOI : 10.3897/biss.5.74332
      Authors : Thomas Chen : Biodiversity informatics have emerged as a key asset in wildlife and ecological conservation around the world. This is especially true in Antarctica, where climate change continues to threaten marine and terrestrial species. It is well documented that the polar regions experience the most drastic rate of climate change compared to the rest of the world (IPCC 2021). Research approaches within the scope of polar biodiversity informatics consist of computational architectures and systems, analysis and modelling methods, and human-computer interfaces, ranging from more traditional statistical techniques to more recent machine learning and artificial intelligence-based imaging techniques. Ongoing discussions include making datasets findable, accessible, interoperable and reusable (FAIR) (Wilkinson et al. 2016). The deployment of biodiversity informatics systems and coordination of standards around their utilization in the Antarctic are important areas of consideration. To bring together scientists and practitioners working at the nexus of informatics and Antarctic biodiversity, the Expert Group on Antarctic Biodiversity Informatics (EG-ABI) was formed under the Scientific Committee on Antarctic Research (SCAR). EG-ABI was created during the SCAR Life Sciences Standing Scientific Group meeting at the SCAR Open Science Conference in Portland Oregon, in July 2012, to advance work at this intersection by coordinating and participating in a range of projects across the SCAR biodiversity science portfolio. SCAR, meanwhile, is a thematic organisation of the International Science Council (ISC), which is the primary entity tasked with coordinating high-quality scientific research on all aspects of Antarctic sciences and humanities, including the Southern Ocean and the interplay between Antarctica and the other six continents. The expert group is led by an international steering committee of roughly ten members, who take an active role in leading related initiatives. Currently, researchers from Australia, Belgium, the United Kingdom, Chile, Germany, France, and the United States are represented on the committee. The current steering committee is comprised of a diverse range of scientists, including early-career researchers and scientists that have primary focuses in both the computational and ecological aspects of Antarctic biodiversity informatics.Current projects that are being coordinated or co-coordinated by EG-ABI include the SCAR/rOpenSci initiative, which is a collaboration with the rOpenSci community to improve resources for users of the R software package in Antarctic and Southern Ocean science. Additionally, EG-ABI has contributed to the POLA3R project (Polar Omics Linkages Antarctic Arctic and Alpine Regions), which is an information system dedicated to aid in the access and discovery of molecular microbial diversity data generated by Antarctic scientists. Furthermore, EG-ABI has trained and helped collate additional species trait information such as feeding and diet information, development, mobility and their importance to society, documented through Vulnerable Marine Ecosystem (VME) indicator taxa, in The Register of Antarctic Species (http://ras.biodiversity.aq/), and the comprehensive inventory of Antarctic and Southern Ocean organisms, which is also a component of the World Register of Marine Species (https://marinespecies.org/). The efforts highlighted are only some of the projects that the expert groups have contributed to.In our presentation, we discuss the previous accomplishments of the EG-ABI from the perspective of a currently serving steering committee member and outline its state in the status quo including collaborations and coordinated activities. We also highlight opportunities for engagement and the benefits for various stakeholders in terms of interacting with EG-ABI on multiple levels, within the SCAR ecosystem and elsewhere. Developing consistent and practical standards for data use in Antarctic ecology, in addition to fostering interdisciplinary and cross-sectoral collaborations for the successful deployment of conservation mechanisms, are key to a sustainable and biodiverse Antarctica, and EG-ABI is one of the premier organizations working towards these aims. HTML XML PDF
      PubDate: Mon, 13 Sep 2021 11:45:00 +030
       
  • Hacking Infrastructures Together: Towards better interoperability of
           infrastructures

    • Abstract: Biodiversity Information Science and Standards 5: e74325
      DOI : 10.3897/biss.5.74325
      Authors : Sofie Meeus, Wouter Addink, Donat Agosti, Christos Arvanitidis, Mariya Dimitrova, Juan Miguel González-Aranda, Jörg Holetschek, Sharif Islam, Thomas Jeppesen, Daniel Mietchen, Tim Robertson, Francisco Sanchez Cano, Maarten Trekels, Quentin Groom : The BiCIKL Project is born from a vision that biodiversity data are most useful if they are viewed as a network of data that can be integrated and viewed from different starting points. BiCIKL’s goal is to realize that vision by linking biodiversity data infrastructures, particularly for literature, molecular sequences, specimens, nomenclature and analytics. BiCIKL is an Open Science project creating Open FAIR data and services for the whole research community. BiCIKL intends to inspire novel, innovative, research and build services that can produce new and valuable knowledge, necessary for the protection of biodiversity and of our environment. BiCIKL will develop methods and workflows to harvest, link and access data extracted from literature. Yet, as the project gets underway, we need to better understand the existing infrastructures, their limitations, the nature of the data they hold, the services they provide and particularly how they can interoperate. To do this we organised a week-long hackathon where small teams worked on a number of pilot projects (Table 1) that were chosen to test the existing linkages between infrastructures and to extract novel ones.We will present our experience of running a hackathon and our evaluation of how successfully it achieved its aims. We will also give examples of the projects we conducted and how successful they were. Finally we will give our preliminary evaluation of what we learned about the interoperability of infrastructures and what recommendations we can give to improve their interoperability, whether that is improvements to the data standards used, the means to access the data and analyse them, or even the physical bandwidth and computational restrictions that limit the potential for research. HTML XML PDF
      PubDate: Mon, 13 Sep 2021 11:45:00 +030
       
  • Mobilizing Animal Movement Data: API use and the Movebank platform

    • Abstract: Biodiversity Information Science and Standards 5: e74312
      DOI : 10.3897/biss.5.74312
      Authors : Sarah Davidson, Gil Bohrer, Andrea Kölzsch, Candace Vinciguerra, Roland Kays : Movebank, a global platform for animal tracking and other animal-borne sensor data, is used by over 3,000 researchers globally to harmonize, archive and share nearly 3 billion animal occurrence records and more than 3 billion other animal-borne sensor measurements that document the movements and behavior of over 1,000 species. Movebank’s publicly described data model (Kranstauber et al. 2011), vocabulary and application programming interfaces (APIs) provide services for users to automate data import and retrieval. Near-live data feeds are maintained in cooperation with over 20 manufacturers of animal-borne sensors, who provide data in agreed-upon formats for accurate data import. Data acquisition by API complies with public or controlled-access sharing settings, defined within the database by data owners. The Environmental Data Automated Track Annotation System (EnvDATA, Dodge et al. 2013) allows users to link animal tracking data with hundreds of environmental parameters from remote sensing and weather reanalysis products through the Movebank website, and offers an API for advanced users to automate the submission of annotation requests. Movebank's mobile apps, the Animal Tracker and Animal Tagger, use APIs to support reporting and monitoring while in the field, as well as communication with citizen scientists. The recently-launched MoveApps platform connects with Movebank data using an API to allow users to build, execute and share repeatable workflows for data exploration and analysis through a user-friendly interface. A new API, currently under development, will allow calls to retrieve data from Movebank reduced according to criteria defined by "reduction profiles", which can greatly reduce the volume of data transferred for many use cases.In addition to making this core set of Movebank services possible, Movebank's APIs enable the development of external applications, including the widely used R programming packages 'move' (Kranstauber et al. 2012) and 'ctmm' (Calabrese et al. 2016), and user-specific workflows to efficiently execute collaborative analyses and automate tasks such as syncing with local organizational and governmental websites and archives. The APIs also support large-scale data acquisition, including for projects under development to visualize, map and analyze bird migrations led by the British Trust for Ornithology, the coordinating organisation for European bird ringing (banding) schemes (EURING), Georgetown University, National Audubon Society, Smithsonian Institution and United Nations Convention on Migratory Species.Our API development is constrained by a lack of standardization in data reporting across animal-borne sensors and a need to ensure adequate communication with data users (e.g., how to properly interpret data; expectations for use and attribution) and data owners (e.g., who is using publicly-available data and how) when allowing automated data access. As interest in data linking, harvesting, mirroring and integration grows, we recognize needs to coordinate API development across animal tracking and biodiversity databases, and to develop a shared system for unique organism identifiers. Such a system would allow linking of information about individual animals within and across repositories and publications in order to recognize data for the same individuals across platforms, retain provenance and attribution information, and ensure beneficial and biologically meaningful data use. HTML XML PDF
      PubDate: Mon, 13 Sep 2021 11:45:00 +030
       
  • Having Your Cake and Eating It Too: JSON-LD as an RDF serialization format

    • Abstract: Biodiversity Information Science and Standards 5: e74266
      DOI : 10.3897/biss.5.74266
      Authors : Steven J Baskauf : One impediment to the uptake of linked data technology is developers’ unfamiliarity with typical Resource Description Framework (RDF) serializations like Turtle and RDF/XML. JSON for Linking Data (JSON-LD) is designed to bypass this problem by expressing linked data in the well-known Javascript Object Notation (JSON) format that is popular with developers. JSON-LD is now Google’s preferred format for exposing Schema.org structured data in web pages for search optimization, leading to its widespread use by web developers. Another successful use of JSON-LD is by the International Image Interoperability Framework (IIIF), which limits its use to a narrow design pattern, which is readily consumed by a variety of applications. This presentation will show how a similar design pattern has been used in Audubon Core and with Biodiversity Information Standards (TDWG) controlled vocabularies to serialize data in a manner that is both easily consumed by conventional applications, but which also can be seamlessly loaded as RDF into triplestores or other linked data applications. The presentation will also suggest how JSON-LD might be used in other contexts within TDWG vocabularies, including with the Darwin Core Resource Relationship terms. HTML XML PDF
      PubDate: Mon, 13 Sep 2021 11:45:00 +030
       
  • Unlocking Inventory Data Capture, Sharing and Reuse: The Humboldt
           Extension to Darwin Core

    • Abstract: Biodiversity Information Science and Standards 5: e74275
      DOI : 10.3897/biss.5.74275
      Authors : Yanina Sica, Paula Zermoglio : Biodiversity inventories, i.e., recording multiple species at a specific place and time, are routinely performed and offer high-quality data for characterizing biodiversity and its change. Digitization, sharing and reuse of incidental point records (i.e., records that are not readily associated with systematic sampling or monitoring, typically museum specimens and many observations from citizen science projects) has been the focus for many years in the biodiversity data community. Only more recently, attention has been directed towards mobilizing data from both new and longstanding inventories and monitoring efforts. These kinds of studies provide very rich data that can enable inferences about species absence, but their reliability depends on the methodology implemented, the survey effort and completeness. The information about these elements has often been regarded as metadata and captured in an unstructured manner, thus making their full use very challenging.Unlocking and integrating inventory data requires data standards that can facilitate capture and sharing of data with the appropriate depth. The Darwin Core standard (Wieczorek et al. 2012) currently enables reporting some of the information contained in inventories, particularly using Darwin Core Event terms such as samplingProtocol, sampleSizeValue, sampleSizeUnit, samplingEffort. However, it is limited in its ability to accommodate spatial, temporal, and taxonomic scopes, and other key aspects of the inventory sampling process, such as direct or inferred measures of sampling effort and completeness. The lack of a standardized way to share inventory data has hindered their mobilization, integration, and broad reuse. In an effort to overcome these limitations, a framework was developed to standardize inventory data reporting: Humboldt Core (Guralnick et al. 2018). Humboldt Core identified three types of inventories (single, elementary, and summary inventories) and proposed a series of terms to report their content. These terms were organized in six categories: dataset and identification; geospatial and habitat scope; temporal scope; taxonomic scope; methodology description; and completeness and effort. While originally planned as a new TDWG standard and being currently implemented in Map of Life (https://mol.org/humboldtcore/), ratification was not pursued at the time, thus limiting broader community adoption.In 2021 the TDWG Humboldt Core Task Group was established to review how to best integrate the terms proposed in the original publication with existing standards and implementation schemas. The first goal of the task group was to determine whether a new, separate standard was needed or if an extension to Darwin Core could accommodate the terms necessary to describe the relevant information elements. Since the different types of inventories can be thought of as Events with different nesting levels (events within events, e.g., plots within sites), and after an initial mapping to existing Darwin Core terms, it was deemed appropriate to start from a Darwin Core Event Core and build an extension to include Humboldt Core terms. The task group members are currently revising all original Humboldt Core terms, reformulating definitions, comments, and examples, and discarding or adding new terms where needed. We are also gathering real datasets to test the use of the extension once an initial list of revised terms is ready, before undergoing a public review period as established by the TDWG process.Through the ratification of Humboldt Core as a TDWG extension, we expect to provide the community with a solution to share and use inventory data, which improves biodiversity data discoverability, interoperability and reuse while lowering the reporting burden at different levels (data collection, integration and sharing). HTML XML PDF
      PubDate: Mon, 13 Sep 2021 11:45:00 +030
       
  • Semantic Search in Legacy Biodiversity Literature: Integrating data from
           different data infrastructures

    • Abstract: Biodiversity Information Science and Standards 5: e74251
      DOI : 10.3897/biss.5.74251
      Authors : Adrian Pachzelt, Gerwin Kasperek, Andy Lücking, Giuseppe Abrami, Christine Driller : Nowadays, obtaining information by entering queries into a web search engine is routine behaviour. With its search portal, the Specialised Information Service Biodiversity Research (BIOfid) adapts the exploration of legacy biodiversity literature and data extraction to current standards (Driller et al. 2020). In this presentation, we introduce the BIOfid search portal and its functionalities in a How-To short guide. To this end, we adapted a knowledge graph representation of our thematic focus of Central European, primarily German language, biodiversity literature of the 19th and 20th centuries. Now, users can search our text-mined corpus containing to date more than 8.700 full-text articles from 68 journals, and particularly focussing on birds, lepidopterans and vascular plants. The texts are automatically preprocessed by the Natural Language Processing provider TextImager (Hemati et al. 2016) and will be linked to various databases such as Wikidata, Wikipedia, the Global Biodiversity Information Facility (GBIF), Encyclopedia of Life (EoL), Geonames, the Integrated Authority File (GND) and WordNet. For data retrieval, users can filter search results and download the article metadata as well as text annotations and database links in JavaScript Object Notation (JSON) format. For example, literature that mentions taxa from certain decades or co-occurrences of species can be searched. Our search engine recognises scientific and vernacular taxon names based on the GBIF Backbone Taxonomy and offers search suggestions to support the user. The semantic network of the BIOfid search portal is also enriched with data from the EoL trait bank, so that trait data can be included in the search queries.Thus, scientists can enhance their own data sets with the search results and feed them into the relevant biodiversity data repositories to sustainably expand the corresponding knowledge graphs with reliable data. Since BIOfid applies standard ontology terms, all data mobilized from literature can be combined with data on natural history collection objects or data from current research projects in order to generate more comprehensive knowledge. Furthermore, taxonomy, ecology and trait ontologies that have been built or extended within this project will be made available through appropriate platforms such as The Open Biological and Biomedical Ontology (OBO) Foundry and the Terminology Service of The German Federation for Biological Data (GFBio). HTML XML PDF
      PubDate: Fri, 10 Sep 2021 15:30:00 +030
       
  • The 8 Years of Existence of Xper3: State of the art and future
           developments of the platform

    • Abstract: Biodiversity Information Science and Standards 5: e74250
      DOI : 10.3897/biss.5.74250
      Authors : Adeline Kerner, Sylvain Bouquin, Rémy Portier, Régine Vignes Lebbe : The Xper3 platform was launched in November 2013 (Saucède et al. 2020). Xper3 is a free web platform that manages descriptive data and provides interactive identification keys. It is a follow-up to Xper (Forget et al. 1986) and Xper2 (Ung et al. 2010). Xper3 is used via web browsers. It offers a collaborative, multi-user interface without local installation. It is compatible with TDWG’s Structured Descriptive Data (SDD) format. Xper3 and its previous version, Xper2, have already been used for various taxonomic groups. In June 2021, 4743 users had created accounts and edited 5756 knowledge bases. Each knowledge base is autonomous and can be published as a free access key link, as a data paper in publications or on websites. The risk of this autonomy and lack of visibility to already existing knowlege bases is possible duplicated content or overlapping effort. Increasingly, users have asked for a public overview of the existing content. A first version of a searching tool is now available online. Explorer lists the databases whose creators have filled in the extended metadata and have accepted the referencing. The user can search by language, taxonomic group, fossil or current, geography, habitat, and key words.New developments of Xper3 are in progress. Some have a first version online, others are in production and the last ones are future projects. We will present an overview of the different projects in progress and for the future.Calculated descriptors are a distinctive feature of Xper3 (Kerner and Vignes Lebbe 2019). These descriptors are automatically computed from other descriptors by using logical operators (Boolean operators). The use of calculated descriptors remains rare. It is necessary to put forward the calculated descriptors to encourage more feedback in order to improve them. The link between Xper3 and Annotate continues to improve (Hays and Kerner 2020). Annotate offers the possibility of tagging images with controlled vocabularies structured in Xper3. Then, an export from Annotate to Xper3, allows automatic filling in of the Xper3 knowledge base with the descriptions (annotations and numerical measures) of virtual specimens, and then comparing specimens to construct species descriptions, etc. Future developments are in progress that will modify the Xper3 architecture in order to have the same functionalities in both local and online versions and to allow various user interfaces from the same knowledge bases. Xper2-specific features, such as merging states, adding notes, adding definitions and/or illustrations in the description tab, having different ways of sorting and filtering the descriptors during an identification (by groups, identification power, alphabetic order, specialist’s choice) have to be added to Xper3.A new tab in Xper3’s interface is being implemented to give an access to various analysis tools, via API (Application Programming Interface), or R programming code: MINSET: minimum list of descriptors sufficient to discriminate all items MINDESCR: minimum set of descriptors to discriminate an itemDESCRXP: generating a description in natural languageMERGEMOD: proposing to merge states without loss of discriminating powerDISTINXP, DISTVAXP: computing similarities between items or descriptorsOne last project that we would like to implement is an interoperability between Xper3, platforms with biodiversity data (e.g., Global Biodiversity Information Facility, GBIF) and bio-ontologies. An ID field already exists to add Universally Unique IDentifiers (UUID) for taxa. ID fields have to be added for descriptors and states to link them with ontologies e.g., Phenotypic Quality Ontology PATO, Plant Ontology PO.We are interested in discussing future developments to further improve the user interface and develop new tools for the analysis of knowledge bases. HTML XML PDF
      PubDate: Fri, 10 Sep 2021 15:30:00 +030
       
  • Third-party Annotations: Linking PlutoF platform and the ELIXIR Contextual
           Data ClearingHouse for the reporting of source material annotation gaps
           and inaccuracies

    • Abstract: Biodiversity Information Science and Standards 5: e74249
      DOI : 10.3897/biss.5.74249
      Authors : Kessy Abarenkov, Allan Zirk, Kadri Põldmaa, Timo Piirmann, Raivo Pöhönen, Filipp Ivanov, Kristjan Adojaan, Urmas Kõljalg : Third-party annotations are a valuable resource to improve the quality of public DNA sequences. For example, sequences in International Nucleotide Sequence Databases Collaboration (INSDC) often lack important features like taxon interactions, species level identification, information associated with habitat, locality, country, coordinates, etc. Therefore, initiatives to mine additional information from publications and link to the public DNA sequences have become common practice (e.g. Tedersoo et al. 2011, Nilsson et al. 2014, Groom et al. 2021). However, third-party annotations have their own specific challenges. For example, annotations can be inaccurate and therefore must be open for permanent data management. Further, every DNA sequence (except sequences from type material) can carry different species names, which must be databased as equal scientific hypotheses. PlutoF platform provides such data management services for third-party annotations.PlutoF is an online data management platform and computing service provider for biology and related disciplines. Registered users can enter and manage a wide range of data, e.g., taxon occurrences, metabarcoding data, taxon classifications, traits, and lab data. It also features an annotation module where third-party annotations (on material source, geolocation and habitat, taxonomic identifications, interacting taxa, etc.) can be added to any collection specimen, living culture or DNA sequence record. The UNITE Community is using these services to annotate and improve the quality of INSDC rDNA Internal Transcribed Spacer (ITS) sequence datasets. The National Center for Biotechnology Information (NCBI) is linking its ITS sequences with their annotations in PlutoF. However, there is still missing an automated solution for linking annotations in PlutoF with any sequence and sample record stored in INSDC databases. One of the ambitions of the BiCIKL Project is to solve this through operating the ELIXIR Contextual Data ClearingHouse (CDCH). CDCH offers a light and simple RESTful Application Programming Interface (API) to enable extension, correction and improvement of publicly available annotations on sample and sequence records available in ELIXIR data resources. It facilitates feeding improved or corrected annotations from databases (such as secondary databases, e.g., PlutoF, which consume and curate data from repositories) back to primary repositories (databases of the three INSDC collaborative partners).In the Biodiversity Community Integrated Knowledge Library (BiCIKL) Project, the University of Tartu Natural History Museum is leading the task of linking the two components—the web interface provided by the PlutoF platform and CDCH APIs—to allow user-friendly and effortless reporting of errors and gaps in sequenced material source annotations. The API and web interface will be promoted to those communities (such as taxonomists, those abstracting from the literature, and those already using the community curated data) with the appropriate knowledge and tools who will be encouraged to report their enhanced annotations back to primary repositories. HTML XML PDF
      PubDate: Fri, 10 Sep 2021 15:30:00 +030
       
  • Towards Interlinked FAIR Biodiversity Knowledge: The BiCIKL perspective

    • Abstract: Biodiversity Information Science and Standards 5: e74233
      DOI : 10.3897/biss.5.74233
      Authors : Lyubomir Penev, Dimitrios Koureas, Quentin Groom, Jerry Lanfear, Donat Agosti, Ana Casino, Joe Miller, Christos Arvanitidis, Guy Cochrane, Boris Barov, Donald Hobern, Olaf Banki, Wouter Addink, Urmas Kõljalg, Patrick Ruch, Kyle Copas, Patricia Mergen, Anton Güntsch, Laurence Benichou, Jose Benito Gonzalez Lopez : The Horizon 2020 project Biodiversity Community Integrated Knowledge Library (BiCIKL) (started 1st of May 2021, duration 3 years) will build a new European community of key research infrastructures, researchers, citizen scientists and other stakeholders in biodiversity and life sciences. Together, the BiCIKL 14 partners will solidify open science practices by providing access to data, tools and services at each stage of, and along the entire biodiversity research and data life cycle (specimens, sequences, taxon names, analytics, publications, biodiversity knowledge graph) (Fig. 1, see also the BiCIKL kick-off presentation through Suppl. material 1), in compliance with the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. The existing services provided by the participating infrastructures will expand through development and adoption of shared, common or interoperable domain standards, resulting in liberated and enhanced flows of data and knowledge across these domains.BiCIKL puts a special focus on the biodiversity literature. Over the span of the project, BiCIKL will develop new methods and workflows for semantic publishing and integrated access to harvesting, liberating, linking, and re-using sub-article-level data extracted from literature (i.e., specimens, material citations, sequences, taxonomic names, taxonomic treatments, figures, tables).Data linkages may be realised with different technologies (e.g., data warehousing, linking between FAIR Data Objects, Linked Open Data) and can be bi-lateral (between two data infrastructures) or multi-lateral (among multiple data infrastructures). The main challenge of BiCIKL is to design, develop and implement a FAIR Data Place (FDP), a central tool for search, discovery and management of interlinked FAIR data across different domains.The key final output of BiCIKL will the future Biodiversity Knowledge Hub (BKH), a one-stop portal, providing access to the BiCIKL services, tools and workflows, beyond the lifetime of the project.  HTML XML PDF
      PubDate: Fri, 10 Sep 2021 15:30:00 +030
       
  • Crowdsourcing Fungal Biodiversity: Approaches and standards used by an
           all-volunteer community science project

    • Abstract: Biodiversity Information Science and Standards 5: e74225
      DOI : 10.3897/biss.5.74225
      Authors : Bill Sheehan, Rob Stevenson, Joanne Schwartz : Fungal Diversity Survey (FunDiS) is an all-volunteer community science organization that documents the diversity and distribution of macrofungi (visible with the naked eye) across North America. FunDiS addresses a key gap in biodiversity conservation: fungi, one of life’s major kingdoms, have been largely neglected in conservation efforts. Fungi are hyperdiverse: it is estimated that only 5% of fungal species have been described (Willis 2018), while support for professional taxonomists has been declining for decades. Therefore, FunDiS engages legions of amateur mycologists to document fungal diversity.Our participation model has four levels for crowdsourcing fungal biodiversity. It consists of a pyramid of participants and skills, continually drawing more people in at the base (simplest tasks), and encouraging them to move up to the next level.Level 1. Field observations: Community scientists document fungi in the field with georeferenced color photos and post observations on public, databased platforms; FunDiS uses iNaturalist and Mushroom Observer. FunDiS established a curated iNaturalist project called the FunDiS Diversity Database, inspired by FungiMap in Australia. Mushroom enthusiasts add observations, with the incentive that they will be reviewed by a team of expert identifiers. Another team of triagers goes through new observations, rejects those that do not follow FunDiS quality standards, and writes encouraging notes to posters on how to make observations more scientifically valuable. As of August 2021, there were almost 50,000 verifiable observations, of which 30,465 (including 3,204 species) were research grade and uploaded by iNaturalist onto the website of the Global Biodiversity Information Facility (GBIF). Another FunDiS initiative, Rare Fungi Challenges, enlists amateurs to search for rare or threatened fungi.Level 2. Sequence: FunDiS built a program for amateurs to submit tissue for DNA sequencing and provided help interpreting results. Barcoding is especially needed to identify fungi because mopho-characteristics and images are often insufficient. Participants register projects, post observations to iNaturalist or Mushroom Observer, and apply to FunDiS for sequencing grants or pay out-of-pocket for sequencing. More than 200 local projects have been registered from Alaska to Puerto Rico, and Iceland to Hawaii. Some 7,000 specimens were sequenced by June 2021. Data are deposited in GenBank.Level 3. Voucher: FunDiS supports preserving well-documented, dried specimens in curated fungaria. To date, this participation level has developed slowly because of limitations of personnel and capacity of those institutions.Level 4. Super User: These are advanced observers with extensive field knowledge who have learned DNA technology; can teach others how to analyze DNA results and create phylogenies; and even describe new species. There are perhaps several dozen super users in the North American fungal science community.Challenges and lessonsFeedback - Feedback to and from participants is critical to the success of community science projects. We have learned that it takes time and personnel to inspire rich interaction with participants in real time and that relying on volunteers with insufficient capacity for coordination, consistency and continuity often disappoints participants. Similarly DNA sequencing is intimidating to most amateurs. We found that guidance was needed for many participants just to correctly document, dry and submit tissue samples for sequencing. An even bigger challenge is making sense of the data that is generated, e.g., knowing if the sequence is of a described species or should be identified as a new species. Deep knowledge is needed for this kind of decision-making. In the past year we were fortunate to have the volunteer services of two professional mycologists and a doctoral student to analyze sequence data.Linking data - Linking data between field observations, genetic sequences and specimens is a major challenge. Our initial goal was to automate both external and internal data flows, but success has been limited with volunteer programmers. They managed to automate uploading iNaturalist and Mushroom Observer observations to our sequencing facility (Barcode of Life), but most other linkages have been tracked by volunteers on static spreadsheets.Paid staff - In retrospect, it was optimistic to attempt a project of such ambitious scope using only volunteer management and labor. The vast majority of community science projects are institution-based, with paid staff to manage and funds for outreach (Pocock et al. 2017). To continue at the present scale, we believe a core of paid staff is essential to leverage the large community we have been building. HTML XML PDF
      PubDate: Fri, 10 Sep 2021 15:30:00 +030
       
  • Integrating Taxonomic Names and Concepts from Paper and Digital Sources
           for a New Flora of Alaska

    • Abstract: Biodiversity Information Science and Standards 5: e74184
      DOI : 10.3897/biss.5.74184
      Authors : Campbell Webb, Stefanie Ickert-Bond, Kimberly Cook : The taxonomic foundation of a new regional flora or monograph is the reconciliation of pre-existing names and taxonomic concepts (i.e., variation in usage of those names). This reconciliation is traditionally done manually, but the availability of taxonomic resources online and of text manipulation software means that some of the work can now be automated, speeding up the development of new taxonomic products. As a contribution to developing a new Flora of Alaska (floraofalaska.org), we have digitized the main pre-existing flora (Hultén 1968) and combined it with key online taxonomic name sources (Panarctic Flora, Flora of North America, International Plant Names Index - IPNI, Tropicos, Kew’s World Checklist of Selected Plant Families), to build a canonical list of names anchored to external Globally Unique Identifiers (GUIDs) (e.g., IPNI URLs). We developed taxonomically-aware fuzzy-matching software (matchnames, Webb 2020) to identify cognates in different lists. The taxa for which there are variations between different sources in accepted names and synonyms are then flagged for review by taxonomic experts. However, even though names may be consistent across previous monographs and floras, the taxonomic concept (or circumscription) of a name may differ among authors, meaning that the way an accepted name in the flora is applied may be unfamiliar to the users of previous floras. We therefore have begun to manually align taxonomic concepts across five existing floras: Panarctic Flora, Flora of North America, Cody’s Flora of the Yukon (Cody 2000), Welsh’s Flora (Welsh 1974) and Hultén’s Flora (Hultén 1968), analysing usage and recording the Region Connection Calculus (RCC-5) relationships between taxonomic concepts common to each source. So far, we have mapped taxa in 13 genera, containing 557 taxonomic concepts and 482 taxonomic concept relationships. To facilitate this alignment process we developed software (tcm, Webb 2021) to record publications, names, taxonomic concepts and relationships, and to visualize the taxonomic concept relationships as graphs. These relationship graphs have proved to be accessible and valuable in discussing the frequently complex shifts in circumscription with the taxonomic experts who have reviewed the work. The taxonomic concept data are being integrated into the larger dataset to permit users of the new flora to instantly see both the chain of synonymy and concept map for any name. We have also worked with the developer of the Arctos Collection Management Solution (a database used for the majority of Alaskan collections) on new data tables for storage and display of taxonomic concept data. In this presentation, we will describe some of the ideas and workflows that may be of value to others working to connect across taxonomic resources. HTML XML PDF
      PubDate: Fri, 10 Sep 2021 15:30:00 +030
       
  • The Fungal Literature-based Occurrence Database in Southern West
           Siberia (Russia)

    • Abstract: Biodiversity Information Science and Standards 5: e74178
      DOI : 10.3897/biss.5.74178
      Authors : Nina Filippova, Dmitry Ageev, Sergey Bolshakov, Olga Vayshlya, Anastasia Vlasenko, Vyacheslav Vlasenko, Sergei Gashkov, Irina Gorbunova, Eugene Davydov, Elena Zvyagina, Nadezhda Kudashova, Maria Tomoshevich, Aleksandra Filippova, Natalia Shabanova, Lidia Yakovchenko, Irina Vorob'eva, Ludmila Kalinina, Ekaterina Palomozhnykh : The abstract presents the initiative to develop the Fungal Literature-based Occurrence Database for Southern West Siberia (FuSWS), which mobilizes occurrences of fungi from published literature (literature-based occurrences, Darwin Core MaterialCitation). The FuSWS database includes 28 fields describing species name, publication source, herbarium number (if exists), date of sampling or observation, locality information, vegetation, substrate, and others. The initiative on digitization of literature-based occurrence data started in the northern part of Western Siberia two years ago (Filippova et al. 2021a). The present project extends the initiative to the south and includes eight administrative regions (Sverdlovsk, Omsk, Kurgan, Tomsk, Novosibirsk, Kemerovo, Altay, and Gorny Altay). The area occupies the central to southern part of the West Siberian Plain. It extends for about 1.5 thousand km from the west to the east from the eastern slopes of the Ural Mountains to Yenisey River, and from north to south—about 1.3 thousand km. The total area equals about 1.2 million km2.Currently, the project is actively growing in spatial, collaboration and data accumulation terms. The working group of about 30 mycologists from 16 organizations dedicated to the digitization initiative was created as part of the Siberian Mycological Society (informal organization since 2019). They have created the most complete bibliographic list of mycology-related papers for the Southern West Siberia, including over 800 publications for the last two centuries (the earliest dated 1800). At abstract submission, the database had been populated with a total of about 10K records from about 100 sources. The dataset is uploaded to GBIF, where it is available for online search of species occurrences and/or download (Filippova et al. 2021b) Fig. 1. The project's page with the introduction, templates, bibliography list, video-presentations and written instructions is available at the website of the Siberian Mycological Society (https://sibmyco.org/literaturedatabase).The following protocol describes the digitization workflow in detail:The bibliography of related publications is compiled using Zotero bibliographic manager. Only published works (peer-reviewed papers, conference proceedings, PhD theses, monographs or book chapters) are selected. If possible, the sources are digitized and added to the library as PDF files. The template of the FuSWS database is made with Google Sheets, which allows simultaneous use by several specialists, in a common data format provided. The simple Microsoft Excel template is also available for the offline databasing. The Darwin Core standard is applied to the database field structure to accommodate the relevant information extracted from the publications.From the available bibliography of publications related to the region, only works with species occurrences are selected for the databasing purpose. The main source of occurrences is annotated species lists with exact localities of the records. However, different sorts of other species citations are also extracted, provided that they had the connection to any geography. All occurrences are georeferenced, either from the coordinates provided in the paper, or from the verbatim description of the field work locality. The georeferencing of the verbatim descriptions is made using Yandex or Google map services. Depending on the quality of georeference provided in publications, the uncertainty is estimated as follows: 1) the coordinate of a fruiting structure or a plot provided in the publication gives the uncertainty about 3-30 meters; 2) the coordinate of the field work locality provided in publication gives the uncertainty about 500 m to 5 km; 3) the report of the species presence in a particular region gives the centroid of the area with the uncertainty radius to include its borders.The locality names reported in Russian are translated to English and written in the «locality» field. Russian descriptions are reserved in the field «verbatimLocality» for accuracy.When possible, the «eventDate» is extracted from the annotation data. Whenever this information is absent, the date of the publication is used instead with the remarks in the «verbatimEventDate» field.The ecological features, habitat and substrate preferences are written in the «habitat» field and reserved in Russian.The original scientific names reported in publications are filled in the «originalNameUsage» field. Correction of spelling errors is made using the GBIF Species Matching tool. This tool is also used to create the additional fields of taxonomic hierarchy from species to kingdom, to fill in the «taxonRank» field and to synonymize according to the GBIF Backbone Taxonomy.To track the digitization process, a worksheet is maintained. Each bibliographic record has a series of fields to describe the digitization process and its results: the total number of extracted occurrence records, general description of the occurrence quality, presence of the observation date, details of georeferencing and the name of a person responsible for the digitization. HTML XML
      PubDate: Fri, 10 Sep 2021 15:30:00 +030
       
  • An Image is Worth a Thousand Species: Scaling high-resolution plant
           biodiversity prediction to biome-level using citizen science data and
           remote sensing imagery

    • Abstract: Biodiversity Information Science and Standards 5: e74052
      DOI : 10.3897/biss.5.74052
      Authors : Lauren Gillespie, Megan Ruffley, Moisés Expósito-Alonso : Accurately mapping biodiversity at high resolution across ecosystems has been a historically difficult task. One major hurdle to accurate biodiversity modeling is that there is a power law relationship between the abundance of different types of species in an environment, with few species being relatively abundant while many species are more rare. This “commonness of rarity,” confounded with differential detectability of species, can lead to misestimations of where a species lives. To overcome these confounding factors, many biodiversity models employ species distribution models (SDMs) to predict the full extent of where a species lives, using observations of where a species has been found, correlated with environmental variables. Most SDMs use bioclimatic environmental variables as the dependent variable to predict a species’ range, but these approaches often rely on biased pseudo-absence generation methods and model species using coarse-grained bioclimatic variables with a useful resolution floor of 1 km-pixel. Here, we pair iNaturalist citizen science plant observations from the Global Biodiversity Information Facility with RGB-Infrared aerial imagery from the National Aerial Imagery Program to develop a deep convolutional neural network model that can predict the presence of nearly 2,500 plant species across California. We utilize a state-of-the-art multilabel image recognition model from the computer vision community, paired with a cutting-edge multilabel classification loss, which leads to comparable or better accuracy to traditional SDM models, but at a resolution of 250m (Ben-Baruch et al. 2020, Ridnik et al. 2020). Furthermore, this deep convolutional model is able to accurately predict species presence across multiple biomes of California with good accuracy and can be used to build a plant biodiversity map across California with unparalleled accuracy. Given the widespread availability of citizen science observations and remote sensing imagery across the globe, this deep learning-enabled method could be deployed to automatically map biodiversity at large scales. HTML XML PDF
      PubDate: Fri, 10 Sep 2021 15:30:00 +030
       
  • #RetroPIDs: The missing link to the foundation of biodiversity
           knowledge

    • Abstract: Biodiversity Information Science and Standards 5: e74141
      DOI : 10.3897/biss.5.74141
      Authors : Nicole Kearney, Colleen Funkhouser, Mike Lichtenberg, Bess Missell, Roderic Page, Joel Richard, Diane Rielinger, Susan Lynch : The Biodiversity Heritage Library (BHL) will soon upload its 60 millionth page of open access biodiversity literature onto the BHL website and the BHL's Internet Archive Collection. The BHL’s massive repository of free knowledge includes content that is available nowhere else online, as well as accessible versions of content that are locked behind paywalls elsewhere. If we are to continue to expand our understanding of life on Earth, we must ensure that the foundation of biodiversity knowledge provided by BHL is discoverable by the tools we rely on to navigate the ever-expanding internet. These tools – search engines and their algorithms – preferentially deliver (and rank) content with good metadata and persistent identifiers (PIDs). In modern online publishing, PID assignment and linking happens at the point of publication:
      DOI s (Digital Object Identifiers) for publications, ORCIDs (Open Researcher and Contributor IDs) for people, and RORs (Research Organization Registry IDs) for organisations. The
      DOI system provided by Crossref (the
      DOI registration agency for scholarly content) delivers reciprocal citations, enabling convenient clicking from article to article, and citation tracking, enabling authors and institutions to track the impact and reach of their research output. Publications that lack PIDs, which include the vast majority of legacy literature, are hard to find and sit outside the linked network of scholarly research. This makes it nearly impossible to determine whether they are being cited, let alone viewed, mentioned, shared or liked. At TDWG 2020, Page 2020, Kearney 2020, Richard 2020 (and 2019, Page 2019b, Page 2019a, Kearney 2019b, Kearney 2019a and 2018, Kearney 2018), we emphasised the need to bring the historic biodiversity literature into the modern linked network of scholarly research. In October 2020, BHL launched a new working group to do exactly this. The BHL Persistent Identifier Working Group (Team #RetroPID) brings together expertise from across BHL’s global community. Over the past year, we have worked tirelessly to make it easier to find, cite, link, share and track the content on BHL, adding article-level metadata to journals and retrospectively assigning
      DOI s (#RetroPIDs). Most importantly, we have developed the tools and documentation that will enable the entire BHL community to take contributed content from “just” accessible to persistently discoverable. This paper will detail our efforts to retrofit the historic literature (a square peg) into the modern PID system (a round hole) and will present both the achievements and the challenges of this important work.  HTML XML PDF
      PubDate: Wed, 8 Sep 2021 17:30:00 +0300
       
  • Genomes on a Tree (GoaT): A centralized resource for eukaryotic
           genome sequencing initiatives

    • Abstract: Biodiversity Information Science and Standards 5: e74138
      DOI : 10.3897/biss.5.74138
      Authors : Cibele Sotero-Caio, Richard Challis, Sujai Kumar, Mark Blaxter : Genomic data are transforming our understanding of biodiversity and, under the umbrella of the Earth BioGenome Project (EBP - https://www.earthbiogenome.org), many initiatives seek to generate large numbers of reference genome sequences. The distributed nature of this work makes coordination essential to ensure optimal synergy between projects and to prevent duplication of effort. While public sequence databases hold data describing completed projects, there is currently no global source of information about projects in progress or planned. In addition, the scoping and delivery of sequencing projects benefits from prior estimates of genome size and karyotype, but existing data are scattered in the literature. To address these issues, the Tree of Life programme (https://www.sanger.ac.uk/programme/tree-of-life/) has developed Genomes on a Tree (GoaT), an ElasticSearch-powered, taxon-centred database that collates observed and estimated genome-relevant metadata—including genome sizes and karyotypes—for eukaryotic species. Missing values for individual species are estimated from phylogenetic comparison. GoaT also holds declarations of actual and planned activity, from priority lists and in-progress status, to submissions to the International Nucleotide Sequence Database Collaboration (INSDC https://www.insdc.org/), across genome sequencing consortia. GoaT can be queried through a mature API (application programming interface), and we have developed a web front-end that includes data summary visualisations (see https://goat.genomehubs.org/). We are currently transitioning this service into the Tree of Life production pipeline. GoaT currently reports priority lists from the Darwin Tree of Life project (focussed on the biodiversity of Britain and Ireland). We are actively soliciting additional data concerning progress and intent from other projects so that GoaT displays a real-time summary of the state of play in reference genome sequencing, and thus facilitates collaboration and cooperation among projects. We are developing standard formats and procedures so that any project can make explicit its intent and progress. Cross referencing to other data systems such as the INSDC sequence databases, the BOLD DNA barcodes resource and Global Biodiversity Information Facility- and Open Tree of Life-related taxonomic and distribution databases will further enhance the system’s utility. We also seek to incorporate additional kinds of metadata, such as  sex chromosome systems, to augment the utility of GoaT in supporting the global genome sequencing effort. HTML XML PDF
      PubDate: Wed, 8 Sep 2021 17:30:00 +0300
       
  • Using the Taxonomic Backbone(s): The challenge of selecting a taxonomic
           resource and integrating it with a collection management solution

    • Abstract: Biodiversity Information Science and Standards 5: e74115
      DOI : 10.3897/biss.5.74115
      Authors : Teresa Mayfield-Meyer, Phyllis Sharp, Dusty McDonald : The reality is that there is no single “taxonomic backbone”, there are many: the Global Biodiversity Information Facility (GBIF) Backbone Taxonomy, the World Register of Marine Species (WoRMS) and MolluscaBase, to name a few. We could view each one of these as a vertebra on the taxonomic backbone, but even that isn’t quite correct as some of these are nested within others (MolluscaBase contributes to WoRMS, which contributes to Catalogue of Life, which contributes to the GBIF Backbone Taxonomy). How is a collection manager without expertise in a given set of taxa and a limited amount of time devoted to finding the “most current” taxonomy supposed to maintain research grade identifications when there are so many seemingly authoritative taxonomic resources' And once a resource is chosen, how can they seamlessly use the information in that resource' This presentation will document how the Arctos community’s use of the taxon name matching service Global Names Architecture (GNA) led one volunteer team leader in a marine invertebrate collection to attempt to make use of WoRMS taxonomy and how her persistence brought better identifications and classifications to a community of collections. It will also provide insight into some of the technical and curatorial challenges involved in using an outside resource as well as the ongoing struggle to keep up with changes as they occur in the curated resource. HTML XML PDF
      PubDate: Wed, 8 Sep 2021 17:30:00 +0300
       
  • Algorithms for connecting scientific names with literature in the
           Biodiversity Heritage Library via the Global Names Project and Catalogue
           of Life

    • Abstract: Biodiversity Information Science and Standards 5: e74114
      DOI : 10.3897/biss.5.74114
      Authors : Geoffrey Ower, Dmitry Mozzherin : Being able to quickly find and access original species descriptions is essential for efficiently conducting taxonomic research. Linking scientific name queries to the original species description is challenging and requires taxonomic intelligence because on average there are an estimated three scientific names associated with each currently accepted species, and many historical scientific names have fallen into disuse from being synonymized or forgotten. Additionally, non-standard usage of journal abbreviations can make it difficult to automatically disambiguate bibliographic citations and ascribe them to the correct publication. The largest open access resource for biodiversity literature is the Biodiversity Heritage Library (BHL), which was built by a consortium of natural history institutions and contains over 200,000 digitized volumes of natural history publications spanning hundreds of years of biological research. Catalogue of Life (CoL) is the largest aggregator of scientific names globally, publishing an annual checklist of currently accepted scientific names and their historical synonyms. TaxonWorks is an integrative web-based workbench that facilitates collaboration on biodiversity informatics research between scientists and developers. The Global Names project has been collaborating with BHL, TaxonWorks, and CoL to develop a Global Names Index that links all of these services together by finding scientific names in BHL and using the taxonomic intelligence provided by CoL to conveniently link directly to the page referenced in BHL. The Global Names Index is continuously updated as metadata is improved and digitization technologies advance to provide more accurate optical character recognition (OCR) of scanned texts. We developed an open source tool, “BHLnames,” and launched a restful application programming interface (API) service with a freely available Javascript widget that can be embedded on any website to link scientific names to literature citations in BHL. If no bibliographic citation is provided, the widget will link to the oldest name usage in BHL, which often is the original species description. The BHLnames widget can also be used to browse all mentions of a scientific name and its synonyms in BHL, which could make the tool more broadly useful for studying the natural history of any species. HTML XML PDF
      PubDate: Wed, 8 Sep 2021 17:30:00 +0300
       
  • BHL and the Pandemic: An accelerator of digital advances and
           transformation

    • Abstract: Biodiversity Information Science and Standards 5: e74061
      DOI : 10.3897/biss.5.74061
      Authors : Alice Lemaire : Committed to the Biodiversity Heritage Library (BHL) since 2016, the National Natural History Museum (MNHN) library encountered opportunities and new challenges during the COVID-19 pandemic.The origins of MNHN date back to 1635, with the foundation of a royal garden for medicinal and teaching purposes, by King Louis XIII. It became the National Natural History Museum in 1793 during the French Revolution.The MNHN collections today include about seventy million specimens. These collections constitute a global archive and a major research infrastructure. Being a very important center of research and teaching, the institution groups together several entities at thirteen different locations. It is deeply committed to preserving biodiversity and to sharing knowledge with the public through its galleries, botanical gardens, zoos and libraries.The library, consisting of a main library and several specialized libraries, is one of the world’s largest natural history libraries. The collection contains more than two million documents of all kinds: printed and electronic books and periodicals; manuscripts and archives; maps, drawings, photographs and art collections. The library takes part in the French higher education libraries network and is associated with the French national library, which offers many opportunities for collaboration at a national level.The MNHN library launched its first digitization program twenty years ago, beginning with the academic publications the MNHN has been releasing since 1802 and including the publications of the related learned societies. A second program devoted to taxonomic documentation began in 2014. It is a research-driven digitization program built in collaboration with the MNHN researchers. A third program shares the treasures of the library, e.g., precious books, manuscripts and archives; iconography (such as the famous velum collection), scientific objects or artworks. The MNHN digital library is harvested by Gallica, the digital library of the French national library.After participating in the BHL-Europe project from 2009 to 2012, the MNHN library became a BHL Member in 2016 and started uploading content in September 2017. The complete collection of MNHN academic publications from 1802 to 2000 is now available on BHL. The publications of the learned societies related to the MNHN are to be the library’s next contribution.During the first lockdown, from March to May 2020, librarians in charge of content uploading to BHL were able to pursue this task full-time, which increased the production. The last BHL-Europe files were loaded during this period of time. More than 100,000 pages were added in 2020. As the production increased, so did the museum's outreach in 2020, by more than 70%, both in number of visitors and in number of pages viewed. It seems that the MNHN library is now better identified as the French access point to BHL, both by learned societies and by researchers who ask for information or for help.But beyond an increased production and a better outreach, the pandemic also provided new tasks for remote workers. The first lockdown was a very difficult time, especially for people who had no remote work and felt deprived of their professional identity. So progressively new tasks were established for people for whom no remote tasks were yet defined. Among these new activities, a workflow for the creation of article-level metadata was set up with the help of Roderic Page (University of Glasgow, Scotland). Thanks to this work, users can easily search and browse individual articles within several MNHN publications, such as Adansonia.The pandemic turned out to be an accelerator of digital awareness and transformation, not only at the management level, but more widely for the whole library staff as well. By providing new remote tasks, BHL reduced inequalities within the library team and offered new opportunities. This greater involvement also strengthened the sense of belonging to BHL, which is definitely not only a resource but also a community, helping us get through this difficult period.Our goal now is to continue to perpetuate these projects. The MNHN library also intends to capitalize on all this work in its own digital library, by assigning digital object identifiers (
      DOI ). This work on articles is indeed a driver for the evolution of the information systems. The Museum is currently redesigning its whole IT infrastructure for collections, helping the library be part of a larger movement. The objectives of this new system are to better connect library collections and naturalist collections and to face the challenge of interoperability in the European and international ecosystem in which the MNHN and BHL participates. HTML XML PDF
      PubDate: Tue, 7 Sep 2021 11:00:00 +0300
       
  • Species Detection and Segmentation of Multi-specimen Historical
           Herbaria 

    • Abstract: Biodiversity Information Science and Standards 5: e74060
      DOI : 10.3897/biss.5.74060
      Authors : Krishna Kumar Thirukokaranam Chandrasekar, Kenzo Milleville, Steven Verstockt : Historically, herbarium specimens have provided users with documented occurrences of plants in specific locations over time. Herbarium collections have therefore been the basis of systematic botany for centuries (Younis et al. 2020). According to the latest summary report based on the data from Index Herbariorum, there are around 3400 active herbaria in the world containing 397 million specimens that are spread across 182 countries (Thiers 2021). Exponential growth in high quality image capturing devices induced by the enormous amount of uncovered collections has further led to rising interest in large scale digitization initiatives across the world (Le Bras et al. 2017). As herbarium specimens are increasingly becoming digitised and accessible in online repositories, an important need has also emerged to develop automated tools to process and enrich these collections to facilitate better access to the preserved archives.This rising number of digitised herbarium sheets provides an opportunity to employ computer-based image processing techniques, such as deep learning, to automatically identify species and higher taxa (Carranza-Rojas and Joly 2018, Carranza-Rojas et al. 2017, Younis et al. 2020) or to extract other useful information from the herbaria sheets, such as detecting handwritten text, color bars, scales and barcodes. The species identification task works well for herbarium sheets that have only one species in a page. However, there are many herbarium books that have multiple species on the same page (as shown in Fig. 1) for which the complexity of the identification problem increases tremendously. It also involves a great deal of time and effort if they are to be enriched manually. In this work, we propose a pipeline that can automatically detect, identify, and enrich plant species in multi-specimen herbaria.The core idea of the pipeline is to detect unique plant species and handwritten text around the plant species and map the text to the correct plant species. As shown in Fig. 2, the proposed pipeline begins with the pre-processing of the images. The images are rotated and aligned such that the longest edge is maintained as its height. In the case of herbarium books, the pages are detected and morphological transformations are performed to reduce occlusions (Thirukokaranam Chandrasekar and Verstockt 2020). A YOLOv3 (You Only Look Once version 3) object detection model (Zhao and Li 2020) is trained from scratch to detect plants and text. The model was trained on a dataset of single species herbarium sheets with a mosaic augmentation technique to extend the plants model to detect multiple species. The first results of the training shows impressive results although it could be further improved with more labelled data. We also plan to train an object segmentation model and contrast its performance with the plant detection model for multi-specimen herbarium sheets. After detecting both the plants and the text, the text will be recognized with a state-of-the-art handwritten text recognition (HTR) model. The recognized text can then be matched with a database of specimens, to identify each detected specimen. Furthermore, additional textual metadata (e.g. date, locality, collector's name, institution) visible on the sheet will be recognized and used to enrich the collection.  HTML XML PDF
      PubDate: Tue, 7 Sep 2021 11:00:00 +0300
       
  • Reducing Manual Supervision Required for Biodiversity Monitoring with
           Self-Supervised Learning

    • Abstract: Biodiversity Information Science and Standards 5: e74047
      DOI : 10.3897/biss.5.74047
      Authors : Omiros Pantazis, Gabriel Brostow, Kate Jones, Oisin Mac Aodha : Recent years have ushered in a vast array of different types of low cost and reliable sensors that are capable of capturing large quantities of audio and visual information from the natural world. In the case of biodiversity monitoring, camera traps (i.e. remote cameras that take images when movement is detected (Kays et al. 2009) have shown themselves to be particularly effective tools for the automated monitoring of the presence and activity of different animal species. However, this ease of deployment comes at a cost, as even a small scale camera trapping project can result in hundreds of thousands of images that need to be reviewed. Until recently, this review process was an extremely time consuming endeavor. It required domain experts to manually inspect each image to:determine if it contained a species of interest andidentify, where possible, which species was present. Fortunately, in the last five years, advances in machine learning have resulted in a new suite of algorithms that are capable of automatically performing image classification tasks like species classification.The effectiveness of deep neural networks (Norouzzadeh et al. 2018), coupled with transfer learning (tuning a model that is pretrained on a larger dataset (Willi et al. 2018), have resulted in high levels of accuracy on camera trap images.However, camera trap images exhibit unique challenges that are typically not present in standard benchmark datasets used in computer vision. For example, objects of interest are often heavily occluded, the appearance of a scene can change dramatically over time due to changes in weather and lighting, and while the overall number of images can be large, the variation in locations is often limited (Schneider et al. 2020). These challenges combined mean that in order to reach high performance on species classification it is necessary to collect a large amount of annotated data to train the deep models. This again takes a significant amount of time for each project, and this time could be better spent addressing the ecological or conservation questions of interest.Self-supervised learning is a paradigm in machine learning that attempts to forgo the need for manual supervision by instead learning informative representations from images directly, e.g. transforming an image in two different ways without impacting the semantics of the included object, and learn by imposing similarity between the two tranformations. This is a tantalizing proposition for camera trap data, as it has the potential to drastically reduce the amount of time required to annotate data. The current performance of these methods on standard computer vision benchmarks is encouraging, as it suggests that self-supervised models have begun to reach the accuracy of their fully supervised counterparts for tasks like classifying everyday objects in images (Chen et al. 2020). However, existing self-supervised methods can struggle when applied to tasks that contain highly similar, i.e. fine-grained, object categories such as different species of plants and animals (Van Horn et al. 2021).To this end, we explore the effectiveness of self-supervised learning when applied to camera trap imagery. We show that these methods can be used to train image classifiers with a significant reduction in manual supervision. Furthermore, we extend this analysis by showing, with some careful design considerations, that off-the-shelf self-supervised methods can be made to learn even more effective image representations for automated species classification. We show that by exploiting cues at training time related to where and when a given image was captured can result in further improvements in classification performance. We demonstrate, across several different camera trapping datasets, that it is possible to achieve similar, and sometimes even superior, accuracy to fully supervised transfer learning-based methods using a factor of ten times less manual supervision. Finally, we discuss some of the limitations of the outlined approaches and their implications on automated species classification from images. HTML XML PDF
      PubDate: Tue, 7 Sep 2021 11:00:00 +0300
       
  • Announcing Big-Bee: An initiative to promote understanding of bees through
           image and trait digitization

    • Abstract: Biodiversity Information Science and Standards 5: e74037
      DOI : 10.3897/biss.5.74037
      Authors : Katja Seltmann, Julie Allen, Brian Brown, Adrian Carper, Michael Engel, Nico Franz, Edward Gilbert, Chris Grinter, Victor Gonzalez, Pam Horsley, Sangmi Lee, Crystal Maier, Istvan Miko, Paul Morris, Peter Oboyski, Naomi Pierce, Jorrit Poelen, Virginia Scott, Mark Smith, Elijah Talamas, Neil Tsutsui, Erika Tucker : While bees are critical to sustaining a large proportion of global food production, as well as pollinating both wild and cultivated plants, they are decreasing in both numbers and diversity. Our understanding of the factors driving these declines is limited, in part, because we lack sufficient data on the distribution of bee species to predict changes in their geographic range under climate change scenarios. Additionally lacking is adequate data on the behavioral and anatomical traits that may make bees either vulnerable or resilient to human-induced environmental changes, such as habitat loss and climate change. Fortunately, a wealth of associated attributes can be extracted from the specimens deposited in natural history collections for over 100 years.Extending Anthophila Research Through Image and Trait Digitization (Big-Bee) is a newly funded US National Science Foundation Advancing Digitization of Biodiversity Collections project. Over the course of three years, we will create over one million high-resolution 2D and 3D images of bee specimens (Fig. 1), representing over 5,000 worldwide bee species, including most of the major pollinating species. We will also develop tools to measure bee traits from images and generate comprehensive bee trait and image datasets to measure changes through time. The Big-Bee network of participating institutions includes 13 US institutions (Fig. 2) and partnerships with US government agencies. We will develop novel mechanisms for sharing image datasets and datasets of bee traits that will be available through an open, Symbiota-Light (Gilbert et al. 2020) data portal called the Bee Library. In addition, biotic interaction and species association data will be shared via Global Biotic Interactions (Poelen et al. 2014). The Big-Bee project will engage the public in research through community science via crowdsourcing trait measurements and data transcription from images using Notes from Nature (Hill et al. 2012). Training and professional development for natural history collection staff, researchers, and university students in data science will be provided through the creation and implementation of workshops focusing on bee traits and species identification. We are also planning a short, artistic college radio segment called "the Buzz" to get people excited about bees, biodiversity, and the wonders of our natural world. HTML XML PDF
      PubDate: Tue, 7 Sep 2021 11:00:00 +0300
       
  • Estimating the Completeness of Preserved Collections in Representing
           Global Biodiversity

    • Abstract: Biodiversity Information Science and Standards 5: e74032
      DOI : 10.3897/biss.5.74032
      Authors : Pieter Huybrechts, Maarten Trekels, Quentin Groom : There are an estimated 8.7 million eukaryotic species globally and knowledge of those organisms is organised about their scientific names and the specimens we have of those species (Sweetlove 2011, Mora et al. 2011). Likewise there are between 1.2 and 2.1 billion (109) specimens held in biodiversity collections globally (Ariño 2010). These collections constitute an infrastructure and scientific tool to understand, catalogue and study biodiversity. Yet we find it hard to answer the simple question, how many species are in a collection' This is not trivial to answer, collections are not completely inventoried, do not use the same taxonomy, and the volume of data is vast (Samy et al. 2013, Ariño 2010). We have developed a method that allows us to take a list of collections and to estimate the species richness contained within them. By doing this we will have a deeper insight into the scientific value of the world's biodiversity collections.Dealing with non-homogeneous and non-random, but incomplete, sampling of sites is a common issue that occurs in many ecological studies (Magurran and McGill 2011, Colwell et al. 2012, Gotelli and Colwell 2001). By using techniques and toolboxes, such as iNEXT (Chao et al. 2014b) and vegan (Oksanen et al. 2020) we can estimate species richness under these conditions. In the case of collections we consider not only the digitized and published proportion of preserved collections, but make extrapolations to the specimens that have not made their way to the Global Biodiversity Information Facility (GBIF) yet.Nevertheless, to calculate on such large datasets we need to employ innovative Big Data analytic tools. GBIF contains 1.8 billion observations that amount to 120 GB of data compressed. This can then be interrogated in the cloud or locally using tools such as Galaxy, which has made it possible to process large numbers of records in a single batch.  We can now evaluate the biodiversity within collections, and divide the result by taxon and geographical region, and compare them to one another.Ultimately, this work will allow individual collections and consortia to evaluate their coverage of biodiversity and help them better target their collecting strategies. HTML XML PDF
      PubDate: Tue, 7 Sep 2021 11:00:00 +0300
       
  • Extracting Data at Scale: Machine learning at the Natural History Museum

    • Abstract: Biodiversity Information Science and Standards 5: e74031
      DOI : 10.3897/biss.5.74031
      Authors : Ben Scott, Laurence Livermore : The Natural History Museum holds over 80 million specimens and 300 million pages of scientific text. This information is a vital research tool to help solve the most important challenge humans face over the coming years – mapping a sustainable future for ourselves and the ecosystems on which we depend. Digitising these collections and providing the data in a structured, computable form is a mammoth challenge. As of 2020, less than 15% of available specimen information currently residing on specimen labels or physical registers is digitised and publicly available (Walton et al. 2020). Machine learning applications can deliver a step-change in our activities’ scope, scale, and speed (Borsch et al. 2020).As part of SYNTHESYS+, the Natural History Museum is leading on the development of a cloud-based workflow platform for natural science specimens, the Specimen Data Refinery (SDR) (Smith et al. 2019). The SDR will provide a series of Machine Learning (ML) models, ranging from semantic segmentation to identify regions of interest on labels, to natural language processing to extract locality and taxonomic text entities from the labels, and image analysis to identify specimen traits and collection quality metrics. Each ML task is atomic, with users of the SDR selecting which model would best extract data from their digitised specimen images, allowing the workflows to be used in different institutions worldwide. It also solves one of the key problems in developing ML-based applications: the rapidity at which models become obsolete. New ML models can be introduced into the workflow, with incremental changes to improve processing, without interruption or refactoring of the pipeline.Alongside specimens, digitised images of pages of scientific literature provide another vital source of data. Functional traits mediate the interactions between plant species and their environment and play roles in determining species’ range size and threatened status. Such information is contained within the taxonomic descriptions of species and a natural language processing library has been developed to locate and extract plant functional traits from these texts (Hoehndorf et al. 2016). The ML models allow complex interrelationships between taxa and trait entities to be inferred based on the grammatical structure of sentences, improving the accuracy and extent of data point extraction.These two projects, like many other applications of ML in natural history collections, are focused on the extraction of visible information, for example, a piece of text or a measurable trait. Given the image of the specimen or page, a person would be able to extract the self-same information. However, ML excels in pattern matching and inferring unknown characters from an entire corpus. At the museum, we have started exploring this space, with our voyagerAI project for identifying specimens collected on historical expeditions of scientific discovery (e.g., the voyages of the Beagle and Challenger). This process fills in the gaps in specimen provenance and identifies 'lost' specimens collected by some of the most famous names in biodiversity history. Developing new applications of ML to uncover scientific meaning and tell the narratives of our collections, will be at the forefront of our scientific innovation in the coming years. This presentation will give an overview of these projects, and our future plans for using ML to extract data at scale within the Natural History Museum. HTML XML PDF
      PubDate: Tue, 7 Sep 2021 11:00:00 +0300
       
  • Machine Learning for Species Identification: The HebelomaProject from
           database to website

    • Abstract: Biodiversity Information Science and Standards 5: e73972
      DOI : 10.3897/biss.5.73972
      Authors : Peter Bartlett, Ursula Eberhardt, Nicole Schütz, Henry Beker : Attempts to use machine learning (ML) for species identification of macrofungi have usually involved the use of image recognition to deduce the species from photographs, sometimes combining this with collection metadata. Our approach is different: we use a set of quantified morphological characters (for example, the average length of the spores) and locality (GPS coordinates). Using this data alone, the machine can learn to differentiate between species.Our case study is the genus Hebeloma, fungi within the order Agaricales, where species determination is renowned as a difficult problem. Whether it is as a result of recent speciation, the plasticity of the species, hybridization or stasis is a difficult question to answer. What is sure is that this has led to difficulties with species delimitation and consequently a controversial taxonomy.The Hebeloma Project—our attempt to solve this problem by rigorously understanding the genus—has been evolving for over 20 years. We began organizing collections in a database in 2003. The database now has over 10,000 collections, from around the world, with not only metadata but also morphological descriptions and photographs, both macroscopic and microscopic, as well as molecular data including at least an internal transcribed spacer (ITS) sequence (generally, but not universally, accepted as a DNA barcode marker for fungi (Schoch et al. 2012)), and in many cases sequences of several loci. Included within this set of collections are almost all type specimens worldwide. The collections on the database have been analysed and compared. The analysis uses both the morphological and molecular data as well as information about habitat and location. In this way, almost all collections are assigned to a species. This development has been enabled and assisted by citizen scientists from around the globe, collecting and recording information about their finds as well as preserving material.From this database, we have built a website, which updates as the database updates. The website (hebeloma.org) is currently undergoing beta testing prior to a public launch. It includes up-to-date species descriptions, which are generated by amalgamating the data from the collections of each species in the database. Additional tools allow the user to explore those species with similar habitat preferences, or those from a particular biogeographic area. The user is also able to compare a range of characters of different species via an interactive plotter.The ML-based species identifier is featured on the website. The standardised storage of the collection data on the database forms the backbone for the identifier. A portion of the collections on the database are (almost) randomly selected as a training set for the learning phase of the algorithm. The learning is “supervised” in the sense that collections in the training set have been pre-assigned to a species by expert analysis. With the learning phase complete, the remainder of the database collections may then be used for testing. To use the species identifier on the website, a user inputs the same small number of morphological characters used to train the tool and it promptly returns the most likely species represented, ranked in order of probability.As well as describing the neural network behind the species identifier tool, we will demonstrate it in action on the website, present the successful results it has had in testing to date and discuss its current limitations and possible generalizations. HTML XML PDF
      PubDate: Tue, 7 Sep 2021 11:00:00 +0300
       
  • Use of Target Species in Citizen Science Fungi Recording Schemes

    • Abstract: Biodiversity Information Science and Standards 5: e73960
      DOI : 10.3897/biss.5.73960
      Authors : Tom May : Observational records of fungi by citizen scientists have mushroomed over the last three decades, especially those submitted via on-line platforms, increasingly accompanied by images. For example, Research Grade observations of Fungi in iNaturalist have increased from just over 5,000 for 2010 to more than 400,000 for 2020, with annual rates of increase of more than 60% in recent years.A feature of fungi records on platforms such as iNaturalist and Mushroom Observer is that the identification of numerous images remains unconfirmed. Of the more than 4 million observations of fungi in iNaturalist, more than 70% are not confirmed as Research Grade, either because the identification is not to species, or because the minimum number of confirming identifications has not been reached.Images are unidentified due to several factors, including that characters necessary for identification are not visible. This aside, many field images are of species of fungi whose identification is challenging, due to subtle macroscopic distinguishing features or because microscopic or DNA characters are required for accurate identification. Even among identified records, misidentifications are common among both observational and herbarium records, due to misapplication of names from one geographic area to another and numerous undescribed species (coupled with the tendency for naive observers to over-identify their observations).One strategy to deal with high under- or mis-identification rates is the use of target species, which are species selected and presented as readily identifiable. Given that citizen science platforms have wide appeal, and many users do not have expert knowledge of fungi, target species make initial engagement more satisfying by facilitating the identification of at least some observations, by both the observer and subsequent identifiers.Target species selection can be based on a range of factors. From the observer point of view, species that are common and widespread provide the advantage that the observer has a reasonable chance of encountering some species on any excursion. Selection of species can be further stratified by habits, hosts and substrates. Diversity of morphological and trophic groups among targets serves to introduce recorders to major groups and educates about the way fungi interact with their environment and other organisms.The most important aspect of target species is identifiability. Expert knowledge of species that could be encountered must be used to select species. Monographs of fungi tend to focus on differentiation from taxonomically related species, often using microscopic characters. In providing information on target species, it is vital to provide comparisons to look-a-like (macroscopically similar) species, whether related or not and whether formally described or not.In Australia, Fungimap commenced in 1995 as a fungi mapping scheme. Initially eight target species were selected, growing to 200 species. Key elements in the success of the scheme included: (1) a regular Fungimap Newsletter, (2) an illustrated guide to the first 100 target species (Fungi Down Under, published in 2005) in which inclusion of maps for all species was a spur for observers to fill and extend distributions, which at that stage were often patchy, (3) a small team of identifiers, who checked incoming records, and (4) training opportunities via workshops and forays.Fungimap records were initially handled in-house in a purpose-built database that lacked a web interface, but could handle input from spreadhseets.   Records are regularly supplied to the Atlas of Living Australia and thence to the Global Biodiversity Information Facility. Observers are now encouraged to use the Fungimap Australia project in iNaturalist.Use of target species significantly increased the number and geographic spread of records. For example, prior to 1990, the highly distinctive Pixie’s Parasol (Mycena interrupta) was known from few specimens (17 unique databased specimens). Inclusion as a target species has yielded more than 2,300 observation records, specifically contributed to Fungimap. There are more than 3,400 observations of the species, of which 99% were contributed since 1990. These data allow presentation of mature distribution maps in contexts such as the Australian State of the Environment report for 2016.In relation to conservation threat assessments, data on target species can support apparent rarity by comparison of records of rare species against those of more common species that are in the same list. The assessment of Tea-tree Fingers (Hypocreopsis amplectens) as Critically Endangered on the IUCN Red List of Threatened Species was supported by the fact that this species had been a Fungimap target since 1999, but at the time of the assessment in 2019 was known only from four sites.Challenges in the use of target species include: (1) adjusting lists to incorporate new taxonomies without confusing recorders, (2) dealing with species that are not formally described, such as those with "field" names, (3) communicating with recorders not engaged with local networks that species belong to target sets, and (4) growing target species lists to maintain engagement. Nevertheless, target species are useful for observers and identifiers, and expert categorisation of the “identifiability” of species could be a useful feature to add to aggregated data sets, for use as a potential filter. HTML XML PDF
      PubDate: Tue, 7 Sep 2021 11:00:00 +0300
       
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
 


Your IP address: 18.232.59.38
 
Home (Search)
API
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-