for Journals by Title or ISSN
for Articles by Keywords
help
Followed Journals
Journal you Follow: 0
 
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Journals are published.
Already have an account? Sign In to see the journals you follow.
Journal Cover Biodiversity Information Science and Standards
  [0 followers]  Follow
    
  This is an Open Access Journal Open Access journal
   ISSN (Online) 2535-0897
   Published by Pensoft Homepage  [25 journals]
  • Everything happens somewhere, multiple times

    • Abstract: Biodiversity Information Science and Standards 1: e21383
      DOI : 10.3897/tdwgproceedings.1.21383
      Authors : Javier de la Torre : Everything happens somewhere, and many of these things get recorded, with many different standards. The geospatial community, through the Open Geospatial Consortium, has been one of the most prolific communities, after TDWG, in creating standards. These standards have helped in many ways to open the industry and foster innovation, but in some cases they have produced the opposite effect. Standards that are created external to the development process of the applications that use them are often difficult to implement, and ultimately superseded by the de facto standards that are driven by a specific community. Thus too much standardization architecture has somehow produced a disconnect between the creation of standards and their actual usage. If standards are too hard to follow they can stop innovation on the implementation side, and alternative standards are created. Over the past few years, a set of innovative companies and open source projects have been revolutionizing the way maps and location data are managed, used and shared. They have done this while circumventing a lot of standards—in fact creating a set of de facto standards that now are being widely adopted. In this talk, I will go over some of the lessons learnt while founding Vizzuality and CARTO, connections to TDWG, and the broader goal of connecting Biodiversity Informatics with the current state of the wider Location Intelligence world. HTML XML PDF
      PubDate: Mon, 2 Oct 2017 13:44:46 +0300
       
  • Standardizing Citizen Science'

    • Abstract: Biodiversity Information Science and Standards 1: e21123
      DOI : 10.3897/tdwgproceedings.1.21123
      Authors : Anne Bowser : Citizen science engages members of the public in collecting and mobilizing information for research and decision-making. While citizen science is well known for supporting biodiversity research and monitoring at national and global scales, many projects also engage the public in areas including local environmental monitoring and participatory health research. Beyond data collection, volunteers increasingly participate in all stages of the scientific research process, including data analysis and project or protocol design.  The use of standards can help scientists and volunteers collect, exchange, and understand information within and beyond the initial data collection context. But the state of data standardization and interoperability in citizen science is currently limited. Not all projects wish to collect standardized data or make their data open for re-use. Further, diverse stakeholders including community members, regulatory agencies, and research scientists, may disagree about what types of information are relevant and what formats information should take Gobel et al. 2017. And while citizen science is valued as a method that can help address complex global problems, integrating a range of interdisciplinary data from diverse sources, as required by projects like Global Mosquito Alert Bowser et al. 2017, can be a formidable technical challenge.  In 2015, a consortium of individuals dedicated to understanding and addressing these challenges was formalized through the Data and Metadata Working Group of the Citizen Science Association (CSA). The Working Group is making progress on a number of challenges, including developing an evolving data and metadata standard and ontology called Public Participation in Scientific Research CORE, or PPSR_CORE Cavalier et al. 2015. To be effective and impactful, PPSR_CORE will need to be compatible with existing standards like Darwin Core, and also develop a shared vocabulary for documenting important aspects of participation like data quality and fitness for purpose. This talk explores the social and technical challenges of standardizing citizen science, shares the efforts already underway, and encourages the TDWG community to come together and contribute to a common agenda for research and capacity building. HTML XML PDF
      PubDate: Mon, 25 Sep 2017 23:18:44 +030
       
  • Interoperability, Attribution, and Value in the Web of Natural History
           Museum Data

    • Abstract: Biodiversity Information Science and Standards 1: e21095
      DOI : 10.3897/tdwgproceedings.1.21095
      Authors : Andrew Bentley : Collections, aggregators, collaborative digitization projects, publishers, researchers, and external users are actors in a complex web of biological specimen data interactions, workflows, and pipelines. The software that mediates interactions among these diverse players enables the creation and delivery of species occurrence data from specimens to a growing set of research data consumers. Informaticists have made great strides in developing the individual services, standards and functions; researchers can now almost effortlessly discover and access massive amounts of museum data to address important, integrative science questions. We need to continue to refine individual tools and capabilities that are part of collection data pipelines, and more emphasis is needed on better integration to ensure the automatic transfer of data--enabling museum data pipelines to work with little or no manual intervention. Also, in order for community systems to benefit all parties, specimen data resources not only need to be efficiently aggregated downstream, but 'value adds' need to flow in the reverse direction--upstream to collections, for their benefit, to recognize their role and facilitate their sustainability. There are valuable, un-realized benefits that collections could be accruing from their participation in aggregation architectures and from the subsequent use of their data by researchers. The Biodiversity Collections Network, a US NSF-funded Research Coordination Network project is planning a series of workshops in collaboration with other collections community groups to identify gaps and one-way traps in the collections data pipeline. The meetings will explore pathways for more effective integration and value distribution in the chain that connects collections, aggregators and data consumers. This talk will highlight relevant examples and outline BCoN’s vision in this area. HTML XML PDF
      PubDate: Wed, 20 Sep 2017 21:35:37 +030
       
  • Linking Heterogeneous Data in Biodiversity Research

    • Abstract: Biodiversity Information Science and Standards 1: e21113
      DOI : 10.3897/tdwgproceedings.1.21113
      Authors : Pamela Soltis : Emerging cyberinfrastructure and new data sources provide unparalleled opportunities for mobilizing and integrating massive amounts of information from organismal biology, ecology, genomics, climatology, and other disciplines. Key among these data sources is the rapidly growing volume of digitized specimen records from natural history collections. With millions of specimen records currently available online, these data provide excellent information on species distributions, changes in distributions over time, phenology, and a host of traits. Particularly powerful is the integration of phylogenies with specimen data, enabling analyses of phylogenetic diversity in a spatio-temporal context, the evolution of niche space, and more. However, a major impediment is the heterogeneous nature of complex data, and new methods are needed to link these divergent data types. Challenges involve assembly, management, and sharing of data, taxonomic name resolution, the patchy nature of data availability, varying scales of data collection, and data integration. Through case studies that link and analyze specimen data and related heterogeneous data sources to address a range of evolutionary and ecological problems, we will explore the specific challenges encountered and how these challenges may be overcome. Although many specific hypotheses may be addressed through integrated analyses of linked biodiversity and environmental data, the additional value of such data-enabled science lies in the unanticipated patterns that emerge. HTML XML PDF
      PubDate: Wed, 20 Sep 2017 16:27:43 +030
       
  • Development of a National Repository for Aquatic Biodiversity in Bhutan

    • Abstract: Biodiversity Information Science and Standards 1: e20809
      DOI : 10.3897/tdwgproceedings.1.20809
      Authors : Sangay Dema, Choki Gyeltshen, Thomas Vattakaven, Prabhakar Rajagopal : In response to a request from the Royal Government of Bhutan, the World Bank commissioned a study on the sustainable development of hydropower in Bhutan. The study identified loss and decline of aquatic biodiversity as one of the major potential environmental impacts of hydropower development in Bhutan. Access to information on aquatic biodiversity is of utmost importance in planning and designing of new hydropower projects in Bhutan. This data is essential for planners to avoid, minimize and effectively mitigate potential adverse impacts on aquatic biodiversity. However, access to this information is not easy. With the objective of making aquatic biodiversity data accessible, key stakeholders within Bhutan have taken the initiative to create and maintain a national data repository for aquatic biodiversity within the country. An inventory and gap analysis of aquatic biodiversity data in Bhutan was done to summarize the available data and information on aquatic biodiversity, stakeholder meetings were held to obtain feedback for the repository and a plan of action has been formulated for creating the repository. Bhutan already maintains a rich biodiversity information repository - the Bhutan Biodiversity Portal (BBP, http://biodiversity.bt/), under the aegis of the National Biodiversity Centre (NBC), Ministry of Agriculture and Forests, Serbithang, Thimphu. The platform is powered by the  open source Biodiversity Informatics Platform that also powers the India Biodiversity Portal (IBP) and a portal on Weed Identification and Knowledge in the Western Indian Ocean (WIKWIO). Additional data fields and functionality were identified to extend the functionality of BBP to cover the needs of the stakeholders. Primarily, interfaces will be built to upload already available datasets on organisms that have been surveyed and identified, as well as newer aquatic biodiversity data that will be generated by surveys and monitoring in future. The portal will facilitate upload of data that captures observed characteristics, e.g., life stage, body size, reproductive state; environmental variables of the locality of occurrence; and other sampling data such as sampling gear and mode of observation. It will also enhance species knowledge by adding the ability to link existing species pages with international databases such as IUCN and FishBase, as well as fields to store voucher specimen details and ecological status. All tabular data that is added will be synchronised to either standard observation fields, custom observation fields that are relevant to aquatic biodiversity or to species traits. Data that cannot be categorised under any of the above will stored as key value pairs. The data upload module will have metadata marked up to the Ecological Metadata Language (EML) metadata specification and data will be available for exchange using the Darwin Core (DwC) standards. The platform will be enabled with an enhanced search and serve function through easy-to-use query panels. Uploaded data will be aggregated and visualised on the portal along spatial, temporal and taxonomic axes. Furthermore, it will be available for stakeholders to download under Creative Commons licences for further processing and planning. The creation of the repository will be complemented by training the stakeholders in data curation and developing a campaign to build awareness of the portal within the community of stakeholders. The establishment of this repository will provide a guide to conserve aquatic biodiversity, maintain ecosystem functioning, and protect livelihoods and food security dependent upon aquatic biodiversity. It will also contribute to the open source biodiversity informatics platform and be available to all other instances of the portal. This will help in enriching the functions of the open source platform and provide value to conservation of biodiversity in other areas of the world. HTML XML PDF
      PubDate: Tue, 5 Sep 2017 19:24:07 +0300
       
  • IndexMEED cases studies using "Omics" data with graph theory

    • Abstract: Biodiversity Information Science and Standards 1: e20740
      DOI : 10.3897/tdwgproceedings.1.20740
      Authors : Romain David, Jean-Pierre Féral, Anne-Sophie Archambeau, Fanny Arnaud, David Auber, Nicolas Bailly, Loup Bernard, Laure Berti-Equille, Cyrille Blanpain, Vincent Breton, Anne Chenuil-Maurel, Anna Cohen Nabeiro, Alrick Dias, Aurélie Delavaud, Robin Goffaud, Sophie Gachet, Karina Gibert, Manuel Herrera Fernandez, Luc Hogie, Dino Ienco, Romain Julliard, Yvan Le Bras, Julien Lecubin, Yannick Legre, Michelle Leydet, Grégoire Lois, Bénédicte Madon, François Marchal, Victor Mendez Munoz, Jean-Charles Meunier, Jean-Baptiste Mihoub, Isabelle Mougenot, Sophie Pamerlon, Eric Peletier, Geneviève Romier, Dad Roux-Michollet, Alison Specht, Christian Surace, Jean-Claude Raynal, Thierry Tatoni : Data produced within marine and terrestrial biodiversity research projects that evaluate and monitor Good Environmental Status, have a high potential for use by stakeholders involved in environmental management. However, environmental data, especially in ecology, are not readily accessible to various users. The specific scientific goals and the logics of project organization and information gathering lead to a decentralized data distribution. In such a heterogeneous system across different organizations and data formats, it is difficult to efficiently harmonize the outputs. Few tools are available to assist. For instance standards and specific protocols can be applied to interconnect databases. Such semantic approaches greatly increase data interoperability. This communication present the recent results and the consortium IndexMEED (Indexing for Mining Ecological and Environmental Data) activity that aims to build new approaches to investigate complex research questions, and support the emergence of new scientific hypotheses based on graph theory Auber et al. 2014). Current developments in data mining based on graphs, as well as the potential for relevant contributions to environmental research, particularly about strategic decision-making, and new ways of organizing data will be presented (David et al. 2015). In particular, the consortium makes decisions on how i) to analyze heterogeneous distributed data spread throughout different databases combining molecular and habitat characteristics data [3], ii) to create matches and incorporate some approximations, iii) to identify statistical relationships between observed data and the emergence of contextual patterns using a calculation library and distributed calculation center at the European level, iv) to encourage openness and sharing data while complying with the general principles of FAIR (Findable, Accessible, Interoperable, Re-usable and citable) in order to enhance data value and their utilization. IndexMEED participants are now exploring the ability of two scientific communities (ecology sensu lato and computer sciences) to work together, using several studies cases. The ECOSCOPE project aims to meet the need to access structured and complementary omics-datasets to better understand biodiversity state and its dynamics. Indeed, the ECOSCOPE case study targets to visualize, through the graph approach, links between datasets and databases from genetics to ecosystems. Another case study, displaying anthropology fossils and omics on the same graph, will also be presented. DEVOTES (DEVelopment Of innovative Tools for understanding marine biodiversity and assessing good Environmental Status) and CIGESMED (Coralligenous based Indicators to evaluate and monitor the "Good Environmental Status" of the MEDiterranean coastal water) European projects, conducted by IMBE, are focused on photo quadrats, cartography and omics data of the marine hard bottom in order to discover context patterns helpful to build decision support system building. Study case “65 Millions d’observateurs” French project is testing AskOmics to provide a graph-based querying interface using RDF (Resource Description Framework) and SPARQL technologies. Scientific questions can be resolved by the new data mining approaches that offer new ways to investigate heterogeneous environmental data with graph mining (Muñoz et al. 2017). The uses of data from biodiversity research demonstrate the prototype functionalities (David et al. 2016) and introduce new perspectives to analyze environmental and societal responses including decision-making at large scale, both at the information system level and the observing system level than at the observed system level. HTML XML PDF
      PubDate: Fri, 1 Sep 2017 21:55:10 +0300
       
  • Pipedream or pipeline: delivering regular, reliable, up-to-date
           information on biodiversity through repeatable workflows

    • Abstract: Biodiversity Information Science and Standards 1: e20749
      DOI : 10.3897/tdwgproceedings.1.20749
      Authors : Quentin Groom, Tim Adriaens, Diederik Strubbe, Sonia Vanderhoeven, Peter Desmet : The current paradigm for studies on biodiversity change are single studies, of finite duration, and a single published output. Yet the results of a such a workflow become out-of-date quickly, particularly as the speed of environmental change increases. If new environmental policies are implemented it is important to monitor their effects, which implies having results from before and after the policy implementation. Furthermore, given the difficulty of influencing policy, the results of analysis need to be reliable, have clearly communicated uncertainty, and should be open to scrutiny. The timely provision of such information could be possible by using open data, shared standards, and automation. The TrIAS (Tracking Invasive Alien Species) project in Belgium is attempting to build a workflow from raw biodiversity data to policy advice, specifically to provide useful information on alien species and their associated risks. We are developing scripts (R, Python) to simplify the repeated Darwin Core standardization of species checklists and observations from a wide-range of sources and their publication to the Global Biodiversity Information Facility. We also aim to propose controlled vocabularies for alien species related Darwin Core terms where these data are needed for downstream analysis. Challenges include entrenched non-standard working methods, heterogeneity of data availability, and the sheer complexity of the biosphere itself. We will discuss our plans, the obstacles and potential solutions. Furthermore, we look to the future for what we might be able to achieve if we are successful. HTML XML PDF
      PubDate: Fri, 1 Sep 2017 8:36:43 +0300
       
  • Using MIxS: An Implementation Report from Two Metagenomic Information
           Systems

    • Abstract: Biodiversity Information Science and Standards 1: e20637
      DOI : 10.3897/tdwgproceedings.1.20637
      Authors : Joel Sachs, Luke Thompson, Nazir El-Kayssi, Satpal Bilkhu : MIxS (Minimum Information about any Sequence) (Yilmaz et al. 2011) is a metadata standard of the Genomics Standards Consortium (GSC), designed to make sequence data findable, accessible, and interoperable. It contains fields for recording physical and chemical characteristics of the sampling environment, geographical and habitat information, and other metadata about the sample and its provenance, which are critical for downstream intepretation of data derived from the sample. We will present our experience implementing MIxS in two metagenomic information systems – the Earth Microbiome Project (EMP) and the Government of Canada (GoC) Ecobiomics project. The EMP (Gilbert et al. 2014) is an ongoing effort to crowdsource environmental microbiome samples from around Earth, then sequence and analyze them using a standardized workflow. The EMP has aggregated and sequenced over 50,000 samples, which are queryable using a publicly available catalogue. A meta-analysis of the first 25,000 samples is currently in review. MIxS and the Environment Ontology (ENVO) (Buttigieg et al. 2016) have been useful in structuring environmental metadata from EMP studies. For the particular application of the EMP meta-analysis, however, several issues were encountered: often there are multiple possible 'correct' assignments to the biome, feature, and material fields; the fields are not hierarchical, limiting logical organization; and the primary ecological factors differentiating microbial communites are not captured. In response to these challenges, the EMP team worked with the ENVO team to devise a new hierarchical structure, the EMP ontology (EMPO), that captures the primary axes along which microbial communities tend to be structured (host-associated or not, saline or not). EMPO is an application ontology, with a formally defined W3C Web Ontology Language (OWL) document mapping to existing ontologies, enabling reuse by the microbial ecology community. Ecobiomics is a joint project of multiple GoC departments and involves the complete workflow, from sampling in a variety of aquatic, soil, and benthic environments, through sample prep, DNA extraction, library prep, sequencing, and analysis. In contrast to the EMP—where some of the samples and metadata had been collected before the establishment of the MIxS standards—the Ecobiomics project has been able to create metadata profiles for each sub-project to conform to, extend, and build, upon the existing MIxS standards. Despite these two different contexts, EMP and Ecobiomics encountered a number of common issues that prevented a complete implementation of MIxS. These issues include ambiguous term names and definitions; inconsistencies amongst the environmental packages; non-standard ways of dealing with units; and a number of issues surrounding ENVO (the Environment Ontology), which is required for filling out the mandatory MIxS fields "Environmental material", "Biome", and "Environmental feature". We will describe these issues, and, more generally, the successes and challenges of our implementations. HTML XML PDF
      PubDate: Mon, 28 Aug 2017 14:41:55 +030
       
  • SeqDB: Biological Collection Management with Integrated DNA Sequence
           Tracking 

    • Abstract: Biodiversity Information Science and Standards 1: e20608
      DOI : 10.3897/tdwgproceedings.1.20608
      Authors : Satpal Bilkhu, Nazir El-Kayssi, Matthew Poff, Anthony Bushara, Michael Oh, Joseph Giustizia, Iyad Kandalaft, Christine Lowe, Oksana Korol, Joel Sachs, Keith Newton, James Macklin : Agriculture and Agri-Food Canada (AAFC) is home to a world-class taxonomy program based on Canada’s national agricultural collections for Botany, Mycology and Entomology.  These collections contain valuable resources, such as type specimen for authoritative identification using approaches that include phenotyping, DNA barcoding, and whole genome sequencing.  These authoritative references allow for accurate identification of the taxonomic biodiversity found in environmental samples in fields such as metagenomics. AAFC’s internally developed web application, termed SeqDB, tracks the complete workflow and provenance chain from source specimen information through DNA extractions, PCR reactions, and sequencing leading to binary DNA sequence files.  In the context of Next Generation Sequencing (NGS) of environmental samples, SeqDB tracks sampling metadata, DNA extractions, and library preparation workflow leading to demultiplexed sequence files.  SeqDB implements the Taxonomic Databases Working Group (TDWG) Darwin Core standard Wieczorek et al. 2012 for Biodiversity Occurrence Data, as well as the Genome Standards Consortium (GSC) Minimum Information about any (X) Sequences (MIxS) specification Yilmaz et al. 2011. When coupled with the built-in data standards validation system, this has  led to the ability to search consistent metadata across multiple studies. Furthermore, the application enables tracking the physical storage of the aforementioned specimens and their derivative molecular extracts using an integrated barcode printing and reading system.   All the information is presented using a graphical user interface that features intuitive molecular workflows as well as a RESTful API that facilitates integration with external applications and programmatic access of the data. The success of SeqDB has been due to the close collaboration with scientists and technicians undertaking molecular research involving the national collection, and the centralization of their data sets in an access controlled relational database implementing internationally recognized standards. We will describe the overall system, and some of our lessons learned in building it. HTML XML PDF
      PubDate: Sat, 26 Aug 2017 23:42:33 +030
       
  • Phenology atlas use cases: a new map of plant phenology across North
           America and beyond

    • Abstract: Biodiversity Information Science and Standards 1: e20582
      DOI : 10.3897/tdwgproceedings.1.20582
      Authors : Zoe Panchen, Jonathan Davies : The goal of the phenology atlas workshop is to explore the development of a platform that would provide capabilities for analysing and visualising phenology data from multiple sources. The atlas would incorporate species-based, location-based and phenophase-based views. Here we provide an overview of potential phenology atlas use cases and present a conceptual framework that could be developed to construct generalizable models of plant phenology. Different species respond to different environmental cues; however, by co-opting statistical tools from the species distribution modelling (SDM) literature, it may be possible to construct flexible models that can be applied across species to capture timing of green up or first flower across North America (and beyond). This approach would allow us to generate a probability map of observing a particular species’ phenological event in a particular location given climate and date. As illustration, we present a simple model where phenology observations are a binary variable, and day of year and monthly climate data are predictors of observing the event. With such models, it could then be possible to tap into projected climate scenarios from General Circulation Models (GCMs), to construct future phenology scenarios. Linked with locality data, it might also be possible to make projections of when and which species will be flowering where (given a date in the future). This information might be interesting to researchers exploring novel species interactions and potential for phenological mismatches under future climate change. HTML XML PDF
      PubDate: Fri, 25 Aug 2017 20:00:05 +030
       
  • Phenological sensitivity to temperature at broad scales: opportunities
           and challenges of natural history collections

    • Abstract: Biodiversity Information Science and Standards 1: e20587
      DOI : 10.3897/tdwgproceedings.1.20587
      Authors : Heather Kharouba, Mark Vellend : The seasonal timing of biological events (i.e. phenology) has been frequently observed to shift in response to recent climate change. While many of these events now occur earlier due to warmer temperatures, there is considerable variation in the direction and magnitude of these shifts across species. This variation could have consequences for species interactions and ecological communities, especially when the relative timing of key life cycle events among species is disrupted. As a first step to better understand the causes and consequences of variation in species’ phenological responses to climate change, we used natural history collections to quantify and compare broad-scale patterns in phenology-temperature relationships for Canadian butterflies and their nectar food plants over the past century. The phenology of both groups advanced in response to warmer temperatures - both across years and sites. Across butterfly-plant associations, flowering time was significantly more sensitive to temperature than the timing of butterfly flight. However, the sensitivities were not correlated across associations. The findings we will present indicate that warming-driven shifts in the timing of species interactions are likely to be prevalent. The opportunities and challenges associated with using natural history collections for detecting and linking phenological responses to climate change will also be discussed. HTML XML PDF
      PubDate: Fri, 25 Aug 2017 3:44:26 +0300
       
  • Rewards and Challenges of eDNA Sequencing with Multiple Genetic Markers
           for Marine Observation Programs

    • Abstract: Biodiversity Information Science and Standards 1: e20548
      DOI : 10.3897/tdwgproceedings.1.20548
      Authors : Kathleen Pitz, Collin Closek, Anni Djurhuus, Reiko Michisaki, Kristine Walz, Alexandria Boehm, Mya Breitbart, Ryan Kelly, Francisco Chavez : Metabarcoding of environmental DNA (eDNA) samples holds new promise to increase our ability to measure changes in biodiversity and community composition over time. It can allow the characterization of large groups of organisms where traditional sampling may be impractical or not cost-effective. However, it is still unclear how best to compare and combine this information with morphological counts in order to inform policies and biodiversity metrics that are based on traditional sampling results. Under the Marine Biodiversity Observation Network (MBON) initiative, multiple taxonomic marker genes (16S rRNA, 18S rRNA, mitochondrial cytochrome c oxidase subunit I (COI), and 12S rRNA) have been used concurrently to examine the phylogenetic diversity of samples across trophic levels from microbes to vertebrates. Marker genes and their amplification primers target a different (and sometimes overlapping) group of organisms. Just as with traditional sampling methods, each have biases towards detecting certain organisms over others. Though eDNA metabarcoding often detects many more species than can be identified through microscopic or macroscopic net tow counts, processing and relating sequence data to traditional counts and biodiversity measures is an ongoing challenge. For samples collected within the MBON project, an analysis pipeline has been adapted to standardize sequence analysis of each marker gene. The pipeline processes reads from quality control and trimming through clustering of sequences into Operational Taxonomic Units (OTUs). Taxonomic identification of OTUs uses publically available sequence databases. Finally, the results of the analysis pipeline are combined into a Biological Observation Matrix (BIOM) file with metadata pertaining to the biological sample, PCR processing, and bioinformatic analysis. BIOM files can be used in downstream analysis to analyze biodiversity patterns within the samples. Monterey Bay in California, USA, is a hot spot of biodiversity and productivity fed by nutrient-rich upwelling water along the coast. A local time-series of samples has been collected by the Monterey Bay Aquarium Research Institute at coastal stations within the bay, providing several decades of contextual environmental data. Samples taken from this time series are ideal for testing the ability of eDNA sequencing to show variability in taxonomic groups over time. For metabarcoding analysis, samples were chosen representing different seasons corresponding to spring (early) upwelling, summer (late) upwelling, fall oceanic regime, and a winter (Davidson) regime from the years 2013-2016. Samples were analyzed across four taxonomic marker genes: two small-subunit ribosomal RNA genes targeting prokaryotic (16S rRNA) and eukaryotic (18S rRNA) organisms and two mitochondrial genes targeting eukaryotes (cytochrome c oxidase subunit I gene (COI)) and vertebrates (mitochondrial small-subunit ribosomal RNA gene (12S)).  In order to combine data from multiple markers, species occupancy modeling was used to determine the probability that an OTU is truly present in a sample (as described in Kelly et al. 2017 and Lahoz-Monfort et al. 2015). Many taxonomic groups show seasonal trends in species abundance and diversity in Monterey Bay. Together this work illustrates the rewards and challenges of applying multiple genetic markers to eDNA sequencing analysis of an environmental time series. HTML XML PDF
      PubDate: Thu, 24 Aug 2017 23:22:56 +030
       
  • A workbench for species identification based on images and deep learning
           techniques

    • Abstract: Biodiversity Information Science and Standards 1: e20569
      DOI : 10.3897/tdwgproceedings.1.20569
      Authors : Ignacio Heredia, Lara Lloret, Jesús Marco, Francisco Pando : We are currently studying the feasibility of applying deep learning techniques to natural sciences. In this contribution we will show our recent advances with an easy plug-and-play framework (that uses the Lasagne module built on top of Theano), which we have successfully trained for plant identification. Subsequent trials have been carried out on cone snails (Conus spp.) with minimum overhead and without writing any new code. The fact that these applications share a common API makes it very easy to create new applications (e.g., on Android, as we are currently testing) and to apply them to new species groups. The code for the framework can be found at: https://github.com/IgnacioHeredia/plant_classification. This kind of application makes taxonomic expertise directly accessible to members of the general public interested in nature and the diversity of living organisms. These applications have a clear educational impact, and may also be used to enhance conservation actions. Deployment and use of the currrent framework is supported by the recently begun, EU-funded project "DEEP Hybrid Datacloud". In particular, the project will support the extensive training of the system needed to develop new applications, and will provide the necessary computational resources to the users. HTML XML PDF
      PubDate: Thu, 24 Aug 2017 12:56:43 +030
       
  • How species interactions are managed in Plinian Core: Status and questions

    • Abstract: Biodiversity Information Science and Standards 1: e20556
      DOI : 10.3897/tdwgproceedings.1.20556
      Authors : Francisco Pando : Plinian Core is a set of vocabulary terms that can be used to describe all kind of properties related to taxa (https://github.com/tdwg/PlinianCore). "Interactions" is a class of properties included in the "Natural History" class. In its current state, the class comprises the elements taken from the Darwin Core class "ResourceRelationship", these are: resourceRelationshipID, resourceID, relatedResourceID, relationshipOfResource, relationshipAccordingTo, relationshipEstablishedDate, relationshipRemarks These terms are complemented with a Plinian Core native element: InteractionSpeciesType (see https://github.com/tdwg/PlinianCore/wiki/InteractionAtomizedClass), intended to group all possible interactions into logical categories. As a generic standard, Plinian Core recommends but does not impose the use of controlled vocabularies to specify interactions and their types. However, the community would benefit enormously from having some consensus vocabularies to recommend for interactions and categories of interactions. That is --as we see it-- the frontier in managing species interaction information. We will review some of the investigations on this subject within the scope of Plinian Core. HTML XML PDF
      PubDate: Thu, 24 Aug 2017 8:52:26 +0300
       
  • Quantifying quality: the "Apparent Quality Index", a measure of data
           quality for occurrence datasets

    • Abstract: Biodiversity Information Science and Standards 1: e20533
      DOI : 10.3897/tdwgproceedings.1.20533
      Authors : Francisco Pando : When making an initial assessment of a dataset originating from an unfamiliar source, a user typically relies on the visible properties of the dataset as a whole, such as, the title, the publisher, and the size of the dataset. Aspects of data quality are usually out of view, beyond some intuitions and hard to compare assertions. In 2007 at GBIF Spain we tried to correct that by developing an index that enables a user to assess the quality of Darwin Core datasets published by GBIF-Spain, and  to track improvements in quality over time. Our goal was to create an index that is explicit, easy to understand, and easy to obtain. We dubbed that index "ICA" GBIF Spain (2010) for its name in Spanish "Índice de Calidad Aparente" (Apparent Quality Index). We say ICA measures "apparent quality", because, although unlikely, a dataset can have a high ICA, while its records are actually a poor reflection of the reality to which they refer. ICA summarizes data quality on the three primary dimensions of biodiversity data: taxonomic, geospatial and temporal. In this contribution we will present the rationale behind the ICA, how it is calculated, how it works within the Darwin Test tool Ortega-Maqueda and Pando (2008), how it is integrated in the data publication processes of GBIF Spain, and some discussion and results about its utility and potential. We also compare ICA to the emerging framework for data quality assessmentTDWG Data Quality Interest Group (2016). HTML XML PDF
      PubDate: Wed, 23 Aug 2017 14:40:54 +030
       
  • World Flora Online Project: An online flora of all known plants

    • Abstract: Biodiversity Information Science and Standards 1: e20529
      DOI : 10.3897/tdwgproceedings.1.20529
      Authors : Chuck Miller, William Ulate : In its decision X/17, the Convention on Biological Diversity (CBD) adopted a consolidated update of the Global Strategy for Plant Conservation (GSPC) for the decade 2011–2020 at its 10th Conference of the Parties held in Nagoya, Japan in October 2010. The updated GSPC includes five objectives and 16 targets to be achieved by 2020. Target 1 aims to complete the ambitious target of “an online flora of all known plants”  by 2020. A widely accessible Flora of all known plant species is a fundamental requirement for plant conservation and provides a baseline for the achievement and monitoring of other targets of the Strategy. The previous (GSPC 2010) target 1 aimed to develop “a widely accessible working list of known plant species as a step towards a complete world flora,” and this target was achieved at the end of 2010, as The Plant List (http://www.theplantlist.org). Drawing from the knowledge gained in producing The Plant List, a project to create an online world Flora of all known plant species was initiated in 2012. A World Flora Online (WFO) Council has been formed with thirty six participating institutions world-wide who are diligently working to achieve the 2020 Target. The WFO portal is hosted at the Google Cloud and is online at http://www.worldfloraonline.org. WFO utilizes a taxonomic backbone of all vascular plants and bryophytes from orders to subspecies. Rapid progress is now being made toward incorporation of descriptive data, distributions and images. This poster will describe the vision, technical approach, progress to date and plans for this significant global project. HTML XML PDF
      PubDate: Wed, 23 Aug 2017 9:27:48 +0300
       
  • Historical flowering phenology across a broad range of Pacific Northwest
           plants

    • Abstract: Biodiversity Information Science and Standards 1: e20528
      DOI : 10.3897/tdwgproceedings.1.20528
      Authors : Christopher Kopp, Linda Jennings (Lipsen), Barbara Neto-Bradley, Jas Sandhar, Siena Smith, Louisa Hsu : For many species, understanding how climate influences the timing of seasonal life history events (phenology) is limited by the availability of long-term data. Further, long-term studies of plant phenology are often local in scale. Recent efforts to digitize herbarium collections make it possible to examine large numbers of specimens from multiple species over broad geographic regions. In the Pacific Northwest (PNW), understory plant species found in old-growth forests may be buffered against climate warming (Frey et al. 2016). Using 8,500 specimens of 40 plant species housed in 25 herbaria collected over more than 100 years in the PNW we analyzed whether these species have experienced shifts in flowering phenology corresponding to long-term climate warming. Our findings were mixed, with some species experiencing earlier flowering phenology over time while others have not shifted their flowering phenology since the early-1900s. Responses were dependent on life-history, including habitat preference and timing of flowering. These results demonstrate that herbarium collections are an important tool for examining long-term flowering phenological over broad geographic areas and habitat types. Further, flowering phenology does not uniformly shift in correlation with climate warming. HTML XML PDF
      PubDate: Wed, 23 Aug 2017 9:21:35 +0300
       
  • VicFlora: a dynamic, service-based Flora

    • Abstract: Biodiversity Information Science and Standards 1: e20525
      DOI : 10.3897/tdwgproceedings.1.20525
      Authors : Niels Klazenga : VicFlora (https://vicflora.rbg.vic.gov.au), launched in September 2016, combines the Flora of Victoria and the Census of Vascular Plants of Victoria. VicFlora is used by government agencies, conservation agencies and consultancies: analytics indicate that on weekdays VicFlora is used by between 200 and 250 people. Content of VicFlora can be updated inside the system, so changes will be live as soon as they are made, the keys coming from KeyBase and the maps using data from the Australasian Virtual Herbarium (AVH) and the Victorian Biodiversity Atlas (VBA), both obtained through Atlas of Living Australia (ALA) web services. Corrections and value-adding to the maps are done through annotations. There have been requests from government and consultancies for the ability to use content from VicFlora in their own applications. Therefore, web services have been created that give access to all content. New versions of VicFlora will use these same services. It is the intention that names and taxon concepts for VicFlora will be managed in the National Species Lists (NSL) and accessed through NSL web services. HTML XML PDF
      PubDate: Wed, 23 Aug 2017 3:57:59 +0300
       
  • Integrating Marine Omics into the Marine Biodiversity Observation Network
           (MBON) in Support of the UN Sustainable Development Goals (SDG) and Agenda
           2030

    • Abstract: Biodiversity Information Science and Standards 1: e20521
      DOI : 10.3897/tdwgproceedings.1.20521
      Authors : Kelly D. Goodwin, Frank Muller-Karger, Gabrielle Canonico : Life on Earth, including humanity, is tightly and inextricably intertwined with the environment. In a concerted effort to promote the well-being and dignity of humanity, while conserving and protecting the environment, the United Nations Resolution developed a series of targets in what is officially known as Transforming our world: the 2030 Agenda for Sustainable Development (UN Resolution A/RES/70/1 of 25 September 2015). This agenda lays out 17 ambitious "Global Goals" with a total of 169 targets that promote capacity building, eradication of poverty, and management practices that sustain the growing need for resources. There are specific targets outlined for marine resource conservation and use in Sustainable Development Goal 14 (SDG 14). These goals seek to guarantee benefits that humans derive from the variety, abundance, and biomass of marine species, and from the diverse interactions between these organisms and the marine environment. Sustainability goals are reflected in ocean policies that mandate integrated, ecosystem-based approaches to marine monitoring. This, in turn, drives a global need for efficient, low-cost bioindicators of marine ecological quality. However, most traditional assessment methods rely on specialized expertise to visually identify a limited set of organisms – a process that it is labor-intensive, slow, and can provide a narrow view of ecological status. Advances in gene-sequencing technology offer an opportunity for improvement. Molecular-based assessments of biodiversity and ecosystem function offer advantages over traditional methods and are increasingly being generated for a suite of taxa – ranging from the microbiome fish and marine mammals. ‘Omic approaches are being implemented in the Marine Biodiversity Observation Network (MBON). MBON is under the umbrella of the Group On Earth Observations Biodiversity Observation Network (GEO BON). The objective of the MBON is to define practical indices that can be deployed in an operational manner to track changes in the number of marine species, the abundance and biomass of marine organisms, the diverse interactions between organisms and the environment, and the variability and change of specific habitats of interest. The goal is to characterize diversity of life at the genetic, species, and ecosystem levels using a broad array of in situ and remote sensing observations. A goal of MBON is to advance practical and wide use of environmental DNA (eDNA) applications to address the need to evaluate status and trends of life in coastal and pelagic environments. MBON activities address a number of SDG 14 targets, including those related to conservation and sustainable management of marine and coastal ecosystems, impacts of ocean acidification, sustainable use of resources, and capacity building. MBON groups are actively engaged at the national and international level to enable the widespread observation of marine life using standardized methods and data management protocols, building on existing capacity and infrastructure. Specifically, collaborations with programs such as the Global Ocean Observing System (GOOS), OBIS, the Global Ocean Acidification Observation Network (GOA ON), and various regional Biodiversity Observation Networks. The MBON projects seek to establish a community of practice built around common tools and goals. HTML XML PDF
      PubDate: Tue, 22 Aug 2017 19:40:53 +030
       
  • Use of Online Species Occurrence Databases in Published Research since
           2010

    • Abstract: Biodiversity Information Science and Standards 1: e20518
      DOI : 10.3897/tdwgproceedings.1.20518
      Authors : Joan Ball-Damerow, Laura Brenskelle, Narayani Barve, Pam Soltis, Raphael LaFrance, Arturo Ariño, Robert Guralnick : Museums and funding agencies have invested considerable resources in recent years to digitize information from natural history specimens and contribute to online species occurrence databases. Such efforts are necessary to reap the full benefits of irreplaceable historical data by making them openly accessible and allowing the integration of collections data with other datasets. However, recent estimates suggest that still only 10% of biocollections are available in digital form. The biocollections community must therefore continue to justify and promote digitization efforts, particularly for high-diversity groups with large numbers of specimens, such as invertebrates.  Our overarching goal is to determine how uses of biodiversity databases have developed in recent years, as more data has come online. To this end, we present a bibliometric analysis of published research to characterize uses of online species occurrence databases since 2010. Relevant papers for this analysis include those that use online and openly accessible primary occurrence records, or those that add data to an online database. Google Scholar (GS) provides full-text indexing, which was important to identify data sources that often appear buried in the methods section of a paper. Our search was therefore restricted to GS. We drew a list of relevant search terms and downloaded all records returned by each search (or the first 500 if there were more) into a Zotero reference management database. About one third of the 2500 papers in the final dataset were relevant. Three of the authors with specialized knowledge of the field characterized relevant papers using a standardized tagging protocol based on a series of key topics of interest. We developed a list of potential tags and descriptions for each topic, including: database(s) used, database accessibility, scale of study, region of study, taxa addressed, general use of data, other data types linked to species occurrence data, data quality issues addressed, authors, institutions, and funding sources. Each tagged paper was thoroughly checked by a second tagger. The final dataset of tagged papers allow us to quantify general areas of research made possible by the expansion of online species occurrence databases, and trends over time. For example, preliminary results on a subset of the papers indicate that the most common uses of online species occurrence databases have been: (a) to determine trends in species richness or distribution; (b) to describe a new database; and (c) to assist in developing species checklists or taxonomic studies. Studies addressing plants have generally been more prevalent than those concerning both vertebrates and invertebrates. However, while the number of plant and vertebrate studies have remained relatively constant in recent years, invertebrate studies are increasing.  We also address the importance of both proper citation of databases and use of approaches to improve data quality issues involving errors and biases. The most common aspects of data quality addressed were to check for currently valid names, spatial errors, and to exclude certain unsuitable records. Finally, we identify more integrative studies that incorporate multiple data types, and determine whether these uses are enabled by collaborations. Overall, our presentation demonstrates initial trend results for over 100 specific tags associated with 13 topics of interest, and network analyses of authors and institutions for relevant papers. We also outline the downstream utility of our dense tagging approach for understanding domain-wide trends, and the potential for developing machine-learning approaches to more efficiently characterize certain aspects of published research. HTML XML PDF
      PubDate: Tue, 22 Aug 2017 17:43:01 +030
       
  • Expanding the Ocean Biogeographic Information System (OBIS) beyond species
           occurrences

    • Abstract: Biodiversity Information Science and Standards 1: e20515
      DOI : 10.3897/tdwgproceedings.1.20515
      Authors : Pieter Provoost, Daphnis De Pooter, Ward Appeltans, Nicolas Bailly, Sky Bristol, Klaas Deneudt, Menashè Eliezer, Ei Fujioka, Alessandra Giorgetti, Philip Goldstein, Mirtha Lewis, Marina Lipizer, Kevin Mackay, Maria Marin, Gwenaëlle Moncoiffé, Stamatina Nikolopoulou, Shannon Rauch, Andres Roubicek, Carlos Torres, Anton van de Putte, Leen Vandepitte, Bart Vanhoorne, Matteo Vinci, Nina Wambiji, Dave Watts, Eduardo Klein, Francisco Hernandez : Data providers in the Ocean Biogeographic Information System (OBIS) network are not just recording species occurrences, they are also recording sampling methodology details and measuring environmental and biotic variables. In order to make OBIS an effective data sharing platform, it needs to be able to store and exchange these data in such a way that they can easily be interpreted by end users, as well as by the tools which will be created to search, analyze and visualize the integrated data. OBIS makes use of Darwin Core Archives (DwC-A) for exchanging data between data providers, regional and thematic nodes, and the central OBIS database. However, due to limitations of the DwC-A schema, this data format is currently not suitable for storing sampling event details or sample related measurements, as well as biotic measurements. In order to overcome this problem, OBIS has created a new extension type based on the existing MeasurementOrFacts extension (De Pooter et al. 2017). This ExtendedMeasurementOrFacts extension adds an occurrenceID field, which allows linking biotic measurements to occurrences, even if the archive contains an event table and sample level measurements or facts. In addition, identifiers for measurement types, values and units can now be added in the new measurementTypeID, measurementValueID and measurementUnitID fields. These identifiers link to vocabularies such as the BODC NERC Vocabulary, and greatly improve the interoperability and reusability of the OBIS datasets. HTML XML PDF
      PubDate: Tue, 22 Aug 2017 17:08:32 +030
       
  • The Genomic Observatories Metadatabase

    • Abstract: Biodiversity Information Science and Standards 1: e20508
      DOI : 10.3897/tdwgproceedings.1.20508
      Authors : John Deck, Michelle Gaither, Rodney Ewing, Christopher Bird, Neil Davies, Christopher Meyer, Cynthia Riginos, Robert Toonen, Eric Crandall : The Genomic Observatories Metadatabase (GeOMe, http://www.geome-db.org/) is an open access repository for geographic and ecological metadata associated with biosamples and genetic data. It contributes to the informatics stack – Biocode Commons – of the Genomic Observatories Network (https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-3-2). While public databases have served as vital repositories for nucleotide sequences, they do not accession all the metadata required for ecological or evolutionary analyses. These metadata are especially important for process oriented, time-series research, as for example, at Long Term Ecological Research (LTER) sites or longitudinal public health studies. GeOMe fills this need, providing a user-friendly, web-based interface for both data contributors and data recipients. The interface allows data contributors to create a customized yet standards-compliant spreadsheet that captures the temporal and geospatial context of each biosample. These metadata are then validated and permanently linked to archived genetic data stored in the National Center for Biotechnology Information's (NCBI's) Sequence Read Archive (SRA) via unique persistent identifiers. By linking ecologically and evolutionarily relevant metadata with publically archived genetic sequence data in a structured manner, GeOMe provides an important linchpin across all levels of biodiversity. HTML XML PDF
      PubDate: Tue, 22 Aug 2017 16:12:56 +030
       
  • Linking molecular and morphological biodiversity evidence by building a
           single name space

    • Abstract: Biodiversity Information Science and Standards 1: e20503
      DOI : 10.3897/tdwgproceedings.1.20503
      Authors : Dmitry Schigel, Markus Döring, Roderic Page, Urmas Kõljalg, Paul Hebert : GBIF is working on the solution to represent molecular (DNA) evidence of species presence in time and space alongside the currently prevailing morphological evidence. Among many benefits of this approach are filling the geographic and taxonomic gaps and adequate representation of functionally important organism groups. Experimental modification of GBIF backbone includes provisional non-Linnaean names from UNITE and BOLD systems, enabling indexing georeferenced sequences alongside other records. Life on Earth does not depend on the language we use to name taxa, single index and access point to global biodiversity data is essential for good science and adequate decision making. HTML XML PDF
      PubDate: Tue, 22 Aug 2017 10:59:15 +030
       
  • Enhancing Monitoring and Control of the Fall Armyworm (Spodoptera
           frugiperda) in the Democratic Republic of the Congo (DR Congo) by Citizen
           Science.

    • Abstract: Biodiversity Information Science and Standards 1: e20499
      DOI : 10.3897/tdwgproceedings.1.20499
      Authors : Papy Miankeba : The Fall Armyworm (FAW) (Spodoptera frugiperda - Lepidoptera) is an insect that feeds on more than 80 plant species and causes major damage to economically important crops including maize, rice or sugarcane. While in cooler climates development slows down to one or a few generations per year because frost kills the insect, in Africa, the FAW moths can travel hundreds of kilometers per night and reproduce every 1–2 months, which helped the pest spread rapidly. Environmental and climatic analyses of Africa show that the FAW is likely to build permanent and significant populations in West and Southern Africa, spreading to other regions when weather or temperatures are favourable. Located in the Central Africa, the DR Congo is now facing this challenge. Since 2016, the FAW has caused important yield losses (up to 80% in 4 regions) where maize is cultivated. Almost 50 of the 147 administrative territories have been affected. The damage resulted in a surge in commodity prices (the price of a 25 kg bag of corn rose from $ 10 to $ 30). Pesticides and genetically modified (GM) crops could be the main methods of control but many farmers in the DR Congo do not yet plant GM crops. Biopesticides (including virus-based and Bacillus thuringiensis), mass rearing and release of parasitoids and predators are  low-risk options, but remain prohibitive to many small-scale farmers; subsidies or government-funded interventions are unavailable. In all cases, further research is needed across the country, through national and regional institutes, to understand the insect’s lifecycle stages and feeding habits. A widespread communications programme is necessary to teach farmers how to monitor and identify the pest. There is a clear need for information resources about FAW, which can help inform and keep all interested parties up-to-date on the latest news regarding spread, management research, diagnostic protocols for monitoring and early detection techniques of FAW. However, faced with a political will shifted to other priorities, no formal program to collect data to characterize this pest has been initiated and studies that provided current data are facing the following main challenges: (i) the national territory is too vast (more than 2,345,000 km2) and impossible to cover by these types of studies, (ii) insects are known under several common names (complicated by >240 languages) or only one name is used to identify taxonomically distant species; (iii) historical data are very hard to find (the oral tradition being preferred to writing); (iv) the absence of reference collections and the lack of specialists; (v) the difficulty of finding geo-referenced data (with the risk associated in its collection). An effective alternative to circumvent these difficulties and to gather data is promoting citizen science. Such research programs involving scientists and the participation of amateurs or interested volunteer citizens within local populations would constitute a set of data across the country and over a longer period than what has been done so far. With the development of free software and mobile applications, non-specialists could, based on standards and protocols validated by scientists, be involved in the digitization of specimens observed or identification of insect species through graphical user interfaces; help clarify the correspondence between vernacular names and scientific names; participate in habitat monitoring of insect species; or help collecting geo-referenced data via mobile phone. These field-based research activities can be conducted without great expense and will offer professionals and non-professionals a collaborative ground to contribute together toward advances in monitoring and knowledge of FAW (and many other pests) in the DR Congo. HTML XML PDF
      PubDate: Tue, 22 Aug 2017 6:22:52 +0300
       
  • An Update on the Plant Phenology Ontology and Plant Phenology Data
           Integration

    • Abstract: Biodiversity Information Science and Standards 1: e20487
      DOI : 10.3897/tdwgproceedings.1.20487
      Authors : Brian Stucky, Ramona Walls, John Deck, Ellen Denny, Kjell Bolmgren, Robert Guralnick : The study of plant phenology is concerned with the timing of plant life-cycle events, such as leafing out, flowering, and fruiting. Today, thanks to data digitization and aggregation initiatives, phenology monitoring networks, and the efforts of citizen scientists, more phenologically relevant plant data is available than ever before.  Until recently, combining these data in large-scale analyses was prohibitively difficult because no standardized plant phenology terms and concepts were available to facilitate data interoperability.  We have recently completed the first public release of The Plant Phenology Ontology (PPO), the result of a collaborative effort to develop the terminology, definitions, and term relationships that are needed for large-scale data integration and machine reasoning.  We are currently using the PPO to join disparate plant phenology datasets into a single data resource.  In this talk, I will give an introduction to the PPO, including the design of the ontology and examples with real phenological data, and I will present preliminary results of our initial experiments with integrating plant phenology data. HTML XML PDF
      PubDate: Mon, 21 Aug 2017 16:38:37 +030
       
  • Darwin Cloud: Mapping real-world data to Darwin Core

    • Abstract: Biodiversity Information Science and Standards 1: e20486
      DOI : 10.3897/tdwgproceedings.1.20486
      Authors : John Wieczorek, Paul J. Morris, James Hanken, David Lowery, Bertram Ludäscher, James Macklin, Timothy McPhillips, Robert Morris, Qian Zhang : Since its ratification as a TDWG standard in 2009, data publishers have had to struggle with the essential step of mapping fields in working databases to the terms in Darwin Core Wieczorek et al. 2012 in order to publish and share data using that standard. Doing so requires a good understanding of both the data set and Darwin Core. The accumulated knowledge about these mappings constitutes what we call the "Darwin Cloud." We will explore the nature of data mapping challenges and the potential for semi-automated solutions to them. Specifically, we will look at the "Darwinizer" actor and its usage in related workflows within the Kurator data quality framework and the implications for community-managed vocabularies. HTML XML PDF
      PubDate: Mon, 21 Aug 2017 16:02:21 +030
       
  • Environmental samples, eDNA and HTS libraries – data standard proposals
           from the Global Genome Biodiversity Network (GGBN)

    • Abstract: Biodiversity Information Science and Standards 1: e20483
      DOI : 10.3897/tdwgproceedings.1.20483
      Authors : Gabriele Droege, Jonas Zimmermann, Tim Fulcher, Sietse Van der Linde, Walter Berendsohn : The GGBN Data Portal (http://www.ggbn.org, Droege et al. 2014) has established standardised data flows for genomic DNA samples, including voucher specimens, tissue samples, DNA samples as well as resulting sequences and publications. Dealing with different types of DNA (aDNA, gDNA, eDNA) is essential and closely related to user-friendly search and display functionalities. GGBN aims both at preserving voucher specimens of all kinds of DNA as well as making these important data accessible on the Internet. In addition to genomic DNA, the development and use of high-throughput-/next-generation-sequencing (HTS formerly designated NGS) has outstripped current plans of SYNTHESYS and GGBN to join natural history collection data with DNA and tissue collection data.  HTS libraries can be considered as a preparation of the genetic material of a single organism or of  multiple organism (e.g. from an environmental mixed sample). From that point of view, they are the actual physical molecular representation of a specimen or sample. However, these libraries come with specific adaptors that limit their transferability to other sequencing systems. The libraries are prepared at great expense, but frequently are only used for a single project, not making use of additional useful information that could potentially be generated. To increase the potential of the HTS libraries to be used for multiple projects they have to be discoverable via published metadata. Optimally, HTS library metadata will include specific standardized keywords (by e.g. organism, HTS method etc.). Here we present our ideas and a prototype for eDNA samples and HTS libraries based on the GGBN Data Standard (Droege et al. 2016). A use case collection from animals, diatoms, fungi and plants has been developed and is available in the GGBN Sandbox (see more details at http://wiki.ggbn.org/ggbn/Use_Case_Collection). These examples will be improved upon and kept on-line at least until 2020, so progress can be observed. HTML XML PDF
      PubDate: Mon, 21 Aug 2017 15:17:34 +030
       
  • BioBeacon: an Online Field Guide to Digital Biodiversity Information
           Resources

    • Abstract: Biodiversity Information Science and Standards 1: e20472
      DOI : 10.3897/tdwgproceedings.1.20472
      Authors : Jarrett Blair, Andrew Borrelli, Michelle Hotchkiss, Candace Park, Gleannan Perrett, Robert Hanner : Managing the biodiversity crisis requires access to credible information on species, as well as their changing abundance and spatio-temporal distributions, among other variables. Technological advances are expanding both the variety and volume of data available, resulting in the emergence of biodiversity informatics as a rapidly growing research paradigm. Many online resources exist, such as GBIF's resources and tools page (Anonymous 2017), however the lack of fundamental categorization inhibits efficient location and use of relevant data for biological research, conservation, education and industrial application. BioBeacon is a student-driven collaboration between the Biodiversity major at University of Guelph and the Biodiversity Institute of Ontario. Its purpose is to shine a light on biodiversity information resources and characterize them according to objective criteria that simplify their navigation and increase accessibility. Criteria will include several categories such as data type, source, region of focus, and current status, as well as many tags for more refined searches. The refined search feature and categorization of databases will be the primary distinguishing charactersitics that separate BioBeacon from previous biodiversity database indexing efforts. We envision BioBeacon to be cooperatively managed by its creators and steering committee, while inviting input from stakeholders and other parties of interest. Ideally, BioBeacon will grow to incorporate relevant biodiversity information resources that bear diverse types of data from locations around the world. HTML XML PDF
      PubDate: Mon, 21 Aug 2017 6:24:25 +0300
       
  • The developing Canadian Integrated Ocean Observing System (CIOOS)

    • Abstract: Biodiversity Information Science and Standards 1: e20432
      DOI : 10.3897/tdwgproceedings.1.20432
      Authors : Lenore Bajona : Canada’s ocean science community which includes the federal government, academia, small businesses, not-for-profit organizations, and other research partners, collect and synthesize physical, chemical and biological ocean observations. This information is used for discovery research purposes, to model ocean changes and provide environmental assessment advice, support resource management decision-making, and establish baseline data for long-term monitoring. Canada’s ocean community collects large amounts of data but, aside from building comprehensive ocean observatories (Fisheries and Oceans Canada (DFO) et al. 2010), there is no easy mechanism to integrate data from various sources to allow the exploration of interrelationships among variables, and no coordination and collaboration mechanism for the ocean community as a whole to generate an efficient system (Ocean Science and Technology Partnership (OSTP) and for Fisheries and Oceans Canada (DFO) 2011). Consequently, we observe fragmented and isolated data – which may never be used outside of a specific project because it is not discoverable by other potential end users. Canada’s ocean science community (Wallace et al. 2014), led and supported by Fisheries and Oceans Canada (DFO), is advancing the development of a Canadian Integrated Ocean Observation System (CIOOS) that brings together and leverages existing Canadian and international ocean observation data into a federated data system which will generate value for users. This integrated ocean observing system (Wilson et al. 2016) will improve coordination and collaboration among diverse data producers, improve access to information for decision making, and enable discovery and access to data to support a wide variety of applied and theoretical research efforts to better understand, monitor, and manage activities in Canada’s oceans. Conceptual discussions on CIOOS have taken place with Environment and Climate Change Canada, Natural Resources Canada, the Department of National Defence, DFO, and the academic and NGO sector. Work is underway on four closely-linked projects to move CIOOS from the concept stage to the design stage, covering key areas required to develop a robust and integrated observing system: Governance; Data and observations; Cyber Infrastructure; and, Visualization tools. The project teams are evaluating the current ocean observing landscape in Canada (what exists, who has it, and what state is it in), the standards followed, and the gaps, limits or barriers to setting up an integrated ocean observing system. From this they will develop a list of recommendations to support the implementation of CIOOS, which will include which standards to use, the resources required (FTE, capital investment, capabilities), and the best practices to follow. HTML XML PDF
      PubDate: Sun, 20 Aug 2017 21:06:18 +030
       
  • Semantically Defining Populations for 'Omics Research

    • Abstract: Biodiversity Information Science and Standards 1: e20435
      DOI : 10.3897/tdwgproceedings.1.20435
      Authors : Ramona Walls, Pier Luigi Buttigieg : The study of populations is central to ‘omics research, whether sequencing environmental samples, controlling for population structure when looking for genetic variation within a species, or studying the evolution of large clades. Researchers use different operational definitions of populations and communities, via the highly varied creation of operational taxonomic units (OTUs) and, in some cases, use of unclustered sequences. The use of different methods, even within one study type (Swarm, UCLUST, CD HIT, etc.), creates very different OTUs, possibly affecting interpretation and leading to questionable reproducibility. The Population and Community Ontology (PCO) offers the semantics to clarify exactly which collection of organisms (i.e., ecological community or population) was used in an investigation. When combined with methods for standardizing observational data from the BioCollections Ontology (BCO), protocol classes from the Ontology for Biomedical investigations (OBI), and characterization of environments from the Environment Ontology (ENVO), PCO can fully describe the methods used to derive organismal or species-based (i.e. taxonomic) OTUs used for biodiversity analysis and monitoring. PCO is not well suited to describe “OTUs” based on sequence variants that may or may not map to population or individual level variation (e.g., output of some clustering algorithms). In this case, the Sequence Ontology (SO) may be more appropriate. This presentation will describe the key ontology design patterns used in the PCO and provide examples of how and when PCO and related ontologies should be used in omics research, with a focus on environmental/metagenomic sequencing applications. HTML XML PDF
      PubDate: Sun, 20 Aug 2017 20:10:41 +030
       
  • How Agricultural Researchers Share their Data: a Landscape Inventory

    • Abstract: Biodiversity Information Science and Standards 1: e20434
      DOI : 10.3897/tdwgproceedings.1.20434
      Authors : Cynthia Parr, Erin Antognoli, Jonathan Sears : The United States Agricultural Research Service (ARS) recently declared a grand challenge: Transform agriculture to deliver a 20% increase in quality*1 food availability with 20% lower environmental impact by 2025. Addressing this challenge requires a sea change in how it conducts agricultural research. Not only will teams need to be multi-disciplinary, as they begin to pursue big data and data-intensive approaches, they will need to find effective ways to share their diverse kinds of data with each other, with other research teams, with members of farming and business communities, and with policy-makers. Biodiversity is a key component of food production  (crop and livestock species, for example, and the pollinators and microbes they depend on) and the impact that food production (including reduction of pest and pathogen species) has on the environment (species richness, invasive species, and ecosystem services, for example). It is currently unclear how much biodiversity data relevant to agriculture is being made available, and if so where it is. These questions are part of a general need to understand how our pilot platform for USDA-funded data cataloging and publication, the Ag Data Commons https://data.nal.usda.gov, can best support grand challenge research. It will also help agricultural librarians assist their researchers in data management and publication. Therefore we conducted an extensive inventory of the options available to researchers both for finding data and sharing data related to the broader areas of agricultural research. We present the general results for agriculture overall, then explore the agrobiodiversity sector specifically. We found 230 active and publicly available agriculture-specific databases and repositories, only 16.6% of which accept submissions outside their institution, consortium or projects, and most of which are not using or not relevant to TDWG standards such as Darwin Core. The use of taxonomic identifiers is also not standardized. While 73 more general repositories (including the Global Biodiversity Information Facility, GBIF) have easily discoverable agricultural data, in many cases the amounts are currently much smaller than one might expect given vast investments in agricultural research. We reviewed the total number of datasets returned by seven agriculture-related search terms, as well as the percent of the total repository each term represented. Only twenty-five (34.2%) of the general repositories returned over 500 results from at least one agricultural search term. Only ten repositories (13.7%) returned 5% or more of their collection with any of these agricultural search terms. Of the top 50 journals where USDA researchers published in 2016, 40 (80%) host supplemental datasets and most state that supplemental material is published as submitted and will not be edited. Thirty (60%) either require or strongly encourage authors to deposit supporting data in public repositories, with 21 (42%) recommending discipline-specific repositories (four journals name GBIF, for example). Only one journal recommended metadata standards according to type of data. Future work should include an assessment of how many of these databases and repositories have machine-readable data dictionaries, which could be used to more effectively discover agriculturally-relevant data and to foster meaningful data integration. Future work should also explore how mining the Biodiversity Heritage Library and other sources can increase the availability of machine-readable legacy agrobiodiversity data. HTML XML PDF
      PubDate: Sun, 20 Aug 2017 19:49:32 +030
       
  • The Arctos Community Model for Sustaining and Enriching Access to
           Biodiversity Data

    • Abstract: Biodiversity Information Science and Standards 1: e20466
      DOI : 10.3897/tdwgproceedings.1.20466
      Authors : Carla Cicero, Joseph Cook, Mariel Campbell, Kyndall Hildebrandt, Teresa Mayfield, John Wieczorek : Arctos (http://arctosdb.org) is a leader in providing museums with collaborative solutions to managing information in their collections. As both a community and a collection management database platform, Arctos is a consortium of museums that collaborate to serve secure and rich data on over 3 million records from natural and cultural history collections through a partnership with the Texas Advanced Computing Center (TACC). An additional 2 million records are in MCZBase, a separate instance at the Museum of Comparative Zoology, Harvard University. Our community collaboratively guides the development of Arctos, shares and develops data vocabulary and standards, and curates and improves the quality of shared data (e.g., agents, geography, and taxonomy). Due to the loss of a permanent staff line at the University of Alaska in 2015, Arctos transitioned to a subscription-based business model in 2016 that funds daily maintenance of the database and future development. Arctos participants pay a tiered subscription fee based on the number of specimens served from their collection(s), plus a $0.02 per specimen cost. Data migration support is available at an additional cost, which depends on the level of assistance required. Collections may apply for a fee reduction with the understanding that they will work to get funds to cover the cost, or they may offset fees with in-kind support through staff time (e.g., assistance with documentation) or expertise. Collections staff also contribute collectively to Arctos through their participation in the Arctos Working Group and Steering Committee. Additional funding support is provided by institutional or collaborative grants that drive specific developments and benefit the Arctos community as a whole. The Arctos community and data platform are sustained by participating collections’ personnel who contribute actively to governance, documentation, development, and funding efforts, as well as by subscription fees. The addition of new collections to the Arctos community provides infusions of new data and expertise, but also contributes to the distribution of platform costs. The change in funding model has created new opportunities to enrich Arctos by adding collections of different sizes and disciplines, and has increased synergy among the curators and data managers who use Arctos as their collection management system. HTML XML PDF
      PubDate: Sun, 20 Aug 2017 18:15:29 +030
       
  • Talking beyond presence: 04 Symposium: Advances in data accessibility and
           data management for marine species occurrence data: Discussion Panel 1

    • Abstract: Biodiversity Information Science and Standards 1: e20407
      DOI : 10.3897/tdwgproceedings.1.20407
      Authors : Andrew Sherin, Mary Kennedy : Talking beyond presence will be a panel discussion on vocabularies. Panelists will include the presenters in the fisrts session of the symposium and invited guest panelists (TBD) Questions for discussion: What controlled vocabularies are required for data types related to species occurrence and associated measurements' Are there existing and / or developing vocabularies for these data types' Are there data types for which vocabularies need to be developed' If so, who are the best authorities to develop and/or manage these vocabularies' HTML XML PDF
      PubDate: Fri, 18 Aug 2017 20:57:36 +030
       
  • Future pathways for sharing and integrating data: 04 Symposium: Advances
           in data accessibility and data management for marine species occurrence
           data: Discussion Panel 2

    • Abstract: Biodiversity Information Science and Standards 1: e20406
      DOI : 10.3897/tdwgproceedings.1.20406
      Authors : Andrew Sherin, Mary Kennedy : Future Pathways for sharing and integrating data is a discussion panel following the second session of the symposium Advances in data accessibility and data management for marine species occurrence data. Panelists will include presenters from the session and invited guest panelists. Questions for discussion will be: Questions for discussion: In your view, how advanced is the marine science research community in data discovery and accessibility' What tools exist to facilitate integration of species occurrence information, associated measurements and environmental data' What tools exist to facilitate visualization of geospatial layers' What are the two things you would recommend be addressed to advance discovery, accessibility, integration and visualization in the next five years' HTML XML PDF
      PubDate: Fri, 18 Aug 2017 20:56:52 +030
       
  • Cookbooks and Curriculum

    • Abstract: Biodiversity Information Science and Standards 1: e20405
      DOI : 10.3897/tdwgproceedings.1.20405
      Authors : Mary Kennedy, Andrew Sherin : Since 2014, the Coastal and Ocean Information Network Atlantic (COINAtlantic) in collaboration with the Canadian node of the Ocean Biogeographic Information System (OBIS) and other academic, government and non-governmental organizations in Atlantic Canada have been rescuing species occurrence data in primary and grey literature and processing it to standards for publication through OBIS. The project has been funded in part by the Atlantic Ecosystem Initiative of Environment and Climate Change Canada and Fisheries and Oceans Canada. The project was awarded Honourable Mention in the 2016 International Data Rescue Award in the Geosciences by Elsevier and the Interdisciplinary Earth Data Alliance. COINAtlantic and OBIS share common goals of promoting and facilitating free and open access to data required for coastal and ocean management. The sharing of data and integration of datasets requires adoption of standards and use of common vocabularies. Manuals, guidelines, and cookbooks can facilitate the process. One of the deliverables of the data rescue project was the release of the first version of “Guidelines for marine species occurrence data rescue – The OBIS Canada Cookbook” in April 2017. This document includes ten recipes ranging from initial identification of sources of data to final project wrap up and lessons learned. A second deliverable was the development of a curriculum for training sessions of custodians of marine species occurrence data. Training is required at all levels in our community. Not only should data be accessible for reuse but also training information and lecture material. This course curriculum, based on the OBIS Canada Cookbook, reused some content already on-line and was tested in a workshop at the 2017 conference of the Atlantic Canada Coastal and Estuarine Research Society. (see Fig. 1). Our curriculum, as presently designed, is an intensive single day, hands on course with a focus on graduate students and early career researchers. The course has nine (9) modules which address the following topics: why we share research data including a general description of and the need for data policies and data management plans and data repositories; an introduction to OBIS and the standards used by OBIS; how to map data sets to Darwin Core terms and how to clean and reformat the data; how to standardize species lists; how to georeference observations and use of gazetteers to standardize location place names; and how to compose standardized discovery metadata. The last module is devoted to the processing of participants’ data sets under the guidance of the instructor. Future activities will include promotion of the use of the cookbook and revision of the recipes according to users’ feedback. The curriculum will be tested again with a new set of participants on an opportunistic basis and modified according to participants’ comments. A staged and edited video of the course is under consideration - the objective is to provide on-line training material. These products will augment the growing number of lesson plans and lecture material made accessible by the OBIS/GBIF community. The resources need to be promoted and reuse encouraged. HTML XML PDF
      PubDate: Fri, 18 Aug 2017 20:23:31 +030
       
  • Unifying Biology Through Informatics (UBTI) a new programme of the
           International Union of Biological Sciences

    • Abstract: Biodiversity Information Science and Standards 1: e20431
      DOI : 10.3897/tdwgproceedings.1.20431
      Authors : David Shorthouse, David Patterson, Nils Stenseth : Biological information, both new and old, is increasingly available in thousands of on-line sites. Biological information relates to elements as small as subatomic particles, and as large as the entire biosphere, and to processes that last from less than a femtosecond to many billions of years. Our economy, supply of food and materials, our health, and individual and collective well-being are set within the context of our natural world. Our world is under pressure from the demands of a growing population. Biologists need new ways to access, organize and analyze biological information to make new styles of research possible to better predict the nature of future change and to inform decisions makers. TDWG has promoted standardization that facilitates communication, but there is no overall architecture that interconnects the digital resources to better serve a wider community. The research funding paradigm leads to imaginative but competitive and short-lived enterprises. The Unifying Biology Through Informatics (UBTI) programme will explore the challenges that have to be overcome to add a cyberinfrastructure agenda to the current research agenda. An emerging infrastructure will require technical improvements inclusive of agreed metadata and ontologies to improve, extend and integrate the informatics tools, processes and skills to manage digital information across the full spectrum of biological phenomena. The new tools must penetrate deep into a large number of small information sources. It will require co-operation of existing enterprises, a target of service not discovery, institutional support to guarantee long term funding of the emerging infrastructure, and a considerable financial investment. 'Unifying Biology Through Informatics' will articulate, in increasing detail, the nature of an indexing infrastructure capable of evolving in response to the needs of producers and users of data; the hosting, funding and sustainability of the infrastructure, and the political and social changes needed to secure the adoption, longevity, and use of the infrastructure. The programme welcomes input on selected use-cases that can be used to identify the most pressing challenges and opportunities. HTML XML PDF
      PubDate: Fri, 18 Aug 2017 18:10:44 +030
       
  • Documenting Reproductive Phenology using Herbarium Specimens: Experiences
           from the New England Vascular Plants Project

    • Abstract: Biodiversity Information Science and Standards 1: e20430
      DOI : 10.3897/tdwgproceedings.1.20430
      Authors : Patrick Sweeney, Edward Gilbert : Herbarium specimens and associated label data are valuable sources of phenological data. They can provide information about the phenological state of the specimen and information about how phenology varies in space and time. In an effort to leverage this tremendous phenological resource, the New England Vascular Plants project (NEVP) has worked over the past few years to create a data set catered to the study of the effects of climate change in New England. This project has focused on capturing images, specimen occurrence data, and reproductive phenology from New England specimens housed in 17 herbaria in northeastern North America. Flowering and fruiting state was scored from images of specimens or derived from pre-existing occurrence records. Data was captured through crowdsourcing efforts and by paid staff. To help standardize the scoring process, a controlled reproductive phenology vocabulary was developed with input from the community. This vocabulary prioritized simplicity of use and broad applicability. In this talk, I will give an overview of our efforts, describing the digitization products, controlled vocabulary, scoring process, and method for sharing the scorings with data users. I will also discuss the challenges involved in utilizing uncontrolled phenology data in pre-existing occurrence records. HTML XML PDF
      PubDate: Fri, 18 Aug 2017 17:19:47 +030
       
  • Data Infrastructure for Scientific and Collections Data in Medium Size
           Institution

    • Abstract: Biodiversity Information Science and Standards 1: e20429
      DOI : 10.3897/tdwgproceedings.1.20429
      Authors : Jiri Frank, Jakub Belka : We in the National Museum in Prague were recently faced with a difficult decision. The question was how to build up optimal and economical infrastructure to store and back up our scientific and collections data, which currently requires approximately 150 TB of data storage and is growing constantly. In addition to the infrastructure it was also important to consider a potential LTP (Long Term Preservation) solution. We would like to share our experience and maybe inspire you with our model. The infrastructure model can be defined by four main elements: visualisation of the data infrastructure, virtual platform, back-up and LTP. The visualisation is done by constantly updating a data schema that shows the data stores and their connections with virtual platforms. Every data store has a defined data structure. For example data storage for collections data reflects their physical structure, location and distribution. So it creates a virtual collections depository divided in collections and sub-collections on various levels. For our virtualisation platform we chose the solution by VMware. This platform creates a data space from high speed local data stores. This space is used for various database systems in the museum, e.g. for collections management. Those database systems are connected with large capacity lower speed data stores. The infrastructure is designed to guarantee fast access to the databases and metadata with lower requirements for a storage capacity. The access to the digitised master files (images etc.) is indeed a bit slower due to the lower speed large capacity storage volumes. The back-up strategy has two options. We are using for the virtualisation platform and virtual machines VEEAM back-up system, which works on the basis of reverse incremental backup. The images of virtual machines are backed up on external data storage in a data centre (hosted by third party). The back-up of large capacity lower speed data stores is done by incremental back up to directly connected external data stores. The external data storage in the data centre is replicated in two separate geographical locations. For the future we are planning an LTP strategy for data and also metadata. The best technology at the moment with high capacity, reasonable price, and a long preservation (more than 100 years) is an Optical Disc Archive (ODA). One of the advantages of this technology is the lack of special requirements for temperature, humidity, etc., as well as economical space requirements. The whole system with the LTP solution and technical descriptions will be described as a schema on the poster. HTML XML PDF
      PubDate: Fri, 18 Aug 2017 17:03:29 +030
       
  • Low Cost Environment Monitoring Server – Wireshield

    • Abstract: Biodiversity Information Science and Standards 1: e20427
      DOI : 10.3897/tdwgproceedings.1.20427
      Authors : Jiri Frank, Lukas Belka : Just as good quality infrastructure is necessary for scientific data and collections, it is important to provide data processing and storage equipment with stable environmental conditions. In these situations, it is important to monitor conditions such as temperature, humidity, etc. In our case, a collections institution where data and especially collections are stored in many locations, we were looking for a reliable and economical solution. After an analysis of available commercial systems we decided to develop our own low cost environmental monitoring server called Wireshield. This solution has proven to have great potential for use in a wide variety of environmental monitoring situations. The Wireshield temperature monitoring prototype can be simply defined by connecting a one chip Raspberry computer with an ATmega microchip, a temperature sensor, and data interpretation. Data are collected by DS18B20 sensors connected via a 1-Wire bus. Audio cables with jack connectors are used for the data transfer between the sensors and microchip. The audio cable can be replaced by a UTP (Unshielded Twisted Pair) cable for longer distances. The ATmega microchip is connected with the Raspberry by UART (Universal Asynchronous Receiver/Transmitter) communication. The role of the Raspberry is to interpret the data by using a web server. Wireshiled also includes a TFT (Thin-film transistor) display showing the actual temperature on each sensor. The display is connected to the ATmega microchip by SPI (Serial Peripheral Interface) bus. It is possible to configure the colour of temperatures and define the limits. When the temperature limit is crossed, the Raspberry immediately sends an email notification to the user. Our prototype was built to monitor temperature, but it can be configured to monitor other conditions (e.g. humidity) depending on sensors and configuration. HTML XML PDF
      PubDate: Fri, 18 Aug 2017 17:00:03 +030
       
  • Using a Deep Convolutional Neural Network for Extracting Morphological
           Traits from Herbarium Images

    • Abstract: Biodiversity Information Science and Standards 1: e20400
      DOI : 10.3897/tdwgproceedings.1.20400
      Authors : Yue Zhu, Thibaut Durand, Eric Chenin, Marc Pignal, Patrick Gallinari, Régine Vignes-Lebbe : Natural history collection data are now accessible through databases and web portals. However, ecological or morphological traits describing specimens are rarely recorded while gathering data.  This lack limits queries and analyses. Manual tagging of millions of specimens will be a huge task even with the help of citizen science projects such as “les herbonautes” (http://lesherbonautes.mnhn.fr). On the other hand, deep learning methods that use convolutional neural networks (CNN) demonstrate their efficiency in various domains, such as computer vision (Krizhevsky et al. 2012, Azizpour et al. 2016), speech recognition (Abdel-Hamid et al. 2014) or face identification (Li et al. 2015, Freytag et al. 2016). We aim to use deep learning to provide a visual representation of words used to describe plants (e.g. simple, or compound leaf), and to associate those words with specimens in the Paris herbarium.  This will provide a semantic description of each of the 7 millions images of the fully digitized collection of the Paris herbarium in the Muséum National d'Histoire Naturelle (MNHN, Paris, France). In a proof of concept project, we have used a CNN - pre-trained on the image database ImageNet (http://www.image-net.org) - in order to identify 4 morphological traits of leaves, using 103,000 herbarium images from 11 different taxa: margin (entire / dentate), leaf attachment (opposite / alternate), leaf organization (simple / compound), plant (woody / non-woody) (see Fig. 1) Seventy percent of images are used to train the weights of the model (in a supervised learning process that uses a training set already tagged for the 4 characteristics), 10% are used as validation set to tune the hyper-parameters of the model and to avoid overfitting, and 20% are used as test set to evaluate the generalization performances of the final neural network. The first results are encouraging with over 80% success on the test set. In a second step, we test if the neural network is not overfitting the training examples, and can generalize to new taxa. If we restrict the training set to a small number of taxa (4 taxa containing 76% of images), the success rate on the 7 other taxa (unseen during training) decreases drastically. A good sampling of the taxonomic diversity of plants appears crucial to train the neural network. A second method visualizes the area of each image that was detected by the CNN as the most important for morphological character recognition (Durand et al. 2017). This method provides an explanatory view of the automatic recognition process. In this poster we describe methods and results on botanical images for the different taxa. We discuss perspectives on image tagging of the Paris herbarium, and how to combine the citizen science project in order to annotate images with CNN automatic image description. HTML XML PDF
      PubDate: Fri, 18 Aug 2017 15:36:30 +030
       
  • Metadata Standards for Genomic Sequence Data: Past and Future of MIxS
           Standards Family

    • Abstract: Biodiversity Information Science and Standards 1: e20423
      DOI : 10.3897/tdwgproceedings.1.20423
      Authors : Pelin Yilmaz : Data-intensive microbial (ecology) research requires approaches to organize information in logical formats to enable comparative studies, replication of experiments, and development of informatics tools. As the quantity and diversity of genomic data continues to increase at an exponential rate, it is imperative that these data are findable, accessible, interoperable, and reusable through the use of a standard format. Since 2005, the Genomics Standards Consortium (GSC) operates an international open-membership working body of over 500 researchers from 15 countries promoting community-driven efforts for the reuse and analysis of contextual metadata describing the collected sample, the environment and/or the host, and sequencing methodologies and technologies. GSC’s development and implementation of Minimum Information Standards (MIxS) established a community-based mechanism for sharing genomic data through a common framework. The GSC’s suite of genomic standards (Field et al. 2008, Yilmaz et al. 2011) have been supported for over a decade by the International Nucleotide Sequence Database Collaboration (INSDC) databases, thus allowing for a complete environmental and experimental description of sequenced samples (Barrett et al. 2012, Cochrane et al. 2015). While broadly accepted and used by the microbial (ecology) research community, MIxS has several shortcomings, as well as areas that require further development. The GSC is committed to engaging domain experts, in order to: (i) expand coverage and breadth to accommodate new data types and emerging technologies, (ii) maximize usability, (iii) expedite further evolution according to community needs, and (iv) automatize update of MIxS. HTML XML PDF
      PubDate: Fri, 18 Aug 2017 14:34:55 +030
       
  • Setting Collections Data Free with the Power of the Crowd: challenges,
           opportunities and a vision for the future

    • Abstract: Biodiversity Information Science and Standards 1: e20422
      DOI : 10.3897/tdwgproceedings.1.20422
      Authors : Margaret Gold : The Natural History Museum in London has embarked on an epic journey to digitise 80 million specimens from one of the world’s most important natural history collections. Publishing this data to our open Data Portal will give the global scientific community access to unrivalled historical, geographic and taxonomic specimen data gathered over the last 250 years. We have been involving the public in this large task by asking digital volunteers to help transcribe information written on specimen labels, via an online crowdsourcing interface, such that anyone in the world can participate. In this talk we will share our experience with these projects over the past year. Within the Digital Collections Programme at the Natural History Museum London we have so far digitally imaged approximately 200,000 of our total holdings of around two million microscope slides, and from these have selected two discrete collections for crowdsourcing the transcription effort. Miniature Lives Magnified (now complete) contained 6,285 digitised Chalcidoidea slides from our Hymenoptera collection, and Miniature Fossils Magnified (not yet complete at time of writing) contains 2,000 Foraminifera slides from our Micropalaeontology collection. Both projects are hosted on the Notes from Nature platform, which has been built on the Zooniverse Panoptes open-project platform. As the lead partner of the Crowdsourcing task within SYNTHESYS3, an EC-funded project creating an integrated European infrastructure for natural history collections, we have partnered with our fellow consortium members to help them design and launch crowdsourcing projects of their own. These have included an Amaranthaceae collection with 444 specimens and a Primulaceae collection with 3,093 specimens hosted on Notes from Nature, and a Brachiopod collection with 1,810 specimens built directly on Panoptes. In this talk we will share the key insights gained through practical experience with this wide range of specimens, specimen data, and label-styles. In particular, we have gained insights into the design of the workflow and interface, such as the considerable reduction in human error when drop-down menu options are introduced where possible, rather than free data entry fields. These five projects provided us with a unique opportunity to compare the dedicated Notes from Nature platform, which has significant advantages due to the size and engagement-level of the existing community, to the open project-building Panoptes platform, which has storytelling advantages in terms of the capacity to provide more information about a specific collection, its subject, and its underlying scientific importance. A crucial element of running successful crowdsourcing projects is building an engaged community of digital volunteers. We compared the use of social media channels with more traditional Museum communication channels (such as e-newsletter and website), and found that the latter had the most reach in terms of raising awareness of the projects, but that the former enabled more frequent and varied engagement with a potential volunteer audience. However, when examining which metrics are the most important to track in assessing the success of various initiatives, we found that the highest impact on the ultimate volume of transcription were in-house volunteering days run in person, rather than online. In reaching out and engaging with a diverse range of volunteer audiences, we found evidence of the major sources of motivation that are described in the existing citizen science literature, but also more nuanced insight into behaviours such as pursuing independent learning, the desire to enter all of the information even when not requested, and preferring tasks that can be performed by rote. Our efforts to support and nurture the existing Notes from Nature community confirmed the importance of the principle of ‘giving-back’, and gave us insight into how to do this when research results emerge over a longer timeline than is typical of field-based citizen science projects. And finally, we will share our experience with the behind-the-scenes elements of crowdsourcing - the parts the ‘crowd’ doesn’t see - such as data quality assessment, data ingestion, data publication, and the flow of data between internal systems. In conclusion we will propose some visions of the future, such as moving towards a global platform for specimen label transcription with a shared underlying database infrastructure, how to deepen the engagement of digital volunteers from transcription tasks to scientific observations, and ways to bring online crowdsourcing and field-based citizen science together in a more streamlined way. HTML XML PDF
      PubDate: Fri, 18 Aug 2017 12:31:00 +030
       
  • The state of the data in citizen science

    • Abstract: Biodiversity Information Science and Standards 1: e20370
      DOI : 10.3897/tdwgproceedings.1.20370
      Authors : Anne Bowser, Caren Cooper : Citizen science has contributed to biodiversity research and monitoring for hundreds of years. Still, the recent increase in scale, scope, diversity and number of citizen science projects highlights the challenge of designing and implementing good practices around data collection and data curation. The Committee on Data for Science and Technology of the International Council for Science (ICSU-CODATA) and the World Data System (WDS) recently founded a joint Task Group to understand and support good practices for citizen science data validation, data cleaning and curation, and, short- and long- term data management. Research projects conducted by the ICSU-CODATA-WDS Task Group include the development of an initial typology of citizen science data generating tasks, and an exploratory landscape analysis of the state of the data in citizen science. The landscape analysis found that citizen science projects use a wide range of strategies for data validation at numerous stages of the scientific research process. In comparison, practices for data documentation, curation, and long-term management are less advanced. This may limit data discovery and re-use. This work compliments the planned and ongoing efforts of the TDWG Citizen Science Interest Group to advance biodiversity informatics for citizen science. Presenting research on the state of the data in citizen science can promote cross-pollination between the ICSU-CODATA-WDS Task Group and the biodiversity community, and encourage researchers and practitioners to work together to advance citizen science data quality, standards, and interoperability. HTML XML PDF
      PubDate: Fri, 18 Aug 2017 2:17:47 +0300
       
  • Three years of Xper3 assessment: towards sharing semantic taxonomic
           content of identification keys

    • Abstract: Biodiversity Information Science and Standards 1: e20382
      DOI : 10.3897/tdwgproceedings.1.20382
      Authors : Amélie Pinel, Sylvain Bouquin, Estelle Bourdon, Adeline Kerner, Régine Vignes-Lebbe : Xper3 is a collaborative system that manages structured descriptive data on taxa or specimens. It is available online and linked to web services including two services for identification: a free (multiple) access key (Vignes Lebbe et al. 2015) and single access key (Burguiere et al. 2013). These web services use the TDWG Structured Descriptive Data format (SDD) (Hagedorn et al. 2005). The Xper3 platform was launched in November 2013. Three years later, 1990 users had created accounts and edited 2499 knowledge bases (KB). Unfortunately, there exists no public overview of the existing content. Each KB is autonomous and can be published as a free access key (e.g., http://cochenilles.bio-agri.org/mkey.html). KB owners are free to publicize their keys in publications (Padovan and Magenta 2015 ; Engel et al. 2016) or on websites (http://acrinwafrica.mnhn.fr/SiteAcri/Xper.html)(http://herbaria.plants.ox.ac.uk/bol/caricaceae/Key). This has two consequences: possible duplicate content or overlapping effort (e.g., several keys on orchids) characters and states, documented by texts and images cannot be used for building another key without making copies. In order to solve the first problem, we analyse Xper3 metadata (e.g., name of KB, owner, number of contributors, date of creation, date of last modification) and we provide an overview of the existing content. Firstly, 48% of KB are empty or inactive with extremely limited content (fewer than 3 taxa or characters), which has not been accessed for a long time. We discard these KB in our analysis and only consider the 1300 active KB. We also discard “test” KB and duplicate KB. Surprisingly, we discovered 15 medical KB for diagnosis of various diseases and 34 non-taxonomic KB (e.g., wine, fashion, computing equipment). For taxonomic KB, we present the taxonomic and geographic distribution of the KB (angiosperms and arthropods are prevailing taxa) and we compare with the number of known species. The second point concerns semantic data sharing. We compute the rate of terms (character, character state) duplicated in several KB in the same taxonomic groups, in order to evaluate the interest in sharing resources between Xper3 KB. Then we look for existing ontologies in the bioportal (https://bioportal.bioontology.org/ontologies). Although Xper3 may manage structured data, its data model does not use an ontology language like RDF*1 or OWL*2 (ontology languages and tools offer unique identifier and reasoning mechanisms), and each KB has its own vocabulary. We will be contacting KB owners to obtain more detailed metadata and to facilitate the automatic publishing of authorized KB on the Xper3 website. We plan also to implement an easy link between Xper3 and external ontologies to help editing new KB and to export KB data models to existing ontologies. HTML XML PDF
      PubDate: Thu, 17 Aug 2017 21:01:19 +030
       
  • Catalogue of Life, China and Taxonomic Tree Tool

    • Abstract: Biodiversity Information Science and Standards 1: e20394
      DOI : 10.3897/tdwgproceedings.1.20394
      Authors : Liqiang Ji : Since 2008, the Species 2000 China Node, with the support of the Chinese Academy of Sciences' Biodiversity Committee, has organized scientists to compile and release the Catalogue of Life, China (CoL China) each year. It follows the Standard Data Set of Species 2000’s global Catalogue of Life to collect and release Chinese species data. Considering the local requirement, a Chinese formal name and its Pinyin (a romanized form of the name) are appended in species records. The data items include the accepted scientific name for the species, Chinese name, synonyms, common names, latest taxonomic scrutiny, source database, family, classification above family and highest taxon in the database, distribution, and references. A dynamic distribution map could be shown for each species in the checklist. CoL China 2017 Annual Checklist was released in July 2017 by the Chinese Academy of Sciences and Ministry of Environment Protection in Beijing. The groups of species in the 2017 Annual Checklist and their number of accepted species names include Animalia (38,631), Bacteria (469), Chromista (2,239), Fungi (4,273), Plantae (44,041), Protozoa (1,843) and viruses (805). Users may access CoL China data via the website (http://www.sp2000.org.cn) or download data via an API (Application Program Interface). We developed a platform for species data collection and the on-line Taxonomic Tree Tool (TTT, http://ttt.biodinfo.org/) for data analysis, which integrates animal data with plant and microbial data into annual checklists and maintains the CoL China database system. TTT is a web-based platform for managing and comparing taxonomic trees. It allows users to create their own taxonomic trees in four ways - manual input, uploading to XML (Extensible Markup Language), manually selecting taxa from template trees provided by TTT, or automatically selecting taxa from template trees according to a species list. Users can share their trees with registered users and compare them with the public trees. TTT provides a tool for comparing different trees to focus on the spots where more attention should be paid by taxonomists or informatics scientists. The comparison tool explores taxonomic relationships from two trees and classifies the differences into contrasting types of relationships. The tool helps find the differences between the taxonomic positions for taxon A and taxon B and highlights these explicitly. Furthermore, it calculates the similarity of branches from two compared trees to help taxonomists judge whether the taxon groups chosen are the same or if it is necessary to continue drilling down the taxonomic trees to explore more differences. TTT can extract common or different parts of two compared trees and the result can be exported for further tree integration research. HTML XML PDF
      PubDate: Thu, 17 Aug 2017 6:23:19 +0300
       
  • BHL: A Source for Big Data Analysis

    • Abstract: Biodiversity Information Science and Standards 1: e20339
      DOI : 10.3897/tdwgproceedings.1.20339
      Authors : Mike Lichtenberg : The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize taxonomic literature and to make that literature available to a global audience for open access and responsible use as a part of a global “biodiversity commons”. In partnership with the Internet Archive and through local digitization efforts, BHL has digitized more than 200,000 volumes of taxonomic literature.  Using the Global Names Recognition and Discovery (GNRD) service, BHL has identified over 177 million instances of species names (including more than 29 million unique names) within the text. This content, which includes over 52 million pages of text, provides a rich unstructured source of biodiversity big data that is associated with taxonomic and bibliographic metadata. BHL allows users to search the collection, read the texts online, and download select pages or entire volumes as PDF files. More importantly, BHL makes the source data available for reuse and Big Data analysis via a number of different services. These services include direct downloads of data files and machine interfaces. This talk will describe the downloads and machine interfaces through which BHL source data is available. Each service will be introduced and described.  Where feasible, the usage of each service will be demonstrated. Theoretical examples of how each service could be used to facilitate Big Data analysis will be provided. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 23:32:45 +030
       
  • Documenting Marine Species Traits in the World Register of Marine Species
           (WoRMS): Current status, Future Plans and Encountered Challenges

    • Abstract: Biodiversity Information Science and Standards 1: e20337
      DOI : 10.3897/tdwgproceedings.1.20337
      Authors : Leen Vandepitte, Simon Claus, Stefanie Dekeyzer, Sofie Vranken, Wim Decock, Bart Vanhoorne, Francisco Hernandez : The importance of describing species patterns and the underlying processes explaining these patterns is essential to assess the status and future evolution of marine ecosystems. This requires biological information on functional and structural species traits such as feeding ecology, body size, reproduction, life history, etc. To accommodate this need, the World Register of Marine Species (WoRMS) (WoRMS Editorial Board 2017) is expanding its content with trait information (Costello et al. 2015), subdivided into 3 main categories: (1) taxonomy related traits, e.g. paraphyletic groups, (2) biological and ecological traits-specific characteristics of a taxon, e.g. body size or feeding type and (3) human defined traits, e.g. the legal protection status of species, whether a species is introduced, harmful, or used as an ecological indicator. Initially, priority was given to the inclusion of traits that could be applied to the majority of marine taxa and where the information was easily available. The main driver for this approach was that the inclusion of these traits should result in new research, which in turn would drive improvements in the quality and quantity of trait information. Pilot projects were carried out for different species groups, allowing a thorough documentation of a selection of traits. In parallel, a standard vocabulary was put together (http://www.marinespecies.org/traits/wiki/), based on already existing resources to cover all marine life. All documented traits needed to be compliant with this vocabulary, in order to make the data as widely useable as possible, across groups. Defining a trait across all marine life is not trivial, as scientists can use terms in a different way between groups. This stresses the importance for users to realize these differences in terminology, before they analyse a trait across all taxa. Some traits were thought to be quite straightforward to document, although practice proved otherwise. Such a trait is body size, where the aim was to document the numerical value of the ‘maximum body size in length’. In reality, a lot of variation is possible (e.g. for fish: fork length versus standard length) and maximum size is not always considered relevant from an ecological point of view. On the other hand, documenting numerical body size for each marine species is quite time consuming. Therefore, a complementary size trait will be documented, indicating whether taxa are considered as micro, meio, macro or mega. Whereas the initial approach was to complete the register for each tackled trait relevant for all marine species, we now complement this by (1) documenting several traits within a specific group, regardless whether this trait is also present in other taxon groups, and (2) documenting one specific trait, covering a variety – but not all – taxonomic groups, e.g. the composition of the skeleton for calcareous animals. Where possible, we aim to document a trait on a higher taxonomic level to allow the work to progress more rapidly. As the database allows top-down inheritance of traits, exceptions can easily be documented. In addition, collaborations are sought with already running initiatives such as Encyclopedia of Life. Very soon, all the documented traits will be searchable through the Marine Species Traits Portal. The human-defined traits are already accessible through the EMODnet Biology Portal (http://www.emodnet-biology.eu/toolbox), in combination with distribution information from the European Ocean Biogeographic Information System (EurOBIS; www.eurobis.org; Vandepitte et al. 2011; Vandepitte et al. 2015) and taxonomy from WoRMS (www.marinespecies.org). Through the LifeWatch Taxonomic Backbone (LW-TaxBB) (http://www.lifewatch.be/data-services/), services are offered to access these traits, combined with data and information from other resources such as WoRMS and (Eur)OBIS. We would like to acknowledge the EMODnet Biology and the LifeWatch project, in which the Flanders Marine Institute (VLIZ) – host institute of WoRMS – is responsible for the development of the LW-TaxBB. Both projects provide funding for the documentation of trait data and development of services allowing researchers to easily access the available data, in combination with data from other sources. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 23:00:30 +030
       
  • Automated Herbarium Specimen Identification using Deep Learning

    • Abstract: Biodiversity Information Science and Standards 1: e20302
      DOI : 10.3897/tdwgproceedings.1.20302
      Authors : Jose Carranza-Rojas, Alexis Joly, Pierre Bonnet, Hervé Goëau, Erick Mata-Montero : Hundreds of herbarium collections have accumulated a valuable heritage and knowledge of plants over several centuries (Page et al. 2015). Recent initiatives, such as iDigBio (https://www.idigbio.org), aggregate data from and images of vouchered herbarium sheets (and other biocollections) and make this information available to botanists and the general public worldwide through web portals. These ambitious plans to transform and preserve these historical biodiversity data into digital format are supported by the United States National Science Foundation (NSF) Advancing the Digitization of Natural History Collections (ADBC) and the digitization is done by the Thematic Collections Networks (TCNs) funded under the ADBC program. However, thousands of herbarium sheets are still unidentified at the species level while numerous sheets should be reviewed and updated following more recent taxonomic knowledge. These annotations and revisions require an unrealistic amount of work for botanists to carry out in a reasonable time (Bebber et al. 2010). Computer vision and machine learning approaches applied to herbarium sheets are promising (Wijesingha and Marikar 2012) but are still not well studied compared to automated species identification from leaf scans or pictures of plants taken in the field. In a recent study, we evaluate the accuracy with which herbarium images can be potentially exploited for species identification with deep learning technology (Carranza-Rojas et al. 2017), particularly Convolutional Neural Networks (CNN) (Szegedy et al. 2015). This type of network allows automatic learning of the most prominent visual patterns in the images since they are trainable end-to-end (thus, differentiable), as opposed to previous approaches that use custom, hand-made feature extractors. A first challenge is to use herbarium sheet images alone to automatically identify the species of plants mounted on herbarium sheets. Secondly, we propose studying if the combination of herbarium sheet images with photos of plants in the field (Joly et al. 2015, Carranza-Rojas and Mata-Montero 2016) is a viable idea to train models that provide accurate results during identification. Finally, we explore if herbarium images from one region with a specific flora can be used in transfer learning (a technique in deep learning that first allows training a model with a dataset and then once trained, uses the weighted results to train another model with that knowledge as the baseline) to another region with other species; for example, in a region under-represented in terms of collected data. Our evaluation shows that the accuracy for species identification with deep learning technology, based on herbarium images, reaches 90.3% on a dataset of more than 1200 European plant species. This could potentially lead to the creation of a semi-, or even fully automated system to help taxonomists and experts with their annotation, classification, and revision works. In this paper, we take a closer look at the accuracy levels achieved with respect to the first two challenges. We evaluate the accuracy levels for each species included in the dataset, which encompasses 253,733 images, 1,204 species. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 20:23:44 +030
       
  • Using YesWorkflow hybrid queries to reveal data lineage from data curation
           activities

    • Abstract: Biodiversity Information Science and Standards 1: e20380
      DOI : 10.3897/tdwgproceedings.1.20380
      Authors : Qian Zhang, Paul J. Morris, Timothy McPhillips, James Hanken, David Lowery, Bertram Ludäscher, James Macklin, Robert Morris, John Wieczorek : The YesWorkflow McPhillips et al. 2015b, McPhillips et al. 2015a toolkit was designed to annotate data curation workflows in conventional scripts (e.g., Python, R, Java) but it can also be used to annotate YAML-based Kurator workflow configuration files. From just a file that has been annotated by YesWorkflow, YesWorkflow is able to render a top-level graphical view of the workflow structure (prospective provenance), including system inputs and outputs, actors, connections among those actors, and expected data to be passed on those connections. YesWorkflow also supports dynamic analysis and reporting on the results of the workflow (retrospective provenance) at various levels of granularity (e.g., at the actor level, script level, data level, record level, file level, function level), provided that it has been configured at each. YesWorkflow includes an @Log annotation, which describes the semantic structure of a log message within some actor in the workflow and allows the log message to be linked to the actor within which it was created, and for parts of that log message to be linked to the data passed between actors. YesWorkflow can be used to analyze the log messages after a run of the workflow and construct a store of facts, which can be queried and reasoned upon to make statements about the evolving paths taken by particular data elements through the workflow and assertions made about those data elements within the workflow. Provenance, like other metadata, appears to be rarely actionable or immediately useful for those who are expected to provide it. However, by refactoring and integrating runtime observables generated from retrospective provenance and context information from prospective provenance analysis into hybrid queries, we show how both elements can yield hybrid visualizations that reveal “the plot” of the whole execution. In this way, a comprehensive workflow graph and a customizable data lineage report are made actionable for a workflow run with meaningful provenance artifacts. Queries run on a set of facts extracted from log messages by YesWorkflow after a workflow run, in combination with the facts extracted from the annotated workflow itself, allow for powerful visualizations of the retrospective provenance of a workflow run and of particular data records within a branching workflow. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 18:07:18 +030
       
  • Fitness-for-Use-Framework-aware Data Quality workflows in Kurator

    • Abstract: Biodiversity Information Science and Standards 1: e20379
      DOI : 10.3897/tdwgproceedings.1.20379
      Authors : Paul J. Morris, James Hanken, David Lowery, Bertram Ludäscher, James Macklin, Timothy McPhillips, Robert Morris, John Wieczorek, Qian Zhang : In the Kurator project, we are developing libraries of small modules, each designed to address a particular data quality test. These libraries, which can be run on single computers or scalable architecture, can be incorporated into data management processes in the form of customizable data quality scripts. A script composed of these modules can be incorporated into other software, run as command-line programs, or provided as a suite of “canned” workflows through a web interface. In some of these modules, we have implemented a subset of the standard tests under development by Task Group 2 (TG2) of the Data Quality Interest Group. We have also been exploring use of the fitness-for-use-framework (Veiga et al. 2017) produced by Task Group 1 (TG1) of the Data Quality Interest Group. Our goals have been to explore use of the framework to describe capabilities of atomic modules of code, how we can use concepts in the framework to produce data quality reports, and what lessons can be learned from implementing data quality control code in the context of the framework.  We have focused on the Data Quality Reports level of the framework; in particular, the representation of Data Quality Measures (measurements on some data quality dimension), Validations (tests for compliance with quality needs), and Amendments (proposals to improve data quality). At the implementation level, we have developed a set of Java annotations to mark methods as providing specific tests from the test suite. In terms of the framework, the annotations can also be used to mark methods as providing Measures, Validations, or Amendments and to associate method parameters with Information Elements by linking them to the Darwin Core terms that were either "acted upon" or "consulted" (Lowery et al. 2016). These annotations can be used by a consumer to identify and run Measures and Validations in two phases: a Pre-Amendment phase, before the Amendments are run; and a Post-Amendment phase, after the changes proposed by the Amendments have been applied. Capturing the test results across both stages allows us to report on how much accepting the amendments would improve the quality of the dataset as a whole, data in some quality dimension, or data for some specific purpose. We have found it important to be able to render data quality reports that identify which Darwin Core terms are the Valuable Information elements involved in a specific test, and, further, to identify which terms are acted upon and which are consulted. Identifying this information allows us, for example, to render tabular reports highlighting cells where amendments have proposed a change. We have also found reporting of error and failure conditions to be important, and have been working on implementing the TG1 suggestion that report elements consist of a result (containing only appropriate values), status (containing a controlled vocabulary term such as completed, or data_prerequisites_not_met), and a human readable message (metadata about why the conclusion that was drawn was drawn, or error messages). We have developed a stake-in-the-ground vocabulary for status values to describe failure conditions including the following concepts: Ambiguous (there was a result, but it has ambiguity, e.g., an event date inferred from the verbatim event date 04/05/1954), Internal Prerequisites Not Met (not able to run the test on the data provided, e.g., day was not an integer), and External Prerequisites Not Met (some external resource that this test consults was unavailable at runtime). In implementing tests in the context of the framework, we have seen the value of identifying Measures, Validations, and Amendments in forcing us to develop small, focused tests, and in allowing us group assertions within data quality reports based on data quality needs. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 17:47:05 +030
       
  • The EDIT Platform for Cybertaxonomy, a Brief Overview

    • Abstract: Biodiversity Information Science and Standards 1: e20368
      DOI : 10.3897/tdwgproceedings.1.20368
      Authors : Andreas Kohlbecker, Andreas Müller, Walter Berendsohn, Anton Güntsch, Katja Luther, Patrick Plitzner : The Platform for Cybertaxonomy (FUB, BGBM 2011) is an open-source software framework covering the full breadth of the taxonomic workflow, from fieldwork to data publication. It provides a set of tools for editing and management of taxonomic data (individually or collaboratively), fully customizable on-line access to that data, and other means of data publication and data exchange (Ciardelli et al. 2009). The EDIT Platform was originally devised for the EU-funded project "European Distributed Institute of Taxonomy" (hence the EDIT name), its development started in 2006, early developments are summarised in Berendsohn (2010). At the core of the platform is the EDIT Common Data Model (CDM), which offers a comprehensive information model covering all data domains that are relevant in the context: names and classifications, descriptive data, media, geographic information, literature, specimens, and persons (see Müller & al., this session). Apart from its role as a software suite supporting the taxonomic workflow, the Platform can be seen as a powerful information broker for a broad range of taxonomic data providing solid and open interfaces including a Java programmer’s library and a REST Service Layer. A number of CDM-based applications have been developed, the most important of which are the Taxonomic Editor (TaxEditor), the CDM Data Portal, and the Platform Web Services. The Eclipse-based TaxEditor allows access and editing all details of the data in the CDM in form-based windows but also provides innovative features such as parsing of nomenclatural data (names, authors, references) from free text entry or paste and automatically calculated "cache" fields that may be protected to allow preliminary, non-atomised data entry. The TaxEditor as well as the underlying CDM database instance are highly configurable, so they can be adapted to the project at hand. The taxonomic tree can be displayed and used for navigation and for restructuring by drag and drop. Apart from the core taxonomic name and classification functionality, there are editors for multimedia object metadata, for the features (descriptive data items) used, for identification keys, and for the specimen hierarchy and even an alignment tool for DNA sequences (Plitzner & al., this session). A “power user interface” presents the data in spreadsheet-like fashion and allows bulk editing and data cleaning. The CDM Data Portal is a Drupal website enhanced with full access to the Platform Web Services. It provides all the configuration options of the Drupal content management system in addition to full and highly configurable access to CDM content. Geographic distribution (both, area and point maps) use the services provided by the EDIT partner at the Royal Museum of Central Africa in Tervuren (Roca et al. 2009). The EDIT Platform Web Services are further detailed in Güntsch & al., this session). A fair number of tools exist that allow the editing of descriptive information and identification keys, so it was decided to couple the platform with the Xper2 software developed in Paris Venin et al. 2010). The respective data structures are present in the CDM, but up to now no direct editing of atomised descriptive information is possible with CDM-based tools. The EDIT Platform for Cybertaxonomy is used by numerous biodiversity research initiatives and from regional and monographic floristic and faunistic projects to international checklists and biodiversity information portals (see FUB, BGBM 2016 for a list of publicly accessible Data Portals). HTML XML PDF
      PubDate: Wed, 16 Aug 2017 14:34:23 +030
       
  • A Comprehensive and Standards-Aware Common Data Model (CDM) for Taxonomic
           Research

    • Abstract: Biodiversity Information Science and Standards 1: e20367
      DOI : 10.3897/tdwgproceedings.1.20367
      Authors : Andreas Müller, Walter Berendsohn, Andreas Kohlbecker, Anton Güntsch, Patrick Plitzner, Katja Luther : The EDIT Common Data Model (CDM) (FUB, BGBM 2008) is the centrepiece of the EDIT Platform for Cybertaxonomy (FUB, BGBM 2011, Ciardelli et al. 2009). Building on modelling efforts reaching back to the 1990ies, it aims to combine existing standards relevant to the taxonomic domain (but often designed for data exchange) with requirements of modern taxonomic tools. Modelled in the Unified Modelling Language (UML) (Booch et al. 2005), it offers an object oriented view on the information domain managed by expert taxonomists that is implemented independent of the used operating system and database management system (DBMS). Being used in various national and international research projects with diverse foci over the past decade, the model evolved and became the common base of a variety of taxonomic projects, such as floras, faunas and checklists (see FUB, BGBM 2016 for a number of data portals created and made publicly available by different projects). The CDM is strictly oriented towards the needs of the taxonomic experts community. Where requirements are complex it tries to reflect them reasonably rather than introducing ambiguity or reduced functionality via (over-)simplification. Where simplification is possible it tries to stay or become simple. Simplification on the model level is achieved by implementing business rules via constraints rather than via typification and subclassing. Simplification on the user interface level is achieved by numerous options for customisation. Being used as a generic model for a variety of application types and use cases, it is adaptable and extendable by users and developers. It uses a combination of static and dynamic typification to allow both efficient handling of complex but well-defined data domains such as taxonomic classifications and nomenclature as well as less well-defined flexible domains like factual and descriptive data. Additionally it allows the creation of more than 30 types of user defined vocabularies such as those for taxonomic rank, nomenclatural status, name-to-name relationships, geographic area, presence status, etc. A strong focus is set on good scientific praxis by making the source of almost all data citable in detail and offering data lineage to trace data back to its roots. It is also easy to reflect multiple opinions in parallel, e.g. differing taxonomic concepts (Berendsohn 1995, Berendsohn & al., this session) or several descriptive treatments obtained from different regional floras or faunas. The CDM attempts to comprehensively cover the data used in the taxonomic domain - nomenclature, taxonomy (including concepts), taxon distribution data, descriptive data of all kinds, including morphological data referring to taxa and/or specimens, images and multimedia data of various kinds, and a complex system covering specimens and specimen derivatives down to DNA samples and sequences (Kilian et al. 2015, Stöver and Müller 2015) that mirrors the complexity of knowledge accumulation in the taxonomic research process. In the context of the EDIT Platform, several applications have been developed based on the CDM and the library that provides the API and web Service interfaces based on the CDM (see Kohlbecker & al. and Güntsch & al., this session). In some areas the CDM is still evolving - although the basic structures are present, questions of application development feed back into modelling decisions. However, a "no-shortcuts" approach to modelling has variously delayed application development in the past, but it now pays off: the Platform can rapidly adapt to changing requirements from different projects and taxonomic specialists. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 14:31:49 +030
       
  • EDIT Platform Web Services in the Biodiversity Infrastructure Landscape

    • Abstract: Biodiversity Information Science and Standards 1: e20363
      DOI : 10.3897/tdwgproceedings.1.20363
      Authors : Anton Güntsch, Andreas Kohlbecker, Andreas Müller, Walter Berendsohn, Katja Luther, Patrick Plitzner : The EDIT Platform for Cybertaxonomy is a standards based suite of software components supporting the taxonomic research workflow from field work to publication in journals and dynamic web portals (FUB, BGBM 2011). The underlying Common Data Model (CDM) covers the main biodiversity informatics foci such as names, classifications, descriptions, literature, multimedia, literature as well as specimens and observations and their derived objects. Today, more than 30 instances of the platform are serving data to the international biodiversity research communities. An often overlooked feature of the platform is its well defined web service layer which provides capable functions for machine access and integration into the growing service-based biodiversity informatics landscape (FUB, BGBM 2010). All platform instances have a pre-installed and open service layer serving three different use cases: The CDM REST API provides a platform independent RESTful (read-only) interface to all resources represented in the CDM. In addition, a set of portal services have been designed to meet the special functional requirements of CDM data portals and their advanced navigation capabilities. While the "raw" REST API has already all functions for searching and browsing the entire information space spanned by the CDM, the integration of CDM services into external infrastructures and workflows requires an additional set of streamlined service endpoints with a special focus on documentation and version stability. To this end, the platform provides a set of "catalogue services" with optimized functions for (fuzzy) name, taxon, and occurrence data searches (FUB, BGBM 2013, FUB, BGBM 2014). A good example for the integration of EDIT platform catalogue services into broader workflows is the "Taxonomic Data Refinement Workflow" implemented in the context of the EU 7th Framework Program Project BioVeL (Hardisty et al. 2016). The workflow uses the service layer of an EDIT Platform based instance of the Catalogue of Life (CoL) for resolving taxonomic discrepancies between specimen datasets (Mathew et al. 2014). The same service is also part of the Unified Taxonomic Information Service (UTIS) providing an easy-to-use interface for running simultaneous searches across multiple taxonomic checklists (FUB, BGBM 2016). HTML XML PDF
      PubDate: Wed, 16 Aug 2017 14:25:44 +030
       
  • The CDM Applied: Unit-Derivation, from Field Observations to DNA Sequences

    • Abstract: Biodiversity Information Science and Standards 1: e20366
      DOI : 10.3897/tdwgproceedings.1.20366
      Authors : Patrick Plitzner, Andreas Müller, Anton Güntsch, Walter Berendsohn, Andreas Kohlbecker, Norbert Kilian, Tilo Henning, Ben Stöver : Specimens form the falsifiable evidence used in plant systematics. Derivatives of specimens (including the specimen as the organism in the field) such as tissue and DNA samples play an increasing role in research. The EDIT Platform for Cybertaxonomy is a specialist’s tool that allows to document and sustainably store all data that are used in the taxonomic work process, from field data to DNA sequences. The types of data stored can be very heterogeneous consisting of specimens, images, text data, primary data files, taxon assignments, etc. The EDIT Platform organizes the linking between such data by using a generic data model for representing the research process. Each step in the process is regarded as a derivation step and generates a derivative of the previous step. This could be a field unit having a specimen as its derivative or a specimen having a tissue sample as its derivative. Each derivation step also produces meta data storing who, when and how the derivation was done. The Platform's Common Data Model (CDM) and the applications build on the CDM library thus represent the first comprehensive implementation of the largely theoretical models developed in the late 1990ies (Berendsohn et al. 1999). In a pilot project research data about the genus Campanula (Kilian et al. 2015, FUB, BGBM 2012) was gathered and used to create a hierarchy of derivatives reaching from field data to DNA sequences. Additionally, the open source library for multiple sequence alignments LibrAlign (Stöver and Müller 2015) was used to integrate an alignment editor into the EDIT platform that allows to generate consensus sequences as derivatives of DNA sequences. The persistent storage of each link in the derivation process and the degree of detail on how the data and meta data are stored will speed up the research process, ease the reproducibility of research results and enhance sustainability of collections. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 14:20:10 +030
       
  • EDIT Platform Projects: What’s Next

    • Abstract: Biodiversity Information Science and Standards 1: e20365
      DOI : 10.3897/tdwgproceedings.1.20365
      Authors : Anton Güntsch, Walter Berendsohn, Andreas Müller, Andreas Kohlbecker : The EDIT Platform for Cybertaxonomy has come a long way towards providing a complete, standards-based and reliable set of tools and services supporting the taxonomic workflow (Ciardelli et al. 2009). The Platform is firmly grounded in the organisational structure of the BGBM, with several positions directly dedicated to maintenance and further development of the Platform, complemented by numerous projects that are carried out in international and national cooperation. Furthermore, we continue to count on the collaboration of the teams in Paris and Tervuren, for descriptive data and geographic mapping functionalities, respectively. However, there are a number of areas where further research and development is needed, and of course the speed of development is often limited by the available resources. Among the topics under discussion are, inter alia: The integration of regional treatments (e.g. floras or faunas) with global monographic ones Efficient ways towards an integrated management of literature references Alternative approaches towards generating morphological descriptions and identification tools The integration of data quality indicators The handling of taxon concepts in a (present or absent) phylogenetic context How concept relations (Berendsohn & al., this session) can be handled efficiently in user interfaces Assignment of stable identifiers to core objects such as scientific names and taxa (needs community agreement both as to the technical implementation and as to rules applicable to identifiers when changes of objects occur). Synchronisation of taxonomic datasets in multiple instances of the EDIT Platform with overlapping information areas. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 13:48:52 +030
       
  • The CDM Applied: Handling of Names, Taxa and Concepts in a Conservation
           Context

    • Abstract: Biodiversity Information Science and Standards 1: e20364
      DOI : 10.3897/tdwgproceedings.1.20364
      Authors : Walter Berendsohn, Andreas Müller, Andreas Kohlbecker, Anton Güntsch, Katja Luther, Patrick Plitzner : One of the major design features of the Common Data Model (CDM) is the ability to store and handle taxonomic concepts (a.k.a. “potential taxa” -Berendsohn 1995 , “taxonyme” - Koperski et al. 2000, "Assertions" - Pyle 2004, "taxonomic entities" -Kennedy et al. 2005 “taxon circumscriptions”, etc.). A major driver of the critical appreciation of the concept problem in databases has been the conservation community. Progress in taxonomy may rapidly erode the validity of taxon-name based species conservation information. For example, in the context of periodic publication of Red Lists the tracing of changes in the circumscription, which may directly impact the conservation status of a group of organisms. So it is not a coincidence that the Federal German Agency for Nature Conservation (BfN) has been an important funder or projects aimed at further investigating and solving this problem Koperski et al. 2000, Berendsohn et al. 2003, Baumann et al. 2012). The president of the agency stated this as follows: "Information systems on plant or animal biodiversity are basic tools for effective nature conservation. .... Factual information about plants or animals are linked to their scientific name. ... when merging taxon-related information from a lot of sources we not only need to know how to handle synonymies, but also the different taxonomic concepts related to these names and the rules for transmitting factual information from one taxonomic concept to the other" (Vogtmann 2003). The problem is particularly evident when dealing with Red Lists of organisms. Since 1971 the BfN regularly publishes Red Lists, the aim is to publish those in 10-year intervals. These are lists of taxa (normally species) with data on their conservation status - including the assigned category of threat (from extinct to unproblematic), further specification of risk factors for threatened species, distribution information, Germany's responsibility for the conservation of the taxon, etc. (Binot-Hafke et al. 2009). A particularity of the German lists is that they are aiming to list all organisms, including those not (currently) threatened. The lists contain an expert assessment of trends (e.g. in population sizes etc.) that may indicate future changes in conservation status (Ludwig et al. 2009), but their editions themselves allow to compute trends over time - that is, if the taxon concept denoted by the name is stable, or if we know how concepts in both lists relate to each other. In the context of the "Red Lists 2020" project (2011-15), the German Red Lists held by the BfN have been imported into the EDIT Platform for Cybertaxonomy. The data are held in 3 Platform instances (databases), one for animals, one for plants and one for fungi (including lichens). Tools developed by BfN staff (G. Ludwig, pers. comm.) allowed to establish concept relations between the different editions - for example, the concepts from 8 publications (including floras) covering plants are included and inter-linked in the respective database. The BfN and the newly established German Red List Centre have decide to use the EDIT Platform to manage the taxonomy of Red Lists in Germany. A new project ("Kooperation Checklisten") will start to develop the tools for the handling of new editions of the checklists, among them a simplified checklist editor, a distribution data editor, and a concept-relation editor (including a wizard-like interface). These tools will be fully browser-based in order to allow wider participation in the editing process. Since conservation is legally a responsibility of the German states, an important issue is to trace and document not only taxon state-level distribution, but also concept differences of checklists used by the state governments against the federal list. A joint management of the taxonomy, allowing differing concepts (and legal applications of names) is seen as a means to further develop consensus about the classification of German organisms, including the necessary updates brought about by new knowledge. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 13:42:12 +030
       
  • Addressing the proposal for new Darwin Core terms for interaction data

    • Abstract: Biodiversity Information Science and Standards 1: e20350
      DOI : 10.3897/tdwgproceedings.1.20350
      Authors : Willem Coetzer : A proposal has been made to create new terms in the Darwin Core Standard to represent biological interaction/species interaction concepts in raw data. The motivation is to consistently perform vertical integration of raw data e.g. to facilitate discovery of larger, more representative datasets. The proposition may be problematic because the word ‘interaction’ is ambiguous, e.g. being used to refer to a ‘behavioural interaction between individual organisms’ or an ‘ecological interaction between populations’. In addition, both of these concepts are high-level abstractions from raw (e.g. ecological) data, which are often incomplete or uncertain, and require an objective causal inference to be made by a machine before they can be instantiated (rather than a subjective human interpretation). In contrast, the Darwin Core terms describe low-level knowledge of the data record itself, e.g. what taxon is represented by the record, and where and when the organism was observed. The potential uses of the broad and heterogeneous class of ‘interaction data’ therefore need to be better understood. Semantic mediation and semantic enrichment (e.g. by causal inferencing), for particular purposes, can then take place. In other words, a specialized knowledge-based system/expert system needs to be designed to ensure that data that potentially represent ecological interactions can be objectively interpreted using appropriate knowledge models (e.g. ontologies). The implications of this, for the proposal to extend the Darwin Core vocabulary with terms describing interactions, are discussed. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 8:02:23 +0300
       
  • Turning Flickr into a useful Citizen Science Resource

    • Abstract: Biodiversity Information Science and Standards 1: e20348
      DOI : 10.3897/tdwgproceedings.1.20348
      Authors : Arthur Chapman : Flickr is a photo and web hosting social media site created by Ludicorp in 2004 and later (in 2005) acquired by Yahoo. It is largely used by people for hosting social media type photos, but through the judicious use of Tagging (including use of machine tags), APIs, grouping, tag indexing and georeferencing it has been turned into a powerful citizen science resource. By 2015, it was estimated that there were over 10 billion images on Flickr, and there are millions of biodiversity-related images posted. Flickr provides for the creation of ‘Groups’ and for example, the ‘Bird Photo Group’ alone has over 2.7 million photographs. Of course, not all biodiversity images are named and identified, or georeferenced (although many are), but many of the Flickr Groups have established rules that allow easy identification of images with names, through tagging and machine tagging. Flickr also includes an in-house georeferencing tool, and georeferences can also be added through machine-tagging. Data Quality is always an issue with any Citizen Science initiative, and identification from images can often be problematic. Many Flickr Groups, however, have scientists as members and these will often interact with the photographer to help identify species. At other times, through the image comments, mis-identifications are queried and suggested corrections made. Thirdly, there are several Flickr Groups established where photographers can place images they are unsure of and ask for identification advice. In this paper, I will give several examples of Flickr Groups that are using innovative techniques for extracting and using images – for example for loading to the Encyclopedia of Life (https://www.flickr.com/groups/encyclopedia_of_life) and the Atlas of Living Australia (https://biocache.ala.org.au/occurrences/8cb21942-0b13-493f-ae1d-ee6fd158f758), as well as by Citizen Science Groups developing comprehensive field guides such as to Insects of Australia using tags to group into orders and familes (https://www.flickr.com/groups/oz_insects), Australian Rainforest Trees using complex tagging to use for species identification (https://www.flickr.com/groups/australianrainforestplants/discuss/72157604373025570), and through developing small specialist groups such as for Australian Native Plants (https://www.flickr.com/groups/australian_native_plants) Pseudoscorpions (https://www.flickr.com/groups/719103@N20), Polypores (https://www.flickr.com/groups/390204@N21), Invasive Species (https://www.flickr.com/groups/18983462@N00), Weeds of Australia (https://www.flickr.com/groups/1553999@N21), etc. I will also explore some options for longer-term archiving of resources such as images on Flickr and other web-hosting sites. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 7:04:00 +0300
       
  • Using ontologies to explore floral evolution in a non-model plant clade

    • Abstract: Biodiversity Information Science and Standards 1: e20347
      DOI : 10.3897/tdwgproceedings.1.20347
      Authors : Annika Smith : The ability to successfully address the complex, multidimensional process of plant character evolution requires approaches that integrate across domains: genetics, evolution, development, and ecology. Additionally, in order to understand the patterns of plant character evolution across a broad phylogenetic scale, we must continue to extend beyond current model organisms and identify new candidate genes implicated in phenotypic evolution. I will explore the potential of ontologies to link the phenotypes and developmental processes of non-model plant clades to underlying candidate genes identified from the model plant Arabidopsis, with the overall goal of generating candidate gene hypotheses in non-model plants. This presentation will explore the process of building an ontology specific for a non-model clade which can be integrated with existing ontologies and repositories for model plants, using the genus Tropaeolum, commonly known as nasturtiums. As well as highlighting the resources and pipelines that facilitate the development of an ontology in a non-model clade, I will also discuss the broader challenges and the potential inherent in an ontological approach. HTML XML PDF
      PubDate: Wed, 16 Aug 2017 6:11:37 +0300
       
  • Towards Insect Digital Collections and Data Publishing: A journey for the
           GBIF-funded African Insect Atlas Collaborative Project

    • Abstract: Biodiversity Information Science and Standards 1: e20274
      DOI : 10.3897/tdwgproceedings.1.20274
      Authors : Boikhutso Lerato Rapalai, Kudzai Mafuwe, Laban Njoroge, Balsama Rajemison : Museums from six African countries (Botswana, Kenya, Zimbabwe, South Africa, Madagascar, Mozambique), with support from the California Academy of Sciences, are currently collaborating on the GBIF funded project: African Insect Atlas: unleashing the potential of insects in conservation and sustainability research in Africa (BID-AF2015-0134-REG). This project was initiated to move biodiversity knowledge out of insect collections into the hands of a new generation of global biodiversity researchers interested in direct outcomes. The project acknowledges that insects are the glue that hold ecosystems together, and are ideal organisms for climate change biology, conservation planning, mapping local and regional patterns of diversity, and monitoring threats to ecosystem services and natural capital, thereby addressing the Sustainable Development Goal #15, 'Life on Land (http://www.undp.org/content/undp/en/home/sustainable-development-goals.html). The consortium partners have, since June 2016, embarked on a journey to learn digitization techniques and have successfully digitized 50% of the project goals. The targeted insect orders include Coleoptera, Odonata, Ephemeroptera, Plecoptera, Trichoptera and Hymenoptera. The data being mobilized includes specimen and species data, habitat information as well as identification of possible threats such as deforestation. These are being captured into a standardized platform using Darwin Core. Elaborate data cleaning is being carried out using tools in OPEN Refine (http://openrefine.org) and Microsoft Excel 2010. The captured data is also being geo-referenced using appropriate software such as GEOLocate (http://www.museum.tulane.edu/geolocate) and GEO-Calculator (http://manisnet.org/gci2.html). The specimen occurrence records will be made available on the GBIF platform and will continuously be updated as more information becomes available. Any specimen images taken will also be linked to the database (SPECIFY and Microsoft Excel). Assessments will be carried out to establish which species are native and endemic as well as to establish their conservation status. Simplified image catalogues, checklists, distribution and habitat maps in suitable formats will also be produced to help scientists and other users to identify these species during their research and in the field. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 22:10:23 +030
       
  • A Pipeline for Processing Specimen Images in iDigBio - Applying and
           Generalizing an Examination of Mercury Use in Preparing Herbarium
           Specimens

    • Abstract: Biodiversity Information Science and Standards 1: e20326
      DOI : 10.3897/tdwgproceedings.1.20326
      Authors : Gaurav Yeole, Saniya Sahdev, Matthew Collins, Alex Thompson, Rebecca Dikow, Paul Frandsen, Sylvia Orli, Renato Figueiredo : iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (Matsunaga et al. 2013). Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we are building a model pipeline for applying user-defined processing to all or any subset of images stored in iDigBio on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. This pipeline utilizes Apache Spark, the Hadoop File System (HDFS), and Mesos (Collins et al. 2017). We have placed a Jupyter notebook server in front of this architecture, which provides an easy environment for end users to write their own Python or R software programs. Users can access the stored data and images and manipulate them per their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we are applying a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury-containing solutions (Schuettpelz, in preparation). The model was trained on Smithsonian servers using their herbarium images and it is being transferred to the GUODA infrastructure hosted at the ACIS lab. All herbarium images in iDigBio are being classified using this model to illustrate the application of these techniques to larger sets of images using a deep convolutional neural network that detects visible mercury crystallization present on digitized herbarium sheets. Such an automated detection process can potentially be used, for instance, to notify other data publishers of any contamination. We are presenting the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 18:16:18 +030
       
  • GoMexSI:  Using Open Platforms such as Github, Wordpress, and GloBI to
           Manage, and Share Species Interaction Data 

    • Abstract: Biodiversity Information Science and Standards 1: e20325
      DOI : 10.3897/tdwgproceedings.1.20325
      Authors : James Simons, Jorrit Poelen : Biodiversity data and databases are usually taxonomic specific (e.g. HerpNet, FishNet2, etc.), although there are cases of regional, non-taxa specific, biodiversity databases. And some biodiversity databases are inclined toward a functional category, such as invasive species. While it is critical to know of the existence and taxonomy of the many biological species of the world, a logical next step is to catalogue the linkages, or interactions, between and amongst the species. These types of data occur in an ecosystem context, culling a subset of many species from many taxa to form the species assemblages and communities and the resulting interactions between them that make up the species interaction network. Toward this end the Gulf of Mexico Species Interactions (GoMexSI) database, which is an application of GloBI, is endeavoring to assemble, extract, upload, and serve all of the recorded species interaction data for the Gulf of Mexico. To do this we are dependent on the interoperability of various biodiversity databases such as EOL (Encyclopedia of Life), NCBI (National Center for Biotechnology Information), WoRMS (World Register of Marine Species), etc to provide name resolution for detection of invalid species names.  Using these data, GoMexSI takes advantage of the existing infrastructure of GloBI to integrate, link, and disseminate these data using various formats and methods. In addition, the relationship with GloBI negates the need to hire informatics staff, thus reducing costs. Data from GoMexSI is shared with scientists, and educators through GoMexSI’s Wordpress based webpage. While GloBI is solely dependent on contributed datasets from scientists willing to share their data, GoMexSI expends a lot of effort harvesting species interaction data from published and unpublished resources, although contributed databases and datasets are accepted. The data extraction and editing process is very time consuming and costly. And funding sources for data extraction and editing are limited, making it difficult to maintain the effort. Much of the data in GoMexSI comes from theses and dissertations (25% of references), while other sources include peer reviewed literature, government technical reports, and conference proceedings. The GoMexSI project has focused on cataloguing predator/prey interactions of the Actinopterygii and Chondrichthyes, but recently began adding predator/prey interaction data on marine mammals, sea and shore birds, sea turtles, molluscs, and crustaceans to the database. Much time and effort has been devoted to developing standards for biodiversity data in order to record these data in a consistent way (i.e. Darwin Core).  One of the key goals of the GoMexSI project from the outset has been to provide data standards for species interaction data where none existed previously. As we continue to work through the predator/prey interaction data of different taxa we are constantly confronted with new problems and issues in recording the data in a standard way. These include the description of predator and prey life history stages, description of prey parts, methods of length measurements, conversion of common names to scientific names, designation of locations, basis of prey identification, diet analysis methods, and others. Currently GoMexSI has 89,209 lines of data representing 2,146 unique interactor, gleaned from 172 references. Future plans for GoMexSI call for the addition of host/parasite, commensal, amensal, and mutualistic interaction data. In addition, we plan to include stable isotope data for Gulf species, as they serve as an integrated record of past interactions. We have shared our data collection methods and spreadsheets with the US Marine Mammal Commission in their effort to create a diet database for marine mammals. We are currently assisting Centro Interdisciplinario de Ciencias Marinas (CICIMAR) in La Paz Mexico to construct a species interaction database similar to GoMexSI for the Baja California (Gulf of California). HTML XML PDF
      PubDate: Tue, 15 Aug 2017 17:18:16 +030
       
  • The use of avian museum specimen data in phenology studies: prospects and
           challenges

    • Abstract: Biodiversity Information Science and Standards 1: e20324
      DOI : 10.3897/tdwgproceedings.1.20324
      Authors : Keith Barker, Michael Wells, Dakota Rowsey : Museum specimens offer a rich source of data on both long-term averages and temporal trends in organismal phenology. To date much of this work has focused on plants, but animal specimens are also useful in this regard. In particular, bird specimens may include data on age, gonad size and development, and molt, all of which are relevant to estimating breeding phenology. In addition, bird collections include nests and eggs, which are direct records of breeding. These data are some of the richest available for any vertebrate group, and the potential for integration with citizen-science-based data (e.g. eBird) is extremely promising. However, there are a number of challenges associated with use of these data, including biased geographic sampling, incomplete digitization, and non-uniform data standards, that currently limit their utility. In addition, methods for analysis of these data have not been well developed. Here, we outline informatics challenges associated with developing a database of avian gonad size data. In addition, we present a novel analytical framework for estimating phenology from such data (Fig. 1) that is broadly applicable to seasonal phenotype measurements. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 17:15:36 +030
       
  • TDWG 2017 Collections Description Interest Group Meeting

    • Abstract: Biodiversity Information Science and Standards 1: e20322
      DOI : 10.3897/tdwgproceedings.1.20322
      Authors : Alexander Thompson, Deborah Paul : The Collections Description Interest Group is dedicated to developing and supporting the Natural Collections Description (NCD) data standard for describing entire collections of natural history materials. Examples include collections of specimens, observation data, original artwork, photographs, and materials from the many voyages of discovery that have been conducted. The standard was brought up to the draft stage in 2008, and we are re-forming the interest group to attempt to finish it. Collection description records contain information about the collection, access and usage of the collection and where to get more detailed information. The meeting this year will focus on the formation of the Task Group to work on the standard itself, with the goal of leaving the meeting with a working draft of the task group charter, including an outline of goals and deadlines. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 16:52:41 +030
       
  • Occurrences: Data resources and Biocache-hub

    • Abstract: Biodiversity Information Science and Standards 1: e20321
      DOI : 10.3897/tdwgproceedings.1.20321
      Authors : Canadensys Network, Anne Bruneau, Carole Sinou, Jeremy Goimard : Atlas of Living Australia (ALA) [*1] framework is an open source infrastructure used to share biodiversity data through severals modules. Adding datasets in ALA is an important step that give access to occurrences. Setting of parameters needs to be accurate in order to correctly view occurrences. Biocache-hub [*2] is an interface that allows research on ingested occurrences by Biocache-store [*3]. It’s an advanced data explorer with filters. This training will be divided in two parts. First part will provide tools and techniques to add datasets, from a csv local resource to a GBIF dataset DWC file, within the administration management of the Collectory module [*4]. It will also present the important steps to link occurrences with datasets and how to update a dataset. Second part, within user view, will present the access to occurrences and options available from a Simple search to a Spatial search. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 16:30:11 +030
       
  • A tool for collections-specific searches in genetic databases

    • Abstract: Biodiversity Information Science and Standards 1: e20320
      DOI : 10.3897/tdwgproceedings.1.20320
      Authors : Michael Trizna : It is becoming increasingly important for museums and other scientific collections to quantify the amount of genetic resources being derived from their holdings. Genetic database records, such as GenBank and Barcode of Life (BOLD), have an optional field for indicating the specimen that it derived from, and, on the other side, specimen databases, such as GBIF (gbif.org) and iDigBio (idigbio.org), have an optional field for indicating sequence records that were derived from it. Making connections between the two types of records should be easy, but unfortunately they are made difficult by inconsistent standards. For example, GenBank has a catch-all "country" term that holds all geographic locality data for a specimen, whereas in Darwin Core (DwC) there are 12 atomized levels of locality names. The software tool described here was originally created for Smithsonian data managers to search genetic databases in a targeted manner for DNA sequences generated from Smithsonian specimens. It is being made open source to be utilized by other scientific institutions to quantify and document the genetic impact of their collections. Other potential uses include checking for data inconsistencies between sequence records and specimen records, and enforcing specimen loan agreements. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 15:45:42 +030
       
  • Biodiversity Information Serving Our Nation (BISON) now with more than 1/3
           billion species occurrences

    • Abstract: Biodiversity Information Science and Standards 1: e20285
      DOI : 10.3897/tdwgproceedings.1.20285
      Authors : Annie Simpson, Elizabeth Martín : Biodiversity Information Serving Our Nation (BISON) is a web-based resource (https://bison.usgs.gov) for finding and accessing occurrence records of species found in the United States (US), its Territories and marine Exclusive Economic Zones. BISON serves as a data aggregator that compiles and standardizes species occurrence data from multiple data providers, and now contains more than 1/3 billion species occurrences. BISON uses the Integrated Taxonomic Information System (ITIS) to standardize species names in searches. It is also the US hub of the Global Biodiversity Information Facility (GBIF) and obtains much of its data from that resource. BISON also enables access to numerous federal datasets such as the US Forest Service's Forest Inventory and Analysis and the US Geological Survey's Bird Banding Lab. BISON accepts all US species occurrence datasets that are Darwin Core-compatible, but especially seeks to mobilize pollinator and invasive species occurrence datasets. Data from BISON can also be accessed via various map services, and the US National Parks Service's Species Checklists application is currently available in a development environment as an example of use of BISON web services. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 15:24:14 +030
       
  • Integrating AnnoSys into your specimen data portal

    • Abstract: Biodiversity Information Science and Standards 1: e20313
      DOI : 10.3897/tdwgproceedings.1.20313
      Authors : Lutz Suhrbier, Okka Tschöpe, Walter Berendsohn, Anton Güntsch : Access to AnnoSys from your portal makes it possible to (1) annotate and to (2) show existing annotations for specimen data records. To this end, weblinks from the page displaying the individual specimen record to AnnoSys are incorporated into your website. In the current (XML-based) system, the portal should provide a link called "annotate" or similar which redirects users to AnnoSys in order to create annotations based on the record data actually shown in your portal. Of course, to enable annotation, an access point for the record to be annotated has to be transmitted with the request. AnnoSys can then download the referred specimen data record. After successfully transferring the data, the user will be redirected to the AnnoSys Annotation Editor first and if they start editing an annotation to the AnnoSys user login/registration dialog subsequently to. At present, AnnoSys supports ABCD 2.06, ABCD 2.1 or SimpleDarwinCore in XML formats. If you already use a BioCASe provider to deliver ABCD data from your collection database, then this is sufficient, but you can also provide any other URL that provides the record in one of the supported formats, and AnnoSys will download the data from that URL. In order to show existing annotations, you request AnnoSys to show them by querying the AnnoSys repository for the ID of the specimen. Currently, the triple ID in use in the GBIF and BioCASE networks is used to identify a specimen. An information request directed to AnnoSys for a certain triple ID will return a JSON response with: hasAnnotation: true/false indicating if there are any annotations available size: number of currently available annotations annotations: a list of relevant annotation metadata you may want to show in your portal, such asannotator (name of the annotator)time (creation time in ms since 01.01.1970)motivation (type) of the annotationrepositoryURI: link to the RDF-Data of the annotationrecordURIs: list of URIs to the original record data (XML) the annotation is based on You can either directly link to the AnnoSys interface to display the existing annotation, using the repositoryURI and recordURIs element values from the JSON response, or you can display selected metadata about already available annotations directly in your portal. For full documentation including examples see the AnnoSys wiki. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 13:34:18 +030
       
  • AnnoSys – an online tool for sharing annotations to enhance data
           quality

    • Abstract: Biodiversity Information Science and Standards 1: e20315
      DOI : 10.3897/tdwgproceedings.1.20315
      Authors : Okka Tschöpe, Lutz Suhrbier, Anton Güntsch, Walter Berendsohn : AnnoSys is a web-based open-source information system that enables users to correct and enrich specimen data published in data portals, thus enhancing data quality and documenting research developments over time. This brings the traditional annotation workflows for specimens to the Internet, as annotations become visible to researchers who subsequently observe the annotated specimen. During its first phase, the AnnoSys project developed a fully functional prototype of an annotation data repository for complex and cross-linked XML-standardized data in the ABCD (Access to biological collection data Berendsohn 2007- and Darwin Core (DwC - Wieczorek et al. 2012) standards, including back-end server functionality, web services and an on-line user interface Tschoepe et al. 2013. Annotation data are stored using the Open Annotation Data Model Sanderson et al. 2013 and an RDF-database Suhrbier et al. 2017. Public access to the annotations and the corresponding copy of the original record is provided via Linked Data, REST and SPARQL web services. AnnoSys can easyly be integrated into portals providing specimen data (see Suhrbier & al., this session). As a result, the individual specimen page then includes two links, one providing access to existing annotations stored in the AnnoSys repository, the other linking to the AnnoSys annotation Editor for annotation input. AnnoSys is now integrated into a dozen specimen portals, including the Global Biodiversity Information Facility GBIF and the Global Genome Biodiversity Network GGBN. In contrast to conventional, site-based annotation systems, annotations regarding a specimen are accessible from all portals providing access to the specimen's data, independent of which portal has originally been used as a starting point for the annotation. Apart from that, users can query the data in the AnnoSys portal or create a subscription to get notified about annotations using criteria referring to the data record. For example, a specialist for a certain family of organisms, working on a flora or fauna of a certain country, may subscribe to that family name and the country. The subscriber is notified by email about any annotations that fulfil these criteria. Other possible subscription and filter criteria include the name of collector, identifer or annotator, catalogue or accession numbers, and collection name or code. For curators a special curatorial workflow supports their handling of annotations, for example confirming a correction according to the annotation in the underlying primary database. User feedback on the currently available system has led to a significantly simplified version of the user interface, which is currently undergoing testing and final implementation. Moreover, the current, second project phase aims at extending the generic qualities of AnnoSys to allow processing of additional data formats, including RDF data with machine readable semantic concepts, and thus opening up the data gathered through AnnoSys for the Semantic Web. We developed a semantic concept driven annotation management, including the specification of a selector concept for RDF data and a repository for original records extended to RDF and other formats. Based on DwC RDF terms and the ABCD ontology, which deconstructs the ABCD XML-schema into individually addressable RDF-resources, we built an “AnnoSys ontology”. The AnnoSys-2 system is currently in the testing phase and will be released in 2018. In future research (see Suhrbier, this volume), we will examine the use of AnnoSys for taxon-level data as well as its integration with image annotation systems. BGBM Berlin is committed to sustain AnnoSys beyond the financed project phase. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 13:33:55 +030
       
  • AnnoSys - future developments

    • Abstract: Biodiversity Information Science and Standards 1: e20317
      DOI : 10.3897/tdwgproceedings.1.20317
      Authors : Lutz Suhrbier, Okka Tschöpe, Anton Güntsch, Walter G. Berendsohn : AnnoSys (Tschöpe et al. 2013, Suhrbier et al. 2017) is a web-based open-source system for correcting and enriching biodiversity data in publicly available data portals. Users are enabled to annotate specimen data, and these annotations become visible to researchers who subsequently observe the annotated specimen. The AnnoSys search and subscription capabilities make it possible to access or receive notification of annotations of records and even records of duplicate specimens accessed in different portals. In its current second project phase, the project's technical infrastructure opens from a mixture of structured specimen data based on XML*1 and semantic information (annotations based on W3C Open Annotation*2) into a pure semantic and linked data (Heath and Bizer 2011) oriented service backend. To this end, we are implementing an AnnoSys ontology prototype providing semantic information about supported data elements, their mappings and semantic relationships with data elements from an extensible catalog of relevant biodiversity standards (e.g. ABCD*3, Darwin Core*4) as well as their annotation workflow oriented collection and organisation within so called annotation types. Furthermore, the linked data oriented service backend enables importing, exporting and transforming annotation related information into a variety of data formats and sources. Ultimately, AnnoSys will be upgraded to the new W3C Web Annotation*5 standard for representing annotations in RDF*6. These new facilities permit AnnoSys to hook into a number of annotation workflows in a way that could not have been realised before. Examples include the automatic generation of annotations from the output of data quality control services, the reporting of update or edit processes at provider databases to the AnnoSys service backend, or recording the changes made in large(r) datasets by analysing differences between download and (corrected) upload. The extension of the data domain from specimen data to taxonomic data (i.e. annotation of checklists) is another envisioned development, same as supporting the annotation of multimedia elements (e.g. the images that are inreasingly linked to specimen data records). Within our presentation, we will sketch out some of these use cases to foster the discussion of further workflow scenarios for biodiversity-related annotations. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 13:33:37 +030
       
  • DNAqua-Net: Advancing Methods, Connecting Communities and Envisaging
           Standards

    • Abstract: Biodiversity Information Science and Standards 1: e20310
      DOI : 10.3897/tdwgproceedings.1.20310
      Authors : Alexander Weigand, Jonas Zimmermann, Agnès Bouchez, Florian Leese : Water covers over 70% of our planet’s surface and it is a key resource for the survival of all organisms, among them humans. Unfortunately, water resources face increasing pressures due to the exponential expansion of and resource exploitation by human beings. The consequences of this on water ecosystems represent hallmarks of the Anthropocene such as chemical pollution, warming, scarcity of clean drinking water, ocean acidification and a dramatic loss of biodiversity. As a consequence, the direct and indirect benefits humanity obtained from these ecosystems as cheap services, such as clean water, biomass production, climate regulation and matter fluxes, are increasingly at risk. Therefore, it is of utmost importance to assess the ecological state of aquatic ecosystems and to protect and manage them in a sustainable way. In order to assess the ecological status of a given water body, aquatic biodiversity data are collected by morphological identification of bioindicator species and a comparison of site-specific species lists to those of fairly natural reference water bodies. Quantifying the differences between the lists guides subsequent management actions. Examples of European standard bioasseessments (so far morphologically-based) are Marine Strategy Framework Directive (2008/56/EC) and the Water Framework Directive (2000/60/EC). While the implementation of biomonitoring programs is already a great success, there is room for improvement. In the field of molecular genetics, revolutionary high-throughput DNA-based analyses have been developed. These can be applied to assess taxon lists of hundreds to many thousands at once and greatly improve speed and accuracy of assessments. However, while these novel genetic tools have attracted a lot of interest, they are not implemented in any of the regular legal biomonitoring programs. In order to change this, the European Co-Operation in Science and Technology (COST) program's Action CA15219 'DNAqua-Net' was launched in November 2016 (Leese et al. 2016). The Action aims to gather existing knowledge and complement those standard procedures by developing and implementing these novel genomic DNA-based approaches for biomonitoring and bioassessment. The Action is comprised of five working groups (WGs): WG1: DNA Barcode References; WG2: Biotic Indices & Metrics; WG3: Lab & Field Protocols; WG4: Data Analysis & Storage and WG5: Implementation Strategies & Legal Issues. However, central to this is the standardisation of the various protocols, methods and biotic indices and integration of DNA-based datasets (e.g. resulting from DNA metabarcoding, mito- and metagenomics) with existing data standards. Here, the TDWG community will be of central importance and its participation in DNAqua-Net is highly desirable. Moreover, the innovative open access journal Metabarcoding & Metagenomics (MBMG) has been recently initiated to promote open science and enhance data exchange in this field as well as to connect the diverse actors and communities (Leese et al. 2017). HTML XML PDF
      PubDate: Tue, 15 Aug 2017 10:26:07 +030
       
  • Towards a comprehensive workflow for biodiversity data in R

    • Abstract: Biodiversity Information Science and Standards 1: e20311
      DOI : 10.3897/tdwgproceedings.1.20311
      Authors : Tomer Gueta, Vijay Barve, Ashwin Agrawal, Thiloshon Nagarajah, Yohay Carmel : Increasing number of scientists are using R for their data analyses, however, proficiency required to manage biodiversity data in R is considerably rarer. Since, users need to retrieve, manage and assess high-volume data with inherent complex structure (Darwin Core standard, DwC), various R packages dealing with biodiversity data and specifically data cleaning have been published. Though numerous new procedures are now available, implementing them require users to provide a great deal of efforts in exploring and learning each R package. For the common users, this task can be daunting. In order to truly facilitate data cleaning using R, there is an urgent need for a package that will fully integrate functionality of existing packages, enhance their functionality, and simplify its implementation. Furthermore, it is also necessary to identify and develop missing crucial functionalities. We are attempting to address these issues by developing two projects under Google Summer of Code (GSoC)-- an international annual program that matches up students with open source organizations to develop code during their summer break. The first project is dealing with the integration challenge by developing a taxonomic cleaning workflow; standardizing various spatial and temporal data quality checks; and enhancing different data retrieval and data management techniques. The second project aims at advancing new and exciting features, such as establishing a flagging system (HashMap-like) in R, an innovative set of DwC summary tables, and developing new techniques for outliers analysis. The products of these projects lay down crucial infrastructure for data quality assessment in R. Obviously this is a work in progress and needs further inputs. By developing a comprehensive framework for handling biodiversity data, we can fully harness the synergetic quality of R, and hopefully supply more holistic and agile solutions for the user. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 10:24:38 +030
       
  • Automated Generation of Lists of Unique Values from iDigBio Data Fields
           to Facilitate Data Quality Improvements

    • Abstract: Biodiversity Information Science and Standards 1: e20306
      DOI : 10.3897/tdwgproceedings.1.20306
      Authors : Saniya Sahdev, Deborah Paul, Matthew Collins, Jose Fortes : iDigBio currently has over 100 million records with up to 260 fields per record [Matsunaga et al. 2013]. Many of these fields are mapped to the Darwin Core (DwC) and Audubon Core standards. How well do the data in those fields meet the term definitions of those standards' Amassing biodiversity collections data into very large aggregated datasets offers never-before-possible ways in which to use the existing data to enhance current data and improve future data. While most data providers attempt to adhere to the recommended standards, looking inside the data entered for a given field across aggregated datasets has revealed significant data quality issues. Among other issues, data might be the wrong data type, mapped incorrectly, use old terminology, be formatted incorrectly, or use a non-standard controlled vocabulary. The Darwin Core Hour webinar initiative [Zermoglio et al. 2017] started in January of 2017 to improve DwC implementation and documentation, as well as community engagement and understanding of the DwC standard and standards process. As part of that process, it was recognized that while aggregators with informatics skills can easily see the above data issues, it is not simple for most data providers or downstream users to visualize large datasets. In fact, it is often difficult for data providers to visualize issues in their own local datasets. One place to start improving data quality is with the fields from the DwC standard that recommend the use of a controlled vocabulary. There are 23 fields that recommend the use of a controlled vocabulary. A call went out to large aggregators to share comma separated values (CSV) files containing a list of distinct values found in each of these 23 fields, along with a count. The responses from iDigBio, the Global Biodiversity Information Facility (GBIF), and VertNet are stored in the TDWG Darwin Core Q&A GitHub repository [Paul 2017]. Based on this community need to have more insight into controlled vocabulary data as well as experience with iDigBio’s existing data cleaning approaches, we have constructed an automated process to generate lists of unique values in iDigBio fields. We used the data available from dumps of the entire iDigBio data set, which are written out weekly and stored on the GUODA (Global Unified Open Data Access) infrastructure [Collins et al. 2017], the distributed processing engine Apache Spark, and the job management software Jenkins. The resulting CSV files are archived and automatically made publicly available once a week through the web on iDigBio’s Ceph object store. Dynamically generating this distinct value data is a first step in understanding the current vocabularies in use by data providers. Using summarization and clustering algorithms, data in the fields can be easily visualized and analyzed. With these data, not only can patterns beyond typos and counts be seen by anyone, but metrics can be put in place. As discipline-specific communities are able to easily see what is in a given field, they can work together to synthesize recommended vocabularies to improve future data. As the data are improved, the number of distinct clusters would be expected to decrease, as would the number of values found in a given cluster. Without these kinds of automated tools that build data products from aggregated data, it would be much harder to tackle many data quality issues. HTML XML PDF
      PubDate: Tue, 15 Aug 2017 4:37:06 +0300
       
  • Toward a Biodiversity Data Fitness for Use Backbone (FFUB): A Node.js
           module prototype

    • Abstract: Biodiversity Information Science and Standards 1: e20300
      DOI : 10.3897/tdwgproceedings.1.20300
      Authors : Allan Veiga, Antonio Saraiva : Introduction: The Biodiversity informatics community has made important achievements regarding digitizing, integrating and publishing standardized data about global biodiversity. However, the assessment of the quality of such data and the determination of the fitness for use of those data in different contexts remain a challenge. To tackle such problem using a common approach and conceptual base, the TDWG Biodiversity Data Quality Interest Group - BDQ-IG (https://github.com/tdwg/bdq) has proposed a conceptual framework to define the necessary components to describe Data Quality (DQ) needs, DQ solutions, and DQ reports. It supports a consistent description of the meaning of DQ in specific contexts and how to assess and manage DQ in a global and collaborative environment Veiga 2016, Veiga et al. 2017. Based on the common ground provided by this conceptual framework, we implemented a prototype of a Fitness for Use Backbone (FFUB) as a Node.js module (https://nodejs.org/api/modules.html) for registering and retrieving instances of the framework concepts. Material and methods: This prototype was built using Node.js, an asynchronous event-driven JavaScript runtime, which uses a non-blocking I/O model that makes it lightweight and efficient to build scalable network applications (https://nodejs.org). In order to facilitate the reusability of the module, we registered it in the NPM package manager (https://www.npmjs.com). To foster collaboration on the development of the module, the source code was made available in the GitHub (https://github.com) version control system. To test the module, we have developed a simple mechanism for measuring, validating and amending the quality of datasets and records, called BDQ-Toolkit. The source code of the FFUB module can be found at https://github.com/BioComp-USP/ffub. Installing and using the module requires Node.js version 6 or higher. Instructions for installing and using the FFUB module can be found at https://www.npmjs.com/package/ffub. Results: The implemented prototype is organized into three main types of functions: registry, retrieve and print. Registry functions enable the creation instances of concepts of the conceptual framework, as illustrated in Fig. 1, such as use cases, information elements, dimensions, criteria, enhancements, specifications, mechanisms, assertions (measure, validation, and amendment) and DQ profiles. As a prototype, these instances are not persisted, but they are stored in an in-memory JSON object. Retrieve functions are used to get instances of the framework concepts, such as DQ reports, based on the in-memory JSON object. Print functions are used to write in the console the concepts stored in the in-memory JSON object in a formatted way. Inside the FFUB module, we implemented a test which registers a set of instances of the framework concepts, including a simple DQ profile, specifications and mechanisms and a set of assertions applied to a sample dataset and its records. Based on these registries, it is possible to retrieve and print DQ reports, presenting the current status of DQ of the sample dataset and its records according to the defined DQ profile. Final remarks: This module provides a practical interface to the proposed conceptual framework. It allows the input of instances of concepts and generates, as output, information which allows the DQ assessment and management. Future work includes creating a RESTful API, based on the functions developed in this prototype, using sophisticated methods of data retrieving based on NoSQL databases. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 21:31:24 +030
       
  • Brazilian Plant-Pollinator Interactions Network: definition of a data
           standard for digitization, sharing, and aggregation of plant-pollinator
           interaction data

    • Abstract: Biodiversity Information Science and Standards 1: e20298
      DOI : 10.3897/tdwgproceedings.1.20298
      Authors : Antonio Saraiva, José Salim, Kayna Agostini, Marina Wolowski, Juliana Silva, Allan Veiga, Bruno Albertini : Pollination is considered one of the most important processes for biodiversity conservation (Kremen 2005). Recently, the global community, by means of the Intergovernmental Platform of Biodiversity and Ecosystems Services (IPBES 2016), and also, the Convention on Biological Diversity (CBD 2002) recognized the importance of plant-pollinator interactions for ecosystems functioning and sustainable agriculture. The conservation of pollination depends of information about plant-pollinator interactions covering a great diversity of functional and taxonomic groups. Studies show that successful pollination can improve the amount and the quality of plant fecundation and fruit production (Kevan and Imperatriz-Fonseca 2002). However, the success of these actions depends on the knowledge on pollinators, their conservation and interactions with plants and the environment. In order to conserve and manage it, more information needs to be captured about plant-pollinator interactions. Primary data about pollinators is becoming increasingly available online and can be accessed at a number of websites and portals. Many initiatives have also been created to facilitate and to stimulate the dissemination of pollination data, and examples are the Inter-American Biodiversity Information Network - Pollinators Thematic Network - IABIN-PTN (www.biocomp.org.br/iabinptn) and the WebBee (www.webbee.org.br) (Saraiva et al. 2003). One important aspect of this trend is the strong reliance on standardized data schemas and protocols (e.g. Darwin Core - DwC and TDWG Access Protocol for Information Retrieval - TAPIR, respectively) that allow us to share and aggregate biological data, among which pollinator data are included. Although plant-pollinator interaction data are critically important to our understanding of the role, importance and effectiveness of (potential) pollinators, they cannot be adequately represented by the current standards for occurrence data (such as DwC). The ways that interaction data are recorded and stored worldwide, as well as their intended use are very diverse. They lack of a common protocol and data schema, that will allow us to aggregate them in web portals and eventually use them to build decision support systems for conservation and sustainable use in agriculture, needs to be addressed. The IABIN-PTN adopted a simple solution to characterize and digitalize plant-pollinator interaction data based on DwC (Cartolano Júnior et al. 2007), allowing the digitalization of many Latin-american collections. Following that work, the Food and Agriculture Organization of the United Nations (FAO) produced a detailed survey of potential descriptors of plant-pollinators interactions. Although the ultimate goal of that work was to propose a data standard, that did not evolve (Cavalheiro et al. 2016). The FAO Global pollination project, adopted in Brazil the same simplified model used by IABIN to digitize plant-pollinator interaction data (Saraiva et al. 2010). Recently many Brazilian scientists gathered around the Brazilian Plant-Pollinator Interactions Network (REBIPP - www.rebipp.org.br) with the aim of developing scientific and teaching activities in the field. The main goals of the network are: generate a diagnosis of plant-pollinator interactions in Brazil; integrate knowledge in pollination of natural, agricultural, urban and restored areas; identify knowledge gaps; support public policy guidelines aimed at the conservation of biodiversity and ecosystem services for pollination and food production; and encourage collaborative studies among REBIPP participants. To achieve these goals the group has resumed those previous works done under the auspices of the IABIN and FAO projects, and a data standard is being discussed. The ultimate goal is to adopt a standard and develop a database of plant-pollinator data in Brazil to be used by the national community. This proposal of a data standard (depicted in Fig. 1) can serve as a starting point for the definition of a global data standard for plant-pollinator interactions under the TDWG umbrella. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 21:12:30 +030
       
  • Building community-specific standards and vocabularies: prospects and
           challenges for linking to the broader community - The SINP Case

    • Abstract: Biodiversity Information Science and Standards 1: e20297
      DOI : 10.3897/tdwgproceedings.1.20297
      Authors : Remy Jomier, Paula Zermoglio, John Wieczorek : Biodiversity data may come from myriad sources. From data capture in the field through digitization processes, each source may choose distinctive ways to capture data. When it comes to sharing data more broadly at national or regional levels, it is imperative that data is presented in ways that encourage understanding both by humans and machines, allowing aggregation and serving the data back to the community. This implies two levels of agreement, one at a structural level, where data is organized under certain terms or fields, and another related to the actual values contained in such fields. Since its ratification in 2009, the Darwin Core standard Wieczorek et al. (2012) has been increasingly used across the community to respond to the first need, providing a relatively simple means to organize shared data. Nonetheless, despite its broad acceptance, efforts to develop different standards to answer the same problems are not uncommon among some stakeholders, and may introduce yet another issue: reconciling the data shared under different standards. The second level of agreement, at the value level, constitutes a much more complex issue, partly given the nature of biodiversity data and partly due to social constraints. Many potential, partial solutions involving the development of dictionaries and controlled vocabularies are found scattered across the community. As the lack of homogeneity renders data less discoverable (Zermoglio et al. 2016) and therefore less usable for research and decision making, there exists a growing need for unifying such efforts. As part of the Biodiversity Information System on Nature and Landscapes (SINP), the French National Museum of Natural History was appointed to develop biodiversity data exchange standards, with the goal of sharing French marine and terrestrial data at the national level, meeting national and European requirements (e.g., the European INSPIRE Directive European Commission 2017). The French data providers include a broad range of people with diverse backgrounds. While some stakeholders can provide data under very specific constraints and formats, others lack the capabilities or resources to do so. The variability in the data provided therefore extends through both the structure and the value levels. In order to integrate the data in a coherent national system, a dedicated working group was assembled, mobilizing a range of biodiversity stakeholders and experts. Existing standards were compared, existing vocabularies gathered and compiled for review by experts, and then presented to the working group. As a result, a set of terms and associated controlled vocabularies was established. Finally, the set was released to the public to test and amended as needed. The results of the French initiative proved useful to compile and share data at the national level, bringing together data providers that otherwise would have been excluded. However, at a global scale, it faces some challenges that still need to be fully addressed. For instance, the standards created do not have an exact correspondence with Darwin Core, and so a complex mapping is required in order to integrate the data with that of the rest of the community. A serious mapping effort is being carried out as the national standards progress and has already rendered good results (Jomier and Pamerlon 2016). Regardless of the problems that remain to be solved, some lessons can be learnt from this effort. Getting actively involved in the broader, global community can help identify available tools, resources and expertise, and avoid repeated efforts that can be costly and time-consuming. Furthermore, re-using elements that already have been proven to work, prevents the need for reconciliations and makes data integration easier. With the ultimate goal of making biodiversity data readily available, these lessons should be kept in mind for future initiatives. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 20:39:02 +030
       
  • Traits as Essential Biodiversity Variables

    • Abstract: Biodiversity Information Science and Standards 1: e20295
      DOI : 10.3897/tdwgproceedings.1.20295
      Authors : Robert Guralnick : Essential Biodiversity Variables (EBVs) are harmonized biodiversity variables and their asssociated measurements needed for developing indicators of global biodiversity change. EBVs can serve the important purpose of aligning biodiversity monitoring efforts, much as Essential Climatic Variables (ECVs) help align allied efforts in climate science. One of six initially proposed EBV classes is devoted to species’ traits, since traits form the crucial link between the evolutionary history of organisms, their assembly into communities, and the nature and dynamic functioning of ecosystems. Despite their importance, prevalence, and scientific promise, the biodiversity community is still developing the conceptual, informatics, technical, and legal frameworks required for the large scale implementation and uptake. As part of an international consortium called GLOBIS-B, and in coordination with the The Group on Earth Observations Biodiversity Observation Network (GEO BON; geobon.org), we report on recent efforts to synthesize current efforts in trait data collection and trait datasets, computational workflows, ways to standardize data and metadata, and assessments of the openness and accessibility of existing species trait datasets. Members of the GLOBIS-B (www.globis-b.eu/) consortium also produced a set of candidate EBVs within the broader trait class ('Phenology', 'Organism morphology', 'Reproduction', 'Physiology' and 'Movement'). In this presentation, we begin by introducing the concept of EBVs, the current working definition of traits in the context of the EBV process, and workflows that have been developed for other EBV classes ('Species Populations') and the importance of standardizing EBV classes and the trait class, in particular. Building on this introduction, we discuss how the EBV concept is operationalized, focusing on workflows for trait integration, and the importance of data and metadata standards,following work from Kissling et al. 2017 (http://onlinelibrary.wiley.com/doi/10.1111/brv.12359/full). On the legal front, we suggest that the Creative Commons (CC) framework provides effective tools for designating legally interoperable and open data, especially when trait data are in the public domain (CC0, CC PDM) or assigned with a CC BY license, and metadata citation and other forms of attribution are available in both human and machine-readable form. We also suggest how EBVs can inform policy at national and global scales. Moving forward, renewed efforts of repeated trait data collection as well as standardised protocols for data and metadata collection are needed to improve the empirical basis of species traits EBVs. Moreover, open data as well as computational workflows are required for comprehensively assessing progress towards conservation policy targets and sustainable development goals. We conclude with a call to action for the TDWG community to consider their role in further developing, implementing, and scaling biodiversity monitoring under the EBV framework. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 20:07:05 +030
       
  • Building semantics in the domain of trait data: an OBO Library approach

    • Abstract: Biodiversity Information Science and Standards 1: e20293
      DOI : 10.3897/tdwgproceedings.1.20293
      Authors : Pier Luigi Buttigieg : As the volume and diversity of digitised trait data grows with ever-increasing speed, there is a clear need to capture the knowledge which contextualises it. Many researchers are addressing similar challenges by using ontology-based approaches to represent knowledge and use it to better structure data across resources, however, there is immense variation in how and for what purpose these ontologies are built. While some approaches emphasise quick and lightweight deployment for specific projects, others spend considerable effort in creating "heavy duty", finely specified semantics for a wide user base. Effectively ontologising trait data collections is likely to require a hybrid of these strategies and must also consider how to meld emerging efforts with those that have matured into well-adopted, production-oriented systems. This contribution will provide an overview of existng ontologies linked to traits, as well as the best practices used to create and develop them within the Open Biological and Biomedical Ontologies (OBO) Foundry and Library (Smith et al. 2007). Specifically, it will outline a collaborative model for future, open development, based on the domain semantics of the Ontology of Biological Attributes (OBA), the Environment Ontology (ENVO; Buttigieg et al. 2013, Buttigieg et al. 2016b), the Population and Community Ontology (PCO; Walls et al. 2014), and recent work on bridging phenotypes and environments (e.g. Thessen et al. 2015). Finally, perspectives on linking trait semantics, and hence trait data, to societal goals via OBO-aligned efforts to represent the semantics of the United Nations' Sustainable Development Agenda for 2030 (e.g. Buttigieg et al. 2016a) will be offered as a means to bridge scientific data with global socio-ecological goals. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 20:03:25 +030
       
  • Species Information pages, how are the data discovered,
           consolidated and presented.

    • Abstract: Biodiversity Information Science and Standards 1: e20294
      DOI : 10.3897/tdwgproceedings.1.20294
      Authors : Jeff Gerbracht : A number of different projects consolidate species information from widely disparate datasets and compile them into a single resource. These projects vary in several dimensions, including taxonomic coverage, depth of information and audience, such as humans or machines. Some focus on Life History information, others focus on observations and specimens or taxonomies and phylogenies. Encyclopedia of Life (eol.org) was one of the early projects and in 2007 took on the challenge of creating a web page for every species in the world, from bacteria to birds. Other projects focused on specific taxonomic groups or regions such as FishBase (fishbase.org) and Atlas of Living Australia (ALA).  Efforts such as the Global Biodiversity Infromation Facility (GBIF) consolidate observational data globally.  At least 5 projects focus solely on the life histories of birds including Birds of North America, Neotropical Birds, Handbook of Birds of the World Alive (HBW) and others. The species data included can range from genomic sequences to studies on demography and behavior, from photos and sound recordings to museum specimens. All these various resources are scattered around the globe and discovering the data of interest and accurately resolving the data to the correct ‘species’ is an ongoing and significant challenge. Publishing taxonomic concepts is still in it infancy, yet is key to discovering and resolving these types of data.  Additionally, biological and environmental trait data are often consolidated within a species account, yet the discovery of these data is frequently a difficult and labor, intensive process. In this talk, we will review Jaguar, a content management system (CMS) being used by the Cornell Lab of Ornithology to manage species account projects focused on birds and currently includes Birds of North America, Neotropical Birds, Merlin and All About Birds.  This custom CMS was designed with taxonomic concepts at the foundation and utilizing these taxonomic concepts, species accounts are automatically extended with observation maps, multimedia and results from various big data analysis projects.  A set of common trait data associated with species is managed using controlled vocabularies and displayed within these species accounts.  We have defined a set of traits, focused on birds, that are generally known and which are most useful to a broad ornithological audience.  We will discuss challenges we have faced in managing these species accounts and future opportunities to extend and enhance these accounts, especially as taxonomic concepts are published and adopted and trait ontologies are defined and, most importantly, applied.  HTML XML PDF
      PubDate: Mon, 14 Aug 2017 20:02:23 +030
       
  • Sustaining Software for Biological Collections Computing

    • Abstract: Biodiversity Information Science and Standards 1: e20254
      DOI : 10.3897/tdwgproceedings.1.20254
      Authors : James Beach : Specify is a biological collections data management platform for the digitization, curation, and dissemination of museum specimen information. The Specify Software Project and its predecessor, the MUSE Project, have been funded by the US National Science Foundation for 30 years. Specify 6, a native desktop app is used in about 500 biological collections for specimen data processing. The latest generation, Specify 7, is a web platform for hosting collections data in the Specify Cloud service or on an institutional server. During 2017 and 2018, with encouragement from its long-time funder, the Specify Project is 'transitioning-to-sustainability', in a campaign to identify an organizational structure and sources of revenue to support the Project's software engineering, help desk and data management services. A museum consortium dedicated to maintaining and evolving Specify Software is a probable outcome. Such a non-profit consortium would formally engage institution directors, collections researchers, and biodiversity informaticists in the governance of Specify. Each group would play a significant role in determining the direction and capabilities of consortium software products and services. In this paper, we will summarize our approach with the Specify transition. The Specify Project's transition to sustainability is not unique within the research museum community. In the United States, groups that produce Symbiota, Arctos, and GeoLocate, as examples, face a similar quest. As legacy funders look to data communities to underwrite more of the ongoing expense of research cyberinfrastructure, projects seeded with grant funds that contribute to the overall suite of collections digitization and management software solutions will, sooner or later, face this existential challenge. Significant questions of research community economics present themselves. How much can collections institutions, of all sizes and budgets, afford to pay for specimen data software platforms and services' Will the significant cost of collections software development and support lead to sustained, collaborative, community-wide efforts, similar to the way that members of professional societies pool resources (including endowments) to provide journal and annual meeting 'infrastructure'' Or will the high cost of software, security and systems management, etc., drive wealthier museums to outsource software development by licensing commercial products, and relegate collections unable to afford software licenses, to bespoke, simple, or no software solutions' This session will include project case studies on the economics of collections research cyberinfrastructure and present perspectives and paths to long-term sustainability. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 19:39:33 +030
       
  • Living Atlases Community

    • Abstract: Biodiversity Information Science and Standards 1: e20290
      DOI : 10.3897/tdwgproceedings.1.20290
      Authors : Marie-Elise Lecoq, Fabien Cavière, Christian Gendreau, Jeremy Goimard, Santiago Martinez de la Riva, Manash Shah, David Martin : Since 2010, the Atlas of Living Australia (ALA) provides information on all the known species in Australia and contributes to the Global Biodiversity Information Facility (GBIF). By lending access to this national open source platform, the open and modular architecture of ALA enables re-use of ALA tools by other countries and regions. Over the years, thanks to the ALA team and GBIF, the community has grown in different ways from production to training courses. Firstly, data portals based on ALA but residing outside Australia, have been launched in several institutions such as INBIO in Costa Rica and Canadensys in Canada, and in the GBIF network via at least six nodes presently operating national ALA-based portals (e.g., Spain, Portugal, France, Sweden, Argentina, United Kingdom). Others will follow in the coming years (e.g., Colombia, Peru). Other countries, such as Andorra and Benin, have also begun to develop their own installations with the aid of partners in the Living Atlases Community. Secondly, we are now able to set up workshops geared to different levels of expertise. At TDWG 2017 we will propose both beginner’s and advanced workshops. Thirdly, the experience gained by installing and customizing their own data portals has allowed many advanced participants to share their expertise in subjects like internationalization, data management, and customization, with others during workshops. Adding to these points, as an open source software, developers contribute to the community by implementing new functionalities and improving the translation into several languages for users of the software. Today, some modules are fully translated into Spanish, French, and Portuguese. In this poster, we will show the human aspect of the project by introducing the Living Atlases, an international community created around the ALA framework, highlighting how re-using existing software can be motivating and stimulating. We will also present the new official website that we will launch through the GBIF Capacity Enhancement Support Programme (CESP)*1 around the next advanced workshop and future projects planned in order to increase the durability of our community. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 19:31:22 +030
       
  • Traits in a graph

    • Abstract: Biodiversity Information Science and Standards 1: e20289
      DOI : 10.3897/tdwgproceedings.1.20289
      Authors : Jennifer Hammock, Katja Schulz : Biodiversity data are well-indexed by taxonomic names. While names reconciliation remains a challenge, there has been tremendous progress in recent years, and integration with available phylogenetic information can support sophisticated analyses for evolutionary questions. However, organisms are also linked to each other by relationships of ecology, geographic proximity, shared habitat, management categories, and other attributes, not yet recorded in a well-structured way. These data are best modeled as a graph, which makes these relationships explicit, and available for reasoning across - just like taxonomic relationships. This would support broad analyses of life on Earth not only from an evolutionary perspective but also across many other axes.  This case study will describe how several categories of data are being modeled in the Encyclopedia of Life (EOL) v3 using ontology terms. It will focus on several areas where we anticipate sufficient taxonomic coverage to underlie significant search and analytical power: habitat, distribution, body size and metabolism, and provenance. Habitat and distribution terms are good examples of data terms in well structured hierarchies that could support powerful search. Habitat terms are available from and hierarchically organized in the Environment Ontology (ENVO). Geographic distribution knowledge can often be structured by geographic terms based on verbatim locality text when geocoordinates are not available. Geographic terms are available from several providers, notably Geonames (geonames.org), Marineregions.org and Wikidata. Both habitat and distribution terms can also be connected to simpler and less formalized but commonly used hierarchies like the World Wildlife Fund (WWF) Ecoregions. The hierarchy information made available for habitat and geography by the semantic structure of these ontologies supports searches like "wetland plants of South America," which requires the intersection of taxonomic, geographic, and habitat hierarchies. Body size and metabolism traits interact in a particular use case, illustrating the importance of precision of categorical data terms for informing calculations of quantitative traits. The use case EOL is currently working to support is the parameterization of food web interactions in ecological modelling software. Default or starting values are needed for the content of energy (or carbon) within an organism, and the rate of loss thereof through metabolism. This, plus assimilation efficiency, allows the modeling of carbon flow through the food web. Traits available for estimating carbon content and metabolic rate include various measures of body size, for which conversion factors and formulae are available. For phytoplankton, for instance, size may be reported as cell dimensions, cell volume, cell wet mass, cell dry mass, and/or carbon biomass. For an automated tool to derive parameters from these which are fit for use, the different types of data must all be findable, but the measurement types must be distinguished from one another so the correct conversions are performed for each - all in a machine readable way, so the process can be automated. The need for semantically structured data terms in this case is different, but just as critical to the success of the use case. Future work: Other important structured connections can be made through provenance metadata. These connect taxa and specimens to literature, authors, collectors, wildlife observers and other agents. The Social Media of biodiversity data, rendered explicit, could increase connectivity and communication in the global community - particularly benefitting young researchers in isolated regions without the benefit of professional travel or literature subscriptions. To accomplish this, we must leverage human identifiers such as those made available by Open Researcher And Contributor ID (ORCID) and Wikidata. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 19:28:40 +030
       
  • TaxonWorks

    • Abstract: Biodiversity Information Science and Standards 1: e20279
      DOI : 10.3897/tdwgproceedings.1.20279
      Authors : Matthew Yoder, Dmitry Dmitriev : One of the best ways to test the adequacy of taxonomic database standards is to build tools on top of them. TaxonWorks (http://taxonworks.org) is an ambitious, open source effort to provide an all-in-one wrapper around the taxonomist’s workflow. Its development is supported by an endowment that facilitates and encourages its community to address long-term questions of sustainability and project evolution. At its core the project seeks to serve both its users (scientists) by providing an ever evolving and improving workbench and, equally important, its developers by providing a modern, open, extensible and deployable codebase. The data model includes coverage for nearly all classes of data recorded in modern taxonomic treatments or biodiversity informatics studies. Technical features include a unit-tested codebase, JSON serving application programming interface (API), web accessible interfaces, batch loading and export capabilities, and application containerization and Kubernetes-based deployment. Mechanisms for contributing at many different levels not just application code (e.g. providing use-cases, interface mockups, critical assessments, help documents, data modelling and standardization) are possible. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 19:20:13 +030
       
  • Defining a Data Quality (DQ) profile and DQ report using a prototype of
           Node.js module of the Fitness for Use Backbone (FFUB)

    • Abstract: Biodiversity Information Science and Standards 1: e20275
      DOI : 10.3897/tdwgproceedings.1.20275
      Authors : Allan Veiga, Antonio Saraiva : Despite the increasing availability of biodiversity data, determing the quality of data and informing would-be data consumers and users remains a significant issue. In order for data users and data owners to perform a satisfactory assessment and management of data fitness for use, they require a Data Quality (DQ) report, which presents a set of relevant DQ measures, validations, and amendments assigned to data. Determining the meaning of "fitness for use" is essential to best manage and assess DQ. To tackle the problem, the TDWG Biodiversity Data Quality (BDQ) - Interest Group (IG) (https://github.com/tdwg/bdq) has proposed a conceptual framework that defines the necessary components to describe Data DQ needs, DQ solutions, and DQ reports (Fig. 1). It supports, in a global and collaborative environment, a consistent description of: (1) the meaning of data fitness for use in specific contexts, using the concept of a DQ profile; (2) DQ solutions, using the concepts of specifications and mechanisms; and (3) the status of quality of data according to a DQ profile, using the concept of a DQ report (Veiga 2016, Veiga et al. 2017). Based on this this conceptual framework, we implemented a prototype of a Fitness for Use Backbone (FFUB) as a Node.js module (https://nodejs.org/api/modules.html) for registering and retrieving instances of the framework concepts. This prototype was built using Node.js, an asynchronous event-driven JavaScript runtime, which uses a non-blocking I/O model that makes it lightweight and efficient to build scalable network applications (https://nodejs.org). We registered our module in the npm package manager (https://www.npmjs.com) in order to facilitate its reuse and we made our source code available in GitHub (https://github.com) in order to foster collaborative development. To test the module, we developed a simple mechanism for measuring, validating and amending the quality of datasets and records, called BDQ-Toolkit, available in the FFUB module. The source code of the FFUB module can be found at https://github.com/BioComp-USP/ffub. Installing and using the module requires Node.js version 6 or higher. Instructions for installing and using the FFUB module can be found at https://www.npmjs.com/package/ffub (Veiga and Saraiva 2017). Using the FFUB module we defined a simple DQ profile describing the meaning of data fitness for use in a specific context by registering a hypothetical use case. Then, we registered a set of valuable information elements for the context of the use case. For measuring the quality of each valuable information elements, we registered a set of DQ dimensions. To validate if the DQ measures are good enough, a set of DQ criteria was defined and registered. Lastly, a set of DQ enhancements for amending the quality in the use case context was also defined and registered. In order to describe the DQ solution used to meet those DQ needs, we registered the BDQ-Toolkit mechanism and all the specifications implemented by it. Using these specifications and mechanism, we generated and assigned to a dataset and its records a set of DQ assertions, according to the DQ dimensions, criteria and enhancements defined in the DQ profile. Based on those assertions we can build DQ reports by composing all the assertions assigned to the dataset or to a specific record. This DQ report describes the status of DQ of a dataset or record according to the context of the DQ profile. This module provides an interface to use the proposed conceptual framework, which allows others to register instances of its concepts. Future work will include creating a RESTful API using sophisticated methods of data retrieval. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 19:20:05 +030
       
  • Introduction of the Living Atlases workshop

    • Abstract: Biodiversity Information Science and Standards 1: e20253
      DOI : 10.3897/tdwgproceedings.1.20253
      Authors : Marie-Elise Lecoq, Fabien Cavière, Régine Vignes-Lebbe : Since 2010 the Atlas of Living Australia (ALA) gives information on all the known species in Australia and contributes to the Global Biodiversity Information Facility (GBIF). To provide a suitable framework for national needs, ALA and GBIF worked together and created a technical community around the platform developed by ALA. This one is called Living Atlases. Through this community, members can be helped during the installation and configuration of modules based on this architecture by the ALA technical team or other participants. Since 2013, several beginner and advanced workshops have been made focusing on various ALA tools. Because the level of the community members has increased, some former trainees have become new trainers. Added to these technical workshops, we have been able to present and demonstrate our work during international conferences with always the main goal to show how re-using an existing software can be motivating and stimulating as well as be part of a living community. Indeed, ten institutions members to Living Atlases launched their national data portal. In the future, we expect more GBIF nodes to put their software in production (Andorra, Germany, Benin, Luxembourg, etc.). During this introduction, we will present the community by showing past events as well as presentations done during previous conferences. Then, we will describe the human aspect of the project by presenting the international community and listing some key points of the software such as internationalization or customization. We will finish it by explaining the future of the Living Atlases community and how you can contribute and be part of it. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 19:06:45 +030
       
  • Living Atlases: tips and advices

    • Abstract: Biodiversity Information Science and Standards 1: e20252
      DOI : 10.3897/tdwgproceedings.1.20252
      Authors : Marie-Elise Lecoq, Fabien Cavière, Régine Vignes-Lebbe : Atlas of Living Australia (ALA), Australian node of the Global Biodiversity Information Facility (GBIF), provides information on all the known species in Australia. Since 2010, ALA has developed an open source framework providing different tools to help users from various sectors. The ALA technical team, with the help of GBIF, has reorganized the architecture of ALA tools into several modules to help other institutions to re-use the code.  The three first sessions of this workshop focuses on data indexation, occurrence search engine and metadata registry. The last session delves into more technical subjects with hands-on demonstration of installing an instance of the portal (also known as ALA-demo*1) using Ansible scripts. In addition to this, sharing of experience regarding the customization of data portals and building a name indexer using national checklists will be covered. Finally, we will present others modules such as ALA4R*2, project allowing R users to access data and resources hosted by a platform based on ALA, or Image service*3, tiling and image repository tool. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 19:03:17 +030
       
  • Building a community of practice through capacity enhancement mentoring

    • Abstract: Biodiversity Information Science and Standards 1: e20234
      DOI : 10.3897/tdwgproceedings.1.20234
      Authors : Laura Russell, Anabela Plos, Manuel Vargas, Paula Zermoglio, Sharon Grant : In 2015, GBIF—the Global Biodiversity Information Facility—officially started Biodiversity Information for Development, or BID, a multi-year, €3.9 million programme funded by the European Union with the aim of increasing the amount of biodiversity information available for use in scientific research and policymaking in the nations of sub-Saharan Africa, the Caribbean and the Pacific. Components of this programme include building a community of practice by offering capacity enhancement workshops in the areas of data mobilization and data application; developing activities and materials to strengthen a base of mentors and trainers; establishing helpdesk support and technical assistance for funded projects; and matchmaking to provide mentoring support to the funded projects. The GBIF Secretariat recruited its mentors through an open call as a first step toward developing a community willing to share their knowledge. The main role of the mentors, either through online or on-site support, is to serve as a bridge between the researchers and collections professionals working on projects funded by BID and the broader biodiversity data community. The mentors receive recognition for their contributions through an open badging system and are listed in an online community directory.  These mentors represent a wide range of backgrounds and experience in leadership, planning, digitization, data quality, data publishing, data application, and informatics support. The GBIF Secretariat seeks to expand and develop the mentor programme and to sustain it beyond the BID-funded programme. We will explore the existing mentor programme, challenges associated with bridging cultures, challenges with providing language support, and ideas for improvement and expansion. We welcome discussion from the wider biodiversity community on how capacity enhancement mentoring can become a recognized activity to narrow knowledge gaps between various groups of biodiversity professionals. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 18:50:02 +030
       
  • Design and use of NOMEN, an ontology defining the rules of biological
           nomenclature

    • Abstract: Biodiversity Information Science and Standards 1: e20284
      DOI : 10.3897/tdwgproceedings.1.20284
      Authors : Matthew Yoder, Dmitry Dmitriev, José Luis Pereira, Maria Marta Cigliano : The most complex nomenclatural databases are developed not from community based efforts but from individuals who have encoded their understanding of the rules of nomenclature into bespoke knowledge-bases. In efforts spanning decades well over 75 types of “status” may to be defined for a single database. Reconciliation of these status types into new, federated systems is nearly always the most difficult aspect of their migration. Nomenclatural data is often recorded in a logically inconsistent manner, for example mixing governed rules and curator annotations. NOMEN (https://github.com/SpeciesFileGroup/nomen) is an Web Ontology Language (OWL) ontology that seeks to address these issues, providing standardized URIs for classes of nomenclatural annotations on taxonomic names (not taxonomic concepts). It includes assertions for the animal (ICZN), plant (ICN), and bacterial (ICNB) codes. NOMEN based assertions can be encoded in a simple graph format, as illustrated in its implementation in TaxonWorks (http://taxonworks.org). We illustrate its application within the migration process of four very large taxonomic databases. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 17:23:50 +030
       
  • Sustainability in Biodiversity Software Development: More financing or
           better practices'

    • Abstract: Biodiversity Information Science and Standards 1: e20283
      DOI : 10.3897/tdwgproceedings.1.20283
      Authors : Matthew Yoder, Michael Twidale, Andrea Thomer : The Species File Group is a small, endowed team who seeks to provided software tools and related technical resources to communities dependent on biodiversity informatics. We will present an overview of the technical products and services the group provides, including software development, data migration, data hosting and data mobilization services. Given the modest size of the Species File group, we have identified our limitations and future capability needs. These include scaling user and technical support services, increasing capability to migrate bespoke legacy data, more rapidly developing rich user interfaces, increasing the speed and scope of our data processing, and providing archival services. Our vision leads us to conclude that different types of applications must be put into production and sustained for varying durations of time; plotting the various kinds of software tools needed on a temporal axis creates a picture of how software ecosystems might evolve, and further helps us to identify where we might focus increased funding. Currently however, there are important issues with software sustainability that, if addressed now, could multiply the utility and value of existing community software investments. These include: (1) architecting a software ecosystem that lowers barriers to entry for those willing to contribute to all aspects of a project, (2) improving the modularity of software components, (3) educating scientists in "carpentry" related concepts, (4) increasing the engagement of experts from fields such as human-computer interaction, user-interface (UI) and user-experience (UX) design, and (5) narrowing the iteration time between new data standards and the new software that employs them in order to more rapidly improve both. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 17:18:49 +030
       
  • Standards in Action: The Darwin Core Hour

    • Abstract: Biodiversity Information Science and Standards 1: e20280
      DOI : 10.3897/tdwgproceedings.1.20280
      Authors : Deborah Paul, Paula Zermoglio, John Wieczorek, Gary Motz, Erica Krimmel : Darwin Core Wieczorek et al. 2012 has become broadly used for biodiversity data sharing since its ratification as a standard in 2009. Despite its popularity, or perhaps because of it, questions about Darwin Core, its definitions, and its applications continue to arise. However, no easy mechanism previously existed for the users of the standard to ask their questions and to have them answered and documented in a strategic and timely way. In order to close this gap, a double-initiative was developed: the Darwin Core Hour (DHC) Darwin Core Hour Team 2017a and the Darwin Core Questions & Answers Site Darwin Core Hour Team 2017b. The Darwin Core Hour Zermoglio et al. 2017 is a webinar series in which particular topics concerning the Darwin Core standard and its use are presented by more experienced and vested community members for the benefit of and discussion amongst the community as a whole. All webinars are recorded for broader accessibility. The Darwin Core Questions & Answers Site is a GitHub repository where questions from the community are submitted as issues, then discussed and answered in the repository as a means of building up documentation. These two instances are tightly linked and feed each other Fig. 1.Questions from the community, some arising during the webinars, turn into issues and are then answered and shaped into documentation, while some questions give birth to new webinar topics for further discussion. So far, this double-initiative model has proved useful in bringing together communities from different geographic locales, levels of expertise, and degrees of involvement in open dialogue for the collaborative evolution of the standard. At the time of this presentation, the group has produced nine webinar sessions and provided a corpus of documentation on several topics. We will discuss its current status, origins and potential of the double-initiative model, community feedback, and future directions, in addition to inviting the TDWG community to join efforts to keep the Darwin Core standard "in action". HTML XML PDF
      PubDate: Mon, 14 Aug 2017 16:17:34 +030
       
  • Maintenance and development of Symbiota2, a platform for data sharing and
           visualization

    • Abstract: Biodiversity Information Science and Standards 1: e20220
      DOI : 10.3897/tdwgproceedings.1.20220
      Authors : Mary Barkworth, Andrew Miller, Curtis Dyreson, Benjamin Brandt, William Pearse : Symbiota is a database management system for aggregating and displaying record-based biodiversity information from collections of widely varying sizes and integrating them with images of living organisms and image-based records. It is currently used by over 230 collections that collectively provide access to records of over 20 million specimens. Its popularity is attributable to the low financial and learning barriers to participation in a Symbiota network and the wide array of tools it offers for creating resources needed by different user groups. It has been developed through grants, contracts, and pro bono contributions but it suffers from a limited pool of developers (for a project of its size), lack of a coordinated training program, and absence of a structure creating and expanded funding base for maintenance of the program and the networks that adopt it. Symbiota2 is designed to address these issues but development of a business plan is in its preliminary stages. It requires articulation of an overall vision for development of the program, knowledge of its current and potential user base, critical funding needs, and identification of those it benefits, particularly those capable of providing financial support. This presentation was developed in response to an invitation to speak at the Symposium. It outlines our preliminary thoughts on the subject. One of the first needs is to establish an entity, provisionally the Symbiota2 Foundation (S2F), to speak for Symbiota2. Its primary task would be fundraising. S2F would have an advisory board comprising individuals willing to help support Symbiota2 development, both personally and by promoting it to their contacts, plus representatives of Symbiota2’s developers, networks, and network users. The first goal of the Foundation would be raise the funds required to ensure that any established networks remain active and secure by maintaining the currency of network installations, responding to bug reports, informing network managers of changes to the program, and making minor enhancements as requested by users. Grant funding will still be needed to add significant new features. Enhancements requiring major changes in the underlying program or development of completely new apps would still require grant or contract support. Before launching S2F, we need to build more effective communication with network and collection managers; create flyers, booklets, and presentations explaining the power of the program and its benefits to many different user groups; develop case studies illustrating benefits derived from using the program; and create at least one monetized app. These resources will make clearer the benefits that the biodiversity sciences, particularly Symbiota2, provide for the public because, ultimately, it is public support that is needed. We must use all the tools at our disposal to encourage people’s interest in the organisms around them if we are to maintain support for biodiversity science and the collections on which it is based. Symbiota2 can play an important role in this regard. It is, however, unrealistic to expect that Symbiota2 will be able to develop the basic funds needed solely from monetized apps; it is also unrealistic to think it can be supported by charging data providers because many of them are, at present, unfunded. Consequently, there is no choice. We need to look initiate a fund raising program. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 15:04:03 +030
       
  • TAXREF-LD: A Reference Thesaurus for Biodiversity on the Web of
           Linked Data

    • Abstract: Biodiversity Information Science and Standards 1: e20232
      DOI : 10.3897/tdwgproceedings.1.20232
      Authors : Franck Michel, Catherine Faron-Zucker, Sandrine Tercerie, Gargominy Olivier : Started in the early 2000’s, the Web of Data has now become a reality [Bizer 2009]. It keeps on growing through the relentless publication and interlinking of data sets spanning various domains of knowledge. Building upon the Resource Description Framework (RDF), this new layer of the Web implements the Linked Data paradigm [Heath and Bizer 2011] to connect and share pieces of data from disparate data sets. Thereby, it enables the integration of distributed and heterogeneous data sets, spawning an unprecedented worldwide knowledge base. Taxonomic registers are key tools to help us comprehend the diversity of nature. They are the backbone for integrating independent data sources, and help figure out strategies regarding biodiversity and natural heritage conservation. As such, they naturally stand out as potential contributors to the Web of Data. Several international initiatives on taxonomic thesaurus such as NCBI Organismal Classification [Federhen 2012], AGROVOC Multilingual agricultural thesaurus [Caracciolo et al. 2013] or Encyclopedia of Life [Blaustein 2009] have already made this move towards the Web of Data. In this talk, we will present an on-going work related to TAXREF [Gargominy et al. 2016], the taxonomic register for fauna, flora and fungus, maintained and distributed by the National Museum of Natural History of Paris (France). TAXREF registers all species inventoried in metropolitan France and overseas territories, in a controlled hierarchy of over 500.000 scientific names. Our goal is to publish TAXREF on the Web of Data, denoted TAXREF-LD, while adhering to standards and best practices for the publication of Linked Open Data (LOD) [Farias Lóscio et al. 2017]. The publication of TAXREF-LD as LOD required tackling several challenges. Far beyond a sheer automatic translation of the TAXREF database into LOD standards, the key point of the reported endeavor was the design of a model able to account for the two coexisting yet distinct realities underlying TAXREF, namely the nomenclature and the taxonomy. At the nomenclatural level, each scientific name is represented by a concept, expressed in the Simple Knowledge Organization System (SKOS) vocabulary [Miles and Bechhofer 2009], along with an authority and a taxonomic rank. At the taxonomic level, a species is represented by a class in the Web Ontology Language (OWL) [Schneider et al. 2012] whose properties are the species traits (habitat, biogeographical status, conservation status...). Both levels are connected by the links between a species and associated names (the valid name and existing synonyms). Note that the modelling applies not only to species but also to any other taxonomic rank (genus, family, etc.). This model has several key advantages. First, it is relevant to biologists as well as computer scientists. Indeed, it agrees with three centuries of thinking on nomenclatural codes [Ride et al. 1999, McNeill et al. 2012] while, at the same time, it fits in with the philosophy underpinning SKOS and OWL: the nomenclatural level allows circulating through a hierarchy of concepts representing scientific names, and at the taxonomic level, the OWL classes represent the sets of individuals sharing common traits. Second, the model enables drawing links with other data sources published on the Web of Data, that may represent either nomenclatural or taxonomic information. Third, the taxonomy evolves frequently along with newly discovered species and changes in the scientific consensus. Typically, a name may alternatively be considered as the valid name of a species or a synonym. The distinction between the nomenclatural and taxonomic levels, alongside an appropriate Uniform Resource Identifier (URI) naming scheme for names and taxa, makes the model flexible enough to accommodate such changes. Furthermore, our goal in this talk is not only to present the work achieved, but more importantly to engage in a discussion with the stakeholders of the community, may they be data consumers or producers of sibling classifications concerned with the publication of LOD, about data integration scenarios that may arise from the availability of such a large, distributed, knowledge database. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 14:21:13 +030
       
  • OpenBiodiv Poster: an Implementation of a Semantic System Running on top
           of the Biodiversity Knowledge Graph

    • Abstract: Biodiversity Information Science and Standards 1: e20246
      DOI : 10.3897/tdwgproceedings.1.20246
      Authors : Viktor Senderov, Teodor Georgiev, Donat Agosti, Terry Catapano, Guido Sautter, Éamonn Ó Tuama, Nico Franz, Kiril Simov, Pavel Stoev, Lyubomir Penev : We present OpenBiodiv - an implementation of the Open Biodiversity Knowledge Management System. The need for an integrated information system serving the needs of the biodiversity community can be dated at least as far back as the sanctioning of the Bouchout declaration in 2007. The Bouchout declaration proposes to make biodiversity knowledge freely available as Linked Open Data (LOD)*1. At TDWG 2016 (Fig. 1) we presented the prototype of the system - then called Open Biodiversity Knolwedge Management System (OBKMS) (Senderov et al. 2016). The specification and design of OpenBiodiv was then outlined in more detail by Senderov and Penev (2016). In this poster, we describe the pilot implementation. We believe OpenBiodiv is possibly the first pilot-stage implementation of a semantic system running on top of a biodiversity knowledge graph. OpenBiodiv has several components: OpenBiodiv ontology: A general data model supporting the extraction of biodiversity knowledge from taxonomic articles or from databases such as GBIF. The ontology (in preparation, Journal of Biomedical Semantics, available on GitHub) incorporates several pre-existing models: Darwin-SW (Baskauf and Webb 2016), SPAR (Peroni 2014), Treatment Ontology, and several others. It defines classes, properties, and rules supporting the interlinking of these disparate ontologies to create a LOD biodiversity knowledge graph. A new addition is the Taxonomic Name Usage class, accompanied by a Vocabulary of Taxonomic Statuses (created via an analysis of 4,000 Pensoft articles) enabling for the automated inference of the taxonomic status of Latinized scientific names. The ontology supports multiple backbone taxonomies via the introduction of a Taxon Concept class (equivalent to DarwinCore Taxon) and Taxon Concept Labels as a subclass of biological name. The Biodiversity Knowledge Graph: A LOD dataset of information extracted from taxonomic literature and databases. To date, this resource has realized part of what was proposed during the pro-iBiosphere project and later discussed by Page (2016). Its main resources are articles, sub-article componets (tables, figures, treatents, references), author names, institution names, geographical locations, biological names, taxon concepts, and occurrences.
      Authors have been disambiguated via their affiliation with the use of fuzzy-logic based on the GraphDB Lucene connector. The graph interlinks: (1) Prospectively published literature via Pensoft Publishers. (2) Legacy literature via Plazi. (3) Well-known resources such as geographical places or institutions via DBPedia. (4) GBIF's backbone taxonomy as a default but not the preferential hierarchy of taxon concepts. (5) OpenBiodiv id's with nomenclator id's (e.g. ZooBank) whenever possible. Names form two networks in the graph: (1) A directed-acyclical graph (DAG) of supercedence that can be followed to the corresponding sinks to infer the currently applicable scientific name for a given taxon. (2) A network of bi-directional relations indicating the relatedness of names. These names may be compared to the related names inferred on the basis of distributional semantics (Nguyen et al. 2017). ropenbio: An R package for RDF*2-ization of biodiversity information resources according to the OpenBiodiv ontology. We intend to submit this to the rOpenSci project. While many of its high-level functions are specific to OpenBiodiv, the low-level functions, and its RDF-ization framework can be used for any R-based RDF-ization effort. OpenBiodiv.net: A front-end of the system allowing users to run low-level SPARQL queries as well to use an extensible set of semantic apps running on top of a biodiversity knowledge graph. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 12:37:42 +030
       
  • Angling for data: making biodiversity metadata more FAIR

    • Abstract: Biodiversity Information Science and Standards 1: e20267
      DOI : 10.3897/tdwgproceedings.1.20267
      Authors : Joakim Philipson : The FAIR guiding principles, first launched in 2014, for making research data more Findable, Accesible, Interoperable and Re-usable, have not yet been widely implemented for biodiversity data. Partly this may be due to the FAIR principles by themselves not yet being fully operational and easy to interpret. There is work in progress to remedy this by different task groups, and different attempts have already been made. In this paper I will give some concrete tips aimed at implementing the FAIR principles for biodiversity research data, focusing on the metadata, in order to enhance the quality of data by making them more findable, accessible, interoperable and reusable. Among the steps that could be taken to make biodiversity database records more findable and accessible is for example to add schema.org markup to the html sourcecode of corresponding web pages, as has been successfully employed in the Uniprot database. Recently biocaddie.org has mapped the metadata format DATS, Data Tag Suite, to schema.org and there is also the ongoing adaptation effort of bioschemas.org. In addition, there is the highly commendable work done by former biosharing.org, which now has become the more general fairsharing.org and which aims to enhance findability, promote the adoption of metadata standards by policy makers and interlink metadata standards among themselves and with repositories (Sansone 2017).  Further, to make biodiversity records more interoperable and reusable, it is essential to provide metadata export to a selection of general standards and formats. In doing this, promises should be kept, meaning that exported metadata records should also validate against the schemas for the chosen format standard. By validating against schemas of both preferred metadata standard and export formats, biodiversity data records also stand a better chance of achieving what has been defined by GBIF and Vertnet as Fitness-for-use, encompassing e.g. accessibility, content, completeness, dataset-level or record level, error correction etc. (Russell 2011). That is, of course, provided the relevant metadata standards have validation schemas or online tools such as the Darwin Core Archive/EML validator that are sufficiently precise to check for these properties. If not, there is always the possibility of creating tailormade validation schemas serving the data quality needs of a specialized biodiversity data repository, e.g. using Schematron or JSON schema. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 11:41:17 +030
       
  • Annotating out the Way to the Linked Biodiversity Data Web

    • Abstract: Biodiversity Information Science and Standards 1: e20270
      DOI : 10.3897/tdwgproceedings.1.20270
      Authors : Guan Shuo Mai, Fu Chun Yang, Mao-Ning Tuanmu : Image annotation is a common approach for biodiversity detection by labeling features of interest from images. However, annotation tools and data structures are usually developed and combined as platform for specific purposes. It makes tools hard to be adopted by different domains and hinders the interoperability of potentially related data from multiple sources. Following linked data principles and ontology design patterns, we proposed a platform-independent framework, and implemented a web-based prototype for semantic annotating images with persistent HTTP Uniform Resource Identifier (URI). Our framework is designed for breaking down data silos, i.e. scattered information annotated from active or legacy biodiversity databases, personal observation blogs, or albums can be queried and interoperated together. The prototype can be used without installation and easily integrated into other platforms. It pulls image links from a page and let people select features of interest (e.g. flowers, birds, or patterns) as tokens with bounding boxes from an image. Tokens can then be populated with properties or traits (e.g. colors, behaviors) derived from domain ontologies which are treated as choosable profiles. Meanwhile tokens can be described with measurement data in certain dimensions such as body weight or wing length. Relations can be created between any two tokens from arbitrary hosts. Tokens, properties, measurements and relations are assembled through framework ontologies such as Extensible Observation Ontology (OBOE). Each token is given a hash URI composed of an image URI and a Universally Unique Identifier (UUID). With URIs, relations can be explicitly kept as structured data instead of literal descriptions, and the data location can be easily resolved. Annotation data are modeled as graph, shared and aggregated by URIs, and thus the meta-information can be extended as much as possible exactly like linked data does. We made a simple visualization to show the interlinking data graph (Fig. 1). In general, audios can also be annotated on spectrogram with simple translation from x, y bounding box coordinates to time and frequency domain to get aligned with the real annotated target. Due to the difficulties for non-expert to describe contents with precise words, an ontology bridging amateurs to professionals should be introduced. Data quality is controlled not only by expert validation but also by peer reviews from experienced observers with revisions. Diverse applications, such as voucher-based biota, trait database, species recognition, visualized dynamic identification keys, phenology monitoring and species interaction data building (e.g. food web, parasitism) can be run in crowdsourcing approach by communities of different domains, while all their efforts for ground truth developing are integrated, ready for further discovery and reuse under our framework. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 11:20:44 +030
       
  • Integrating data-cleaning with data analysis to enhance usability of
           biodiversity big-data

    • Abstract: Biodiversity Information Science and Standards 1: e20244
      DOI : 10.3897/tdwgproceedings.1.20244
      Authors : Tomer Gueta, Yohay Carmel : Biodiversity big-data (BBD) has the potential to provide answers to some unresolved questions – at spatial and taxonomic swathes that were previously inaccessible. However, BBDs contain serious error and bias. Therefore, any study that uses BBD should ask whether data quality is sufficient to provide a reliable answer to the research question. We propose that the question of data quality and the research question could be addressed simultaneously, by binding data-cleaning to data analysis. The change in signal between the pre- and post-cleaning phases, in addition to the signal itself, can be used to evaluate the findings, their implications, and their robustness. This approach includes five steps: Downloading raw occurrence data from a BBD. Data analysis, statistical and / or simulation modeling in order to answer the research question, using the raw data after the necessary basic cleaning. This part is similar to the common practice. Comprehensive data-cleaning. Repeated data analysis using the cleaned data. Comparing the results of steps 2 and 4 (i.e., before- and after data-cleaning). This comparison will address the issue of data quality, as well as answer the research question itself.  The results of step 2 alone may be misleading, due to the error and bias in the data. Even the results of step 4 may not be trustworthy, since data-cleaning is never complete, and some of the error and much bias remain in the data. However, the changes in the results before- and after cleaning are important keys to answer the research question. If cleaned data reveal a stronger and clearer signal than raw data, then the signal is most likely trustworthy, and the respective hypothesis is confirmed. Conversely, if the cleaned data show a weaker signal than obtained from the raw data, then the respective hypothesis, even if confirmed by original data, needs to be rejected. Lastly, if there is a mixed trend, whereby in some cases the signal is stronger and in others it is weaker – the data is probably inadequate and findings cannot be considered conclusive. Thus, we propose that data-cleaning and data analysis should be conducted jointly. We present a case study on the effects of environmental factors on species distribution, using GBIF data of all Australian mammals. We used the performance of a species distribution model (SDM) as a proxy for the strength of environmental factors in determining gradients of species richness. We implemented three different SDM algorithms for 190 species in several different grid cells, that vary in their species richness. We examined the correlations between species richness and 10 different SDM performance indices. Species-environment affinity was weaker in species-rich areas, across all SDM algorithms. The results support the notion that the impact of environmental factors on species distribution at a continental scale decreases with increasing species richness. Seemingly, the results also support the continuum hypothesis, namely that in species-poor areas, species have strong affinities to particular niches, but this structure breaks in species-rich communities. Furthermore, a much stronger signal was revealed after data-cleaning. Thus, a joint study of a research question and data-cleaning provides a more reliable means for using BBDs. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 10:51:33 +030
       
  • Invasive Organisms Information: A proposed TDWG Task Group

    • Abstract: Biodiversity Information Science and Standards 1: e20266
      DOI : 10.3897/tdwgproceedings.1.20266
      Authors : Quentin Groom, Steven Baskauf, Peter Desmet, Melodie McGeoch, Shyama Pagad, Dmitry Schigel, Ramona Walls, John Wilson, Paula Zermoglio : Invasive species are a global problem for conservation, economics and health. Information on their distribution, spread and impact are essential to inform national and international policy on biodiversity. Furthermore, demand for these data are only likely to increase as recent environmental change results in the widespread reconfiguring of species distributions. Researchers and managers of invasive species require certain elements of data from observations and inventories of species, such as, how the organism was brought to the location, how well established it is and whether it is considered alien to that location. However, Darwin Core either lacks terms sufficient for these purposes or does not have a suitable controlled vocabulary on existing terms to express these concepts clearly and to harmonize data collection. We are proposing a TDWG task group to make recommendations to improve Darwin Core for invasive species research and management. Some of the specific terms we will look are dwc:establishmentMeans and dwc:occurrenceStatus. However, we may also recommend new terms and controlled vocabularies, including how to express the degree of establishment of an organism at a location. We will look at current frameworks for alien species data and analyse how these are used both by invasive species specialists and by the broader community collecting biodiversity observations. We will aim to make a proposal that is sufficiently flexible to be of use to the whole community, while providing sufficient resolution to be of use to specialists in invasion biology. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 10:11:17 +030
       
  • SPNHC 2017 Natural History Collections Biodiversity Informatics 101 Short
           Course Insights

    • Abstract: Biodiversity Information Science and Standards 1: e20263
      DOI : 10.3897/tdwgproceedings.1.20263
      Authors : Holly Little, Deborah Paul, Jennifer Strotman : Many digitization-focused talks at recent Society for the Preservation of Natural History Collections (SPNHC, spnhc.org) meetings illustrate the impact of biodiversity informatics tools, standards, and resources on collections management tasks. Collections and data managers need up-to-date data skills and knowledge to manage and curate their growing digital collections. The National Science Foundation (NSF) Program, Advancing the Digitization of Biological Collections (ADBC), highlights the need for collections informatics literacy in support of creating and sustaining digital resources. The vision for a Network Integrated Biocollections Alliance (NIBA), Beach et al. 2010 conceptualized the necessity for a program such as the ADBC. In their subsequent NIBA Implementation Plan, Beach et al. 2012, goal three is "Enhance the training of existing collections staff and create the next generation of biodiversity information managers." In the rapidly changing digital collections landscape, the collection staff find their roles evolving. Technical discussions in meeting symposia and some demos showcasing informatics and technological advances may often cover both expertise and background knowledge the audience lacks. In addition, where there are challenges in informatics literacy within the collections community, one can also observe a breakdown in understanding and communication of overlapping informatics-related goals and efforts between collections staff and their informatics counterparts and administration. In order to facilitate overall understanding of the biodiversity informatics landscape, we developed the Natural History Collections Biodiversity Informatics 101 short course, offered just before the start of the SPNHC 2017 meeting. The aim of this one day short course was to provide introductory materials on a wide range of biodiversity informatics topics. Topics covered included the basics of natural history collection data and digital object lifecycle management, including digitally archiving as well as mobilizing collections data and participation in global initiatives. The course was led by museum and informatics professionals with various expertise in natural history collections digitization and informatics. The presentations were distributed across three main themes: 1) what is natural history biodiversity informatics' 2) what are some of the current relevant projects' and 3) how to get involved. We were able to gather feedback during the course day, throughout the SPNHC 2017 meeting, and in a post-short-course survey. Here we share an assessment of the course, the feedback received as it relates to skills and workflow assessments and needs, possible future iterations of this short course, and a better understanding of the informatics literacy skills landscape of the SPNHC community. We seek your input on this topic for the 2018 TDWG and SPNHC joint meeting and for TDWG's future role in this capacity building effort. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 8:58:14 +0300
       
  • Fitness for Use: The BDQIG aims for improved Stability and Consistency

    • Abstract: Biodiversity Information Science and Standards 1: e20240
      DOI : 10.3897/tdwgproceedings.1.20240
      Authors : Arthur Chapman, Antonio Saraiva, Lee Belbin, Allan Veiga, Miles Nicholls, Paula Zermoglio, Paul Morris, Dmitry Schigel, Alexander Thompson : The process of choosing data for a project and then determining what subset of records are suitable for use has become one of the most important concerns for biodiversity researchers in the 21st century. The rise of large data aggregators such as GBIF (Global Biodiversity Information Facility), iDigBio (Integrated Digitized Biocollections), the ALA (Atlas of Living Australia) and its many clones, OBIS (Ocean Biogeographic Information System), SIBBr (Sistema de Informação sobre a Biodiversidade Brasileria), CRIA (Centro de Referência em Informação Ambiental) and many others has made access to large volumes of data easier, but choosing which data are fit for use remains a more difficult task. There has been no consistency between the various aggregators on how best to clean and document the quality – how tests are run, or how annotations are stored and reported. Feedback to data custodians on possible errors has been minimal, inconsistent, and adherence to recommendations and controlled vocabularies (where they exist) has been haphazard to say the least. The TDWG Data Quality Interest Group is addressing these issues, either alone or in conjunction with other Interest Groups (Annotations, Darwin Core, Invasive Species, Citizen Science and Vocabulary Maintenance) to develop a framework, tests and assertions, use cases and controlled vocabularies. The Interest Group is also working closely with the data aggregators toward consistent implementations. The practical work is being done through five Task Groups. A published framework is leading to a user-friendly Fitness for Use Backbone (FFUB) and data quality profiles by which users can document the quality they need for a project. A standard set of core tests and assertions has been developed around the Darwin Core standard and are currently being tested and integrated into several aggregators. A use case library has been compiled and these cases will lead to themed data quality profiles as part of the FFUB. Two new Task Groups are being established to develop controlled vocabularies to address the inconsistencies in values of at least 40 Darwin Core terms. These inconsistencies make the evaluation of fitness for use far more difficult than achieved by using controlled vocabularies. The first TG is looking at vocabularies generally, while the second is looking at those just pertaining to Invasive Species. It is not just the aggregators though that are the stakeholders in this work. The data custodians and even the collectors have a vested interest in ensuring their data and metadata are of highest quality and therefore seeing their data used widely. It is only after aggregation that many uses of the data become apparent, and most collectors aren’t aware of these uses at the time of collecting. Issues of data quality at the time of collection can later restrict the range of later uses of the data.  Feeding back information to the data custodians from users and aggregators on suspect records is essential, and this is where annotations and reporting back on the results of tests conducted by aggregators is important. The project is also generating standard code and test data for the tests and assertions so that data custodians can readily integrate them into their own procedures. It is far cheaper to correct errors at the source than try and rectify them further down the line. A lot of progress has been made, but we still have a long way to go – join us in making biodiversity data quality a product of which we can all be proud. HTML XML PDF
      PubDate: Mon, 14 Aug 2017 6:24:34 +0300
       
  • What’s Missing From All the Portals'

    • Abstract: Biodiversity Information Science and Standards 1: e20236
      DOI : 10.3897/tdwgproceedings.1.20236
      Authors : Sharon Grant, Kate Webbink, Janeen Jones, Pete Herbst, Robert Zschernitz, Rusty Russell : At time of writing there are over 784 million occurrence records in the Global Biodiversity Information Facility (GBIF) portal (gbif.org), 106 million on the iDigBio site (idigbio.org); 68 million in the Atlas of Living Australia (ala.org.au) and 20 million in VertNet (vertnet.org). The list of biodiversity aggregators and portals that boast occurrence counts in the millions continues to increase. Combined with sites who gather data their data from outside of the GBIF domain such as The Paleobiology Database, there is compelling evidence that global digitization is starting to illuminate the black hole of biodiversity data held in collections across the world. The visibility and demands on our collective natural history heritage have never been as high, and they are increasingly in the spotlight with both internal and external audiences. Funding sources have moved away from massive "digitization for the sake of digitization" projects and demand much more focused proposals. To compete in this arena, collections staff and researchers must collaborate and mine collections for their strengths and use those to justify efforts. To do this, however, they must have access to information about the non-digitized occurrence level records in the world’s holdings. We discuss the potential use of current TDWG standards to allow the capture of existing institutional data about undigitized collections and also those whose records have been marked as environmentally, culturally, or politically sensitive and so must remain digitally dark, so that portals like GBIF can use them in a comparable way as existing occurrence records. Can Darwin Core (with its extensions) together with the Natural Collections Description (draft standard) be used to describe accessions, inventory-level information, and backlog estimates in an efficient and effective way and provide even greater visibility of those undigitized occurrences' In addition, can these data also serve as a means to further refine existing digitized records' HTML XML PDF
      PubDate: Mon, 14 Aug 2017 0:35:44 +0300
       
 
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
 
Home (Search)
Subjects A-Z
Publishers A-Z
Customise
APIs
Your IP address: 107.20.120.65
 
About JournalTOCs
API
Help
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2016