for Journals by Title or ISSN
for Articles by Keywords

Publisher: Oxford University Press   (Total: 370 journals)

 A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  

        1 2 | Last   [Sort by number of followers]   [Restore default list]

Showing 1 - 200 of 370 Journals sorted alphabetically
Acta Biochimica et Biophysica Sinica     Hybrid Journal   (Followers: 6, SJR: 0.881, h-index: 38)
Adaptation     Hybrid Journal   (Followers: 9, SJR: 0.111, h-index: 4)
Aesthetic Surgery J.     Hybrid Journal   (Followers: 6, SJR: 1.538, h-index: 35)
African Affairs     Hybrid Journal   (Followers: 59, SJR: 1.512, h-index: 46)
Age and Ageing     Hybrid Journal   (Followers: 85, SJR: 1.611, h-index: 107)
Alcohol and Alcoholism     Hybrid Journal   (Followers: 17, SJR: 0.935, h-index: 80)
American Entomologist     Full-text available via subscription   (Followers: 6)
American Historical Review     Hybrid Journal   (Followers: 149, SJR: 0.652, h-index: 43)
American J. of Agricultural Economics     Hybrid Journal   (Followers: 39, SJR: 1.441, h-index: 77)
American J. of Epidemiology     Hybrid Journal   (Followers: 172, SJR: 3.047, h-index: 201)
American J. of Hypertension     Hybrid Journal   (Followers: 25, SJR: 1.397, h-index: 111)
American J. of Jurisprudence     Hybrid Journal   (Followers: 18)
American J. of Legal History     Full-text available via subscription   (Followers: 6, SJR: 0.151, h-index: 7)
American Law and Economics Review     Hybrid Journal   (Followers: 27, SJR: 0.824, h-index: 23)
American Literary History     Hybrid Journal   (Followers: 12, SJR: 0.185, h-index: 22)
Analysis     Hybrid Journal   (Followers: 23)
Annals of Botany     Hybrid Journal   (Followers: 35, SJR: 1.912, h-index: 124)
Annals of Occupational Hygiene     Hybrid Journal   (Followers: 28, SJR: 0.837, h-index: 57)
Annals of Oncology     Hybrid Journal   (Followers: 48, SJR: 4.362, h-index: 173)
Annals of the Entomological Society of America     Full-text available via subscription   (Followers: 8, SJR: 0.642, h-index: 53)
Annals of Work Exposures and Health     Hybrid Journal  
AoB Plants     Open Access   (Followers: 4, SJR: 0.78, h-index: 10)
Applied Economic Perspectives and Policy     Hybrid Journal   (Followers: 19, SJR: 0.884, h-index: 31)
Applied Linguistics     Hybrid Journal   (Followers: 52, SJR: 1.749, h-index: 63)
Applied Mathematics Research eXpress     Hybrid Journal   (Followers: 1, SJR: 0.779, h-index: 11)
Arbitration Intl.     Full-text available via subscription   (Followers: 20)
Arbitration Law Reports and Review     Hybrid Journal   (Followers: 13)
Archives of Clinical Neuropsychology     Hybrid Journal   (Followers: 27, SJR: 0.96, h-index: 71)
Aristotelian Society Supplementary Volume     Hybrid Journal   (Followers: 2, SJR: 0.102, h-index: 20)
Arthropod Management Tests     Hybrid Journal   (Followers: 2)
Astronomy & Geophysics     Hybrid Journal   (Followers: 45, SJR: 0.144, h-index: 15)
Behavioral Ecology     Hybrid Journal   (Followers: 51, SJR: 1.698, h-index: 92)
Bioinformatics     Hybrid Journal   (Followers: 271, SJR: 4.643, h-index: 271)
Biology Methods and Protocols     Hybrid Journal  
Biology of Reproduction     Full-text available via subscription   (Followers: 9, SJR: 1.646, h-index: 149)
Biometrika     Hybrid Journal   (Followers: 19, SJR: 2.801, h-index: 90)
BioScience     Hybrid Journal   (Followers: 30, SJR: 2.374, h-index: 154)
Bioscience Horizons : The National Undergraduate Research J.     Open Access   (Followers: 1, SJR: 0.213, h-index: 9)
Biostatistics     Hybrid Journal   (Followers: 16, SJR: 1.955, h-index: 55)
BJA : British J. of Anaesthesia     Hybrid Journal   (Followers: 156, SJR: 2.314, h-index: 133)
BJA Education     Hybrid Journal   (Followers: 65, SJR: 0.272, h-index: 20)
Brain     Hybrid Journal   (Followers: 63, SJR: 6.097, h-index: 264)
Briefings in Bioinformatics     Hybrid Journal   (Followers: 46, SJR: 4.086, h-index: 73)
Briefings in Functional Genomics     Hybrid Journal   (Followers: 4, SJR: 1.771, h-index: 50)
British J. for the Philosophy of Science     Hybrid Journal   (Followers: 35, SJR: 1.267, h-index: 38)
British J. of Aesthetics     Hybrid Journal   (Followers: 27, SJR: 0.217, h-index: 18)
British J. of Criminology     Hybrid Journal   (Followers: 548, SJR: 1.373, h-index: 62)
British J. of Social Work     Hybrid Journal   (Followers: 85, SJR: 0.771, h-index: 53)
British Medical Bulletin     Hybrid Journal   (Followers: 7, SJR: 1.391, h-index: 84)
British Yearbook of Intl. Law     Hybrid Journal   (Followers: 27)
Bulletin of the London Mathematical Society     Hybrid Journal   (Followers: 3, SJR: 1.474, h-index: 31)
Cambridge J. of Economics     Hybrid Journal   (Followers: 59, SJR: 0.957, h-index: 59)
Cambridge J. of Regions, Economy and Society     Hybrid Journal   (Followers: 10, SJR: 1.067, h-index: 22)
Cambridge Quarterly     Hybrid Journal   (Followers: 11, SJR: 0.1, h-index: 7)
Capital Markets Law J.     Hybrid Journal   (Followers: 1)
Carcinogenesis     Hybrid Journal   (Followers: 2, SJR: 2.439, h-index: 167)
Cardiovascular Research     Hybrid Journal   (Followers: 12, SJR: 2.897, h-index: 175)
Cerebral Cortex     Hybrid Journal   (Followers: 43, SJR: 4.827, h-index: 192)
CESifo Economic Studies     Hybrid Journal   (Followers: 17, SJR: 0.501, h-index: 19)
Chemical Senses     Hybrid Journal   (Followers: 1, SJR: 1.436, h-index: 76)
Children and Schools     Hybrid Journal   (Followers: 6, SJR: 0.211, h-index: 18)
Chinese J. of Comparative Law     Hybrid Journal   (Followers: 3)
Chinese J. of Intl. Law     Hybrid Journal   (Followers: 21, SJR: 0.737, h-index: 11)
Chinese J. of Intl. Politics     Hybrid Journal   (Followers: 9, SJR: 1.238, h-index: 15)
Christian Bioethics: Non-Ecumenical Studies in Medical Morality     Hybrid Journal   (Followers: 11, SJR: 0.191, h-index: 8)
Classical Receptions J.     Hybrid Journal   (Followers: 24, SJR: 0.1, h-index: 3)
Clinical Infectious Diseases     Hybrid Journal   (Followers: 60, SJR: 4.742, h-index: 261)
Clinical Kidney J.     Open Access   (Followers: 4, SJR: 0.338, h-index: 19)
Community Development J.     Hybrid Journal   (Followers: 24, SJR: 0.47, h-index: 28)
Computer J.     Hybrid Journal   (Followers: 8, SJR: 0.371, h-index: 47)
Conservation Physiology     Open Access   (Followers: 2)
Contemporary Women's Writing     Hybrid Journal   (Followers: 11, SJR: 0.111, h-index: 3)
Contributions to Political Economy     Hybrid Journal   (Followers: 5, SJR: 0.313, h-index: 10)
Critical Values     Full-text available via subscription  
Current Legal Problems     Hybrid Journal   (Followers: 26)
Current Zoology     Full-text available via subscription   (Followers: 1, SJR: 0.999, h-index: 20)
Database : The J. of Biological Databases and Curation     Open Access   (Followers: 11, SJR: 1.068, h-index: 24)
Digital Scholarship in the Humanities     Hybrid Journal   (Followers: 13)
Diplomatic History     Hybrid Journal   (Followers: 20, SJR: 0.296, h-index: 22)
DNA Research     Open Access   (Followers: 4, SJR: 2.42, h-index: 77)
Dynamics and Statistics of the Climate System     Open Access   (Followers: 3)
Early Music     Hybrid Journal   (Followers: 15, SJR: 0.124, h-index: 11)
Economic Policy     Hybrid Journal   (Followers: 37, SJR: 2.052, h-index: 52)
ELT J.     Hybrid Journal   (Followers: 25, SJR: 1.26, h-index: 23)
English Historical Review     Hybrid Journal   (Followers: 51, SJR: 0.311, h-index: 10)
English: J. of the English Association     Hybrid Journal   (Followers: 13, SJR: 0.144, h-index: 3)
Environmental Entomology     Full-text available via subscription   (Followers: 11, SJR: 0.791, h-index: 66)
Environmental Epigenetics     Open Access   (Followers: 1)
Environmental History     Hybrid Journal   (Followers: 28, SJR: 0.197, h-index: 25)
EP-Europace     Hybrid Journal   (Followers: 2, SJR: 2.201, h-index: 71)
Epidemiologic Reviews     Hybrid Journal   (Followers: 10, SJR: 3.917, h-index: 81)
ESHRE Monographs     Hybrid Journal  
Essays in Criticism     Hybrid Journal   (Followers: 16, SJR: 0.1, h-index: 6)
European Heart J.     Hybrid Journal   (Followers: 50, SJR: 6.997, h-index: 227)
European Heart J. - Cardiovascular Imaging     Hybrid Journal   (Followers: 8, SJR: 2.044, h-index: 58)
European Heart J. - Cardiovascular Pharmacotherapy     Full-text available via subscription   (Followers: 1)
European Heart J. - Quality of Care and Clinical Outcomes     Hybrid Journal  
European Heart J. Supplements     Hybrid Journal   (Followers: 7, SJR: 0.152, h-index: 31)
European J. of Cardio-Thoracic Surgery     Hybrid Journal   (Followers: 8, SJR: 1.568, h-index: 104)
European J. of Intl. Law     Hybrid Journal   (Followers: 169, SJR: 0.722, h-index: 38)
European J. of Orthodontics     Hybrid Journal   (Followers: 4, SJR: 1.09, h-index: 60)
European J. of Public Health     Hybrid Journal   (Followers: 23, SJR: 1.284, h-index: 64)
European Review of Agricultural Economics     Hybrid Journal   (Followers: 11, SJR: 1.549, h-index: 42)
European Review of Economic History     Hybrid Journal   (Followers: 28, SJR: 0.628, h-index: 24)
European Sociological Review     Hybrid Journal   (Followers: 41, SJR: 2.061, h-index: 53)
Evolution, Medicine, and Public Health     Open Access   (Followers: 11)
Family Practice     Hybrid Journal   (Followers: 12, SJR: 1.048, h-index: 77)
Fems Microbiology Ecology     Hybrid Journal   (Followers: 9, SJR: 1.687, h-index: 115)
Fems Microbiology Letters     Hybrid Journal   (Followers: 21, SJR: 1.126, h-index: 118)
Fems Microbiology Reviews     Hybrid Journal   (Followers: 26, SJR: 7.587, h-index: 150)
Fems Yeast Research     Hybrid Journal   (Followers: 13, SJR: 1.213, h-index: 66)
Foreign Policy Analysis     Hybrid Journal   (Followers: 22, SJR: 0.859, h-index: 10)
Forestry: An Intl. J. of Forest Research     Hybrid Journal   (Followers: 16, SJR: 0.903, h-index: 44)
Forum for Modern Language Studies     Hybrid Journal   (Followers: 6, SJR: 0.108, h-index: 6)
French History     Hybrid Journal   (Followers: 32, SJR: 0.123, h-index: 10)
French Studies     Hybrid Journal   (Followers: 20, SJR: 0.119, h-index: 7)
French Studies Bulletin     Hybrid Journal   (Followers: 10, SJR: 0.102, h-index: 3)
Gastroenterology Report     Open Access   (Followers: 2)
Genome Biology and Evolution     Open Access   (Followers: 12, SJR: 3.22, h-index: 39)
Geophysical J. Intl.     Hybrid Journal   (Followers: 34, SJR: 1.839, h-index: 119)
German History     Hybrid Journal   (Followers: 26, SJR: 0.437, h-index: 13)
GigaScience     Open Access   (Followers: 3)
Global Summitry     Hybrid Journal  
Glycobiology     Hybrid Journal   (Followers: 14, SJR: 1.692, h-index: 101)
Health and Social Work     Hybrid Journal   (Followers: 51, SJR: 0.505, h-index: 40)
Health Education Research     Hybrid Journal   (Followers: 13, SJR: 0.814, h-index: 80)
Health Policy and Planning     Hybrid Journal   (Followers: 21, SJR: 1.628, h-index: 66)
Health Promotion Intl.     Hybrid Journal   (Followers: 21, SJR: 0.664, h-index: 60)
History Workshop J.     Hybrid Journal   (Followers: 27, SJR: 0.313, h-index: 20)
Holocaust and Genocide Studies     Hybrid Journal   (Followers: 26, SJR: 0.115, h-index: 13)
Human Molecular Genetics     Hybrid Journal   (Followers: 9, SJR: 4.288, h-index: 233)
Human Reproduction     Hybrid Journal   (Followers: 79, SJR: 2.271, h-index: 179)
Human Reproduction Update     Hybrid Journal   (Followers: 17, SJR: 4.678, h-index: 128)
Human Rights Law Review     Hybrid Journal   (Followers: 61, SJR: 0.7, h-index: 21)
ICES J. of Marine Science: J. du Conseil     Hybrid Journal   (Followers: 54, SJR: 1.233, h-index: 88)
ICSID Review     Hybrid Journal   (Followers: 11)
ILAR J.     Hybrid Journal   (Followers: 1, SJR: 1.099, h-index: 51)
IMA J. of Applied Mathematics     Hybrid Journal   (SJR: 0.329, h-index: 26)
IMA J. of Management Mathematics     Hybrid Journal   (Followers: 1, SJR: 0.351, h-index: 20)
IMA J. of Mathematical Control and Information     Hybrid Journal   (Followers: 2, SJR: 0.661, h-index: 28)
IMA J. of Numerical Analysis - advance access     Hybrid Journal   (SJR: 2.032, h-index: 44)
Industrial and Corporate Change     Hybrid Journal   (Followers: 7, SJR: 1.37, h-index: 81)
Industrial Law J.     Hybrid Journal   (Followers: 32, SJR: 0.184, h-index: 15)
Information and Inference     Free  
Integrative and Comparative Biology     Hybrid Journal   (Followers: 8, SJR: 1.911, h-index: 90)
Interacting with Computers     Hybrid Journal   (Followers: 10, SJR: 0.529, h-index: 59)
Interactive CardioVascular and Thoracic Surgery     Hybrid Journal   (Followers: 5, SJR: 0.743, h-index: 35)
Intl. Affairs     Hybrid Journal   (Followers: 52, SJR: 1.264, h-index: 53)
Intl. Data Privacy Law     Hybrid Journal   (Followers: 30)
Intl. Health     Hybrid Journal   (Followers: 5, SJR: 0.835, h-index: 15)
Intl. Immunology     Hybrid Journal   (Followers: 3, SJR: 1.613, h-index: 111)
Intl. J. for Quality in Health Care     Hybrid Journal   (Followers: 34, SJR: 1.593, h-index: 69)
Intl. J. of Constitutional Law     Hybrid Journal   (Followers: 60, SJR: 0.613, h-index: 19)
Intl. J. of Epidemiology     Hybrid Journal   (Followers: 149, SJR: 4.381, h-index: 145)
Intl. J. of Law and Information Technology     Hybrid Journal   (Followers: 4, SJR: 0.247, h-index: 8)
Intl. J. of Law, Policy and the Family     Hybrid Journal   (Followers: 29, SJR: 0.307, h-index: 15)
Intl. J. of Lexicography     Hybrid Journal   (Followers: 8, SJR: 0.404, h-index: 18)
Intl. J. of Low-Carbon Technologies     Open Access   (Followers: 1, SJR: 0.457, h-index: 12)
Intl. J. of Neuropsychopharmacology     Open Access   (Followers: 3, SJR: 1.69, h-index: 79)
Intl. J. of Public Opinion Research     Hybrid Journal   (Followers: 9, SJR: 0.906, h-index: 33)
Intl. J. of Refugee Law     Hybrid Journal   (Followers: 34, SJR: 0.231, h-index: 21)
Intl. J. of Transitional Justice     Hybrid Journal   (Followers: 13, SJR: 0.833, h-index: 12)
Intl. Mathematics Research Notices     Hybrid Journal   (Followers: 1, SJR: 2.052, h-index: 42)
Intl. Political Sociology     Hybrid Journal   (Followers: 31, SJR: 1.339, h-index: 19)
Intl. Relations of the Asia-Pacific     Hybrid Journal   (Followers: 18, SJR: 0.539, h-index: 17)
Intl. Studies Perspectives     Hybrid Journal   (Followers: 7, SJR: 0.998, h-index: 28)
Intl. Studies Quarterly     Hybrid Journal   (Followers: 40, SJR: 2.184, h-index: 68)
Intl. Studies Review     Hybrid Journal   (Followers: 18, SJR: 0.783, h-index: 38)
ISLE: Interdisciplinary Studies in Literature and Environment     Hybrid Journal   (Followers: 1, SJR: 0.155, h-index: 4)
ITNOW     Hybrid Journal   (Followers: 2, SJR: 0.102, h-index: 4)
J. of African Economies     Hybrid Journal   (Followers: 15, SJR: 0.647, h-index: 30)
J. of American History     Hybrid Journal   (Followers: 44, SJR: 0.286, h-index: 34)
J. of Analytical Toxicology     Hybrid Journal   (Followers: 13, SJR: 1.038, h-index: 60)
J. of Antimicrobial Chemotherapy     Hybrid Journal   (Followers: 13, SJR: 2.157, h-index: 149)
J. of Antitrust Enforcement     Hybrid Journal   (Followers: 1)
J. of Applied Poultry Research     Hybrid Journal   (Followers: 3, SJR: 0.563, h-index: 43)
J. of Biochemistry     Hybrid Journal   (Followers: 42, SJR: 1.341, h-index: 96)
J. of Chromatographic Science     Hybrid Journal   (Followers: 17, SJR: 0.448, h-index: 42)
J. of Church and State     Hybrid Journal   (Followers: 11, SJR: 0.167, h-index: 11)
J. of Competition Law and Economics     Hybrid Journal   (Followers: 36, SJR: 0.442, h-index: 16)
J. of Complex Networks     Hybrid Journal   (Followers: 1, SJR: 1.165, h-index: 5)
J. of Conflict and Security Law     Hybrid Journal   (Followers: 13, SJR: 0.196, h-index: 15)
J. of Consumer Research     Full-text available via subscription   (Followers: 43, SJR: 4.896, h-index: 121)
J. of Crohn's and Colitis     Hybrid Journal   (Followers: 10, SJR: 1.543, h-index: 37)
J. of Cybersecurity     Hybrid Journal   (Followers: 3)
J. of Deaf Studies and Deaf Education     Hybrid Journal   (Followers: 9, SJR: 0.69, h-index: 36)
J. of Design History     Hybrid Journal   (Followers: 16, SJR: 0.166, h-index: 14)
J. of Economic Entomology     Full-text available via subscription   (Followers: 6, SJR: 0.894, h-index: 76)
J. of Economic Geography     Hybrid Journal   (Followers: 24, SJR: 2.909, h-index: 69)
J. of Environmental Law     Hybrid Journal   (Followers: 24, SJR: 0.457, h-index: 20)
J. of European Competition Law & Practice     Hybrid Journal   (Followers: 20)
J. of Experimental Botany     Hybrid Journal   (Followers: 14, SJR: 2.798, h-index: 163)
J. of Financial Econometrics     Hybrid Journal   (Followers: 22, SJR: 1.314, h-index: 27)
J. of Global Security Studies     Hybrid Journal   (Followers: 4)
J. of Heredity     Hybrid Journal   (Followers: 4, SJR: 1.024, h-index: 76)
J. of Hindu Studies     Hybrid Journal   (Followers: 7, SJR: 0.186, h-index: 3)
J. of Hip Preservation Surgery     Open Access  
J. of Human Rights Practice     Hybrid Journal   (Followers: 20, SJR: 0.399, h-index: 10)
J. of Infectious Diseases     Hybrid Journal   (Followers: 39, SJR: 4, h-index: 209)
J. of Insect Science     Open Access   (Followers: 9, SJR: 0.388, h-index: 31)

        1 2 | Last   [Sort by number of followers]   [Restore default list]

Journal Cover Database : The Journal of Biological Databases and Curation
  [SJR: 1.068]   [H-I: 24]   [11 followers]  Follow
  This is an Open Access Journal Open Access journal
   ISSN (Online) 1758-0463
   Published by Oxford University Press Homepage  [370 journals]
  • Extension modules for storage, visualization and querying of genomic,
           genetic and breeding data in Tripal databases

    • Authors: Jung S; Lee T, Cheng C, et al.
      Abstract: Tripal is an open-source database platform primarily used for development of genomic, genetic and breeding databases. We report here on the release of the Chado Loader, Chado Data Display and Chado Search modules to extend the functionality of the core Tripal modules. These new extension modules provide additional tools for (1) data loading, (2) customized visualization and (3) advanced search functions for supported data types such as organism, marker, QTL/Mendelian Trait Loci, germplasm, map, project, phenotype, genotype and their respective metadata. The Chado Loader module provides data collection templates in Excel with defined metadata and data loaders with front end forms. The Chado Data Display module contains tools to visualize each data type and the metadata which can be used as is or customized as desired. The Chado Search module provides search and download functionality for the supported data types. Also included are the tools to visualize map and species summary. The use of materialized views in the Chado Search module enables better performance as well as flexibility of data modeling in Chado, allowing existing Tripal databases with different metadata types to utilize the module. These Tripal Extension modules are implemented in the Genome Database for Rosaceae (, CottonGen (, Citrus Genome Database (, Genome Database for Vaccinium ( and the Cool Season Food Legume Database ( URL:,,,,
      PubDate: Sat, 09 Dec 2017 00:00:00 GMT
  • PlantCircNet: a database for plant circRNA–miRNA–mRNA
           regulatory networks

    • Authors: Zhang P; Meng X, Chen H, et al.
      Abstract: Circular RNA (circRNA) is a novel type of endogenous noncoding RNA with covalently closed loop structures, which are widely expressed in various tissues and have functional implications in cellular processes. Acting as competing endogenous RNAs (ceRNAs), circRNAs are important regulators of miRNA activities. The identification of these circRNAs underlines the increasing complexity of ncRNA-mediated regulatory networks. However, more biological evidence is required to infer direct circRNA–miRNA associations while little attention has been paid to circRNAs in plants as compared to the abundant research in mammals. PlantCircNet is presented as an integrated database that provides visualized plant circRNA–miRNA–mRNA regulatory networks containing identified circRNAs in eight model plants. The bioinformatics integration of data from multiple sources reveals circRNA–miRNA–mRNA regulatory networks and helps identify mechanisms underlying metabolic effects of circRNAs. An enrichment analysis tool was implemented to detect significantly overrepresented Gene Ontology categories of miRNA targets. The genomic annotations, sequences and isoforms of circRNAs were also investigated. PlantCircNet provides a user-friendly interface for querying detailed information of specific plant circRNAs. The database may serve as a resource to facilitate plant circRNA research. Several circRNAs were identified to play potential regulatory roles in flower development and response to environmental stress from regulatory networks related with miR156a and AT5G59720, respectively. This present research indicated that circRNAs could be involved in diverse biological processes.Database URL:
      PubDate: Sat, 09 Dec 2017 00:00:00 GMT
  • Biomarker identification of hepatocellular carcinoma using a methodical
           literature mining strategy

    • Authors: Chang N; Dai H, Shih Y, et al.
      Abstract: Hepatocellular carcinoma (HCC), one of the most common causes of cancer-related deaths, carries a 5-year survival rate of 18%, underscoring the need for robust biomarkers. In spite of the increased availability of HCC related literatures, many of the promising biomarkers reported have not been validated for clinical use. To narrow down the wide range of possible biomarkers for further clinical validation, bioinformaticians need to sort them out using information provided in published works. Biomedical text mining is an automated way to obtain information of interest within the massive collection of biomedical knowledge, thus enabling extraction of data for biomarkers associated with certain diseases. This method can significantly reduce both the time and effort spent on studying important maladies such as liver diseases. Herein, we report a text mining-aided curation pipeline to identify potential biomarkers for liver cancer. The curation pipeline integrates PubMed E-Utilities to collect abstracts from PubMed and recognize several types of named entities by machine learning-based and pattern-based methods. Genes/proteins from evidential sentences were classified as candidate biomarkers using a convolutional neural network. Lastly, extracted biomarkers were ranked depending on several criteria, such as the frequency of keywords and articles and the journal impact factor, and then integrated into a meaningful list for bioinformaticians. Based on the developed pipeline, we constructed MarkerHub, which contains 2128 candidate biomarkers extracted from PubMed publications from 2008 to 2017.Database URL:
      PubDate: Fri, 08 Dec 2017 00:00:00 GMT
  • Collaborative relation annotation and quality analysis in Markyt

    • Authors: Pérez-Pérez M; Pérez-Rodríguez G, Fdez-Riverola F, et al.
      Abstract: Text mining is showing potential to help in biomedical knowledge integration and discovery at various levels. However, results depend largely on the specifics of the knowledge problem and, in particular, on the ability to produce high-quality benchmarking corpora that may support the training and evaluation of automatic prediction systems. Annotation tools enabling the flexible and customizable production of such corpora are thus pivotal. The open-source Markyt annotation environment brings together the latest web technologies to offer a wide range of annotation capabilities in a domain-agnostic way. It enables the management of multi-user and multi-round annotation projects, including inter-annotator agreement and consensus assessments. Also, Markyt supports the description of entity and relation annotation guidelines on a project basis, being flexible to partial word tagging and the occurrence of annotation overlaps. This paper describes the current release of Markyt, namely new annotation perspectives, which enable the annotation of relations among entities, and enhanced analysis capabilities. Several demos, inspired by public biomedical corpora, are presented as means to better illustrate such functionalities. Markyt aims to bring together annotation capabilities of broad interest to those producing annotated corpora. Markyt demonstration projects describe 20 different annotation tasks of varied document sources (e.g. abstracts, twitters or drug labels) and languages (e.g. English, Spanish or Chinese). Continuous development is based on feedback from practical applications as well as community reports on short- and medium-term mining challenges. Markyt is freely available for non-commercial use at URL:
      PubDate: Tue, 05 Dec 2017 00:00:00 GMT
  • A semantic-based workflow for biomedical literature annotation

    • Authors: Sernadela P; Oliveira J.
      Abstract: Computational annotation of textual information has taken on an important role in knowledge extraction from the biomedical literature, since most of the relevant information from scientific findings is still maintained in text format. In this endeavour, annotation tools can assist in the identification of biomedical concepts and their relationships, providing faster reading and curation processes, with reduced costs. However, the separate usage of distinct annotation systems results in highly heterogeneous data, as it is difficult to efficiently combine and exchange this valuable asset. Moreover, despite the existence of several annotation formats, there is no unified way to integrate miscellaneous annotation outcomes into a reusable, sharable and searchable structure. Taking up this challenge, we present a modular architecture for textual information integration using semantic web features and services. The solution described allows the migration of curation data into a common model, providing a suitable transition process in which multiple annotation data can be integrated and enriched, with the possibility of being shared, compared and reused across semantic knowledge bases.
      PubDate: 2017-11-15
  • Update notifications for the BioCyc collection of databases

    • Authors: Paley S; Karp P.
      Abstract: We describe the BioCyc update notifications service, a new mechanism to keep researchers informed of the latest developments in their areas of interest. combines databases for 9,300 sequenced organisms that integrate genome, metabolic pathway, and regulatory information with extensive bioinformatics tools. Users of the BioCyc website can register their specific areas of interest online by specifying a set of genes, pathways and/or Gene Ontology terms. Then, when significant new information becomes available in a BioCyc database in a user’s interest areas (usually due to curation), an email notification is sent to the user. The BioCyc ontology is leveraged to identify changes that are both relevant to a user’s specified interests and worthy of notification. Every effort is made to ensure that the resulting email text is both concise and informative, with links to relevant BioCyc pages.Database URL:,,
      PubDate: 2017-11-14
  • eGenPub, a text mining system for extending computationally mapped
           bibliography for UniProt Knowledgebase by capturing centrality

    • Authors: Ding R; Boutet E, Lieberherr D, et al.
      Abstract: UniProt Knowledgebase (UniProtKB) is a publicly available database with access to a vast amount of protein sequence and functional information. To widen the scope of the publications associated with a protein entry, UniProt has introduced the computationally mapped additional bibliography section, which includes literature collected from external sources. In this article, we describe a text mining system, eGenPub, which selects articles that are ‘about’ specific proteins and allows automatic identification of additional bibliography for given UniProt protein entries. Focusing on plant proteins initially, eGenPub utilizes a gene normalization tool called pGenN, and a trained support vector machine model, which achieves a precision of 95.3%, to predict whether an article, based on its abstract, should be linked to a given UniProt entry. We have conducted a full-scale PubMed processing using eGenPub for eight common plant species. Altogether, 9025 articles are identified as relevant bibliography for 4752 UniProt entries, among which 5252 are additional papers not in the existing publication section. These newly computationally mapped additional bibliography via eGenPub is being integrated in the UniProt production pipeline, and can be accessed via the UniProtKB protein entry publication view.
      PubDate: 2017-11-13
  • Database of resistance related metabolites in Wheat Fusarium head blight
           Disease (MWFD)

    • Authors: Surendra A; Cuperlovic-Culf M.
      Abstract: Fungal diseases are an increasing threat to worldwide food security. Fusarium head blight (FHB), primarily caused by Fusarium graminearum, is a devastating disease of Triticum aestivum (bread wheat). Partial resistance to FHB of several wheat and barley cultivars includes specific metabolic responses to inoculation. Investigation of metabolic changes in plants, following pathogen infection, provides valuable data for understanding of the role of metabolites and metabolism in plant-pathogen interaction and resistance. Determination of functions of metabolites in resistance can also inspire the development of antifungals. Metabolic changes induced by FHB in resistant and susceptible plants have been previously investigated. However, the functionality of the majority of these investigated metabolites remains unknown. The ‘Metabolites in the Wheat Fusarium head blight Disease’ (MWFD) database was compiled in order to determine possible targets and roles of these molecules in resistance to FBH and aid in the development of related synthetic antifungals. The MWFD database allows for the quick retrieval of known resistance related metabolites, associated target proteins and their sequence analogues in wheat and Fusarium genomes. The database can be searched for compounds, MeSH terms, as well as protein targets. This comprehensive, manually curated, collection of resistance related metabolites is available at URL:
      PubDate: 2017-11-06
  • Improving average ranking precision in user searches for biomedical
           research datasets

    • Authors: Teodoro D; Mottin L, Gobeill J, et al.
      Abstract: Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results.Database URL:
      PubDate: 2017-11-06
  • YGMD: a repository for yeast cooperative transcription factor sets and
           their target gene modules

    • Authors: Wu W; Chen P, Chen T, et al.
      Abstract: By organizing the genome into gene modules (GMs), a living cell coordinates the activities of a set of genes to properly respond to environmental changes. The transcriptional regulation of the expression of a GM is usually carried out by a cooperative transcription factor set (CoopTFS) consisting of several cooperative transcription factors (TFs). Therefore, a database which provides CoopTFSs and their target GMs is useful for studying the cellular responses to internal or external stimuli. To address this need, here we constructed YGMD (Yeast Gene Module Database) to provide 34120 CoopTFSs, each of which consists of two to five cooperative TFs, and their target GMs. The cooperativity between TFs in a CoopTFS is suggested by physical/genetic interaction evidence or/and predicted by existing algorithms. The target GM regulated by a CoopTFS is defined as the common target genes of all the TFs in that CoopTFS. The regulatory association between any TF in a CoopTFS and any gene in the target GM is supported by experimental evidence in the literature. In YGMD, users can (i) search the GM regulated by a specific CoopTFS of interest or (ii) search all possible CoopTFSs whose target GMs contain a specific gene of interest. The biological relevance of YGMD is shown by a case study which demonstrates that YGMD can provide a GM enriched with genes known to be regulated by the query CoopTFS (Cbf1-Met4-Met32). We believe that YGMD provides a valuable resource for yeast biologists to study the transcriptional regulation of GMs.Database URL:, or
      PubDate: 2017-11-06
  • FishTrace: a genetic catalogue of European fishes

    • Authors: Zanzi A; Martinsohn J.
      Abstract: FishTrace is a genetic catalogue for species identification associated to reference collections of taxonomically identified vouchers from more than 200 commercial marine fish species. The main purpose of the genetic catalogue is to enable reliable species identification for research purposes as well as in support of traceability schemes under the remit of food and feed laws. A major asset of FishTrace is that all genetic data are linked to biological collections of vouchers, that is the fish specimen that have been identified genetically have, at the same time, been identified by taxonomists and are stored and curated by natural history museums. This opens the potential for future applications related to fish species authenticity tests, also in a legal context, and associated biological research. The genetic catalogue, which contains molecular data together with detailed information on sampling and geographical origin, is publically accessible on the web site of the project.Database URL
      PubDate: 2017-10-31
  • Proficiency of data interpretation: identification of signaling
           SNPs/specific loci for coronary artery disease

    • Authors: Cheema A; Rosenthal S, Ilyas Kamboh M.
      Abstract: Coronary artery disease (CAD) is a complex disorder involving both genetic and non-genetic factors. Genome-wide association studies (GWAS) have identified hundreds of single nucleotides polymorphisms (SNPs) tagging over > 40 CAD risk loci. We hypothesized that some non-coding variants might directly regulate the gene expression rather than tagging a nearby locus. We used RegulomeDB to examine regulatory functions of 58 SNPs identified in two GWAS and those SNPs in linkage disequilibrium (LD) (r2 ≥ 0.80) with the GWAS SNPs. Of the tested 1200 SNPs, 858 returned scores of 1–6 by RegulomeDB. Of these 858 SNPs, 97 were predicted to have regulatory functions with RegulomeDB score of < 3. Notably, only 8 of the 97 predicted regulatory variants were genome-wide significant SNPs (LIPA/rs2246833, RegulomeDB score = 1b; ZC3HC1/rs11556924, CYP17A1-CNNM2-NT5C2/rs12413409, APOE-APOC1/rs2075650 and UBE2Z/rs46522, each with a RegulomeDB score = 1f; ZNF259-APOA5-APOA1/rs964184, SMG6/rs2281727 and COL4A1-COL4A2/rs4773144, each with a RegulomeDB score = 2b). The remainder 89 functional SNPs were in linkage disequilibrium with GWAS SNPs. This study supports the hypothesis that some of the non-coding variants are true signals via regulation of gene expression at transcription level. Our study indicates that RegulomeDB is a useful database to examine the putative functions of large number of genetic variants and it may help to identify a true association among multiple tagged SNPs in a complex disease, such as CAD.Database URLs;
      PubDate: 2017-10-31
  • DLREFD: a database providing associations of long non-coding RNAs,
           environmental factors and phenotypes

    • Authors: Sun Y; Zhang D, Ming Z, et al.
      Abstract: The development of many common complex diseases depends on the interactions between genetic factors (GF) and environmental factors (EF). Non-coding RNAs have been identified as major players in regulation of gene expression responding to environmental cues. In recent years, lots of studies have reported that the dysfunctions of long non-coding RNA (lncRNAs), EFs and their inter-actions have strong effects on phenotypes. However, compared with protein coding genes and microRNAs, there is a paucity of bioinformatics resource platform for understanding the disease mechanism in the level of lncRNA-EF interactions. In this study, we constructed the Disease Related LncRNA-EF Interaction Database (DLREFD), which contains a comprehensive collection and curation of experimentally supported interactions among lncRNAs, EFs and phenotypes. It integrated 835 entries, 475 LncRNAs, 153 EFs and 124 phenotypes. The names of lncRNAs, phenotypes, EFs, conditions of EFs, samples, species, evidence and references were further annotated. We hope DLREFD will be a useful resource for researches on lncRNAs, EFs and diseases.Database URL:
      PubDate: 2017-10-25
  • Edaphostat: interactive ecological analysis of soil organism occurrences
           and preferences from the Edaphobase data warehouse

    • Authors: Hausen J; Scholz-Starke B, Burkhardt U, et al.
      Abstract: The Edaphostat web application allows interactive and dynamic analyses of soil organism data stored in the Edaphobase data warehouse. It is part of the Edaphobase web application and can be accessed by any modern browser. The tool combines data from different sources (publications, field studies and museum collections) and allows species preferences along various environmental gradients (i.e. C/N ratio and pH) and classification systems (habitat type and soil type) to be analyzed.Database URL: Edaphostat is part of the Edaphobase Web Application available at
      PubDate: 2017-10-24
  • EUCANEXT: an integrated database for the exploration of genomic and
           transcriptomic data from Eucalyptus species

    • Authors: Nascimento L; Salazar M, Lepikson-Neto J, et al.
      Abstract: Tree species of the genus Eucalyptus are the most valuable and widely planted hardwoods in the world. Given the economic importance of Eucalyptus trees, much effort has been made towards the generation of specimens with superior forestry properties that can deliver high-quality feedstocks, customized to the industrýs needs for both cellulosic (paper) and lignocellulosic biomass production. In line with these efforts, large sets of molecular data have been generated by several scientific groups, providing invaluable information that can be applied in the development of improved specimens. In order to fully explore the potential of available datasets, the development of a public database that provides integrated access to genomic and transcriptomic data from Eucalyptus is needed. EUCANEXT is a database that analyses and integrates publicly available Eucalyptus molecular data, such as the E. grandis genome assembly and predicted genes, ESTs from several species and digital gene expression from 26 RNA-Seq libraries. The database has been implemented in a Fedora Linux machine running MySQL and Apache, while Perl CGI was used for the web interfaces. EUCANEXT provides a user-friendly web interface for easy access and analysis of publicly available molecular data from Eucalyptus species. This integrated database allows for complex searches by gene name, keyword or sequence similarity and is publicly accessible at Through EUCANEXT, users can perform complex analysis to identify genes related traits of interest using RNA-Seq libraries and tools for differential expression analysis. Moreover, all the bioinformatics pipeline here described, including the database schema and PERL scripts, are readily available and can be applied to any genomic and transcriptomic project, regardless of the organism.Database URL:
      PubDate: 2017-10-24
  • Query expansion using MeSH terms for dataset retrieval: OHSU at the
           bioCADDIE 2016 dataset retrieval challenge

    • Authors: Wright T; Ball D, Hersh W.
      Abstract: Scientific data are being generated at an ever-increasing rate. The Biomedical and Healthcare Data Discovery Index Ecosystem (bioCADDIE) is an NIH-funded Data Discovery Index that aims to provide a platform for researchers to locate, retrieve, and share research datasets. The bioCADDIE 2016 Dataset Retrieval Challenge was held to identify the most effective dataset retrieval methods. We aimed to assess the value of Medical Subject Heading (MeSH) term-based query expansion to improve retrieval. Our system, based on the open-source search engine, Elasticsearch, expands queries by identifying synonyms from the MeSH vocabulary and adding these to the original query. The number and relative weighting of MeSH terms is variable. The top 1000 search results for the 15 challenge queries were submitted for evaluation. After the challenge, we performed additional runs to determine the optimal number of MeSH terms and weighting. Our best overall score used five MeSH terms with a 1:5 terms:words weighting ratio, achieving an inferred normalized distributed cumulative gain (infNDCG) of 0.445, which was the third highest score among the 10 research groups who participated in the challenge. Further testing revealed our initial combination of MeSH terms and weighting yielded the best overall performance. Scores varied considerably between queries as well as with different variations of MeSH terms and weights. Query expansion using MeSH terms can enhance search relevance of biomedical datasets. High variability between queries and system variables suggest room for improvement and directions for further research.Database URL:
      PubDate: 2017-10-20
  • Improving taxonomic accuracy for fungi in public sequence databases:
           applying ‘one name one species’ in well-defined genera with
           Trichoderma/Hypocrea as a test case

    • Authors: Robbertse B; Strope P, Chaverri P, et al.
      Abstract: The ITS (nuclear ribosomal internal transcribed spacer) RefSeq database at the National Center for Biotechnology Information (NCBI) is dedicated to the clear association between name, specimen and sequence data. This database is focused on sequences obtained from type material stored in public collections. While the initial ITS sequence curation effort together with numerous fungal taxonomy experts attempted to cover as many orders as possible, we extended our latest focus to the family and genus ranks. We focused on Trichoderma for several reasons, mainly because the asexual and sexual synonyms were well documented, and a list of proposed names and type material were recently proposed and published. In this case study the recent taxonomic information was applied to do a complete taxonomic audit for the genus Trichoderma in the NCBI Taxonomy database. A name status report is available here: As a result, the ITS RefSeq Targeted Loci database at NCBI has been augmented with more sequences from type and verified material from Trichoderma species. Additionally, to aid in the cross referencing of data from single loci and genomes we have collected a list of quality records of the RPB2 gene obtained from type material in GenBank that could help validate future submissions. During the process of curation misidentified genomes were discovered, and sequence records from type material were found hidden under previous classifications. Source metadata curation, although more cumbersome, proved to be useful as confirmation of the type material designation.Database URL:
      PubDate: 2017-10-13
  • In-Cardiome: integrated knowledgebase for coronary artery disease enabling
           translational research

    • Authors: Sharma A; Deshpande V, Ghatge M, et al.
      Abstract: Coronary artery disease (CAD) is a leading cause of death worldwide. Prevention, diagnosis and clinical interventions are dependent on the conventional risk factors like hypertension, diabetes and obesity. However, these conventional risk factors do not completely identify high risk individuals. One major hurdle in the improvement of diagnosis and treatment for CAD is the lack of integration of knowledge from different areas of research like molecular, clinical and drug development. In order to provide comprehensive information from hitherto dispersed data, we developed an integrative knowledgebase called “In-Cardiome or Integrated Cardiome” for all the stake holders in healthcare such as scientists, clinicians and pharmaceutical companies. It is created by integrating 16 different data sources, 995 curated genes classified into 12 different functional categories associated with disease, 1204 completed clinical trials, 12 therapy or drug classifications with 62 approved drugs and drug target networks. This knowledgebase gives the most needed opportunity to understand the disease process and therapeutic impact along with gene expression data from both animal models and patients. The data is classified into three different search categories functional groups, risk factors and therapy/drug based classes. One more unique aspect of In-Cardiome is integration of clinical data of 10,217 subject data from our ongoing Indian Atherosclerosis Research Study (IARS) (6357 unaffected and 3860 CAD affected). IARS data showing demographics and associations of individual and combinations of risk factors in Indian population along with molecular information will enable better translational and drug development research.Database
      PubDate: 2017-10-10
  • KiPho: malaria parasite kinome and phosphatome portal

    • Authors: Pandey R; Kumar P, Gupta D.
      Abstract: The Plasmodium kinases and phosphatases play an essential role in the regulation of substrate reversible-phosphorylation and overall cellular homeostasis. Reversible phosphorylation is one of the key post-translational modifications (PTMs) essential for parasite survival. Thus, a complete and comprehensive information of malarial kinases and phosphatases as a single web resource will not only aid in systematic and better understanding of the PTMs, but also facilitate efforts to look for novel drug targets for malaria. In the current work, we have developed KiPho, a comprehensive and one step web-based information resource for Plasmodium kinases and phosphatases. To develop KiPho, we have made use of search methods to retrieve, consolidate and integrate predicted as well as annotated information from several publically available web repositories. Additionally, we have incorporated relevant and manually curated data, which will be updated from time to time with the availability of new information. The KiPho (Malaria Parasite Kinome—Phosphatome) resource is freely available at
      PubDate: 2017-10-10
  • An online analytical processing multi-dimensional data warehouse for
           malaria data

    • Authors: Arifin S; Madey G, Vyushkov A, et al.
      Abstract: Malaria is a vector-borne disease that contributes substantially to the global burden of morbidity and mortality. The management of malaria-related data from heterogeneous, autonomous, and distributed data sources poses unique challenges and requirements. Although online data storage systems exist that address specific malaria-related issues, a globally integrated online resource to address different aspects of the disease does not exist. In this article, we describe the design, implementation, and applications of a multi-dimensional, online analytical processing data warehouse, named the VecNet Data Warehouse (VecNet-DW). It is the first online, globally-integrated platform that provides efficient search, retrieval and visualization of historical, predictive, and static malaria-related data, organized in data marts. Historical and static data are modelled using star schemas, while predictive data are modelled using a snowflake schema. The major goals, characteristics, and components of the DW are described along with its data taxonomy and ontology, the external data storage systems and the logical modelling and physical design phases. Results are presented as screenshots of a Dimensional Data browser, a Lookup Tables browser, and a Results Viewer interface. The power of the DW emerges from integrated querying of the different data marts and structuring those queries to the desired dimensions, enabling users to search, view, analyse, and store large volumes of aggregated data, and responding better to the increasing demands of users.Database URL
      PubDate: 2017-10-07
  • Information retrieval for biomedical datasets: the 2016 bioCADDIE dataset
           retrieval challenge

    • Authors: Roberts K; Gururaj AE, Chen X, et al.
      Abstract: The focus of the 2016 bioCADDIE Dataset Retrieval Challenge was the evaluation of information retrieval techniques for identifying relevant biomedical datasets. Participants were provided with a corpus of ∼795 thousand datasets from 20 biomedical data repositories and their retrieval systems were evaluated with 15 test queries. There were 10 participants in the Challenge, submitting a total of 45 runs. The top inferred normalized discounted cumulative gain score was 0.513, while the top precision at 10 score was 0.827. The systems utilized a range of retrieval approaches, from advanced query processing to learning-to-rank frameworks. The results of the task demonstrate the potential for advanced retrieval methods in finding relevant biomedical datasets.
      PubDate: 2017-09-26
  • First steps in automatic summarization of transcription factor properties
           for RegulonDB: classification of sentences about structural domains and
           regulated processes

    • Authors: Méndez-Cruz C; Gama-Castro S, Mejía-Almonte C, et al.
      Abstract: The RegulonDB ( team generates manually elaborated summaries about transcription factors (TFs) of Escherichia coli K-12. These texts involve considerable effort, since they summarize a diverse collection of structural, mechanistic and physiological properties of TFs and, due to constant new research, ideally they require frequent updating. In natural language processing, several techniques for automatic summarization have been developed. Therefore, our proposal is to extract, by using those techniques, relevant information about TFs for assisting the curation and elaboration of the manual summaries. Here, we present the results of the automatic classification of sentences about the biological processes regulated by a TF and the information about the structural domains constituting the TF. We tested two classical classifiers, Naïve Bayes and Support Vector Machines (SVMs), with the sentences of the manual summaries as training data. The best classifier was an SVM employing lexical, grammatical, and terminological features (F-score, 0.8689). The sentences of articles analyzed by this classifier were frequently true, but many sentences were set aside (high precision with low recall); consequently, some improvement is required. Nevertheless, automatic summaries of complete articles about five TFs, generated with this classifier, included much of the relevant information of the summaries written by curators (high ROUGE-1 recall). In fact, a manual comparison confirmed that the best summary encompassed 100% of the relevant information. Hence, our empirical results suggest that our proposal is promising for covering more properties of TFs to generate suggested sentences with relevant information to help the curation work without losing quality.Database URLRegulonDB,
      PubDate: 2017-09-26
  • SFMetaDB: a comprehensive annotation of mouse RNA splicing factor RNA-Seq

    • Authors: Li J; Tseng C, Federico A, et al.
      Abstract: Although the number of RNA-Seq datasets deposited publicly has increased over the past few years, incomplete annotation of the associated metadata limits their potential use. Because of the importance of RNA splicing in diseases and biological processes, we constructed a database called SFMetaDB by curating datasets related with RNA splicing factors. Our effort focused on the RNA-Seq datasets in which splicing factors were knocked-down, knocked-out or over-expressed, leading to 75 datasets corresponding to 56 splicing factors. These datasets can be used in differential alternative splicing analysis for the identification of the potential targets of these splicing factors and other functional studies. Surprisingly, only ∼15% of all the splicing factors have been studied by loss- or gain-of-function experiments using RNA-Seq. In particular, splicing factors with domains from a few dominant Pfam domain families have not been studied. This suggests a significant gap that needs to be addressed to fully elucidate the splicing regulatory landscape. Indeed, there are already mouse models available for ∼20 of the unstudied splicing factors, and it can be a fruitful research direction to study these splicing factors in vitro and in vivo using RNA-Seq.Database URL:
      PubDate: 2017-09-19
  • AllerBase: a comprehensive allergen knowledgebase

    • Authors: Kadam K; Karbhal R, Jayaraman VK, et al.
      Abstract: Allergic diseases represent a major health concern worldwide due to steady rise in their prevalence leading to increased disease burden. The field of allergy research has witnessed significant progress and the focus of studies has shifted to molecular level. Vast amounts of data for allergens are archived in allergen databases, which cover diverse aspects of allergens and allergenicity with varying degrees of completeness. Users are required to refer to multiple databases including general purpose immunological databases to obtain relevant data of allergens. AllerBase, a relational database, has been developed with the objective of integrating protein allergens and related data from prevailing bioinformatics resources and published literature. AllerBase is a manually curated comprehensive knowledgebase of experimentally validated allergens where various attributes of allergenicity are made available on a single platform. Links to sequences, structures and immunological data are provided, where available, for all allergens. Data for specific features such as IgE-binding epitopes, IgE cross-reactivity and IgE antibodies are curated and archived. Bibliographic data reporting assays used for experimental characterization of allergens are compiled and processed using text mining approach. AllerBase, thus provides enhanced coverage of data with high granularity. The database can be browsed by category of allergens, epitopes or antibodies. It can also be queried using various attributes such as allergen name, isoallergens/variants, taxonomic level, source organism, sequences, structures, epitopes, antibodies and cross-reactive allergens. AllerBase also provides interface for sequence-based analyses and visualization of structure/epitope which can be used as a framework for translational research. A Completeness Index has been devised to indicate the availability of data on nine attributes for every allergen. This index will also serve as a pointer to identify allergen-specific areas of further research. AllerBase can be used to understand various aspects of allergy and allergenicity at molecular level and to design anti-allergy therapeutics.Database URL:
      PubDate: 2017-09-12
  • Echinobase: an expanding resource for echinoderm genomic information

    • Authors: Kudtarkar P; Cameron R.
      Abstract: Echinobase, a web accessible information system of diverse genomics and biological data for the echinoderm clade, grew out of SpBase, the first echinoderm genome project for sea urchin, Strongylocentrotus purpuratus. Sea urchins and their relatives are utilitarian research models in fields ranging from marine biology to developmental biology and gene regulatory systems. Echinobase is a user-friendly web interface that links an array of biological data that would otherwise have been tedious and frustrating for researchers to extract and organize. The system hosts a powerful gene search engine, genomics browser and other bioinformatics tools to investigate genomics and high throughput data. The Echinobase information system now serves genomic information for eight echinoderm species: S. purpuratus, Strongylocentrotus fransciscanus, Allocentrotus fragilis, Lytechinus variegatus, Patiria miniata, Parastichopus parvimensis and Ophiothrix spiculata, Eucidaris tribuloides. Herein lies a description of the web information system, genomics data types and content hosted by The goal of Echinobase is to connect genomic information to various experimental data and accelerate the research in field of molecular biology, developmental process, gene regulatory networks and more recently engineering biological systems0.Database URL:
      PubDate: 2017-09-12
  • IsoPlot: a database for comparison of mRNA isoforms in fruit fly and

    • Authors: Ng I; Huang J, Tsai S, et al.
      Abstract: Alternative splicing (AS), a mechanism by which different forms of mature messenger RNAs (mRNAs) are generated from the same gene, widely occurs in the metazoan genomes. Knowledge about isoform variants and abundance is crucial for understanding the functional context in the molecular diversity of the species. With increasing transcriptome data of model and non-model species, a database for visualization and comparison of AS events with up-to-date information is needed for further research. IsoPlot is a publicly available database with visualization tools for exploration of AS events, including three major species of mosquitoes, Aedes aegypti, Anopheles gambiae, and Culex quinquefasciatus, and fruit fly Drosophila melanogaster, the model insect species. IsoPlot includes not only 88,663 annotated transcripts but also 17,037 newly predicted transcripts from massive transcriptome data at different developmental stages of mosquitoes. The web interface enables users to explore the patterns and abundance of isoforms in different experimental conditions as well as cross-species sequence comparison of orthologous transcripts. IsoPlot provides a platform for researchers to access comprehensive information about AS events in mosquitoes and fruit fly. Our database is available on the web via an interactive user interface with an intuitive graphical design, which is applicable for the comparison of complex isoforms within or between species.Database URL:
      PubDate: 2017-09-12
  • Multi-field query expansion is effective for biomedical dataset retrieval

    • Authors: Bouadjenek M; Verspoor K.
      Abstract: In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one.
      PubDate: 2017-09-07
  • RampDB: a web application and database for the exploration and prediction
           of receptor activity modifying protein interactions

    • Authors: Topaz N; Mojib N, Chande AT, et al.
      Abstract: Receptor Activity Modifying Proteins (RAMPs) serve as accessory proteins that modulate the signaling activities of G-Protein Coupled Receptors (GPCRs). RAMPs function by interacting with the N-termini and transmembrane domains of GPCRs, and the receptor phenotypes of the resulting complexes are determined by the specific isoform of the interacting RAMPs. RAMPs were discovered in 1998, and since that time the number of known RAMP-GPCR interactions has steadily increased; RAMPs are now known to interact with nearly every member of the class ‘B’, Secretin receptor family of peptide-binding GPCRs as well as some members of the class ‘A’ and ‘C’ peptide-binding GPCRs. Given the steadily increasing number of known RAMP–GPCR interactions, phenotypes and functions, there is a pressing need for a central resource dedicated to their storage, prediction and dissemination. We have developed a web application and database—RampDB—with the goal of addressing this need. RampDB consists of a custom RAMP–GPCR–ligand database integrated with a search utility, which together facilitate the exploration and analysis of RAMP interactions. The RampDB search utility allows users to explore known RAMP interactions, or to predict novel interactions, via either protein sequence (bioinformatic) or ligand (chemoinformatic) queries. The underlying architecture of RampDB was designed using best database practices in order to enable rapid retrieval of search results, automated updates and the seamless incorporation of additional features.Database URL:
      PubDate: 2017-09-06
  • Impact of translation on named-entity recognition in radiology texts

    • Authors: Campos L; Pedro V, Couto F.
      Abstract: Radiology reports describe the results of radiography procedures and have the potential of being a useful source of information which can bring benefits to health care systems around the world. One way to automatically extract information from the reports is by using Text Mining tools. The problem is that these tools are mostly developed for English and reports are usually written in the native language of the radiologist, which is not necessarily English. This creates an obstacle to the sharing of Radiology information between different communities. This work explores the solution of translating the reports to English before applying the Text Mining tools, probing the question of what translation approach should be used. We created MRRAD (Multilingual Radiology Research Articles Dataset), a parallel corpus of Portuguese research articles related to Radiology and a number of alternative translations (human, automatic and semi-automatic) to English. This is a novel corpus which can be used to move forward the research on this topic. Using MRRAD we studied which kind of automatic or semi-automatic translation approach is more effective on the Named-entity recognition task of finding RadLex terms in the English version of the articles. Considering the terms extracted from human translations as our gold standard, we calculated how similar to this standard were the terms extracted using other translations. We found that a completely automatic translation approach using Google leads to F-scores (between 0.861 and 0.868, depending on the extraction approach) similar to the ones obtained through a more expensive semi-automatic translation approach using Unbabel (between 0.862 and 0.870). To better understand the results we also performed a qualitative analysis of the type of errors found in the automatic and semi-automatic translations.Database URL:
      PubDate: 2017-08-28
  • Elsevier’s approach to the bioCADDIE 2016 Dataset Retrieval

    • Authors: Scerri A; Kuriakose J, Deshmane A, et al.
      Abstract: We developed a two-stream, Apache Solr-based information retrieval system in response to the bioCADDIE 2016 Dataset Retrieval Challenge. One stream was based on the principle of word embeddings, the other was rooted in ontology based indexing. Despite encountering several issues in the data, the evaluation procedure and the technologies used, the system performed quite well. We provide some pointers towards future work: in particular, we suggest that more work in query expansion could benefit future biomedical search engines.Database URL:
      PubDate: 2017-08-21
  • A publicly available benchmark for biomedical dataset retrieval: the
           reference standard for the 2016 bioCADDIE dataset retrieval challenge

    • Authors: Cohen T; Roberts K, Gururaj AE, et al.
      Abstract: The rapid proliferation of publicly available biomedical datasets has provided abundant resources that are potentially of value as a means to reproduce prior experiments, and to generate and explore novel hypotheses. However, there are a number of barriers to the re-use of such datasets, which are distributed across a broad array of dataset repositories, focusing on different data types and indexed using different terminologies. New methods are needed to enable biomedical researchers to locate datasets of interest within this rapidly expanding information ecosystem, and new resources are needed for the formal evaluation of these methods as they emerge. In this paper, we describe the design and generation of a benchmark for information retrieval of biomedical datasets, which was developed and used for the 2016 bioCADDIE Dataset Retrieval Challenge. In the tradition of the seminal Cranfield experiments, and as exemplified by the Text Retrieval Conference (TREC), this benchmark includes a corpus (biomedical datasets), a set of queries, and relevance judgments relating these queries to elements of the corpus. This paper describes the process through which each of these elements was derived, with a focus on those aspects that distinguish this benchmark from typical information retrieval reference sets. Specifically, we discuss the origin of our queries in the context of a larger collaborative effort, the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium, and the distinguishing features of biomedical dataset retrieval as a task. The resulting benchmark set has been made publicly available to advance research in the area of biomedical dataset retrieval.Database URL:
      PubDate: 2017-08-18
  • A RESTful application programming interface for the PubMLST molecular
           typing and genome databases

    • Authors: Jolley KA; Bray JE, Maiden MJ.
      Abstract: Molecular typing is used to differentiate microorganisms at the subspecies or strain level for epidemiological investigations, infection control, public health and environmental sampling. DNA sequence-based typing methods require authoritative databases that link sequence variants to nomenclature in order to facilitate communication and comparison of identified types in national or global settings. The PubMLST website ( fulfils this role for over a hundred microorganisms for which it hosts curated molecular sequence typing data, providing sequence and allelic profile definitions for multi-locus sequence typing (MLST) and single-gene typing approaches. In recent years, these have expanded to cover the whole genome with schemes such as core genome MLST (cgMLST) and whole genome MLST (wgMLST) which catalogue the allelic diversity found in hundreds to thousands of genes. These approaches provide a common nomenclature for high-resolution strain characterization and comparison. Molecular typing information is linked to isolate provenance, phenotype, and increasingly genome assemblies, providing a resource for outbreak investigation and research in to population structure, gene association, global epidemiology and vaccine coverage. A Representational State Transfer (REST) Application Programming Interface (API) has been developed for the PubMLST website to make these large quantities of structured molecular typing and whole genome sequence data available for programmatic access by any third party application. The API is an integral component of the Bacterial Isolate Genome Sequence Database (BIGSdb) platform that is used to host PubMLST resources, and exposes all public data within the site. In addition to data browsing, searching and download, the API supports authentication and submission of new data to curator queues.Database URL:
      PubDate: 2017-08-09
  • BioSearch: a semantic search engine for Bio2RDF

    • Authors: Hu W; Qiu H, Huang J, et al.
      Abstract: Biomedical data are growing at an incredible pace and require substantial expertise to organize data in a manner that makes them easily findable, accessible, interoperable and reusable. Massive effort has been devoted to using Semantic Web standards and technologies to create a network of Linked Data for the life sciences, among others. However, while these data are accessible through programmatic means, effective user interfaces for non-experts to SPARQL endpoints are few and far between. Contributing to user frustrations is that data are not necessarily described using common vocabularies, thereby making it difficult to aggregate results, especially when distributed across multiple SPARQL endpoints. We propose BioSearch — a semantic search engine that uses ontologies to enhance federated query construction and organize search results. BioSearch also features a simplified query interface that allows users to optionally filter their keywords according to classes, properties and datasets. User evaluation demonstrated that BioSearch is more effective and usable than two state of the art search and browsing solutions.Database URL:
      PubDate: 2017-08-08
  • CTD 2 Dashboard: a searchable web interface to connect validated results
           from the Cancer Target Discovery and Development Network

    • Authors: Aksoy B; Dančík V, Smith K, et al.
      Abstract: The Cancer Target Discovery and Development (CTD2) Network aims to use functional genomics to accelerate the translation of high-throughput and high-content genomic and small-molecule data towards use in precision oncology. As part of this goal, and to share its conclusions with the research community, the Network developed the ‘CTD2 Dashboard’ [], which compiles CTD2 Network-generated conclusions, termed ‘observations’, associated with experimental entities, collected by its member groups (‘Centers’). Any researcher interested in learning about a given gene, protein, or compound (a ‘subject’) studied by the Network can come to the CTD2 Dashboard to quickly and easily find, review, and understand Network-generated experimental results. In particular, the Dashboard allows visitors to connect experiments about the same target, biomarker, etc., carried out by multiple Centers in the Network. The Dashboard’s unique knowledge representation allows information to be compiled around a subject, so as to become greater than the sum of the individual contributions. The CTD2 Network has broadly defined levels of validation for evidence (‘Tiers’) pertaining to a particular finding, and the CTD2 Dashboard uses these Tiers to indicate the extent to which results have been validated. Researchers can use the Network’s insights and tools to develop a new hypothesis or confirm existing hypotheses, in turn advancing the findings towards clinical applications.Database URL:
      PubDate: 2017-08-08
  • Cancer Odor Database (COD): a critical databank for cancer diagnosis

    • Authors: Janfaza S; Banan Nojavani M, Khorsand B, et al.
      Abstract: Here, we present Cancer Odor Database (COD), a web-based database comprising comprehensive information of volatile organic metabolites of cancer (VOMC), known as cancer odor, which gives a structured overview of VOMCs that are of critical importance in cancer research. The database contains more than 1300 records with 19 critical features for each record, such as structural and chemical properties (e.g. boiling point, molecular formula and molecular weight) of 450 different VOMCs and their origins, which can be used effectively to identify correlations between VOMCs and various types of cancer. COD database has been constructed based on the data that were directly extracted from literature. COD information can be helpful for cancer researches, especially for those who are developing sensors and electronic nose systems for cancer detection. COD is freely available for non-commercial purposes online at URL:
      PubDate: 2017-08-03
  • RegulatorDB: a resource for the analysis of yeast transcriptional

    • Authors: Choi JA; Wyrick JJ.
      Abstract: Mutant expression profiles have been published for nearly all the nonessential regulators in yeast, yet there is a need for improved analysis and visualization tools to analyze these data and integrate it with complementary protein-DNA binding data. The RegulatorDB database contains mutant expression profiles and DNA binding data for more than 900 and 250 yeast regulators, respectively. RegulatorDB provides web-based tools to visualize the effects of each mutant regulator on the expression of individual genes or user-selected gene sets, and identify regulators whose targets are enriched in user-selected gene sets. The database can be queried to search for targets of single or multiple regulators. Regulatory networks can be constructed and visualized that include multiple classes of regulators and multiple regulatory layers, including regulator DNA binding data. In summary, RegulatorDB is a powerful resource for the study of yeast gene regulation, from the level of individual genes up to genome-scale networks.Database URL:
      PubDate: 2017-08-03
  • MiRIAD update: using alternative polyadenylation, protein interaction
           network analysis and additional species to enhance exploration of the role
           of intragenic miRNAs and their host genes

    • Authors: Hinske LC; dos Santos FC, Ohara DT, et al.
      Abstract: MicroRNAs have established their role as potent regulators of the epigenome. Interestingly, most miRNAs are located within protein-coding genes with functional consequences that have yet to be fully investigated. MiRIAD is a database with an interactive and user-friendly online interface that has been facilitating research on intragenic miRNAs. In this article, we present a major update. First, data for five additional species (chimpanzee, rat, dog, cow and frog) were added to support the exploration of evolutionary aspects of the relationship between host genes and intragenic miRNAs. Moreover, we integrated data from two different sources to generate a comprehensive alternative polyadenylation dataset. The miRIAD interface was therefore redesigned and provides a completely new gene model representation, including an interactive visualization of the 3′ untranslated region (UTR) with alternative polyadenylation sites, corresponding signals and potential miRNA binding sites. Furthermore, we expanded on functional host gene network analysis. Although the previous version solely reported protein interactions, the update features a separate network analysis view that can either be accessed through the submission of a list of genes of interest or directly from a gene’s list of protein interactions. In addition to statistical properties of the submitted gene set, the interaction network graph is presented and miRNAs with seed site over- and underrepresentation are identified. In summary, the update of miRIAD provides novel datasets and bioinformatics resources with a significant increase in functionality to facilitate intragenic miRNA research in a user-friendly and interactive way.Database URL:
      PubDate: 2017-08-01
  • New extension software modules to enhance searching and display of
           transcriptome data in Tripal databases

    • Authors: Chen M; Henry N, Almsaeed A, et al.
      Abstract: Tripal is an open source software package for developing biological databases with a focus on genetic and genomic data. It consists of a set of core modules that deliver essential functions for loading and displaying data records and associated attributes including organisms, sequence features and genetic markers. Beyond the core modules, community members are encouraged to contribute extension modules to build on the Tripal core and to customize Tripal for individual community needs. To expand the utility of the Tripal software system, particularly for RNASeq data, we developed two new extension modules. Tripal Elasticsearch enables fast, scalable searching of the entire content of a Tripal site as well as the construction of customized advanced searches of specific data types. We demonstrate the use of this module for searching assembled transcripts by functional annotation. A second module, Tripal Analysis Expression, houses and displays records from gene expression assays such as RNA sequencing. This includes biological source materials (biomaterials), gene expression values and protocols used to generate the data. In the case of an RNASeq experiment, this would reflect the individual organisms and tissues used to produce sequencing libraries, the normalized gene expression values derived from the RNASeq data analysis and a description of the software or code used to generate the expression values. The module will load data from common flat file formats including standard NCBI Biosample XML. Data loading, display options and other configurations can be controlled by authorized users in the Drupal administrative backend. Both modules are open source, include usage documentation, and can be found in the Tripal organization’s GitHub repository.Database URL: Tripal Elasticsearch module: Analysis Expression module:
      PubDate: 2017-07-27
  • The Grass Carp Genome Database (GCGD): an online platform for genome
           features and annotations

    • Authors: Chen Y; Shi M, Zhang W, et al.
      Abstract: As one of the four major Chinese carps of important economic value, the grass carp (Ctenopharyngodon idellus) has attracted increasing attention from the scientific community. Recently, the draft genome has been released as a milestone in research of grass carp. In order to facilitate the utilization of these genome data, we developed the grass carp genome database (GCGD). GCGD provides visual presentation of the grass carp genome along with annotations and amino acid sequences of predicted protein-coding genes. Other related genetic and genomic data available in this database include the genetic linkage maps, microsatellite genetic markers (i.e. Short Sequence Repeats, SSRs), and some selected transcriptomic datasets. A series of tools have been integrated into GCGD for visualization, analysis and retrieval of data, e.g. JBrowse for navigation of genome annotations, BLAST for sequence alignment, EC2KEGG for comparison of metabolic pathways, IDConvert for conversion of terms across databases and ReadContigs for extraction of sequences from the grass carp genome.Database URL:
      PubDate: 2017-07-27
  • NRDTD: a database for clinically or experimentally supported non-coding
           RNAs and drug targets associations

    • Authors: Chen X; Sun Y, Zhang D, et al.
      Abstract: In recent years, more and more non-coding RNAs (ncRNAs) have been identified and increasing evidences have shown that ncRNAs may affect gene expression and disease progression, making them a new class of targets for drug discovery. It thus becomes important to understand the relationship between ncRNAs and drug targets. For this purpose, an ncRNAs and drug targets association database would be extremely beneficial. Here, we developed ncRNA Drug Targets Database (NRDTD) that collected 165 entries of clinically or experimentally supported ncRNAs as drug targets, including 97 ncRNAs and 96 drugs. Moreover, we annotated ncRNA-drug target associations with drug information from KEGG, PubChem, DrugBank, CTD or Wikipedia, GenBank sequence links, OMIM disease ID, pathway and function annotation for ncRNAs, detailed description of associations between ncRNAs and diseases from HMDD or LncRNADisease and the publication PubMed ID. Additionally, we provided users a link to submit novel disease-ncRNA-drug associations and corresponding supporting evidences into the database. We hope NRDTD will be a useful resource for investigating the roles of ncRNAs in drug target identification, drug discovery and disease treatment.Database URL:
      PubDate: 2017-07-21
  • RetrogeneDB–a database of plant and animal retrocopies

    • Authors: Rosikiewicz W; Kabza M, Kosiński JG, et al.
      Abstract: For a long time, retrocopies were considered ‘junk DNA’, but numerous studies have shown that retrocopies may gain functionality and become so-called retrogenes. Retrogenes may code fully functional proteins that coexist with parental gene products or may even replace them. Retrocopies may also function as regulatory RNAs and, for example, become a source of small interfering RNAs, act as trans natural antisense transcripts or as alternative targets for miRNAs. Numerous researchers have emphasized that retrogenes play a crucial role in various organisms’ developmental stages and diseases. Despite the ever-growing evidence of the importance of retrocopies, resources dedicated to retroposition are very limited. Here, we report an update of the RetrogeneDB, which, to the best of our knowledge, is the largest database dedicated to retrocopies. It provides annotations of 86 458 retrocopies in 62 animal and 37 plant species. The database contains information about the retrocopies’ localization, open reading frame conservation, expression, RNA Polymerase II activity and the alternative transcription start site studies. Orthologous relationships between retrogenes were also determined, which made retrocopy conservation studies much more valuable. Additionally, based on the RNA-Seq data from the Geuvadis project, the expression levels of retrocopies were estimated in a total of 50 individuals from 5 human populations. The information is now presented in a new, more user-friendly web interface, with easy access to the source data, which may be used for the downstream analysis. RetrogeneDB is freely available at URL: database URL:
      PubDate: 2017-07-14
  • Benchmarking distributed data warehouse solutions for storing genomic
           variant information

    • Authors: Wiewiórka MS; Wysakowicz DP, Okoniewski MJ, et al.
      Abstract: Genomic-based personalized medicine encompasses storing, analysing and interpreting genomic variants as its central issues. At a time when thousands of patientss sequenced exomes and genomes are becoming available, there is a growing need for efficient database storage and querying. The answer could be the application of modern distributed storage systems and query engines. However, the application of large genomic variant databases to this problem has not been sufficiently far explored so far in the literature. To investigate the effectiveness of modern columnar storage [column-oriented Database Management System (DBMS)] and query engines, we have developed a prototypic genomic variant data warehouse, populated with large generated content of genomic variants and phenotypic data. Next, we have benchmarked performance of a number of combinations of distributed storages and query engines on a set of SQL queries that address biological questions essential for both research and medical applications. In addition, a non-distributed, analytical database (MonetDB) has been used as a baseline. Comparison of query execution times confirms that distributed data warehousing solutions outperform classic relational DBMSs. Moreover, pre-aggregation and further denormalization of data, which reduce the number of distributed join operations, significantly improve query performance by several orders of magnitude. Most of distributed back-ends offer a good performance for complex analytical queries, while the Optimized Row Columnar (ORC) format paired with Presto and Parquet with Spark 2 query engines provide, on average, the lowest execution times. Apache Kudu on the other hand, is the only solution that guarantees a sub-second performance for simple genome range queries returning a small subset of data, where low-latency response is expected, while still offering decent performance for running analytical queries. In summary, research and clinical applications that require the storage and analysis of variants from thousands of samples can benefit from the scalability and performance of distributed data warehouse solutions. Database URL:
      PubDate: 2017-07-11
  • Improved annotation of the insect vector of citrus greening disease:
           biocuration by a diverse genomics community

    • Authors: Saha S; Hosmani PS, Villalobos-Ayala K, et al.
      Abstract: The Asian citrus psyllid (Diaphorina citri Kuwayama) is the insect vector of the bacterium Candidatus Liberibacter asiaticus (CLas), the pathogen associated with citrus Huanglongbing (HLB, citrus greening). HLB threatens citrus production worldwide. Suppression or reduction of the insect vector using chemical insecticides has been the primary method to inhibit the spread of citrus greening disease. Accurate structural and functional annotation of the Asian citrus psyllid genome, as well as a clear understanding of the interactions between the insect and CLas, are required for development of new molecular-based HLB control methods. A draft assembly of the D. citri genome has been generated and annotated with automated pipelines. However, knowledge transfer from well-curated reference genomes such as that of Drosophila melanogaster to newly sequenced ones is challenging due to the complexity and diversity of insect genomes. To identify and improve gene models as potential targets for pest control, we manually curated several gene families with a focus on genes that have key functional roles in D. citri biology and CLas interactions. This community effort produced 530 manually curated gene models across developmental, physiological, RNAi regulatory and immunity-related pathways. As previously shown in the pea aphid, RNAi machinery genes putatively involved in the microRNA pathway have been specifically duplicated. A comprehensive transcriptome enabled us to identify a number of gene families that are either missing or misassembled in the draft genome. In order to develop biocuration as a training experience, we included undergraduate and graduate students from multiple institutions, as well as experienced annotators from the insect genomics research community. The resulting gene set (OGS v1.0) combines both automatically predicted and manually curated gene models.Database URL:
      PubDate: 2017-06-30
  • MGFD: the maize gene families database

    • Authors: Sheng L; Jiang H, Yan H, et al.
      Abstract: Due to circumstances beyond the journal's control, the database for the above article is no longer accessible. Apologies for the inconvenience.
      PubDate: 2017-06-26
  • Text mining and expert curation to develop a database on psychiatric
           diseases and their genes

    • Authors: Gutiérrez-Sacristán A; Bravo À, Portero-Tresserra M, et al.
      Abstract: Psychiatric disorders constitute one of the main causes of disability worldwide. During the past years, considerable research has been conducted on the genetic architecture of such diseases, although little understanding of their etiology has been achieved. The difficulty to access up-to-date, relevant genotype-phenotype information has hampered the application of this wealth of knowledge to translational research and clinical practice in order to improve diagnosis and treatment of psychiatric patients. PsyGeNET ( has been developed with the aim of supporting research on the genetic architecture of psychiatric diseases, by providing integrated and structured accessibility to their genotype–phenotype association data, together with analysis and visualization tools. In this article, we describe the protocol developed for the sustainable update of this knowledge resource. It includes the recruitment of a team of domain experts in order to perform the curation of the data extracted by text mining. Annotation guidelines and a web-based annotation tool were developed to support the curators’ tasks. A curation workflow was designed including a pilot phase and two rounds of curation and analysis phases. Negative evidence from the literature on gene–disease associations (GDAs) was taken into account in the curation process. We report the results of the application of this workflow to the curation of GDAs for PsyGeNET, including the analysis of the inter-annotator agreement and suggest this model as a suitable approach for the sustainable development and update of knowledge resources.Database URL: http://www.psygenet.orgPsyGeNET corpus:
      PubDate: 2017-06-26
  • LeishDB: a database of coding gene annotation and non-coding RNAs in
           Leishmania braziliensis

    • Authors: Torres F; Arias-Carrasco R, Caris-Maldonado JC, et al.
      Abstract: Leishmania braziliensis is the etiological agent of cutaneous leishmaniasis, a disease with high public health importance, affecting 12 million people worldwide. Although its genome sequence was originally published in 2007, the two reference public annotations still presents at least 80% of the genes simply classified as hypothetical or putative proteins. Furthermore, it is notable the absence of non-coding RNA (ncRNA) sequences from Leishmania species in public databases. These poorly annotated coding genes and ncRNAs could be important players for the understanding of this protozoan biology, the mechanisms behind host-parasite interactions and disease control. Herein, we performed a new prediction and annotation of L. braziliensis protein-coding genes and non-coding RNAs, using recently developed predictive algorithms and updated databases. In summary, we identified 11 491 ORFs, with 5263 (45.80%) of them associated with proteins available in public databases. Moreover, we identified for the first time the repertoire of 11 243 ncRNAs belonging to different classes distributed along the genome. The accuracy of our predictions was verified by transcriptional evidence using RNA-seq, confirming that they are actually generating real transcripts. These data were organized in a public repository named LeishDB (, which represents an improvement on the publicly available data related to genomic annotation for L. braziliensis. This updated information can be useful for future genomics, transcriptomics and metabolomics studies; being an additional tool for genome annotation pipelines and novel studies associated with the understanding of this protozoan genome complexity, organization, biology, and development of innovative methodologies for disease control and diagnostics.Database
      PubDate: 2017-06-13
  • PigVar: a database of pig variations and positive selection signatures

    • Authors: Zhou Z; Li A, Otecko N, et al.
      Abstract: Pigs are excellent large-animal models for medical research and a promising organ donor source for transplant patients. Next-generation sequencing technology has yielded a dramatic increase in the volume of genomic data for pigs. However, the limited amount of variation data provided by dbSNP, and non-congruent criteria used for calling variation, present considerable hindrances to the utility of this data. We used a uniform pipeline, based on GATK, to identify non-redundant, high-quality, whole-genome SNPs from 280 pigs and 6 outgroup species. A total of 64.6 million SNPs were identified in 280 pigs and 36.8 million in the outgroups. We then used LUMPY to identify a total of 7 236 813 structural variations (SVs) in 211 pigs. Positively selected loci were identified through five statistical tests of different evolutionary attributes of the SNPs. Combining the non-redundant variations and the evolutionary selective scores, we built the first pig-specific variation database, PigVar (, which is a web-based open-access resource. PigVar collects parameters of the variations including summary lists of the locations of the variations within protein-coding and long intergenic non-coding RNA (lincRNA) genes, whether the SNPs are synonymous or non-synonymous, their ancestral and derived states, geographic sampling locations, as well as breed information. The PigVar database will be kept operational and updated to facilitate medical research using the pig as model and agricultural research including pig breeding.Database URL:
      PubDate: 2017-06-13
  • In silico characterization of tandem repeats in Trichophyton rubrum and
           related dermatophytes provides new insights into their role in

    • Authors: Franco M; Bitencourt T, Marins M, et al.
      Abstract: Trichophyton rubrum is the most common etiological agent of dermatophytoses worldwide, which is able to degrade keratinized tissues. The sequencing of the genome of different dermatophyte species has provided a large amount of data, including tandem repeats that may play a role in genetic variability and in the pathogenesis of these fungi. Tandem repeats are adjacent DNA sequences of 2–200 nucleotides in length, which exert regulatory and adaptive functions. These repetitive DNA sequences are found in different classes of fungal proteins, especially those involved in cell adhesion, a determinant factor for the establishment of fungal infection. The objective of this study was to develop a Dermatophyte Tandem Repeat Database (DTRDB) for the storage and identification of tandem repeats in T. rubrum and six other dermatophyte species. The current version of the database contains 35 577 tandem repeats detected in 16 173 coding sequences. The repeats can be searched using entry parameters such as repeat unit length (nt—nucleotide), repeat number, variability score, and repeat sequence motif. These data were used to study the relative frequency and distribution of repeats in the sequences, as well as their possible functions in dermatophytes. A search of the database revealed that these repeats occur in 22–33% of genes transcribed in dermatophytes where they could be involved in the success of adaptation to the host tissue and establishment of infection. The repeats were detected in transcripts that are mainly related to three biological processes: regulation, adhesion, and metabolism. The database developed enables users to identify and analyse tandem repeat regions in target genes related to pathogenicity and fungal–host interactions in dermatophytes and may contribute to the discovery of new targets for the development of antifungal agents.Database URL:
      PubDate: 2017-06-11
  • decodeRNA— predicting non-coding RNA functions using

    • Authors: Lefever S; Anckaert J, Volders P, et al.
      Abstract: Although the long non-coding RNA (lncRNA) landscape is expanding rapidly, only a small number of lncRNAs have been functionally annotated. Here, we present decodeRNA (, a database providing functional contexts for both human lncRNAs and microRNAs in 29 cancer and 12 normal tissue types. With state-of-the-art data mining and visualization options, easy access to results and a straightforward user interface, decodeRNA aims to be a powerful tool for researchers in the ncRNA field.Database URL:
      PubDate: 2017-06-11
  • MGIS: managing banana ( Musa spp.) genetic resources information and
           high-throughput genotyping data

    • Authors: Ruas M; Guignon VV, Sempere GG, et al.
      Abstract: Unraveling the genetic diversity held in genebanks on a large scale is underway, due to advances in Next-generation sequence (NGS) based technologies that produce high-density genetic markers for a large number of samples at low cost. Genebank users should be in a position to identify and select germplasm from the global genepool based on a combination of passport, genotypic and phenotypic data. To facilitate this, a new generation of information systems is being designed to efficiently handle data and link it with other external resources such as genome or breeding databases. The Musa Germplasm Information System (MGIS), the database for global ex situ-held banana genetic resources, has been developed to address those needs in a user-friendly way. In developing MGIS, we selected a generic database schema (Chado), the robust content management system Drupal for the user interface, and Tripal, a set of Drupal modules which links the Chado schema to Drupal. MGIS allows germplasm collection examination, accession browsing, advanced search functions, and germplasm orders. Additionally, we developed unique graphical interfaces to compare accessions and to explore them based on their taxonomic information. Accession-based data has been enriched with publications, genotyping studies and associated genotyping datasets reporting on germplasm use. Finally, an interoperability layer has been implemented to facilitate the link with complementary databases like the Banana Genome Hub and the MusaBase breeding database.Database URL:
      PubDate: 2017-06-11
  • NutriChem 2.0: exploring the effect of plant-based foods on human health
           and drug efficacy

    • Authors: Ni Y; Jensen K, Kouskoumvekaki I, et al.
      Abstract: NutriChem is a database generated by text mining of 21 million MEDLINE abstracts that links plant-based foods with their small molecule components and human health effect. In this new, second release of NutriChem (NutriChem 2.0) we have integrated information on overlapping protein targets between FDA-approved drugs and small compounds in plant-based foods, which may have implications on drug pharmacokinetics and pharmacodynamics. NutriChem 2.0 contains predicted interactions between 428 drugs and 339 foods, supported by 107 jointly targeted proteins. Chemical bioactivity data were integrated, facilitating the comparison of activity concentrations between drugs and phytochemicals. In addition, we have added functionality that allows for user inspection of supporting evidence, the classification of food constituents based on KEGG “Phytochemical Compounds”, phytochemical structure output in SMILES and network output in both static figure and Cytoscape-compatible xgmml format. The current update of NutriChem moves one step further towards a more comprehensive assessment of dietary effects on human health and drug treatment.Database URL:
      PubDate: 2017-06-11
  • TIBLE: a web-based, freely accessible resource for small-molecule binding
           data for mycobacterial species

    • Authors: Malhotra S; Mugumbate G, Blundell TL, et al.
      Abstract: TIBLE is a web-based resource that provides easy access to data on the minimal inhibitory concentrations for small molecules against several mycobacterial species, as well as the target binding and off-target predictions for Mycobacterium tuberculosis. The current version of the database holds the activity data for more than 19 000 distinct small molecules against 39 mycobacterial species, binding data for 106 Mycobacterium tuberculosis target proteins and predictions for their potential off-targets. The resource integrates disparate public data and methods to provide easy access to the minimum inhibitory concentration and binding data, facilitation of data sharing, and identification of small molecules and targets for development of anti-tuberculosis therapeutics.Database URL:
      PubDate: 2017-06-11
  • Triage by ranking to support the curation of protein interactions

    • Authors: Mottin L; Pasche E, Gobeill J, et al.
      Abstract: Today, molecular biology databases are the cornerstone of knowledge sharing for life and health sciences. The curation and maintenance of these resources are labour intensive. Although text mining is gaining impetus among curators, its integration in curation workflow has not yet been widely adopted. The Swiss Institute of Bioinformatics Text Mining and CALIPHO groups joined forces to design a new curation support system named nextA5. In this report, we explore the integration of novel triage services to support the curation of two types of biological data: protein–protein interactions (PPIs) and post-translational modifications (PTMs). The recognition of PPIs and PTMs poses a special challenge, as it not only requires the identification of biological entities (proteins or residues), but also that of particular relationships (e.g. binding or position). These relationships cannot be described with onto-terminological descriptors such as the Gene Ontology for molecular functions, which makes the triage task more challenging. Prioritizing papers for these tasks thus requires the development of different approaches. In this report, we propose a new method to prioritize articles containing information specific to PPIs and PTMs. The new resources (RESTful APIs, semantically annotated MEDLINE library) enrich the neXtA5 platform. We tuned the article prioritization model on a set of 100 proteins previously annotated by the CALIPHO group. The effectiveness of the triage service was tested with a dataset of 200 annotated proteins. We defined two sets of descriptors to support automatic triage: the first set to enrich for papers with PPI data, and the second for PTMs. All occurrences of these descriptors were marked-up in MEDLINE and indexed, thus constituting a semantically annotated version of MEDLINE. These annotations were then used to estimate the relevance of a particular article with respect to the chosen annotation type. This relevance score was combined with a local vector-space search engine to generate a ranked list of PMIDs. We also evaluated a query refinement strategy, which adds specific keywords (such as ‘binds’ or ‘interacts’) to the original query. Compared to PubMed, the search effectiveness of the nextA5 triage service is improved by 190% for the prioritization of papers with PPIs information and by 260% for papers with PTMs information. Combining advanced retrieval and query refinement strategies with automatically enriched MEDLINE contents is effective to improve triage in complex curation tasks such as the curation of protein PPIs and PTMs.Database URL:
      PubDate: 2017-06-11
  • Exploring convolutional neural networks for drug–drug interaction

    • Authors: Suárez-Paniagua V; Segura-Bedmar I, Martínez P.
      Abstract: Drug–drug interaction (DDI), which is a specific type of adverse drug reaction, occurs when a drug influences the level or activity of another drug. Natural language processing techniques can provide health-care professionals with a novel way of reducing the time spent reviewing the literature for potential DDIs. The current state-of-the-art for the extraction of DDIs is based on feature-engineering algorithms (such as support vector machines), which usually require considerable time and effort. One possible alternative to these approaches includes deep learning. This technique aims to automatically learn the best feature representation from the input data for a given task. The purpose of this paper is to examine whether a convolutional neural network (CNN), which only uses word embeddings as input features, can be applied successfully to classify DDIs from biomedical texts. Proposed herein, is a CNN architecture with only one hidden layer, thus making the model more computationally efficient, and we perform detailed experiments in order to determine the best settings of the model. The goal is to determine the best parameter of this basic CNN that should be considered for future research. The experimental results show that the proposed approach is promising because it attained the second position in the 2013 rankings of the DDI extraction challenge. However, it obtained worse results than previous works using neural networks with more complex architectures.
      PubDate: 2017-05-25
  • Biocuration in the structure–function linkage database: the anatomy
           of a superfamily

    • Authors: Holliday GL; Brown SD, Akiva E, et al.
      Abstract: The funding section was mistakenly not included in this paper. The article has now been updated to include this section. The publisher apologises for this error.
      PubDate: 2017-05-24
  • GenomeHubs: simple containerized setup of a custom Ensembl database and
           web server for any species

    • Authors: Challis RJ; Kumar S, Stevens L, et al.
      Abstract: As the generation and use of genomic datasets is becoming increasingly common in all areas of biology, the need for resources to collate, analyse and present data from one or more genome projects is becoming more pressing. The Ensembl platform is a powerful tool to make genome data and cross-species analyses easily accessible through a web interface and a comprehensive application programming interface. Here we introduce GenomeHubs, which provide a containerized environment to facilitate the setup and hosting of custom Ensembl genome browsers. This simplifies mirroring of existing content and import of new genomic data into the Ensembl database schema. GenomeHubs also provide a set of analysis containers to decorate imported genomes with results of standard analyses and functional annotations and support export to flat files, including EMBL format for submission of assemblies and annotations to International Nucleotide Sequence Database Collaboration.Database URL:
      PubDate: 2017-05-15
  • BioM2MetDisease: a manually curated database for associations between
           microRNAs, metabolites, small molecules and metabolic diseases

    • Authors: Xu Y; Yang H, Wu T, et al.
      Abstract: BioM2MetDisease is a manually curated database that aims to provide a comprehensive and experimentally supported resource of associations between metabolic diseases and various biomolecules. Recently, metabolic diseases such as diabetes have become one of the leading threats to people’s health. Metabolic disease associated with alterations of multiple types of biomolecules such as miRNAs and metabolites. An integrated and high-quality data source that collection of metabolic disease associated biomolecules is essential for exploring the underlying molecular mechanisms and discovering novel therapeutics. Here, we developed the BioM2MetDisease database, which currently documents 2681 entries of relationships between 1147 biomolecules (miRNAs, metabolites and small molecules/drugs) and 78 metabolic diseases across 14 species. Each entry includes biomolecule category, species, biomolecule name, disease name, dysregulation pattern, experimental technique, a brief description of metabolic disease-biomolecule relationships, the reference, additional annotation information etc. BioM2MetDisease provides a user-friendly interface to explore and retrieve all data conveniently. A submission page was also offered for researchers to submit new associations between biomolecules and metabolic diseases. BioM2MetDisease provides a comprehensive resource for studying biology molecules act in metabolic diseases, and it is helpful for understanding the molecular mechanisms and developing novel therapeutics for metabolic diseases.Database URL:
      PubDate: 2017-05-12
  • AnnoSys—implementation of a generic annotation system for schema-based
           data using the example of biodiversity collection data

    • Authors: Suhrbier LL; Kusber WH, Tschöpe OO, et al.
      Abstract: Several errors were noted in the above paper after publication and have now been corrected.
      PubDate: 2017-05-06
  • CHOmine: an integrated data warehouse for CHO systems biology and modeling

    • Authors: Gerstl MP; Hanscho M, Ruckerbauer DE, et al.
      Abstract: The last decade has seen a surge in published genome-scale information for Chinese hamster ovary (CHO) cells, which are the main production vehicles for therapeutic proteins. While a single access point is available at, the primary data is distributed over several databases at different institutions. Currently research is frequently hampered by a plethora of gene names and IDs that vary between published draft genomes and databases making systems biology analyses cumbersome and elaborate. Here we present CHOmine, an integrative data warehouse connecting data from various databases and links to other ones. Furthermore, we introduce CHOmodel, a web based resource that provides access to recently published CHO cell line specific metabolic reconstructions. Both resources allow to query CHO relevant data, find interconnections between different types of data and thus provides a simple, standardized entry point to the world of CHO systems biology.Database URL:
      PubDate: 2017-04-22
  • Improving biocuration of microRNAs in diseases: a case study in idiopathic
           pulmonary fibrosis

    • Authors: Balderas-Martínez Y; Rinaldi F, Contreras G, et al.
      Abstract: MicroRNAs (miRNAs) are small and non-coding RNA molecules that inhibit gene expression posttranscriptionally. They play important roles in several biological processes, and in recent years there has been an interest in studying how they are related to the pathogenesis of diseases. Although there are already some databases that contain information for miRNAs and their relation with illnesses, their curation represents a significant challenge due to the amount of information that is being generated every day. In particular, respiratory diseases are poorly documented in databases, despite the fact that they are of increasing concern regarding morbidity, mortality and economic impacts. In this work, we present the results that we obtained in the BioCreative Interactive Track (IAT), using a semiautomatic approach for improving biocuration of miRNAs related to diseases. Our procedures will be useful to complement databases that contain this type of information. We adapted the OntoGene text mining pipeline and the ODIN curation system in a full-text corpus of scientific publications concerning one specific respiratory disease: idiopathic pulmonary fibrosis, the most common and aggressive of the idiopathic interstitial cases of pneumonia. We curated 823 miRNA text snippets and found a total of 246 miRNAs related to this disease based on our semiautomatic approach with the system OntoGene/ODIN. The biocuration throughput improved by a factor of 12 compared with traditional manual biocuration. A significant advantage of our semiautomatic pipeline is that it can be applied to obtain the miRNAs of all the respiratory diseases and offers the possibility to be used for other illnesses.Database URL:
      PubDate: 2017-04-22
  • Strategies towards digital and semi-automated curation in RegulonDB

    • Authors: Rinaldi F; Lithgow O, Gama-Castro S, et al.
      Abstract: Several errors to the authors’ details have now been corrected in the above paper.
      PubDate: 2017-04-17
  • GeneHancer: genome-wide integration of enhancers and target genes in

    • Authors: Fishilevich S; Nudel R, Rappaport N, et al.
      Abstract: A major challenge in understanding gene regulation is the unequivocal identification of enhancer elements and uncovering their connections to genes. We present GeneHancer, a novel database of human enhancers and their inferred target genes, in the framework of GeneCards. First, we integrated a total of 434 000 reported enhancers from four different genome-wide databases: the Encyclopedia of DNA Elements (ENCODE), the Ensembl regulatory build, the functional annotation of the mammalian genome (FANTOM) project and the VISTA Enhancer Browser. Employing an integration algorithm that aims to remove redundancy, GeneHancer portrays 285 000 integrated candidate enhancers (covering 12.4% of the genome), 94 000 of which are derived from more than one source, and each assigned an annotation-derived confidence score. GeneHancer subsequently links enhancers to genes, using: tissue co-expression correlation between genes and enhancer RNAs, as well as enhancer-targeted transcription factor genes; expression quantitative trait loci for variants within enhancers; and capture Hi-C, a promoter-specific genome conformation assay. The individual scores based on each of these four methods, along with gene–enhancer genomic distances, form the basis for GeneHancer’s combinatorial likelihood-based scores for enhancer–gene pairing. Finally, we define ‘elite’ enhancer–gene relations reflecting both a high-likelihood enhancer definition and a strong enhancer–gene association.GeneHancer predictions are fully integrated in the widely used GeneCards Suite, whereby candidate enhancers and their annotations are displayed on every relevant GeneCard. This assists in the mapping of non-coding variants to enhancers, and via the linked genes, forms a basis for variant–phenotype interpretation of whole-genome sequences in health and disease.Database URL:
      PubDate: 2017-04-17
  • Surveying the Maize community for their diversity and pedigree
           visualization needs to prioritize tool development and curation

    • Authors: Sen TZ; Braun BL, Schott DA, et al.
      Abstract: The Maize Genetics and Genomics Database (MaizeGDB) team prepared a survey to identify breeders’ needs for visualizing pedigrees, diversity data and haplotypes in order to prioritize tool development and curation efforts at MaizeGDB. The survey was distributed to the maize research community on behalf of the Maize Genetics Executive Committee in Summer 2015. The survey garnered 48 responses from maize researchers, of which more than half were self-identified as breeders. The survey showed that the maize researchers considered their top priorities for visualization as: (i) displaying single nucleotide polymorphisms in a given region for a given list of lines, (ii) showing haplotypes for a given list of lines and (iii) presenting pedigree relationships visually. The survey also asked which populations would be most useful to display. The following two populations were on top of the list: (i) 3000 publicly available maize inbred lines used in Romay et al. (Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol, 2013;14:R55) and (ii) maize lines with expired Plant Variety Protection Act (ex-PVP) certificates. Driven by this strong stakeholder input, MaizeGDB staff are currently working in four areas to improve its interface and web-based tools: (i) presenting immediate progenies of currently available stocks at the MaizeGDB Stock pages, (ii) displaying the most recent ex-PVP lines described in the Germplasm Resources Information Network (GRIN) on the MaizeGDB Stock pages, (iii) developing network views of pedigree relationships and (iv) visualizing genotypes from SNP-based diversity datasets. These survey results can help other biological databases to direct their efforts according to user preferences as they serve similar types of data sets for their communities.Database URL:
      PubDate: 2017-04-17
  • TriatoKey: a web and mobile tool for biodiversity identification of
           Brazilian triatomine species

    • Authors: Márcia de Oliveira L; Nogueira de Brito R, Anderson Souza Guimarães P, et al.
      Abstract: Triatomines are blood-sucking insects that transmit the causative agent of Chagas disease, Trypanosoma cruzi. Despite being recognized as a difficult task, the correct taxonomic identification of triatomine species is crucial for vector control in Latin America, where the disease is endemic. In this context, we have developed a web and mobile tool based on PostgreSQL database to help healthcare technicians to overcome the difficulties to identify triatomine vectors when the technical expertise is missing. The web and mobile version makes use of real triatomine species pictures and dichotomous key method to support the identification of potential vectors that occur in Brazil. It provides a user example-driven interface with simple language. TriatoKey can also be useful for educational purposes.Database URL:
      PubDate: 2017-04-17
  • Workflow and web application for annotating NCBI BioProject transcriptome

    • Authors: Vera Alvarez R; Medeiros Vidal N, Garzón-Martínez GA, et al.
      Abstract: The volume of transcriptome data is growing exponentially due to rapid improvement of experimental technologies. In response, large central resources such as those of the National Center for Biotechnology Information (NCBI) are continually adapting their computational infrastructure to accommodate this large influx of data. New and specialized databases, such as Transcriptome Shotgun Assembly Sequence Database (TSA) and Sequence Read Archive (SRA), have been created to aid the development and expansion of centralized repositories. Although the central resource databases are under continual development, they do not include automatic pipelines to increase annotation of newly deposited data. Therefore, third-party applications are required to achieve that aim. Here, we present an automatic workflow and web application for the annotation of transcriptome data. The workflow creates secondary data such as sequencing reads and BLAST alignments, which are available through the web application. They are based on freely available bioinformatics tools and scripts developed in-house. The interactive web application provides a search engine and several browser utilities. Graphical views of transcript alignments are available through SeqViewer, an embedded tool developed by NCBI for viewing biological sequence data. The web application is tightly integrated with other NCBI web applications and tools to extend the functionality of data processing and interconnectivity. We present a case study for the species Physalis peruviana with data generated from BioProject ID 67621.Database URL:
      PubDate: 2017-04-17
  • HopBase: a unified resource for Humulus genomics

    • Authors: Hill ST; Sudarsanam R, Henning J, et al.
      Abstract: Hop (Humulus lupulus L. var lupulus) is a dioecious plant of worldwide significance, used primarily for bittering and flavoring in brewing beer. Studies on the medicinal properties of several unique compounds produced by hop have led to additional interest from pharmacy and healthcare industries as well as livestock production as a natural antibiotic. Genomic research in hop has resulted a published draft genome and transcriptome assemblies. As research into the genomics of hop has gained interest, there is a critical need for centralized online genomic resources. To support the growing research community, we report the development of an online resource "" In addition to providing a gene annotation to the existing Shinsuwase draft genome, HopBase makes available genome assemblies and annotations for both the cultivar “Teamaker” and male hop accession number USDA 21422M. These genome assemblies, gene annotations, along with other common data, coupled with a genome browser and BLAST database enable the hop community to enter the genomic age. The HopBase genomic resource is accessible at and
      PubDate: 2017-04-06
  • Chemical-induced disease relation extraction via convolutional neural

    • Authors: Gu J; Sun F, Qian L, et al.
      Abstract: This article describes our work on the BioCreative-V chemical–disease relation (CDR) extraction task, which employed a maximum entropy (ME) model and a convolutional neural network model for relation extraction at inter- and intra-sentence level, respectively. In our work, relation extraction between entity concepts in documents was simplified to relation extraction between entity mentions. We first constructed pairs of chemical and disease mentions as relation instances for training and testing stages, then we trained and applied the ME model and the convolutional neural network model for inter- and intra-sentence level, respectively. Finally, we merged the classification results from mention level to document level to acquire the final relations between chemical and disease concepts. The evaluation on the BioCreative-V CDR corpus shows the effectiveness of our proposed approach.Database URL:
      PubDate: 2017-04-02
  • NaviCom: a web application to create interactive molecular network
           portraits using multi-level omics data

    • Authors: Dorel M; Viara E, Barillot E, et al.
      Abstract: Human diseases such as cancer are routinely characterized by high-throughput molecular technologies, and multi-level omics data are accumulated in public databases at increasing rate. Retrieval and visualization of these data in the context of molecular network maps can provide insights into the pattern of regulation of molecular functions reflected by an omics profile. In order to make this task easy, we developed NaviCom, a Python package and web platform for visualization of multi-level omics data on top of biological network maps. NaviCom is bridging the gap between cBioPortal, the most used resource of large-scale cancer omics data and NaviCell, a data visualization web service that contains several molecular network map collections. NaviCom proposes several standardized modes of data display on top of molecular network maps, allowing addressing specific biological questions. We illustrate how users can easily create interactive network-based cancer molecular portraits via NaviCom web interface using the maps of Atlas of Cancer Signalling Network (ACSN) and other maps. Analysis of these molecular portraits can help in formulating a scientific hypothesis on the molecular mechanisms deregulated in the studied disease.Database URL: NaviCom is available at
      PubDate: 2017-04-02
  • Automated PDF highlighting to support faster curation of literature for
           Parkinson’s and Alzheimer’s disease

    • Authors: Wu H; Oellrich A, Girges C, et al.
      Abstract: Neurodegenerative disorders such as Parkinson’s and Alzheimer’s disease are devastating and costly illnesses, a source of major global burden. In order to provide successful interventions for patients and reduce costs, both causes and pathological processes need to be understood. The ApiNATOMY project aims to contribute to our understanding of neurodegenerative disorders by manually curating and abstracting data from the vast body of literature amassed on these illnesses. As curation is labour-intensive, we aimed to speed up the process by automatically highlighting those parts of the PDF document of primary importance to the curator. Using techniques similar to those of summarisation, we developed an algorithm that relies on linguistic, semantic and spatial features. Employing this algorithm on a test set manually corrected for tool imprecision, we achieved a macro F1-measure of 0.51, which is an increase of 132% compared to the best bag-of-words baseline model. A user based evaluation was also conducted to assess the usefulness of the methodology on 40 unseen publications, which reveals that in 85% of cases all highlighted sentences are relevant to the curation task and in about 65% of the cases, the highlights are sufficient to support the knowledge curation task without needing to consult the full text. In conclusion, we believe that these are promising results for a step in automating the recognition of curation-relevant sentences. Refining our approach to pre-digest papers will lead to faster processing and cost reduction in the curation process.Database URL:
      PubDate: 2017-03-27
  • Effective biomedical document classification for identifying publications
           relevant to the mouse Gene Expression Database (GXD)

    • Authors: Jiang X; Ringwald M, Blake J, et al.
      Abstract: The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area.Database
      PubDate: 2017-03-24
  • GrTEdb: the first web-based database of transposable elements in cotton (
           Gossypium raimondii)

    • Authors: Xu Z; Liu J, Ni W, et al.
      Abstract: Although several diploid and tetroploid Gossypium species genomes have been sequenced, the well annotated web-based transposable elements (TEs) database is lacking. To better understand the roles of TEs in structural, functional and evolutionary dynamics of the cotton genome, a comprehensive, specific, and user-friendly web-based database, Gossypium raimondii transposable elements database (GrTEdb), was constructed. A total of 14 332 TEs were structurally annotated and clearly categorized in G. raimondii genome, and these elements have been classified into seven distinct superfamilies based on the order of protein-coding domains, structures and/or sequence similarity, including 2929 Copia-like elements, 10 368 Gypsy-like elements, 299 L1, 12 Mutators, 435 PIF-Harbingers, 275 CACTAs and 14 Helitrons. Meanwhile, the web-based sequence browsing, searching, downloading and blast tool were implemented to help users easily and effectively to annotate the TEs or TE fragments in genomic sequences from G. raimondii and other closely related Gossypium species. GrTEdb provides resources and information related with TEs in G. raimondii, and will facilitate gene and genome analyses within or across Gossypium species, evaluating the impact of TEs on their host genomes, and investigating the potential interaction between TEs and protein-coding genes in Gossypium species.Database URL:
      PubDate: 2017-03-24
  • TMPL: a database of experimental and theoretical transmembrane protein
           models positioned in the lipid bilayer

    • Authors: Postic G; Ghouzam Y, Etchebest C, et al.
      Abstract: Knowing the position of protein structures within the membrane is crucial for fundamental and applied research in the field of molecular biology. Only few web resources propose coordinate files of oriented transmembrane proteins, and these exclude predicted structures, although they represent the largest part of the available models. In this article, we present TMPL (, a database of transmembrane protein structures (α-helical and β-sheet) positioned in the lipid bilayer. It is the first database to include theoretical models of transmembrane protein structures, making it a large repository with more than 11 000 entries. The TMPL database also contains experimentally solved protein structures, which are available as either atomistic or coarse-grained models. A unique feature of TMPL is the possibility for users to update the database by uploading, through an intuitive web interface, the membrane assignments they can obtain with our recent OREMPRO web server.
      PubDate: 2017-03-24
  • WikiGenomes: an open web application for community consumption and
           curation of gene annotation data in Wikidata

    • Authors: Putman TE; Lelong S, Burgstaller-Muehlbacher S, et al.
      Abstract: With the advancement of genome-sequencing technologies, new genomes are being sequenced daily. Although these sequences are deposited in publicly available data warehouses, their functional and genomic annotations (beyond genes which are predicted automatically) mostly reside in the text of primary publications. Professional curators are hard at work extracting those annotations from the literature for the most studied organisms and depositing them in structured databases. However, the resources don’t exist to fund the comprehensive curation of the thousands of newly sequenced organisms in this manner. Here, we describe WikiGenomes (, a web application that facilitates the consumption and curation of genomic data by the entire scientific community. WikiGenomes is based on Wikidata, an openly editable knowledge graph with the goal of aggregating published knowledge into a free and open database. WikiGenomes empowers the individual genomic researcher to contribute their expertise to the curation effort and integrates the knowledge into Wikidata, enabling it to be accessed by anyone without restriction.Database URL:
      PubDate: 2017-03-24
  • ABCMdb reloaded: updates on mutations in ATP binding cassette proteins

    • Authors: Tordai H; Jakab K, Gyimesi G, et al.
      Abstract: ABC (ATP-Binding Cassette) proteins with altered function are responsible for numerous human diseases. To aid the selection of positions and amino acids for ABC structure/function studies we have generated a database, ABCMdb (Gyimesi et al., ABCMdb: a database for the comparative analysis of protein mutations in ABC transporters, and a potential framework for a general application. Hum Mutat 2012; 33:1547–1556.), with interactive tools. The database has been populated with mentions of mutations extracted from full text papers, alignments and structural models. In the new version of the database we aimed to collect the effect of mutations from databases including ClinVar. Because of the low number of available data, even in the case of the widely studied disease-causing ABC proteins, we also included the possible effects of mutations based on SNAP2 and PROVEAN predictions. To aid the interpretation of variations in non-coding regions, the database was supplemented with related DNA level information. Our results emphasize the importance of in silico predictions because of the sparse information available on variants and suggest that mutations at analogous positions in homologous ABC proteins have a strong predictive power for the effects of mutations. Our improved ABCMdb advances the design of both experimental studies and meta-analyses in order to understand drug interactions of ABC proteins and the effects of mutations on functional expression.Database URL:
      PubDate: 2017-03-18
  • AnnoSys—implementation of a generic annotation system for schema-based
           data using the example of biodiversity collection data

    • Authors: Suhrbier LL; Kusber WH, Tschöpe OO, et al.
      Abstract: Biological research collections holding billions of specimens world-wide provide the most important baseline information for systematic biodiversity research. Increasingly, specimen data records become available in virtual herbaria and data portals. The traditional (physical) annotation procedure fails here, so that an important pathway of research documentation and data quality control is broken. In order to create an online annotation system, we analysed, modeled and adapted traditional specimen annotation workflows. The AnnoSys system accesses collection data from either conventional web resources or the Biological Collection Access Service (BioCASe) and accepts XML-based data standards like ABCD or DarwinCore. It comprises a searchable annotation data repository, a user interface, and a subscription based message system. We describe the main components of AnnoSys and its current and planned interoperability with biodiversity data portals and networks. Details are given on the underlying architectural model, which implements the W3C OpenAnnotation model and allows the adaptation of AnnoSys to different problem domains. Advantages and disadvantages of different digital annotation and feedback approaches are discussed. For the biodiversity domain, AnnoSys proposes best practice procedures for digital annotations of complex records.Database URL:
      PubDate: 2017-03-18
  • Better living through ontologies at the Immune Epitope Database

    • Authors: Vita R; Overton JA, Sette A, et al.
      Abstract: The Immune Epitope Database (IEDB) project incorporates independently developed ontologies and controlled vocabularies into its curation and search interface. This simplifies curation practices, improves the user query experience and facilitates interoperability between the IEDB and other resources. While the use of independently developed ontologies has long been recommended as a best practice, there continues to be a significant number of projects that develop their own vocabularies instead, or that do not fully utilize the power of ontologies that they are using. We describe how we use ontologies in the IEDB, providing a concrete example of the benefits of ontologies in practice.Database
      PubDate: 2017-03-18
  • Biocuration in the structure–function linkage database: the anatomy
           of a superfamily

    • Authors: Holliday GL; Brown SD, Akiva E, et al.
      Abstract: With ever-increasing amounts of sequence data available in both the primary literature and sequence repositories, there is a bottleneck in annotating molecular function to a sequence. This article describes the biocuration process and methods used in the structure-function linkage database (SFLD) to help address some of the challenges. We discuss how the hierarchy within the SFLD allows us to infer detailed functional properties for functionally diverse enzyme superfamilies in which all members are homologous, conserve an aspect of their chemical function and have associated conserved structural features that enable the chemistry. Also presented is the Enzyme Structure-Function Ontology (ESFO), which has been designed to capture the relationships between enzyme sequence, structure and function that underlie the SFLD and is used to guide the biocuration processes within the SFLD.Database URL:
      PubDate: 2017-03-18
  • Ensembl core software resources: storage and programmatic access for DNA
           sequence and genome annotation

    • Authors: Ruffier M; Kähäri A, Komorowska M, et al.
      Abstract: The Ensembl software resources are a stable infrastructure to store, access and manipulate genome assemblies and their functional annotations. The Ensembl ‘Core’ database and Application Programming Interface (API) was our first major piece of software infrastructure and remains at the centre of all of our genome resources. Since its initial design more than fifteen years ago, the number of publicly available genomic, transcriptomic and proteomic datasets has grown enormously, accelerated by continuous advances in DNA-sequencing technology. Initially intended to provide annotation for the reference human genome, we have extended our framework to support the genomes of all species as well as richer assembly models. Cross-referenced links to other informatics resources facilitate searching our database with a variety of popular identifiers such as UniProt and RefSeq. Our comprehensive and robust framework storing a large diversity of genome annotations in one location serves as a platform for other groups to generate and maintain their own tailored annotation. We welcome reuse and contributions: our databases and APIs are publicly available, all of our source code is released with a permissive Apache v2.0 licence at and we have an active developer mailing list ( URL:
      PubDate: 2017-03-18
  • Literature consistency of bioinformatics sequence databases is effective
           for assessing record quality

    • Authors: Bouadjenek M; Verspoor K, Zobel J.
      Abstract: Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records.Database URL:
      PubDate: 2017-03-18
  • MiDAS 2.0: an ecosystem-specific taxonomy and online database for the
           organisms of wastewater treatment systems expanded for anaerobic digester

    • Authors: McIlroy S; Kirkegaard R, McIlroy B, et al.
      Abstract: Wastewater is increasingly viewed as a resource, with anaerobic digester technology being routinely implemented for biogas production. Characterising the microbial communities involved in wastewater treatment facilities and their anaerobic digesters is considered key to their optimal design and operation. Amplicon sequencing of the 16S rRNA gene allows high-throughput monitoring of these systems. The MiDAS field guide is a public resource providing amplicon sequencing protocols and an ecosystem-specific taxonomic database optimized for use with wastewater treatment facility samples. The curated taxonomy endeavours to provide a genus-level-classification for abundant phylotypes and the online field guide links this identity to published information regarding their ecology, function and distribution. This article describes the expansion of the database resources to cover the organisms of the anaerobic digester systems fed primary sludge and surplus activated sludge. The updated database includes descriptions of the abundant genus-level-taxa in influent wastewater, activated sludge and anaerobic digesters. Abundance information is also included to allow assessment of the role of emigration in the ecology of each phylotype. MiDAS is intended as a collaborative resource for the progression of research into the ecology of wastewater treatment, by providing a public repository for knowledge that is accessible to all interested in these biotechnologically important systems.Database URL:
      PubDate: 2017-03-18
  • miRnalyze: an interactive database linking tool to unlock intuitive
           microRNA regulation of cell signaling pathways

    • Authors: Subhra Das S; James M, Paul S, et al.
      Abstract: The various pathophysiological processes occurring in living systems are known to be orchestrated by delicate interplays and cross-talks between different genes and their regulators. Among the various regulators of genes, there is a class of small non-coding RNA molecules known as microRNAs. Although, the relative simplicity of miRNAs and their ability to modulate cellular processes make them attractive therapeutic candidates, their presence in large numbers make it challenging for experimental researchers to interpret the intricacies of the molecular processes they regulate. Most of the existing bioinformatic tools fail to address these challenges. Here, we present a new web resource ‘miRnalyze’ that has been specifically designed to directly identify the putative regulation of cell signaling pathways by miRNAs. The tool integrates miRNA-target predictions with signaling cascade members by utilizing TargetScanHuman 7.1 miRNA-target prediction tool and the KEGG pathway database, and thus provides researchers with in-depth insights into modulation of signal transduction pathways by miRNAs. miRnalyze is capable of identifying common miRNAs targeting more than one gene in the same signaling pathway—a feature that further increases the probability of modulating the pathway and downstream reactions when using miRNA modulators. Additionally, miRnalyze can sort miRNAs according to the seed-match types and TargetScan Context ++ score, thus providing a hierarchical list of most valuable miRNAs. Furthermore, in order to provide users with comprehensive information regarding miRNAs, genes and pathways, miRnalyze also links to expression data of miRNAs (miRmine) and genes (TiGER) and proteome abundance (PaxDb) data. To validate the capability of the tool, we have documented the correlation of miRnalyze’s prediction with experimental confirmation studies.Database URL:
      PubDate: 2017-03-18
  • Strategies towards digital and semi-automated curation in RegulonDB

    • Authors: Rinaldi F; Lithgow O, Gama-Castro S, et al.
      Abstract: Experimentally generated biological information needs to be organized and structured in order to become meaningful knowledge. However, the rate at which new information is being published makes manual curation increasingly unable to cope. Devising new curation strategies that leverage upon data mining and text analysis is, therefore, a promising avenue to help life science databases to cope with the deluge of novel information. In this article, we describe the integration of text mining technologies in the curation pipeline of the RegulonDB database, and discuss how the process can enhance the productivity of the curators.Specifically, a named entity recognition approach is used to pre-annotate terms referring to a set of domain entities which are potentially relevant for the curation process. The annotated documents are presented to the curator, who, thanks to a custom-designed interface, can select sentences containing specific types of entities, thus restricting the amount of text that needs to be inspected. Additionally, a module capable of computing semantic similarity between sentences across the entire collection of articles to be curated is being integrated in the system. We tested the module using three sets of scientific articles and six domain experts. All these improvements are gradually enabling us to obtain a high throughput curation process with the same quality as manual curation.
      PubDate: 2017-03-18
  • The HIV oligonucleotide database (HIVoligoDB)

    • Authors: Carneiro J; Resende A, Pereira F.
      Abstract: The human immunodeficiency virus (HIV) is associated with one of the most widespread infectious disease, the acquired immunodeficiency syndrome (AIDS). The development of antiretroviral drugs and methods for virus detection requires a comprehensive analysis of the HIV genomic diversity, particularly in the binding sites of oligonucleotides. Here, we describe a versatile online database (HIVoligoDB) with oligonucleotides selected for the diagnosis of HIV and treatment of AIDS. Currently, the database provides an interface for visualization, analysis and download of 380 HIV-1 and 65 HIV-2 oligonucleotides annotated according to curated reference genomes. The database also allows the selection of the most conserved HIV genomic regions for the development of molecular diagnostic assays and sequence-based candidate therapeutics.Database URL:
      PubDate: 2017-03-18
  • Curated protein information in the Saccharomyces genome database

    • Authors: Hellerstedt ST; Nash RS, Weng S, et al.
      Abstract: Due to recent advancements in the production of experimental proteomic data, the Saccharomyces genome database (SGD; has been expanding our protein curation activities to make new data types available to our users. Because of broad interest in post-translational modifications (PTM) and their importance to protein function and regulation, we have recently started incorporating expertly curated PTM information on individual protein pages. Here we also present the inclusion of new abundance and protein half-life data obtained from high-throughput proteome studies. These new data types have been included with the aim to facilitate cellular biology research.Database URL:
      PubDate: 2017-03-11
  • The ‘straight mouse’: defining anatomical axes in 3D embryo

    • Authors: Armit C; Hill B, Venkataraman SS, et al.
      Abstract: A primary objective of the eMouseAtlas Project is to enable 3D spatial mapping of whole embryo gene expression data to capture complex 3D patterns for indexing, visualization, cross-comparison and analysis. For this we have developed a spatio-temporal framework based on 3D models of embryos at different stages of development coupled with an anatomical ontology. Here we introduce a method of defining coordinate axes that correspond to the anatomical or biologically relevant anterior–posterior (A–P), dorsal–ventral (D–V) and left–right (L–R) directions. These enable more sophisticated query and analysis of the data with biologically relevant associations, and provide novel data visualizations that can reveal patterns that are otherwise difficult to detect in the standard 3D coordinate space. These anatomical coordinates are defined using the concept of a ‘straight mouse-embryo’ within which the anatomical coordinates are Cartesian. The straight embryo model has been mapped via a complex non-linear transform onto the standard embryo model. We explore the utility of this anatomical coordinate system in elucidating the spatial relationship of spatially mapped embryonic ‘Fibroblast growth factor’ gene expression patterns, and we discuss the importance of this technology in summarizing complex multimodal mouse embryo image data from gene expression and anatomy studies.Database
      PubDate: 2017-03-11
  • Boechera microsatellite website: an online portal for species
           identification and determination of hybrid parentage

    • Authors: Li F; Rushworth CA, Beck JB, et al.
      Abstract: Boechera (Brassicaceae) has many features to recommend it as a model genus for ecological and evolutionary research, including species richness, ecological diversity, experimental tractability and close phylogenetic proximity to Arabidopsis. However, efforts to realize the full potential of this model system have been thwarted by the frequent inability of researchers to identify their samples and place them in a broader evolutionary context. Here we present the Boechera Microsatellite Website (BMW), a portal that archives over 55 000 microsatellite allele calls from 4471 specimens (including 133 nomenclatural types). The portal includes analytical tools that utilize data from 15 microsatellite loci as a highly effective DNA barcoding system. The BMW facilitates the accurate identification of Boechera samples and the investigation of reticulate evolution among the ±83 sexual diploid taxa in the genus, thereby greatly enhancing Boechera’s potential as a model system.Database URL:
      PubDate: 2017-02-27
  • Actionable, long-term stable and semantic web compatible identifiers for
           access to biological collection objects

    • Authors: Güntsch A; Hyam R, Hagedorn G, et al.
      Abstract: With biodiversity research activities being increasingly shifted to the web, the need for a system of persistent and stable identifiers for physical collection objects becomes increasingly pressing. The Consortium of European Taxonomic Facilities agreed on a common system of HTTP-URI-based stable identifiers which is now rolled out to its member organizations. The system follows Linked Open Data principles and implements redirection mechanisms to human-readable and machine-readable representations of specimens facilitating seamless integration into the growing semantic web. The implementation of stable identifiers across collection organizations is supported with open source provider software scripts, best practices documentations and recommendations for RDF metadata elements facilitating harmonized access to collection information in web portals.Database URL:
      PubDate: 2017-02-26
  • BELMiner: adapting a rule-based relation extraction system to extract
           biological expression language statements from bio-medical literature
           evidence sentences

    • Authors: Ravikumar KE; Rastegar-Mojarad M, Liu H.
      Abstract: Extracting meaningful relationships with semantic significance from biomedical literature is often a challenging task. BioCreative V track4 challenge for the first time has organized a comprehensive shared task to test the robustness of the text-mining algorithms in extracting semantically meaningful assertions from the evidence statement in biomedical text. In this work, we tested the ability of a rule-based semantic parser to extract Biological Expression Language (BEL) statements from evidence sentences culled out of biomedical literature as part of BioCreative V Track4 challenge. The system achieved an overall best F-measure of 21.29% in extracting the complete BEL statement. For relation extraction, the system achieved an F-measure of 65.13% on test data set. Our system achieved the best performance in five of the six criteria that was adopted for evaluation by the task organizers. Lack of ability to derive semantic inferences, limitation in the rule sets to map the textual extractions to BEL function were some of the reasons for low performance in extracting the complete BEL statement. Post shared task we also evaluated the impact of differential NER components on the ability to extract BEL statements on the test data sets besides making a single change in the rule sets that translate relation extractions into a BEL statement. There is a marked improvement by over 20% in the overall performance of the BELMiner’s capability to extract BEL statement on the test set. The system is available as a REST-API at URL:
      PubDate: 2017-02-26
  • Carotenoids Database: structures, chemical fingerprints and distribution
           among organisms

    • Authors: Yabuzaki J.
      Abstract: To promote understanding of how organisms are related via carotenoids, either evolutionarily or symbiotically, or in food chains through natural histories, we built the Carotenoids Database. This provides chemical information on 1117 natural carotenoids with 683 source organisms. For extracting organisms closely related through the biosynthesis of carotenoids, we offer a new similarity search system ‘Search similar carotenoids’ using our original chemical fingerprint ‘Carotenoid DB Chemical Fingerprints’. These Carotenoid DB Chemical Fingerprints describe the chemical substructure and the modification details based upon International Union of Pure and Applied Chemistry (IUPAC) semi-systematic names of the carotenoids. The fingerprints also allow (i) easier prediction of six biological functions of carotenoids: provitamin A, membrane stabilizers, odorous substances, allelochemicals, antiproliferative activity and reverse MDR activity against cancer cells, (ii) easier classification of carotenoid structures, (iii) partial and exact structure searching and (iv) easier extraction of structural isomers and stereoisomers. We believe this to be the first attempt to establish fingerprints using the IUPAC semi-systematic names. For extracting close profiled organisms, we provide a new tool ‘Search similar profiled organisms’. Our current statistics show some insights into natural history: carotenoids seem to have been spread largely by bacteria, as they produce C30, C40, C45 and C50 carotenoids, with the widest range of end groups, and they share a small portion of C40 carotenoids with eukaryotes. Archaea share an even smaller portion with eukaryotes. Eukaryotes then have evolved a considerable variety of C40 carotenoids. Considering carotenoids, eukaryotes seem more closely related to bacteria than to archaea aside from 16S rRNA lineage analysis.Database URL:
      PubDate: 2017-02-26
  • OCaPPI-Db: an oligonucleotide probe database for pathogen identification
           through hybridization capture

    • Authors: Gasc C; Constantin A, Jaziri F, et al.
      Abstract: The detection and identification of bacterial pathogens involved in acts of bio- and agroterrorism are essential to avoid pathogen dispersal in the environment and propagation within the population. Conventional molecular methods, such as PCR amplification, DNA microarrays or shotgun sequencing, are subject to various limitations when assessing environmental samples, which can lead to inaccurate findings. We developed a hybridization capture strategy that uses a set of oligonucleotide probes to target and enrich biomarkers of interest in environmental samples. Here, we present Oligonucleotide Capture Probes for Pathogen Identification Database (OCaPPI-Db), an online capture probe database containing a set of 1,685 oligonucleotide probes allowing for the detection and identification of 30 biothreat agents up to the species level. This probe set can be used in its entirety as a comprehensive diagnostic tool or can be restricted to a set of probes targeting a specific pathogen or virulence factor according to the user’s needs.Database URL:
      PubDate: 2017-02-26
  • Outreach and online training services at the Saccharomyces Genome Database

    • Authors: MacPherson KA; Starr B, Wong ED, et al.
      Abstract: The Saccharomyces Genome Database (SGD;, the primary genetics and genomics resource for the budding yeast S. cerevisiae, provides free public access to expertly curated information about the yeast genome and its gene products. As the central hub for the yeast research community, SGD engages in a variety of social outreach efforts to inform our users about new developments, promote collaboration, increase public awareness of the importance of yeast to biomedical research, and facilitate scientific discovery. Here we describe these various outreach methods, from networking at scientific conferences to the use of online media such as blog posts and webinars, and include our perspectives on the benefits provided by outreach activities for model organism databases.Database URL:
      PubDate: 2017-02-26
  • PCPPI: a comprehensive database for the prediction of Penicillium –crop
           protein–protein interactions

    • Authors: Yue J; Zhang D, Ban R, et al.
      Abstract: Penicillium expansum, the causal agent of blue mold, is one of the most prevalent post-harvest pathogens, infecting a wide range of crops after harvest. In response, crops have evolved various defense systems to protect themselves against this and other pathogens. Penicillium–crop interaction is a multifaceted process and mediated by pathogen- and host-derived proteins. Identification and characterization of the inter-species protein–protein interactions (PPIs) are fundamental to elucidating the molecular mechanisms underlying infection processes between P. expansum and plant crops. Here, we have developed PCPPI, the Penicillium-Crop Protein–Protein Interactions database, which is constructed based on the experimentally determined orthologous interactions in pathogen–plant systems and available domain–domain interactions (DDIs) in each PPI. Thus far, it stores information on 9911 proteins, 439 904 interactions and seven host species, including apple, kiwifruit, maize, pear, rice, strawberry and tomato. Further analysis through the gene ontology (GO) annotation indicated that proteins with more interacting partners tend to execute the essential function. Significantly, semantic statistics of the GO terms also provided strong support for the accuracy of our predicted interactions in PCPPI. We believe that all the PCPPI datasets are helpful to facilitate the study of pathogen-crop interactions and freely available to the research community.Database URL:
      PubDate: 2017-02-26
  • SilkPathDB: a comprehensive resource for the study of silkworm pathogens

    • Authors: Li T; Pan G, Vossbrinck CR, et al.
      Abstract: Silkworm pathogens have been heavily impeding the development of sericultural industry and play important roles in lepidopteran ecology, and some of which are used as biological insecticides. Rapid advances in studies on the omics of silkworm pathogens have produced a large amount of data, which need to be brought together centrally in a coherent and systematic manner. This will facilitate the reuse of these data for further analysis. We have collected genomic data for 86 silkworm pathogens from 4 taxa (fungi, microsporidia, bacteria and viruses) and from 4 lepidopteran hosts, and developed the open-access Silkworm Pathogen Database (SilkPathDB) to make this information readily available. The implementation of SilkPathDB involves integrating Drupal and GBrowse as a graphic interface for a Chado relational database which houses all of the datasets involved. The genomes have been assembled and annotated for comparative purposes and allow the search and analysis of homologous sequences, transposable elements, protein subcellular locations, including secreted proteins, and gene ontology. We believe that the SilkPathDB will aid researchers in the identification of silkworm parasites, understanding the mechanisms of silkworm infections, and the developmental ecology of silkworm parasites (gene expression) and their hosts.Database URL:
      PubDate: 2017-02-26
  • VerSeDa: vertebrate secretome database

    • Authors: Cortazar AR; Oguiza JA, Aransay AM, et al.
      Abstract: Based on the current tools, de novo secretome (full set of proteins secreted by an organism) prediction is a time consuming bioinformatic task that requires a multifactorial analysis in order to obtain reliable in silico predictions. Hence, to accelerate this process and offer researchers a reliable repository where secretome information can be obtained for vertebrates and model organisms, we have developed VerSeDa (Vertebrate Secretome Database). This freely available database stores information about proteins that are predicted to be secreted through the classical and non-classical mechanisms, for the wide range of vertebrate species deposited at the NCBI, UCSC and ENSEMBL sites. To our knowledge, VerSeDa is the only state-of-the-art database designed to store secretome data from multiple vertebrate genomes, thus, saving an important amount of time spent in the prediction of protein features that can be retrieved from this repository directly.Database URL: VerSeDa is freely available at
      PubDate: 2017-02-24
  • RAIN: RNA–protein Association and Interaction Networks

    • Abstract: doi: 10.1093/database/baw167
      PubDate: 2017-02-10
  • Automatic query generation using word embeddings for retrieving passages
           describing experimental methods

    • Authors: Aydın F; Hüsünbeyi Z, Özgür A.
      Abstract: Information regarding the physical interactions among proteins is crucial, since protein–protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scientific publications that report them. In this study, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative–Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The first method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency–relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The word embeddings are obtained by utilizing a large unlabeled full text corpus. The proposed methods are evaluated on the test set consisting of 17 articles. Both methods obtain higher recall scores compared with the baseline, with a loss in precision. Besides higher recall, the word embeddings based approach achieves higher F-measure than the baseline and the tf.rf based methods. We also show that incorporating gene name and interaction keyword identification leads to improved precision and F-measure scores for all three evaluated methods. The tf.rf based approach was developed as part of our participation in the Collaborative Biocurator Assistant Task of the BioCreative V challenge assessment, whereas the word embeddings based approach is a novel contribution of this article.Database URL:
      PubDate: 2017-01-10
  • blend4php: a PHP API for galaxy

    • Authors: Wytko C; Soto B, Ficklin SP.
      Abstract: Galaxy is a popular framework for execution of complex analytical pipelines typically for large data sets, and is a commonly used for (but not limited to) genomic, genetic and related biological analysis. It provides a web front-end and integrates with high performance computing resources. Here we report the development of the blend4php library that wraps Galaxy’s RESTful API into a PHP-based library. PHP-based web applications can use blend4php to automate execution, monitoring and management of a remote Galaxy server, including its users, workflows, jobs and more. The blend4php library was specifically developed for the integration of Galaxy with Tripal, the open-source toolkit for the creation of online genomic and genetic web sites. However, it was designed as an independent library for use by any application, and is freely available under version 3 of the GNU Lesser General Public License (LPGL v3.0) at URL:
      PubDate: 2017-01-10
  • Duplicates, redundancies and inconsistencies in the primary nucleotide
           databases: a descriptive study

    • Authors: Chen Q; Zobel J, Verspoor K.
      Abstract: GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC—a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases – in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.Database URL: the merged records are available at
      PubDate: 2017-01-10
  • FARME DB: a functional antibiotic resistance element database

    • Authors: Wallace JC; Port JA, Smith MN, et al.
      Abstract: Antibiotic resistance (AR) is a major global public health threat but few resources exist that catalog AR genes outside of a clinical context. Current AR sequence databases are assembled almost exclusively from genomic sequences derived from clinical bacterial isolates and thus do not include many microbial sequences derived from environmental samples that confer resistance in functional metagenomic studies. These environmental metagenomic sequences often show little or no similarity to AR sequences from clinical isolates using standard classification criteria. In addition, existing AR databases provide no information about flanking sequences containing regulatory or mobile genetic elements. To help address this issue, we created an annotated database of DNA and protein sequences derived exclusively from environmental metagenomic sequences showing AR in laboratory experiments. Our Functional Antibiotic Resistant Metagenomic Element (FARME) database is a compilation of publically available DNA sequences and predicted protein sequences conferring AR as well as regulatory elements, mobile genetic elements and predicted proteins flanking antibiotic resistant genes. FARME is the first database to focus on functional metagenomic AR gene elements and provides a resource to better understand AR in the 99% of bacteria which cannot be cultured and the relationship between environmental AR sequences and antibiotic resistant genes derived from cultured isolates.Database URL:
      PubDate: 2017-01-10
  • KTCNlncDB—a first platform to investigate lncRNAs expressed in human
           keratoconus and non-keratoconus corneas

    • Authors: Szcześniak MW; Kabza M, Karolak JA, et al.
      Abstract: Keratoconus (KTCN, OMIM 148300) is a degenerative eye disorder characterized by progressive stromal thinning that leads to a conical shape of the cornea, resulting in optical aberrations and even loss of visual function. The biochemical background of the disease is poorly understood, which motivated us to perform RNA-Seq experiment, aimed at better characterizing the KTCN transcriptome and identification of long non-coding RNAs (lncRNAs) that might be involved in KTCN etiology. The in silico functional studies based on predicted lncRNA:RNA base-pairings led us to recognition of a number of lncRNAs possibly regulating genes with known or plausible links to KTCN. The lncRNA sequences and data regarding their predicted functions in controlling the RNA processing and stability are available for browse, search and download in KTCNlncDB (, the first online platform devoted to KTCN transcriptome.Database URL:
      PubDate: 2017-01-10
  • MAHMI database: a comprehensive MetaHit-based resource for the study of
           the mechanism of action of the human microbiota

    • Authors: Blanco-Míguez A; Gutiérrez-Jácome A, Fdez-Riverola F, et al.
      Abstract: The Mechanism of Action of the Human Microbiome (MAHMI) database is a unique resource that provides comprehensive information about the sequence of potential immunomodulatory and antiproliferative peptides encrypted in the proteins produced by the human gut microbiota. Currently, MAHMI database contains over 300 hundred million peptide entries, with detailed information about peptide sequence, sources and potential bioactivity. The reference peptide data section is curated manually by domain experts. The in silico peptide data section is populated automatically through the systematic processing of publicly available exoproteomes of the human microbiome. Bioactivity prediction is based on the global alignment of the automatically processed peptides with experimentally validated immunomodulatory and antiproliferative peptides, in the reference section. MAHMI provides researchers with a comparative tool for inspecting the potential immunomodulatory or antiproliferative bioactivity of new amino acidic sequences and identifying promising peptides to be further investigated. Moreover, researchers are welcome to submit new experimental evidence on peptide bioactivity, namely, empiric and structural data, as a proactive, expert means to keep the database updated and improve the implemented bioactivity prediction method. Bioactive peptides identified by MAHMI have a huge biotechnological potential, including the manipulation of aberrant immune responses and the design of new functional ingredients/foods based on the genetic sequences of the human microbiome. Hopefully, the resources provided by MAHMI will be useful to those researching gastrointestinal disorders of autoimmune and inflammatory nature, such as Inflammatory Bowel Diseases. MAHMI database is routinely updated and is available free of charge.Database URL:
      PubDate: 2017-01-10
  • RAIN: RNA–protein Association and Interaction Networks

    • Authors: Junge A; Refsgaard JC, Garde C, et al.
      Abstract: Protein association networks can be inferred from a range of resources including experimental data, literature mining and computational predictions. These types of evidence are emerging for non-coding RNAs (ncRNAs) as well. However, integration of ncRNAs into protein association networks is challenging due to data heterogeneity. Here, we present a database of ncRNA–RNA and ncRNA–protein interactions and its integration with the STRING database of protein–protein interactions. These ncRNA associations cover four organisms and have been established from curated examples, experimental data, interaction predictions and automatic literature mining. RAIN uses an integrative scoring scheme to assign a confidence score to each interaction. We demonstrate that RAIN outperforms the underlying microRNA-target predictions in inferring ncRNA interactions. RAIN can be operated through an easily accessible web interface and all interaction data can be downloaded.Database URL:
      PubDate: 2017-01-10
  • The BioC-BioGRID corpus: full text articles annotated for curation of
           protein–protein and genetic interactions

    • Authors: Islamaj Doğan R; Kim S, Chatr-aryamontri A, et al.
      Abstract: A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report.Database URL:
      PubDate: 2017-01-10
  • FirebrowseR: an R client to the Broad Institute’s Firehose Pipeline

    • Authors: Deng M; Brägelmann J, Kryukov I, et al.
      Abstract: With its Firebrowse service ( the Broad Institute is making large-scale multi-platform omics data analysis results publicly available through a Representational State Transfer (REST) Application Programmable Interface (API). Querying this database through an API client from an arbitrary programming environment is an essential task, allowing other developers and researchers to focus on their analysis and avoid data wrangling. Hence, as a first result, we developed a workflow to automatically generate, test and deploy such clients for rapid response to API changes. Its underlying infrastructure, a combination of free and publicly available web services, facilitates the development of API clients. It decouples changes in server software from the client software by reacting to changes in the RESTful service and removing direct dependencies on a specific implementation of an API. As a second result, FirebrowseR, an R client to the Broad Institute’s RESTful Firehose Pipeline, is provided as a working example, which is built by the means of the presented workflow. The package’s features are demonstrated by an example analysis of cancer gene expression data.Database URL:
      PubDate: 2017-01-06
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2016