for Journals by Title or ISSN
for Articles by Keywords

Publisher: Oxford University Press   (Total: 396 journals)

 A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  

        1 2 | Last   [Sort by number of followers]   [Restore default list]

Showing 1 - 200 of 396 Journals sorted alphabetically
ACS Symposium Series     Full-text available via subscription   (SJR: 0.189, CiteScore: 0)
Acta Biochimica et Biophysica Sinica     Hybrid Journal   (Followers: 5, SJR: 0.79, CiteScore: 2)
Adaptation     Hybrid Journal   (Followers: 9, SJR: 0.143, CiteScore: 0)
Advances in Nutrition     Hybrid Journal   (Followers: 50, SJR: 2.196, CiteScore: 5)
Aesthetic Surgery J.     Hybrid Journal   (Followers: 6, SJR: 1.434, CiteScore: 1)
African Affairs     Hybrid Journal   (Followers: 65, SJR: 1.869, CiteScore: 2)
Age and Ageing     Hybrid Journal   (Followers: 90, SJR: 1.989, CiteScore: 4)
Alcohol and Alcoholism     Hybrid Journal   (Followers: 18, SJR: 1.376, CiteScore: 3)
American Entomologist     Full-text available via subscription   (Followers: 7)
American Historical Review     Hybrid Journal   (Followers: 156, SJR: 0.467, CiteScore: 1)
American J. of Agricultural Economics     Hybrid Journal   (Followers: 42, SJR: 2.113, CiteScore: 3)
American J. of Clinical Nutrition     Hybrid Journal   (Followers: 159, SJR: 3.438, CiteScore: 6)
American J. of Epidemiology     Hybrid Journal   (Followers: 181, SJR: 2.713, CiteScore: 3)
American J. of Hypertension     Hybrid Journal   (Followers: 25, SJR: 1.322, CiteScore: 3)
American J. of Jurisprudence     Hybrid Journal   (Followers: 19, SJR: 0.281, CiteScore: 1)
American J. of Legal History     Full-text available via subscription   (Followers: 8, SJR: 0.116, CiteScore: 0)
American Law and Economics Review     Hybrid Journal   (Followers: 27, SJR: 1.053, CiteScore: 1)
American Literary History     Hybrid Journal   (Followers: 16, SJR: 0.391, CiteScore: 0)
Analysis     Hybrid Journal   (Followers: 22, SJR: 1.038, CiteScore: 1)
Animal Frontiers     Hybrid Journal   (Followers: 1)
Annals of Behavioral Medicine     Hybrid Journal   (Followers: 15, SJR: 1.423, CiteScore: 3)
Annals of Botany     Hybrid Journal   (Followers: 36, SJR: 1.721, CiteScore: 4)
Annals of Oncology     Hybrid Journal   (Followers: 45, SJR: 5.599, CiteScore: 9)
Annals of the Entomological Society of America     Full-text available via subscription   (Followers: 10, SJR: 0.722, CiteScore: 1)
Annals of Work Exposures and Health     Hybrid Journal   (Followers: 32, SJR: 0.728, CiteScore: 2)
AoB Plants     Open Access   (Followers: 4, SJR: 1.28, CiteScore: 3)
Applied Economic Perspectives and Policy     Hybrid Journal   (Followers: 18, SJR: 0.858, CiteScore: 2)
Applied Linguistics     Hybrid Journal   (Followers: 56, SJR: 2.987, CiteScore: 3)
Applied Mathematics Research eXpress     Hybrid Journal   (Followers: 1, SJR: 1.241, CiteScore: 1)
Arbitration Intl.     Full-text available via subscription   (Followers: 20)
Arbitration Law Reports and Review     Hybrid Journal   (Followers: 14)
Archives of Clinical Neuropsychology     Hybrid Journal   (Followers: 30, SJR: 0.731, CiteScore: 2)
Aristotelian Society Supplementary Volume     Hybrid Journal   (Followers: 3)
Arthropod Management Tests     Hybrid Journal   (Followers: 2)
Astronomy & Geophysics     Hybrid Journal   (Followers: 43, SJR: 0.146, CiteScore: 0)
Behavioral Ecology     Hybrid Journal   (Followers: 52, SJR: 1.871, CiteScore: 3)
Bioinformatics     Hybrid Journal   (Followers: 311, SJR: 6.14, CiteScore: 8)
Biology Methods and Protocols     Hybrid Journal  
Biology of Reproduction     Full-text available via subscription   (Followers: 9, SJR: 1.446, CiteScore: 3)
Biometrika     Hybrid Journal   (Followers: 20, SJR: 3.485, CiteScore: 2)
BioScience     Hybrid Journal   (Followers: 29, SJR: 2.754, CiteScore: 4)
Bioscience Horizons : The National Undergraduate Research J.     Open Access   (Followers: 1, SJR: 0.146, CiteScore: 0)
Biostatistics     Hybrid Journal   (Followers: 17, SJR: 1.553, CiteScore: 2)
BJA : British J. of Anaesthesia     Hybrid Journal   (Followers: 169, SJR: 2.115, CiteScore: 3)
BJA Education     Hybrid Journal   (Followers: 65)
Brain     Hybrid Journal   (Followers: 68, SJR: 5.858, CiteScore: 7)
Briefings in Bioinformatics     Hybrid Journal   (Followers: 49, SJR: 2.505, CiteScore: 5)
Briefings in Functional Genomics     Hybrid Journal   (Followers: 3, SJR: 2.15, CiteScore: 3)
British J. for the Philosophy of Science     Hybrid Journal   (Followers: 35, SJR: 2.161, CiteScore: 2)
British J. of Aesthetics     Hybrid Journal   (Followers: 25, SJR: 0.508, CiteScore: 1)
British J. of Criminology     Hybrid Journal   (Followers: 586, SJR: 1.828, CiteScore: 3)
British J. of Social Work     Hybrid Journal   (Followers: 87, SJR: 1.019, CiteScore: 2)
British Medical Bulletin     Hybrid Journal   (Followers: 6, SJR: 1.355, CiteScore: 3)
British Yearbook of Intl. Law     Hybrid Journal   (Followers: 32)
Bulletin of the London Mathematical Society     Hybrid Journal   (Followers: 4, SJR: 1.376, CiteScore: 1)
Cambridge J. of Economics     Hybrid Journal   (Followers: 65, SJR: 0.764, CiteScore: 2)
Cambridge J. of Regions, Economy and Society     Hybrid Journal   (Followers: 11, SJR: 2.438, CiteScore: 4)
Cambridge Quarterly     Hybrid Journal   (Followers: 9, SJR: 0.104, CiteScore: 0)
Capital Markets Law J.     Hybrid Journal   (Followers: 2, SJR: 0.222, CiteScore: 0)
Carcinogenesis     Hybrid Journal   (Followers: 2, SJR: 2.135, CiteScore: 5)
Cardiovascular Research     Hybrid Journal   (Followers: 14, SJR: 3.002, CiteScore: 5)
Cerebral Cortex     Hybrid Journal   (Followers: 46, SJR: 3.892, CiteScore: 6)
CESifo Economic Studies     Hybrid Journal   (Followers: 18, SJR: 0.483, CiteScore: 1)
Chemical Senses     Hybrid Journal   (Followers: 1, SJR: 1.42, CiteScore: 3)
Children and Schools     Hybrid Journal   (Followers: 5, SJR: 0.246, CiteScore: 0)
Chinese J. of Comparative Law     Hybrid Journal   (Followers: 4, SJR: 0.412, CiteScore: 0)
Chinese J. of Intl. Law     Hybrid Journal   (Followers: 22, SJR: 0.329, CiteScore: 0)
Chinese J. of Intl. Politics     Hybrid Journal   (Followers: 10, SJR: 1.392, CiteScore: 2)
Christian Bioethics: Non-Ecumenical Studies in Medical Morality     Hybrid Journal   (Followers: 10, SJR: 0.183, CiteScore: 0)
Classical Receptions J.     Hybrid Journal   (Followers: 27, SJR: 0.123, CiteScore: 0)
Clean Energy     Open Access   (Followers: 1)
Clinical Infectious Diseases     Hybrid Journal   (Followers: 67, SJR: 5.051, CiteScore: 5)
Communication Theory     Hybrid Journal   (Followers: 23, SJR: 2.424, CiteScore: 3)
Communication, Culture & Critique     Hybrid Journal   (Followers: 27, SJR: 0.222, CiteScore: 1)
Community Development J.     Hybrid Journal   (Followers: 27, SJR: 0.268, CiteScore: 1)
Computer J.     Hybrid Journal   (Followers: 9, SJR: 0.319, CiteScore: 1)
Conservation Physiology     Open Access   (Followers: 2, SJR: 1.818, CiteScore: 3)
Contemporary Women's Writing     Hybrid Journal   (Followers: 9, SJR: 0.121, CiteScore: 0)
Contributions to Political Economy     Hybrid Journal   (Followers: 5, SJR: 0.906, CiteScore: 1)
Critical Values     Full-text available via subscription  
Current Developments in Nutrition     Open Access   (Followers: 2)
Current Legal Problems     Hybrid Journal   (Followers: 29)
Current Zoology     Full-text available via subscription   (Followers: 3, SJR: 1.164, CiteScore: 2)
Database : The J. of Biological Databases and Curation     Open Access   (Followers: 8, SJR: 1.791, CiteScore: 3)
Digital Scholarship in the Humanities     Hybrid Journal   (Followers: 14, SJR: 0.259, CiteScore: 1)
Diplomatic History     Hybrid Journal   (Followers: 20, SJR: 0.45, CiteScore: 1)
DNA Research     Open Access   (Followers: 5, SJR: 2.866, CiteScore: 6)
Dynamics and Statistics of the Climate System     Open Access   (Followers: 4)
Early Music     Hybrid Journal   (Followers: 16, SJR: 0.139, CiteScore: 0)
Economic Policy     Hybrid Journal   (Followers: 41, SJR: 3.584, CiteScore: 3)
ELT J.     Hybrid Journal   (Followers: 24, SJR: 0.942, CiteScore: 1)
English Historical Review     Hybrid Journal   (Followers: 54, SJR: 0.612, CiteScore: 1)
English: J. of the English Association     Hybrid Journal   (Followers: 14, SJR: 0.1, CiteScore: 0)
Environmental Entomology     Full-text available via subscription   (Followers: 11, SJR: 0.818, CiteScore: 2)
Environmental Epigenetics     Open Access   (Followers: 3)
Environmental History     Hybrid Journal   (Followers: 27, SJR: 0.408, CiteScore: 1)
EP-Europace     Hybrid Journal   (Followers: 2, SJR: 2.748, CiteScore: 4)
Epidemiologic Reviews     Hybrid Journal   (Followers: 9, SJR: 4.505, CiteScore: 8)
ESHRE Monographs     Hybrid Journal  
Essays in Criticism     Hybrid Journal   (Followers: 17, SJR: 0.113, CiteScore: 0)
European Heart J.     Hybrid Journal   (Followers: 57, SJR: 9.315, CiteScore: 9)
European Heart J. - Cardiovascular Imaging     Hybrid Journal   (Followers: 9, SJR: 3.625, CiteScore: 3)
European Heart J. - Cardiovascular Pharmacotherapy     Full-text available via subscription   (Followers: 1)
European Heart J. - Quality of Care and Clinical Outcomes     Hybrid Journal  
European Heart J. : Case Reports     Open Access  
European Heart J. Supplements     Hybrid Journal   (Followers: 8, SJR: 0.223, CiteScore: 0)
European J. of Cardio-Thoracic Surgery     Hybrid Journal   (Followers: 9, SJR: 1.681, CiteScore: 2)
European J. of Intl. Law     Hybrid Journal   (Followers: 187, SJR: 0.694, CiteScore: 1)
European J. of Orthodontics     Hybrid Journal   (Followers: 4, SJR: 1.279, CiteScore: 2)
European J. of Public Health     Hybrid Journal   (Followers: 20, SJR: 1.36, CiteScore: 2)
European Review of Agricultural Economics     Hybrid Journal   (Followers: 10, SJR: 1.172, CiteScore: 2)
European Review of Economic History     Hybrid Journal   (Followers: 30, SJR: 0.702, CiteScore: 1)
European Sociological Review     Hybrid Journal   (Followers: 42, SJR: 2.728, CiteScore: 3)
Evolution, Medicine, and Public Health     Open Access   (Followers: 12)
Family Practice     Hybrid Journal   (Followers: 16, SJR: 1.018, CiteScore: 2)
Fems Microbiology Ecology     Hybrid Journal   (Followers: 13, SJR: 1.492, CiteScore: 4)
Fems Microbiology Letters     Hybrid Journal   (Followers: 27, SJR: 0.79, CiteScore: 2)
Fems Microbiology Reviews     Hybrid Journal   (Followers: 31, SJR: 7.063, CiteScore: 13)
Fems Yeast Research     Hybrid Journal   (Followers: 14, SJR: 1.308, CiteScore: 3)
Food Quality and Safety     Open Access   (Followers: 1)
Foreign Policy Analysis     Hybrid Journal   (Followers: 24, SJR: 1.425, CiteScore: 1)
Forest Science     Hybrid Journal   (Followers: 7, SJR: 0.89, CiteScore: 2)
Forestry: An Intl. J. of Forest Research     Hybrid Journal   (Followers: 16, SJR: 1.133, CiteScore: 3)
Forum for Modern Language Studies     Hybrid Journal   (Followers: 6, SJR: 0.104, CiteScore: 0)
French History     Hybrid Journal   (Followers: 33, SJR: 0.118, CiteScore: 0)
French Studies     Hybrid Journal   (Followers: 20, SJR: 0.148, CiteScore: 0)
French Studies Bulletin     Hybrid Journal   (Followers: 10, SJR: 0.152, CiteScore: 0)
Gastroenterology Report     Open Access   (Followers: 2)
Genome Biology and Evolution     Open Access   (Followers: 13, SJR: 2.578, CiteScore: 4)
Geophysical J. Intl.     Hybrid Journal   (Followers: 35, SJR: 1.506, CiteScore: 3)
German History     Hybrid Journal   (Followers: 23, SJR: 0.161, CiteScore: 0)
GigaScience     Open Access   (Followers: 4, SJR: 5.022, CiteScore: 7)
Global Summitry     Hybrid Journal   (Followers: 1)
Glycobiology     Hybrid Journal   (Followers: 13, SJR: 1.493, CiteScore: 3)
Health and Social Work     Hybrid Journal   (Followers: 56, SJR: 0.388, CiteScore: 1)
Health Education Research     Hybrid Journal   (Followers: 15, SJR: 0.854, CiteScore: 2)
Health Policy and Planning     Hybrid Journal   (Followers: 25, SJR: 1.512, CiteScore: 2)
Health Promotion Intl.     Hybrid Journal   (Followers: 22, SJR: 0.812, CiteScore: 2)
History Workshop J.     Hybrid Journal   (Followers: 31, SJR: 1.278, CiteScore: 1)
Holocaust and Genocide Studies     Hybrid Journal   (Followers: 28, SJR: 0.105, CiteScore: 0)
Human Communication Research     Hybrid Journal   (Followers: 15, SJR: 2.146, CiteScore: 3)
Human Molecular Genetics     Hybrid Journal   (Followers: 8, SJR: 3.555, CiteScore: 5)
Human Reproduction     Hybrid Journal   (Followers: 72, SJR: 2.643, CiteScore: 5)
Human Reproduction Open     Open Access  
Human Reproduction Update     Hybrid Journal   (Followers: 18, SJR: 5.317, CiteScore: 10)
Human Rights Law Review     Hybrid Journal   (Followers: 58, SJR: 0.756, CiteScore: 1)
ICES J. of Marine Science: J. du Conseil     Hybrid Journal   (Followers: 53, SJR: 1.591, CiteScore: 3)
ICSID Review     Hybrid Journal   (Followers: 11)
ILAR J.     Hybrid Journal   (Followers: 2, SJR: 1.732, CiteScore: 4)
IMA J. of Applied Mathematics     Hybrid Journal   (SJR: 0.679, CiteScore: 1)
IMA J. of Management Mathematics     Hybrid Journal   (SJR: 0.538, CiteScore: 1)
IMA J. of Mathematical Control and Information     Hybrid Journal   (Followers: 2, SJR: 0.496, CiteScore: 1)
IMA J. of Numerical Analysis - advance access     Hybrid Journal   (SJR: 1.987, CiteScore: 2)
Industrial and Corporate Change     Hybrid Journal   (Followers: 10, SJR: 1.792, CiteScore: 2)
Industrial Law J.     Hybrid Journal   (Followers: 36, SJR: 0.249, CiteScore: 1)
Inflammatory Bowel Diseases     Hybrid Journal   (Followers: 43, SJR: 2.511, CiteScore: 4)
Information and Inference     Free  
Integrative and Comparative Biology     Hybrid Journal   (Followers: 8, SJR: 1.319, CiteScore: 2)
Interacting with Computers     Hybrid Journal   (Followers: 11, SJR: 0.292, CiteScore: 1)
Interactive CardioVascular and Thoracic Surgery     Hybrid Journal   (Followers: 7, SJR: 0.762, CiteScore: 1)
Intl. Affairs     Hybrid Journal   (Followers: 62, SJR: 1.505, CiteScore: 3)
Intl. Data Privacy Law     Hybrid Journal   (Followers: 25)
Intl. Health     Hybrid Journal   (Followers: 6, SJR: 0.851, CiteScore: 2)
Intl. Immunology     Hybrid Journal   (Followers: 3, SJR: 2.167, CiteScore: 4)
Intl. J. for Quality in Health Care     Hybrid Journal   (Followers: 36, SJR: 1.348, CiteScore: 2)
Intl. J. of Constitutional Law     Hybrid Journal   (Followers: 63, SJR: 0.601, CiteScore: 1)
Intl. J. of Epidemiology     Hybrid Journal   (Followers: 238, SJR: 3.969, CiteScore: 5)
Intl. J. of Law and Information Technology     Hybrid Journal   (Followers: 5, SJR: 0.202, CiteScore: 1)
Intl. J. of Law, Policy and the Family     Hybrid Journal   (Followers: 24, SJR: 0.223, CiteScore: 1)
Intl. J. of Lexicography     Hybrid Journal   (Followers: 10, SJR: 0.285, CiteScore: 1)
Intl. J. of Low-Carbon Technologies     Open Access   (Followers: 1, SJR: 0.403, CiteScore: 1)
Intl. J. of Neuropsychopharmacology     Open Access   (Followers: 3, SJR: 1.808, CiteScore: 4)
Intl. J. of Public Opinion Research     Hybrid Journal   (Followers: 11, SJR: 1.545, CiteScore: 1)
Intl. J. of Refugee Law     Hybrid Journal   (Followers: 38, SJR: 0.389, CiteScore: 1)
Intl. J. of Transitional Justice     Hybrid Journal   (Followers: 11, SJR: 0.724, CiteScore: 2)
Intl. Mathematics Research Notices     Hybrid Journal   (Followers: 1, SJR: 2.168, CiteScore: 1)
Intl. Political Sociology     Hybrid Journal   (Followers: 39, SJR: 1.465, CiteScore: 3)
Intl. Relations of the Asia-Pacific     Hybrid Journal   (Followers: 23, SJR: 0.401, CiteScore: 1)
Intl. Studies Perspectives     Hybrid Journal   (Followers: 9, SJR: 0.983, CiteScore: 1)
Intl. Studies Quarterly     Hybrid Journal   (Followers: 47, SJR: 2.581, CiteScore: 2)
Intl. Studies Review     Hybrid Journal   (Followers: 25, SJR: 1.201, CiteScore: 1)
ISLE: Interdisciplinary Studies in Literature and Environment     Hybrid Journal   (Followers: 2, SJR: 0.15, CiteScore: 0)
ITNOW     Hybrid Journal   (Followers: 1, SJR: 0.103, CiteScore: 0)
J. of African Economies     Hybrid Journal   (Followers: 17, SJR: 0.533, CiteScore: 1)
J. of American History     Hybrid Journal   (Followers: 46, SJR: 0.297, CiteScore: 1)
J. of Analytical Toxicology     Hybrid Journal   (Followers: 14, SJR: 1.065, CiteScore: 2)
J. of Antimicrobial Chemotherapy     Hybrid Journal   (Followers: 15, SJR: 2.419, CiteScore: 4)
J. of Antitrust Enforcement     Hybrid Journal   (Followers: 1)
J. of Applied Poultry Research     Hybrid Journal   (Followers: 5, SJR: 0.585, CiteScore: 1)
J. of Biochemistry     Hybrid Journal   (Followers: 41, SJR: 1.226, CiteScore: 2)
J. of Burn Care & Research     Hybrid Journal   (Followers: 10, SJR: 0.768, CiteScore: 2)
J. of Chromatographic Science     Hybrid Journal   (Followers: 18, SJR: 0.36, CiteScore: 1)
J. of Church and State     Hybrid Journal   (Followers: 11, SJR: 0.139, CiteScore: 0)
J. of Communication     Hybrid Journal   (Followers: 55, SJR: 4.411, CiteScore: 5)
J. of Competition Law and Economics     Hybrid Journal   (Followers: 37, SJR: 0.33, CiteScore: 0)
J. of Complex Networks     Hybrid Journal   (Followers: 2, SJR: 1.05, CiteScore: 4)
J. of Computer-Mediated Communication     Open Access   (Followers: 29, SJR: 2.961, CiteScore: 6)
J. of Conflict and Security Law     Hybrid Journal   (Followers: 12, SJR: 0.402, CiteScore: 0)
J. of Consumer Research     Full-text available via subscription   (Followers: 46, SJR: 5.856, CiteScore: 5)
J. of Crohn's and Colitis     Hybrid Journal   (Followers: 8, SJR: 2.728, CiteScore: 5)

        1 2 | Last   [Sort by number of followers]   [Restore default list]

Journal Cover
Database : The Journal of Biological Databases and Curation
Journal Prestige (SJR): 1.791
Citation Impact (citeScore): 3
Number of Followers: 8  

  This is an Open Access Journal Open Access journal
ISSN (Online) 1758-0463
Published by Oxford University Press Homepage  [396 journals]
  • ANCO-GeneDB: annotations and comprehensive analysis of candidate genes for
           alcohol, nicotine, cocaine and opioid dependence

    • Authors: Hu R; Dai Y, Jia P, et al.
      Abstract: Studies have shown that genetic factors play an important role in the risk to substance addiction and abuse. So far, various genetic and genomic studies have reported the related evidence. These rich, but highly heterogeneous, data provide us an unprecedented opportunity to systematically collect, curate and assess the genetic and genomic signals from published studies and to perform a comprehensive analysis of their features, functional roles and druggability. Such genetic data resources have been made available for other disease or phenotypes but not for major substance dependence yet. Here, we report comprehensive data collection and secondary analyses of four phenotypes of dependence: alcohol dependence, nicotine dependence, cocaine dependence and opioid dependence, collectively named as Alcohol, Nicotine, Cocaine and Opioid (ANCO) dependence. We built the ANCO-GeneDB, an ANCO-dependence-associated gene resource database. ANCO-GeneDB includes resources from genome-wide association studies and candidate gene-based studies, transcriptomic studies, methylation studies, literature mining and drug-target data, as well as the derived data such as spatial–temporal gene expression, promoters, enhancers and expression quantitative trait loci. All associated genes and genetic variants are well annotated by using the collected evidence. Based on the collected data, we performed integrative, secondary analyses to prioritize genes, pathways, eQTLs and tissues that are significantly enriched in ANCO-related phenotypes.
      PubDate: Tue, 06 Nov 2018 00:00:00 GMT
      DOI: 10.1093/database/bay121
      Issue No: Vol. 2018, No. 2018 (2018)
  • RTPDB: a database providing associations between genetic variation or
           expression and cancer prognosis with radiotherapy-based treatment

    • Authors: Zhang C; Yang Y, Chen H, et al.
      Abstract: In recent years, lots of studies have reported the relationship between genetic variation or expression and cancer prognosis with radiotherapy-based treatment. However, due to limitation in available journals or literature database, inconsistent nomenclature system of genetic variation and cancer and time-consuming investigation on literature searching and reading, considerable researches could hardly get found and cited. In this study, we constructed the Radiotherapy Prognosis Database (RTPDB), which contains a comprehensive resource about genes and related cancer prognosis. It included 775 studies, which consist of 275 Single Nucleotide Polymorphism (SNP) studies with 59 765 patients, 261 genes, 708 SNPs, 16 tumors and 16 treatment types, and 500 expression studies with 55 751 patients, 264 genes, 27 tumors and 15 treatment types. The names of genes and their variants were converted and displayed in the form of the official symbol. The detailed information of the tumor, treatment and prognosis were classified. We hope RTPDB will be a useful resource with great potential for researches on genes, variants and cancer prognosis.
      PubDate: Tue, 30 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay118
      Issue No: Vol. 2018, No. 2018 (2018)
  • Gene ontology concept recognition using named concept: understanding the
           various presentations of the gene functions in biomedical literature

    • Authors: Yang C; Chiang J.
      Abstract: Objective: A major challenge in precision medicine is the development of patient-specific genetic biomarkers or drug targets. The firsthand information of the genes associated with the pathologic pathways of interest is buried in the ocean of biomedical literature. Gene ontology concept recognition (GOCR) is a biomedical natural language processing task used to extract and normalize the mentions of gene ontology (GO), the controlled vocabulary for gene functions across many species, from biomedical text. The previous GOCR systems, using either rule-based or machine-learning methods, treated GO concepts as separate terms and did not have an efficient way of sharing the common synonyms among the concepts.Materials and Methods: We used the CRAFT corpus in this study. Targeting the compositional structure of the GO, we introduced named concept, the basic conceptual unit which has a conserved name and is used in other complex concepts. Using the named concepts, we separated the GOCR task into dictionary-matching and machine-learning steps. By harvesting the surface names used in the training data, we wildly boosted the synonyms of GO concepts via the connection of the named concepts and then enhanced the capability to recognize more GO concepts in the text. The source code is available at Named concept gene ontology concept recognizer (NCGOCR) achieved 0.804 precision and 0.715 recall by correct recognition of the non-standard mentions of the GO concepts.Discussion: The lack of consensus on GO naming causes diversity in the GO mentions in biomedical manuscripts. The high performance is owed to the stability of the composing GO concepts and the lack of variance in the spelling of named concepts.Conclusion: NCGOCR reduced the arduous work of GO annotation and amended the process of searching for the biomarkers or drug targets, leading to improved biomarker development and greater success in precision medicine.
      PubDate: Mon, 29 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay115
      Issue No: Vol. 2018, No. 2018 (2018)
  • GCDB: a glaucomatous chemogenomics database for in silico drug discovery

    • Authors: Wei Y; Li J, Li B, et al.
      Abstract: Glaucoma is a group of neurodegenerative diseases that can cause irreversible blindness. The current medications, which mainly reduce intraocular pressure to slow the progression of disease, may have local and systemic side effects. Recently, medications with possible neuroprotective effects have attracted much attention. To assist in the identification of new glaucoma drugs, we created a glaucomatous chemogenomics database (GCDB; in which various glaucoma-related chemogenomics data records are assembled, including 275 genes, 105 proteins, 83 approved or clinical trial drugs, 90 206 chemicals associated with 213 093 records of reported bioactivities from 22 324 corresponding bioassays and 5630 references. Moreover, an improved chemical similarity ensemble approach computational algorithm was incorporated in the GCDB to identify new targets and design new drugs. Further, we demonstrated the application of GCDB in a case study screening two chemical libraries, Maybridge and Specs, to identify interactions between small molecules and glaucoma-related proteins. Finally, six and four compounds were selected from the final hits for in vitro human glucocorticoid receptor (hGR) and adenosine A3 receptor (A3AR) inhibitory assays, respectively. Of these compounds, six were shown to have inhibitory activities against hGR, with IC50 values ranging from 2.92–28.43 μM, whereas one compoundshowed inhibitory activity against A3AR, with an IC50 of 6.15 μM. Overall, GCDB will be helpful in target identification and glaucoma chemogenomics data exchange and sharing, and facilitate drug discovery for glaucoma treatment.
      PubDate: Mon, 29 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay117
      Issue No: Vol. 2018, No. 2018 (2018)
  • YHMI: a web tool to identify histone modifications and histone/chromatin
           regulators from a gene list in yeast

    • Authors: Wu W; Tu H, Chu Y, et al.
      Abstract: Post-translational modifications of histones (e.g. acetylation, methylation, phosphorylation and ubiquitination) play crucial roles in regulating gene expression by altering chromatin structures and creating docking sites for histone/chromatin regulators. However, the combination patterns of histone modifications, regulatory proteins and their corresponding target genes remain incompletely understood. Therefore, it is advantageous to have a tool for the enrichment/depletion analysis of histone modifications and histone/chromatin regulators from a gene list. Many ChIP-chip/ChIP-seq datasets of histone modifications and histone/chromatin regulators in yeast can be found in the literature. Knowing the needs and having the data motivate us to develop a web tool, called Yeast Histone Modifications Identifier (YHMI), which can identify the enriched/depleted histone modifications and the enriched histone/chromatin regulators from a list of yeast genes. Both tables and figures are provided to visualize the identification results. Finally, the high-quality and biological insight of the identification results are demonstrated by two case studies. We believe that YHMI is a valuable tool for yeast biologists to do epigenetics research.
      PubDate: Mon, 29 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay116
      Issue No: Vol. 2018, No. 2018 (2018)
  • VianniaTopes: a database of predicted immunogenic peptides for Leishmania
           (Viannia) species

    • Authors: Llanes A; Restrepo C, Lleonart R.
      Abstract: Leishmania is a protozoan parasite causing several disease presentations collectively known as leishmaniasis. Pathogenic species of Leishmania are divided into two subgenera, L. (Leishmania) and L. (Viannia). Species belonging to the Viannia subgenus have only been reported in Central and South America. These species predominantly cause cutaneous leishmaniasis, but in some cases, parasites can migrate to the nasopharyngeal area and cause a highly disfiguring mucocutaneous presentation. Despite intensive efforts, no effective antileishmanial vaccine is available for use in humans, although a few candidates mainly designed for L. (Leishmania) species are now in clinical trials. After sequencing the genome of Leishmania panamensis, we noticed a high degree of sequence divergence among several orthologous proteins from both subgenera. Consequently, some of the previously published candidates may not work properly for species of the Viannia subgenus. To help in vaccine design, we predicted CD4+ and CD8+ T cell epitopes in the theoretical proteomes of four strains belonging to the Viannia subgenus. Prediction was performed with at least two independent bioinformatics tools, using the most frequent human major histocompatibility complex (MHC) class I and class II alleles in the affected geographic area. Although predictions resulted in millions of peptides, relatively few of them were predicted to bind to several MHC alleles and can therefore be considered promiscuous epitopes. Comparison of our results to previous applications to species of the Leishmania subgenus confirmed that approximately half of the reported candidates are not present in Viannia proteins with a threshold of 80% sequence similarity and coverage. However, our prediction methodology was able to predict 70–100% of the candidates that could be found in Viannia. All the prediction data generated in this study are publicly available in an interactive database called VianniaTopes.
      PubDate: Thu, 25 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay111
      Issue No: Vol. 2018 (2018)
  • LPTK: a linguistic pattern-aware dependency tree kernel approach for the
           BioCreative VI CHEMPROT task

    • Authors: Warikoo N; Chang Y, Hsu W.
      Abstract: Identifying the interactions between chemical compounds and genes from biomedical literatures is one of the frequently discussed topics of text mining in the life science field. In this paper, we describe Linguistic Pattern-Aware Dependency Tree Kernel, a linguistic interaction pattern learning method developed for CHEMPROT task–BioCreative VI, to capture chemical–protein interaction (CPI) patterns within biomedical literatures. We also introduce a framework to integrate these linguistic patterns with smooth partial tree kernel to extract the CPIs. This new method of feature representation models aspects of linguistic probability in geometric representation, which not only optimizes the sufficiency of feature dimension for classification, but also defines features as interpretable contexts rather than long vectors of numbers. In order to test the robustness and efficiency of our system in identifying different kinds of biological interactions, we evaluated our framework on three separate data sets, i.e. CHEMPROT corpus, Chemical–Disease Relation corpus and Protein–Protein Interaction corpus. Corresponding experiment results demonstrate that our method is effective and outperforms several compared systems for each data set.
      PubDate: Mon, 22 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay108
      Issue No: Vol. 2018 (2018)
  • MGH: a genome hub for the medicinal plant maca (Lepidium meyenii)

    • Authors: Chen J; Zhang J, Lin M, et al.
      Abstract: Maca (Lepidium meyenii), a Brassicaceae herb plant originated from Andean mountains, has attracted wide interests due to its unique health benefits in reproduction and fertility. Because of its adaptation to the 4000 m high-altitude harsh environment, maca is attracting more and more attention from both crop breeders and basic biologists. After our previous release of the maca genome sequence, there’s a growing need to store, query, analyze and integrate various maca resources efficiently. Here, we created Maca Genome Hub (MGH), a genomics and genetics database of maca. Currently, the MGH V1.0 harbors the genome sequence, predicted coding sequences and protein sequences, various annotations, markers and expression data. For the maca research community, we also provided the publications, researchers and related news. MGH is designed to enable users’ easy access to analyze, retrieve and visualize the genomic or genetic information through a series of online tools, including the Basic Local Alignment Search Tool, the JBrowse, the query system, the synteny tool and the data downloads. These integrated heterogeneous data, tools and interfaces in MGH allow efficient mining of the latest genomics and genetics data. We hope that MGH will accelerate the research and development in maca.
      PubDate: Fri, 19 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay113
      Issue No: Vol. 2018 (2018)
  • GFDP: the gene family database in poplar

    • Authors: Wang H; Yan H, Liu H, et al.
      Abstract: A gene family is formed by duplication of a single original gene. Poplar trees (genus Populus) are important, principally because of their ecological and economic benefits, and are one of the most widely distributed and adaptable trees in the world. Systematic identification and annotation of gene family members are primary steps in studying the function and evolution of poplar genomes. Here, we describe the construction of the Gene Family Database in Poplar (GFDP), which contains information that systematically describes 6551 genes distributed in 145 gene families. GFDP is designed to present important biological information, such as gene structure, protein length, isoelectric point and functional and evolutionary information, using highly visual displays. Data and graphs are visualized by a web-based interface. Users can browse and download data through all the major browsers. GFDP provides a comprehensive platform with a solid foundation for further study of poplar gene families. GFDP is free available.
      PubDate: Fri, 19 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay107
      Issue No: Vol. 2018 (2018)
  • AutismKB 2.0: a knowledgebase for the genetic evidence of autism spectrum

    • Authors: Yang C; Li J, Wu Q, et al.
      Abstract: Autism spectrum disorder (ASD) is a complex neurodevelopmental disorder with strong genetic contributions. To provide a comprehensive resource for the genetic evidence of ASD, we have updated the Autism KnowledgeBase (AutismKB) to version 2.0. AutismKB 2.0 integrates multiscale genetic data on 1379 genes, 5420 copy number variations and structural variations, 11 669 single-nucleotide variations or small insertions/deletions (SNVs/indels) and 172 linkage regions. In particular, AutismKB 2.0 highlights 5669 de novo SNVs/indels due to their significant contribution to ASD genetics and includes 789 mosaic variants due to their recently discovered contributions to ASD pathogenesis. The genes and variants are annotated extensively with genetic evidence and clinical evidence. To help users fully understand the functional consequences of SNVs and small indels, we provided comprehensive predictions of pathogenicity with iFish, SIFT, Polyphen etc. To improve user experiences, the new version incorporates multiple query methods, including simple query, advanced query and batch query. It also functionally integrates two analytical tools to help users perform downstream analyses, including a gene ranking tool and an enrichment analysis tool, KOBAS. AutismKB 2.0 is freely available and can be a valuable resource for researchers.
      PubDate: Thu, 18 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay106
      Issue No: Vol. 2018 (2018)
  • Medicinal Materials DNA Barcode Database (MMDBD) version 1.5—one-stop
           solution for storage, BLAST, alignment and primer design

    • Authors: Wong T; But G, Wu H, et al.
      Abstract: Authentication of medicinal materials by deoxyribonucleic acid (DNA) technology is gaining popularity. In 2010, our team has created Medicinal Materials DNA Barcode Database (MMDBD) version 1.0 to provide an interactive database for documenting DNA barcode sequences of medicinal materials. This database now contains DNA barcode sequences of medicinal materials listed in the Chinese Pharmacopoeia, Dietary Supplements Compendium and Herbal Medicine Compendium of the US Pharmacopoeia and selected adulterants. The data archive is regularly updated and currently it stores 62 011 DNA sequences of 2111 medicinal materials. Our team has recently completed the major improvement on the interfaces and incorporated essential bioinformatics tools to facilitate the authentication work. MMDBD version 1.5 contains detailed information of each medicinal material including their material names, medical part, pharmacopeia information, biological classification in rank of family and status on the Convention on International Trade in Endangered Species of Wild Fauna and Flora and the International Union for Conservation of Nature’s Red List of Threatened Species, if any. DNA sequences can be retrieved by search in Latin scientific name, Chinese name, family name, material name, medical part and simplified Chinese character stroke. A `BLAST’-based engine for searching DNA sequences is included in the MMDBD version 1.5. Since primer design is a key step in DNA barcoding authentication, we have integrated the `Clustal Omega alignment tool’ and `Primer3’ in the form of web interface. These new tools facilitate multiple sequence comparison and the design of primers for amplification of a target DNA barcode region, allowing DNA barcoding authentication.
      PubDate: Thu, 18 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay112
      Issue No: Vol. 2018 (2018)
  • ZFLNC: a comprehensive and well-annotated database for zebrafish lncRNA

    • Authors: Hu X; Chen W, Li J, et al.
      Abstract: There is emerging evidence showing that lncRNAs can be involved in various critical biological processes. Zebrafish is a fully developed model system being used in a variety of basic research and biomedical studies. Hence, it is an ideal model organism to study the functions and mechanisms of lncRNAs. Here, we constructed ZFLNC—a comprehensive database of zebrafish lncRNA that is dedicated to providing a zebrafish-based platform for deep exploration of zebrafish lncRNAs and their mammalian counterparts to the relevant academic communities. The main data resources of lncRNAs in this database come from the NCBI, Ensembl, NONCODE, zflncRNApedia and literature. We also obtained lncRNAs as a supplement by analysing RNA-Seq datasets from SRA database. With these IncRNAs, we further carried out expression profiling, co-expression network prediction, Gene Ontology (GO)/Kyoto Encyclopediaof Genes and Genomes (KEGG)/Online Mendelian Inheritance in Man (OMIM) annotation and conservation analysis. As far as we know, ZFLNC is the most comprehensive and well-annotated database for zebrafish lncRNA.
      PubDate: Thu, 18 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay114
      Issue No: Vol. 2018 (2018)
  • Overview of the BioCreative VI text-mining services for Kinome Curation

    • Authors: Gobeill J; Gaudet P, Dopp D, et al.
      Abstract: The text-mining services for kinome curation track, part of BioCreative VI, proposed a competition to assess the effectiveness of text mining to perform literature triage. The track has exploited an unpublished curated data set from the neXtProt database. This data set contained comprehensive annotations for 300 human protein kinases. For a given protein and a given curation axis [diseases or gene ontology (GO) biological processes], participants’ systems had to identify and rank relevant articles in a collection of 5.2 M MEDLINE citations (task 1) or 530 000 full-text articles (task 2). Explored strategies comprised named-entity recognition and machine-learning frameworks. For that latter approach, participants developed methods to derive a set of negative instances, as the databases typically do not store articles that were judged as irrelevant by curators. The supervised approaches proposed by the participating groups achieved significant improvements compared to the baseline established in a previous study and compared to a basic PubMed search.
      PubDate: Wed, 17 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay104
      Issue No: Vol. 2018 (2018)
  • siAbasic: a comprehensive database for potent siRNA-6Ø sequences
           without off-target effects

    • Authors: Park J; Ahn S, Cho K, et al.
      Abstract: Small interfering RNA (siRNA) is widely used to specifically silence target gene expression, but its microRNA (miRNA)-like function inevitably suppresses hundreds of off-targets. Recently, complete elimination of the off-target repression has been achieved by introducing an abasic nucleotide to the pivot (position 6; siRNA-6Ø), of which impaired base pairing destabilizes transitional nucleation (positions 2–6). However, siRNA-6Ø varied in its conservation of on-target activity (∼80–100%), demanding bioinformatics to discover the principles underlying its on-target efficiency. Analyses of miRNA–target interactions (Ago HITS-CLIP) showed that the stability of transitional nucleation correlated with the target affinity of RNA interference. Furthermore, interrogated analyses of siRNA screening efficiency, experimental data and broadly conserved miRNA sequences showed that the free energy of transitional nucleation (positions 2–5) in siRNA-6Ø required the range of stability for effective on-target activity (−6 ≤ ΔG[2:5] ≤ −3.5 kcal mol−1). Taking into consideration of these features together with locations, guanine-cytosine content (GC content), nucleotide stretches, single nucleotide polymorphisms and repetitive elements, we implemented a database named ‘siAbasic’ that provided the list of potent siRNA-6Ø sequences for most of human and mouse genes (≥ ∼95%), wherein we experimentally validated some of their therapeutic potency. siAbasic will aid to ensure potency of siRNA-6Ø sequences without concerning off-target effects for experimental and clinical purposes.
      PubDate: Fri, 12 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay109
      Issue No: Vol. 2018 (2018)
  • PVsiRNAdb: a database for plant exclusive virus-derived small interfering

    • Authors: Gupta N; Zahra S, Singh A, et al.
      Abstract: Ribonucleic acids (RNA) interference mechanism has been proved to be an important regulator of both transcriptional and post-transcription controls of gene expression during biotic and abiotic stresses in plants. Virus-derived small interfering RNAs (vsiRNAs) are established components of the RNA silencing mechanism for incurring anti-viral resistance in plants. Some databases like siRNAdb, HIVsirDB and VIRsiRNAdb are available online pertaining to siRNAs as well as vsiRNAs generated during viral infection in humans; however, currently there is a lack of repository for plant exclusive vsiRNAs. We have developed `PVsiRNAdb (’, a manually curated plant-exclusive database harboring information related to vsiRNAs found in different virus-infected plants collected by exhaustive data mining of published literature so far. This database contains a total of 322 214 entries and 282 549 unique sequences of vsiRNAs. In PVsiRNAdb, detailed and comprehensive information is available for each vsiRNA sequence. Apart from the core information consisting of plant, tissue, virus name and vsiRNA sequence, additional information of each vsiRNAs (map position, length, coordinates, strand information and predicted structure) may be of high utility to the user. Different types of search and browse modules with three different tools namely BLAST, Smith–Waterman Align and Mapping are provided at PVsiRNAdb. Thus, this database being one of its kind will surely be of much use to molecular biologists for exploring the complex viral genetics and genomics, viral–host interactions and beneficial to the scientific community and can prove to be very advantageous in the field of agriculture for producing viral resistance transgenic crops.
      PubDate: Thu, 11 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay105
      Issue No: Vol. 2018 (2018)
  • Extracting chemical–protein relations using attention-based neural

    • Authors: Liu S; Shen F, Komandur Elayavilli R, et al.
      Abstract: Relation extraction is an important task in the field of natural language processing. In this paper, we describe our approach for the BioCreative VI Task 5: text mining chemical–protein interactions. We investigate multiple deep neural network (DNN) models, including convolutional neural networks, recurrent neural networks (RNNs) and attention-based (ATT-) RNNs (ATT-RNNs) to extract chemical–protein relations. Our experimental results indicate that ATT-RNN models outperform the same models without using attention and the ATT-gated recurrent unit (ATT-GRU) achieves the best performing micro average F1 score of 0.527 on the test set among the tested DNNs. In addition, the result of word-level attention weights also shows that attention mechanism is effective on selecting the most important trigger words when trained with semantic relation labels without the need of semantic parsing and feature engineering. The source code of this work is available at
      PubDate: Mon, 08 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay102
      Issue No: Vol. 2018 (2018)
  • Toward a service-based workflow for automated information extraction from
           herbarium specimens

    • Authors: Kirchhoff A; Bügel U, Santamaria E, et al.
      Abstract: Over the past years, herbarium collections worldwide have started to digitize millions of specimens on an industrial scale. Although the imaging costs are steadily falling, capturing the accompanying label information is still predominantly done manually and develops into the principal cost factor. In order to streamline the process of capturing herbarium specimen metadata, we specified a formal extensible workflow integrating a wide range of automated specimen image analysis services. We implemented the workflow on the basis of OpenRefine together with a plugin for handling service calls and responses. The evolving system presently covers the generation of optical character recognition (OCR) from specimen images, the identification of regions of interest in images and the extraction of meaningful information items from OCR. These implementations were developed as part of the Deutsche Forschungsgemeinschaft-funded a standardised and optimised process for data acquisition from digital images of herbarium specimens (StanDAP-Herb) Project.
      PubDate: Mon, 08 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay103
      Issue No: Vol. 2018 (2018)
  • A survey of ontology learning techniques and applications

    • Authors: Asim M; Wasim M, Khan M, et al.
      Abstract: Ontologies have gained a lot of popularity and recognition in the semantic web because of their extensive use in Internet-based applications. Ontologies are often considered a fine source of semantics and interoperability in all artificially smart systems. Exponential increase in unstructured data on the web has made automated acquisition of ontology from unstructured text a most prominent research area. Several methodologies exploiting numerous techniques of various fields (machine learning, text mining, knowledge representation and reasoning, information retrieval and natural language processing) are being proposed to bring some level of automation in the process of ontology acquisition from unstructured text. This paper describes the process of ontology learning and further classification of ontology learning techniques into three classes (linguistics, statistical and logical) and discusses many algorithms under each category. This paper also explores ontology evaluation techniques by highlighting their pros and cons. Moreover, it describes the scope and use of ontology learning in several industries. Finally, the paper discusses challenges of ontology learning along with their corresponding future directions.
      PubDate: Fri, 05 Oct 2018 00:00:00 GMT
      DOI: 10.1093/database/bay101
      Issue No: Vol. 2018 (2018)
  • AGD: Aneurysm Gene Database

    • Authors: Sun R; Cui C, Zhou Y, et al.
      Abstract: An aneurysm is an outward bulge on an arterial wall. Aneurysms are becoming a serious public health concern as the worldwide population ages. Unfortunately, no effective drugs have been developed for aneurysms to date. In addition, aneurysms may be associated with grave prognosis due to conditions such as ruptures and recurrence. Altogether, these factors make earlier aneurysm prevention, diagnosis and intervention strategies even more important. A bioinformatics resource for aneurysm-associated molecules would be helpful for addressing the above issues; however, such a tool is not yet available. In this study, we developed Aneurysm Gene Database (AGD) for the above purpose. AGD contains 1472 aneurysm-gene associations, including 29 types of aneurysms, 967 protein-coding genes, 29 miRNAs, 6 lncRNAs and several other types of molecules. Users can search, browse and download content in AGD. We believe that AGD is a valuable resource that can help us better understand aneurysms and discover novel treatment targets.
      PubDate: Wed, 26 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay100
      Issue No: Vol. 2018 (2018)
  • Large-scale automated machine reading discovers new cancer-driving

    • Authors: Valenzuela-Escárcega M; Babur Ö, Hahn-Powell G, et al.
      Abstract: PubMed, a repository and search engine for biomedical literature, now indexes >1 million articles each year. This exceeds the processing capacity of human domain experts, limiting our ability to truly understand many diseases. We present Reach, a system for automated, large-scale machine reading of biomedical papers that can extract mechanistic descriptions of biological processes with relatively high precision at high throughput. We demonstrate that combining the extracted pathway fragments with existing biological data analysis algorithms that rely on curated models helps identify and explain a large number of previously unidentified mutually exclusive altered signaling pathways in seven different cancer types. This work shows that combining human-curated ‘big mechanisms’ with extracted ‘big data’ can lead to a causal, predictive understanding of cellular processes and unlock important downstream applications.
      PubDate: Wed, 26 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay098
      Issue No: Vol. 2018 (2018)
  • Tripal Developer Toolkit

    • Authors: Condon B; Almsaeed A, Chen M, et al.
      Abstract: Tripal community database construction toolkit utilizing the content management system Drupal. Tripal is used to make biological, genetic and genomic data more discoverable, shareable, searchable and standardized. As funding for community-level genomics databases declines, Tripal’s open-source codebase provides a means for sites to be built and maintained with a minimal investment in staff and new development. Tripal is ultimately as strong as the community of sites and developers that use it. We present a set of developer tools that will make building and maintaining Tripal 3 sites easier for new and returning users. These tools break down barriers to entry such as setting up developer and testing environments, acquiring and loading test datasets, working with controlled vocabulary terms and writing new Drupal classes.
      PubDate: Thu, 20 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay099
      Issue No: Vol. 2018 (2018)
  • Document triage for identifying protein–protein interactions affected by
           mutations: a neural network ensemble approach

    • Authors: Luo L; Yang Z, Lin H, et al.
      Abstract: The precision medicine (PM) initiative promises to identify individualized treatment depending on a patients’ genetic profile and their related responses. In order to help health professionals and researchers in the PM endeavor, BioCreative VI organized a PM Track to mine protein–protein interactions (PPI) affected by genetic mutations from the biomedical literature. In this paper, we present a neural network ensemble approach to identify relevant articles describing PPI affected by mutations. In this approach, several neural network models are used for document triage, and the ensemble performs better than any individual model. In the official runs, our best submission achieves an F-score of 69.04% in the BioCreative VI PM document triage task. After post-challenge analysis, to address the problem of the limited size of training set, a PPI pre-trained module is incorporated into our approach to further improve the performance. Finally, our best ensemble method achieves an F-score of 71.04% on the test set.
      PubDate: Wed, 19 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay097
      Issue No: Vol. 2018 (2018)
  • Wide-scope biomedical named entity recognition and normalization with
           CRFs, fuzzy matching and character level modeling

    • Authors: Kaewphan S; Hakala K, Miekka N, et al.
      Abstract: We present a system for automatically identifying a multitude of biomedical entities from the literature. This work is based on our previous efforts in the BioCreative VI: Interactive Bio-ID Assignment shared task in which our system demonstrated state-of-the-art performance with the highest achieved results in named entity recognition. In this paper we describe the original conditional random field-based system used in the shared task as well as experiments conducted since, including better hyperparameter tuning and character level modeling, which led to further performance improvements. For normalizing the mentions into unique identifiers we use fuzzy character n-gram matching. The normalization approach has also been improved with a better abbreviation resolution method and stricter guideline compliance resulting in vastly improved results for various entity types. All tools and models used for both named entity recognition and normalization are publicly available under open license.
      PubDate: Tue, 18 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay096
      Issue No: Vol. 2018 (2018)
  • IsopiRBank: a research resource for tracking piRNA isoforms

    • Authors: Zhang H; Ali A, Gao J, et al.
      Abstract: PIWI-interacting RNAs (piRNAs) are essential for transcriptional and post-transcriptional regulation of transposons and coding genes in germline. With the development of sequencing technologies, length variations of piRNAs have been identified in several species. However, the extent to which, piRNA isoforms exist, and whether these isoforms are functionally distinct from canonical piRNAs remain uncharacterized. Through data mining from 2154 datasets of small RNA sequencing data from four species (Homo sapiens, Mus musculus, Danio rerio and Drosophila melanogaster), we have identified 8 749 139 piRNA isoforms from 175 454 canonical piRNAs, and classified them on the basis of variations on 5′ or 3′ end via the alignment of isoforms with canonical sequence. We thus established a database named IsopiRBank. Each isoforms has detailed annotation as follows: normalized expression data, classification, spatiotemporal expression data and genome origin. Users can also select interested isoforms for further analysis, including target prediction and Enrichment analysis. Taken together, IsopiRBank is an interactive database that aims to present the first integrated resource of piRNA isoforms, and broaden the research of piRNA biology. IsopiRBank can be accessed at without any registration or log in requirement. Database URL:
      PubDate: Tue, 28 Aug 2018 00:00:00 GMT
      DOI: 10.1093/database/bay059
      Issue No: Vol. 2018 (2018)
  • SDADB: a functional annotation database of protein structural domains

    • Authors: Zeng C; Zhan W, Deng L.
      Abstract: Annotating functional terms with individual domains is essential for understanding the functions of full-length proteins. We describe SDADB, a functional annotation database for structural domains. SDADB provides associations between gene ontology (GO) terms and SCOP domains calculated with an integrated framework. GO annotations are assigned probabilities of being correct, which are estimated with a Bayesian network by taking advantage of structural neighborhood mappings, SCOP-InterPro domain mapping information, position-specific scoring matrices (PSSMs) and sequence homolog features, with the most substantial contribution coming from high-coverage structure-based domain-protein mappings. The domain-protein mappings are computed using large-scale structure alignment. SDADB contains ontological terms with probabilistic scores for more than 214 000 distinct SCOP domains. It also provides additional features include 3D structure alignment visualization, GO hierarchical tree view, search, browse and download options.Database URL:
      PubDate: Tue, 28 Aug 2018 00:00:00 GMT
      DOI: 10.1093/database/bay064
      Issue No: Vol. 2018 (2018)
  • Hierarchical bi-directional attention-based RNNs for supporting document
           classification on protein–protein interactions affected by genetic

    • Authors: Fergadis A; Baziotis C, Pappas D, et al.
      Abstract: In this paper, we describe a hierarchical bi-directional attention-based Re-current Neural Network (RNN) as a reusable sequence encoder architecture, which is used as sentence and document encoder for document classification. The sequence encoder is composed of two bi-directional RNN equipped with an attention mechanism that identifies and captures the most important elements, words or sentences, in a document followed by a dense layer for the classification task. Our approach utilizes the hierarchical nature of documents which are composed of sequences of sentences and sentences are composed of sequences of words. In our model, we use word embeddings to project the words to a low-dimensional vector space. We leverage word embeddings trained on PubMed for initializing the embedding layer of our network. We apply this model to biomedical literature specifically, on paper abstracts published in PubMed. We argue that the title of the paper itself usually contains important information more salient than a typical sentence in the abstract. For this reason, we propose a shortcut connection that integrates the title vector representation directly to the final feature representation of the document. We concatenate the sentence vector that represents the title and the vectors of the abstract to the document feature vector used as input to the task classifier. With this system we participated in the Document Triage Task of the BioCreative VI Precision Medicine Track and we achieved 0.6289 Precision, 0.7656 Recall and 0.6906 F1-score with the Precision and F1-score be the highest ranking first among the other systems.Database URL:
      PubDate: Tue, 21 Aug 2018 00:00:00 GMT
      DOI: 10.1093/database/bay076
      Issue No: Vol. 2018 (2018)
  • HDncRNA: a comprehensive database of non-coding RNAs associated with heart

    • Authors: Wang W; Wang Y, Hu Y, et al.
      Abstract: Heart diseases (HDs) represent a common group of diseases that involve the heart, a number of which are characterized by high morbidity and lethality. Recently, increasing evidence demonstrates diverse non-coding RNAs (ncRNAs) play critical roles in HDs. However, currently there lacks a systematic investigation of the association between HDs and ncRNAs. Here, we developed a Heart Disease-related Non-coding RNAs Database (HDncRNA), to curate the HDs-ncRNA associations from 3 different sources including 1904 published articles, 3 existing databases [the Human microRNA Disease Database (HMDD), miR2disease and lncRNAdisease] and 5 RNA-seq datasets. The HDs-ncRNA associations with experimental validations curated from these articles, HMDD, miR2disease and part of data from lncRNAdisease were ‘direct evidence’. Relationships got from high-through data in lncRNAdisease and annotated differential expressed lncRNAs from RNA-seq data were defined as ‘high-throughput associations’. Novel lncRNAs identified from RNA-seq data in HDs had least credibility and were defined as ‘predicted associations’. Currently, the database contains 2304 HDs-ncRNA associations for 133 HDs in 6 species including human, mouse, rat, pig, calf and dog. The database also has the following features: (i) A user-friendly web interface for browsing and searching the data; (ii) a visualization tool to plot miRNA and lncRNA locations in the human and mouse genomes; (iii) information about neighboring genes of lncRNAs and (iv) links to some mainstream databases including miRbase, Ensemble and Fantom Cat for the annotated lncRNAs and miRNAs. In summary, HDncRNA provides an excellent platform for exploring HDs related ncRNAs.Database URL:
      PubDate: Tue, 24 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay067
      Issue No: Vol. 2018 (2018)
  • mutTCPdb: a comprehensive database for genomic variants of a tropical
           country neglected disease—tropical calcific pancreatitis

    • Authors: Singh G; Bhat B, Jayadev M, et al.
      Abstract: Tropical calcific pancreatitis (TCP) is a juvenile, non-alcoholic form of chronic pancreatitis with its exclusive presence in tropical regions associated with the low economic status. TCP initiates in the childhood itself and then proliferates silently. mutTCPdb is a manually curated and comprehensive disease specific single nucleotide variant (SNV) database. Extensive search strategies were employed to create a repository while SNV information was collected from published articles. Several existing databases such as the dbSNP, Uniprot, miRTarBase2.0, HGNC, PFAM, KEGG, PROSITE, MINT, BIOGRID 3.4 and Ensemble Genome Browser 87 were queried to collect information specific to the gene. mutTCPdb is running on the XAMPP web server with MYSQL database in the backend for data storage and management. Currently, the mutTCPdb enlists 100 variants of all 11 genes identified in TCP, out of which 45 are non-synonymous (missense, nonsense, deletions and insertions), 46 are present in non-coding regions (UTRs, promoter region and introns) and 9 are synonymous variants. The database is highly curated for disease-specific gene variants and provides complete information on function, transcript information, pathways, interactions, miRNAs and PubMed references along with remarks. It is an informative portal for clinicians and researchers for a better understanding of the disease, as it may help in identifying novel targets and diagnostic markers, hence, can be a source to improve the strategies for TCP management.Database URL:
      PubDate: Tue, 24 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay043
      Issue No: Vol. 2018 (2018)
  • ChemDIS 2: an update of chemical-disease inference system

    • Authors: Tung C; Wang S.
      Abstract: Computational inference of affected functions, pathways and diseases for chemicals could largely accelerate the evaluation of potential effects of chemical exposure on human beings. Previously, we have developed a ChemDIS system utilizing information of interacting targets for chemical-disease inference. With the target information, testable hypotheses can be generated for experimental validation. In this work, we present an update of ChemDIS 2 system featured with more updated datasets and several new functions, including (i) custom enrichment analysis function for single omics data; (ii) multi-omics analysis function for joint analysis of multi-omics data; (iii) mixture analysis function for the identification of interaction and overall effects; (iv) web application programming interface (API) for programmed access to ChemDIS 2. The updated ChemDIS 2 system capable of analyzing more than 430 000 chemicals is expected to be useful for both drug development and risk assessment of environmental chemicals.Database URL: ChemDIS 2 is freely accessible via
      PubDate: Mon, 23 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay077
      Issue No: Vol. 2018 (2018)
  • Extracting chemical–protein relations with ensembles of SVM and deep
           learning models

    • Authors: Peng Y; Rios A, Kavuluru R, et al.
      Abstract: Mining relations between chemicals and proteins from the biomedical literature is an increasingly important task. The CHEMPROT track at BioCreative VI aims to promote the development and evaluation of systems that can automatically detect the chemical–protein relations in running text (PubMed abstracts). This work describes our CHEMPROT track entry, which is an ensemble of three systems, including a support vector machine, a convolutional neural network, and a recurrent neural network. Their output is combined using majority voting or stacking for final predictions. Our CHEMPROT system obtained 0.7266 in precision and 0.5735 in recall for an F-score of 0.6410 during the challenge, demonstrating the effectiveness of machine learning-based approaches for automatic relation extraction from biomedical literature and achieving the highest performance in the task during the 2017 challenge.Database URL:
      PubDate: Tue, 17 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay073
      Issue No: Vol. 2018 (2018)
  • realDB: a genome and transcriptome resource for the red algae (phylum

    • Authors: Chen F; Zhang J, Chen J, et al.
      Abstract: With over 6000 species in seven classes, red algae (Rhodophyta) have diverse economic, ecological, experimental and evolutionary values. However, red algae are usually absent or rare in comparative analyses because genomic information of this phylum is often under-represented in various comprehensive genome databases. To improve the accessibility to the ome data and omics tools for red algae, we provided 10 genomes and 27 transcriptomes representing all seven classes of Rhodophyta. Three genomes and 18 transcriptomes were de novo assembled and annotated in this project. User-friendly BLAST suit, Jbrowse tools and search system were developed for online analyses. Detailed introductions to red algae taxonomy and the sequencing status are also provided. In conclusion, realDB ( provides a platform covering the most genome and transcriptome data for red algae and a suite of tools for online analyses, and will attract both red algal biologists and those working on plant ecology, evolution and development.Database URL:
      PubDate: Tue, 17 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay072
      Issue No: Vol. 2018 (2018)
  • RabGTD: a comprehensive database of rabbit genome and transcriptome

    • Authors: Zhou L; Xiao Q, Bi J, et al.
      Abstract: The rabbit is a very important species for both biomedical research and agriculture animal breeding. They are not only the most-used experimental animals for the production of antibodies, but also widely used for studying a variety of human diseases. Here we developed RabGTD, the first comprehensive rabbit database containing both genome and transcriptome data generated by next-generation sequencing. Genomic variations coming from 79 samples were identified and annotated, including 33 samples of wild rabbits and 46 samples of domestic rabbits with diverse populations. Gene expression profiles of 86 tissue samples were complied, including those from the most commonly used models for hyperlipidemia and atherosclerosis. RabGTD is a web-based and open-access resource, which also provides convenient functions and friendly interfaces of searching, browsing and downloading for users to explore the big data.Database URL:
      PubDate: Fri, 13 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay075
      Issue No: Vol. 2018 (2018)
  • Leveraging prior knowledge for protein–protein interaction
           extraction with memory network

    • Authors: Zhou H; Liu Z, Ning S, et al.
      Abstract: Automatically extracting protein–protein interactions (PPIs) from biomedical literature provides additional support for precision medicine efforts. This paper proposes a novel memory network-based model (MNM) for PPI extraction, which leverages prior knowledge about protein–protein pairs with memory networks. The proposed MNM captures important context clues related to knowledge representations learned from knowledge bases. Both entity embeddings and relation embeddings of prior knowledge are effective in improving the PPI extraction model, leading to a new state-of-the-art performance on the BioCreative VI PPI dataset. The paper also shows that multiple computational layers over an external memory are superior to long short-term memory networks with the local memories.Database URL:
      PubDate: Fri, 13 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay071
      Issue No: Vol. 2018 (2018)
  • A Field Sensor: computing the composition and intent of PubMed queries

    • Authors: Yeganova L; Kim W, Comeau D, et al.
      Abstract: PubMed® is a search engine providing access to a collection of over 27 million biomedical bibliographic records as of 2017. PubMed processes millions of queries a day, and understanding these queries is one of the main building blocks for successful information retrieval. In this work, we present Field Sensor, a domain-specific tool for understanding the composition and predicting the user intent of PubMed queries. Given a query, the Field Sensor infers a field for each token or sequence of tokens in a query in multi-step process that includes syntactic chunking, rule-based tagging and probabilistic field prediction. In this work, the fields of interest are those associated with (meta-)data elements of each PubMed record such as article title, abstract, author name(s), journal title, volume, issue, page and date. We evaluate the accuracy of our algorithm on a human-annotated corpus of 10 000 PubMed queries, as well as a new machine-annotated set of 103 000 PubMed queries. The Field Sensor achieves an accuracy of 93 and 91% on the two corresponding corpora and finds that nearly half of all searches are navigational (e.g. author searches, article title searches etc.) and half are informational (e.g. topical searches). The Field Sensor has been integrated into PubMed since June 2017 to detect informational queries for which results sorted by relevance can be suggested as an alternative to those sorted by the default date sort. In addition, the composition of PubMed queries as computed by the Field Sensor proves to be essential for understanding how users query PubMed.
      PubDate: Thu, 12 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay052
      Issue No: Vol. 2018 (2018)
  • Improving the learning of chemical-protein interactions from literature
           using transfer learning and specialized word embeddings

    • Authors: Corbett P; Boyle J.
      Abstract: In this paper, we explore the application of artificial neural network (‘deep learning’) methods to the problem of detecting chemical-protein interactions in PubMed abstracts. We present here a system using multiple Long Short Term Memory layers to analyse candidate interactions, to determine whether there is a relation and which type. A particular feature of our system is the use of unlabelled data, both to pre-train word embeddings and also pre-train LSTM layers in the neural network. On the BioCreative VI CHEMPROT test corpus, our system achieves an F score of 61.51% (56.10% precision, 67.84% recall).
      PubDate: Thu, 12 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay066
      Issue No: Vol. 2018 (2018)
  • A set of domain rules and a deep network for protein coreference

    • Authors: Li C; Rao Z, Zheng Q, et al.
      Abstract: Current research of bio-text mining mainly focuses on event extractions. Biological networks present much richer and meaningful information to biologists than events. Bio-entity coreference resolution (CR) is a very important method to complete a bio-event’s attributes and interconnect events into bio-networks. Though general CR methods have been studies for a long time, they could not produce a practically useful result when applied to a special domain. Therefore, bio-entity CR needs attention to better assist biological network extraction. In this article, we present two methods for bio-entity CR. The first is a rule-based method, which creates a set of syntactic rules or semantic constraints for CR. It obtains a state-of-the-art performance (an F1-score of 62.0%) on the community supported dataset. We also present a machine learning-based method, which takes use of a recurrent neural network model, a long-short term memory network. It automatically learns global discriminative representations of all kinds of coreferences without hand-crafted features. The model outperforms the previously best machine leaning-based method.
      PubDate: Wed, 11 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay065
      Issue No: Vol. 2018 (2018)
  • HRPDviewer: human ribosome profiling data viewer

    • Authors: Wu W; Jiang Y, Chang J, et al.
      Abstract: Translational regulation plays an important role in protein synthesis. Dysregulation of translation causes abnormal cell physiology and leads to diseases such as inflammatory disorders and cancers. An emerging technique, called ribosome profiling (ribo-seq), was developed to capture a snapshot of translation. It is based on deep sequencing of ribosome-protected mRNA fragments. A lot of ribo-seq data have been generated in various studies, so databases are needed for depositing and visualizing the published ribo-seq data. Nowadays, GWIPS-viz, RPFdb and TranslatomeDB are the three largest databases developed for this purpose. However, two challenges remain to be addressed. First, GWIPS-viz and RPFdb databases align the published ribo-seq data to the genome. Since ribo-seq data aim to reveal the actively translated mRNA transcripts, there are advantages of aligning ribo-req data to the transcriptome over the genome. Second, TranslatomeDB does not provide any visualization and the other two databases only provide visualization of the ribo-seq data around a specific genomic location, while simultaneous visualization of the ribo-seq data on multiple mRNA transcripts produced from the same gene or different genes is desired. To address these two challenges, we developed the Human Ribosome Profiling Data viewer (HRPDviewer). HRPDviewer (i) contains 610 published human ribo-seq datasets from Gene Expression Omnibus, (ii) aligns the ribo-seq data to the transcriptome and (iii) provides visualization of the ribo-seq data on the selected mRNA transcripts. Using HRPDviewer, researchers can compare the ribosome binding patterns of multiple mRNA transcripts from the same gene or different genes to gain an accurate understanding of protein synthesis in human cells. We believe that HRPDviewer is a useful resource for researchers to study translational regulation in human.Database URL: or
      PubDate: Wed, 11 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay074
      Issue No: Vol. 2018 (2018)
  • Atlas of Schistosoma mansoni long non-coding RNAs and their expression
           correlation to protein-coding genes

    • Authors: Vasconcelos E; Mesel V, daSilva L, et al.
      Abstract: Long non-coding RNAs (lncRNAs) have been widely discovered in several organisms with the help of high-throughput RNA sequencing. LncRNAs are over 200 nt-long transcripts that do not have protein-coding (PC) potential, having been reported in model organisms to act mainly on the overall control of PC gene expression. Little is known about the functionality of lncRNAs in evolutionarily ancient non-model metazoan organisms, like Schistosoma mansoni, the parasite that causes schistosomiasis, one of the most prevalent infectious-parasitic diseases worldwide. In a recent transcriptomics effort, we identified thousands of S. mansoni lncRNAs predicted to be functional along the course of parasite development. Here, we present an online catalog of each of the S. mansoni lncRNAs whose expression is correlated to PC genes along the parasite life-cycle, which can be conveniently browsed and downloaded through a new web resource We also provide access now to navigation on the co-expression networks disclosed in our previous publication, where we correlated mRNAs and lncRNAs transcriptional patterns across five life-cycle stages/forms, pinpointing biological processes where lncRNAs might act upon.Database URL:
      PubDate: Mon, 09 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay068
      Issue No: Vol. 2018 (2018)
  • CeleryDB: a genomic database for celery

    • Authors: Feng K; Hou X, Li M, et al.
      Abstract: Celery (Apium graveolens L.) is a plant belonging to the Apiaceae family, and a popular vegetable worldwide because of its abundant nutrients and various medical functions. Although extensive genetic and molecular biological studies have been conducted on celery, its genomic data remain unclear. Given the significance of celery and the growing demand for its genomic data, the whole genome of ‘Q2-JN11’ celery (a highly inbred line obtained by artificial selfing of ‘Jinnan Shiqin’) was sequenced using HiSeq 2000 sequencing technology. For the convenience of researchers to study celery, an online database of the whole-genome sequences of celery, CeleryDB, was constructed. The sequences of the whole genome, nucleotide sequences of the predicted genes and amino acid sequences of the predicted proteins are available online on CeleryDB. Home, BLAST, Genome Browser, Transcription Factor and Download interfaces composed of the organizational structure of CeleryDB. Users can search the celery genomic data by using two user-friendly query tools: basic local alignment search tool and Genome Browser. In the future, CeleryDB will be constantly updated to satisfy the needs of celery researchers worldwide.Database URL:
      PubDate: Mon, 09 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay070
      Issue No: Vol. 2018 (2018)
  • KM-express: an integrated online patient survival and gene expression
           analysis tool for the identification and functional characterization of
           prognostic markers in breast and prostate cancers

    • Authors: Chen X; Miao Z, Divate M, et al.
      Abstract: The identification and functional characterization of novel biomarkers in cancer requires survival analysis and gene expression analysis of both patient samples and cell line models. To help facilitate this process, we have developed KM-Express. KM-Express holds an extensive manually curated transcriptomic profile of 45 different datasets for prostate and breast cancer with phenotype and pathoclinical information, spanning from clinical samples to cell lines. KM-Express also contains The Cancer Genome Atlas datasets for 30 other cancer types with matching cell line expression data for 23 of them. We present KM-Express as a hypothesis generation tool for researchers to identify potential new prognostic RNA biomarkers as well as targets for further downstream functional cell-based studies. Specifically, KM-Express allows users to compare the expression level of genes in different groups of patients based on molecular, genetic, clinical and pathological status. Moreover, KM-Express aids the design of biological experiments based on the expression profile of the genes in different cell lines. Thus, KM-Express provides a one-stop analysis from bench work to clinical prospects. We have used this tool to successfully evaluate the prognostic potential of previously published biomarkers for prostate cancer and breast cancer. We believe KM-Express will accelerate the translation of biomedical research from bench to bed.Database URL:
      PubDate: Mon, 09 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay069
      Issue No: Vol. 2018 (2018)
  • Identifying frequent patterns in biochemical reaction networks: a workflow

    • Authors: Lambusch F; Waltemath D, Wolkenhauer O, et al.
      Abstract: Computational models in biology encode molecular and cell biological processes. Many of these models can be represented as biochemical reaction networks. Studying such networks, one is mostly interested in systems that share similar reactions and mechanisms. Typical goals of an investigation thus include understanding of model parts, identification of reoccurring patterns and recognition of biologically relevant motifs. The large number and size of available models, however, require automated methods to support researchers in achieving their goals. Specifically for the problem of finding patterns in large networks only partial solutions exist. We propose a workflow that identifies frequent structural patterns in biochemical reaction networks encoded in the Systems Biology Markup Language. The workflow utilizes a subgraph mining algorithm to detect the network patterns. Once patterns are identified, the textual pattern description can automatically be converted into a graphical representation. Furthermore, information about the distribution of patterns among a selected set of models can be retrieved. The workflow was validated with 575 models from the curated branch of BioModels. In this paper, we highlight interesting and frequent structural patterns. Furthermore, we provide exemplary patterns that incorporate terms from the Systems Biology Ontology. Our workflow can be applied to a custom set of models or to models already existing in our graph database MaSyMoS. The occurrences of frequent patterns may give insight into the encoding of central biological processes, evaluate postulated biological motifs or serve as a similarity measure for models that share common structures.Database URL:
      PubDate: Tue, 03 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay051
      Issue No: Vol. 2018 (2018)
  • A database of wild rice germplasm of Oryza rufipogon species complex from
           different agro-climatic zones of India

    • Authors: Tripathy K; Singh B, Singh N, et al.
      Abstract: Rice is a staple food for the people of Asia that supplies more than 50% of the food energy globally. It is widely accepted that the crop domestication process has left behind substantial useful genetic diversity in their wild progenitor species that has huge potential for developing crop varieties with enhanced resistance to an array of biotic and abiotic stresses. In this context, Oryza rufipogon, Oryza nivara and their intermediate types wild rice germplasm/s collected from diverse agro-climatic regions would provide a rich repository of genes and alleles that could be utilized for rice improvement using genomics-assisted breeding. Here we present a database of detailed information on 614 such diverse wild rice accessions collected from different agro-climatic zones of India, including 46 different morphological descriptors, complete passport data and DNA fingerprints. The information has been stored in a web-based database entitled ‘Indian Wild Rice (IWR) Database’. The information provided in the IWR Database will be useful for the rice geneticists and breeders for improvement of rice cultivars for yield, quality and resilience to climate change.Database URL: 8080/iwrdb/index.jsp
      PubDate: Mon, 02 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay058
      Issue No: Vol. 2018 (2018)
  • Improved biomedical term selection in pseudo relevance feedback

    • Authors: Nabeel Asim M; Wasim M, Usman Ghani Khan M, et al.
      Abstract: Biomedical information retrieval systems are becoming popular and complex due to massive amount of ever-growing biomedical literature. Users are unable to construct a precise and accurate query that represents the intended information in a clear manner. Therefore, query is expanded with the terms or features that retrieve more relevant information. Selection of appropriate expansion terms plays key role to improve the performance of retrieval task. We propose document frequency chi-square, a newer version of chi-square in pseudo relevance feedback for term selection. The effects of pre-processing on the performance of information retrieval specifically in biomedical domain are also depicted. On average, the proposed algorithm outperformed state-of-the-art term selection algorithms by 88% at pre-defined test points. Our experiments also conclude that, stemming cause a decrease in overall performance of the pseudo relevance feedback based information retrieval system particularly in biomedical domain.Database URL:
      PubDate: Mon, 02 Jul 2018 00:00:00 GMT
      DOI: 10.1093/database/bay056
      Issue No: Vol. 2018 (2018)
  • PtRFdb: a database for plant transfer RNA-derived fragments

    • Authors: Gupta N; Singh A, Zahra S, et al.
      Abstract: Transfer RNA-derived fragments (tRFs) represent a novel class of small RNAs (sRNAs) generated through endonucleolytic cleavage of both mature and precursor transfer RNAs (tRNAs). These 14–28 nt length tRFs that have been extensively studied in animal kingdom are to be explored in plants. In this study, we introduce a database of plant tRFs named PtRFdb (, for the scientific community. We analyzed a total of 1344 sRNA sequencing datasets of 10 different plant species and identified a total of 5607 unique tRFs (758 tRF-1, 2269 tRF-3 and 2580 tRF-5), represented by 487 765 entries. In PtRFdb, detailed and comprehensive information is available for each tRF entry. Apart from the core information consisting of the tRF type, anticodon, source organism, tissue, sequence and the genomic location; additional information like PubMed identifier (PMID), Sample accession number (GSM), sequence length and frequency relevant to the tRFs may be of high utility to the user. Two different types of search modules (Basic Search and Advanced Search), sequence similarity search (by BLAST) and Browse option with data download facility for each search is provided in this database. We believe that PtRFdb is a unique database of its kind and it will be beneficial in the validation and further characterization of plant tRFs.Database URL:
      PubDate: Fri, 22 Jun 2018 00:00:00 GMT
      DOI: 10.1093/database/bay063
      Issue No: Vol. 2018 (2018)
  • Chemical–gene relation extraction using recursive neural network

    • Authors: Lim S; Kang J.
      Abstract: In this article, we describe our system for the CHEMPROT task of the BioCreative VI challenge. Although considerable research on the named entity recognition of genes and drugs has been conducted, there is limited research on extracting relationships between them. Extracting relations between chemical compounds and genes from the literature is an important element in pharmacological and clinical research. The CHEMPROT task of BioCreative VI aims to promote the development of text mining systems that can be used to automatically extract relationships between chemical compounds and genes. We tested three recursive neural network approaches to improve the performance of relation extraction. In the BioCreative VI challenge, we developed a tree-Long Short-Term Memory networks (tree-LSTM) model with several additional features including a position feature and a subtree containment feature, and we also applied an ensemble method. After the challenge, we applied additional pre-processing steps to the tree-LSTM model, and we tested the performance of another recursive neural network model called Stack-augmented Parser Interpreter Neural Network (SPINN). Our tree-LSTM model achieved an F-score of 58.53% in the BioCreative VI challenge. Our tree-LSTM model with additional pre-processing and the SPINN model obtained F-scores of 63.7 and 64.1%, respectively.Database URL:
      PubDate: Thu, 21 Jun 2018 00:00:00 GMT
      DOI: 10.1093/database/bay060
      Issue No: Vol. 2018 (2018)
  • dbLGL: an online leukemia gene and literature database for the
           retrospective comparison of adult and childhood leukemia genetics with
           literature evidence

    • Authors: Liu Y; Luo M, Jin Z, et al.
      Abstract: Leukemia is a group of cancers with increased numbers of immature or abnormal leucocytes that originated in the bone marrow and other blood-forming organs. The development of differentially diagnostic biomarkers for different subtypes largely depends on understanding the biological pathways and regulatory mechanisms associated with leukemia-implicated genes. Unfortunately, the leukemia-implicated genes that have been identified thus far are scattered among thousands of published studies, and no systematic summary of the differences between adult and childhood leukemia exists with regard to the causative genetic mutations and genetic mechanisms of the various subtypes. In this study, we performed a systematic literature review of those susceptibility genes reported in small-scale experiments and built an online gene database containing a total of 1805 leukemia-associated genes, available at Our comparison of genes from the four primary subtypes and between adult and childhood cases identified a number of potential genes related to patient survival. These curated genes can satisfy a growing demand for further integrating genomics screening for leukemia-associated low-frequency mutated genes.Database URL:
      PubDate: Thu, 21 Jun 2018 00:00:00 GMT
      DOI: 10.1093/database/bay062
      Issue No: Vol. 2018 (2018)
  • LncCeRBase: a database of experimentally validated human competing
           endogenous long non-coding RNAs

    • Authors: Pian C; Zhang G, Tu T, et al.
      Abstract: Long non-coding RNAs (lncRNAs) are endogenous molecules longer than 200 nucleotides, and lack coding potential. LncRNAs that interact with microRNAs (miRNAs) are known as a competing endogenous RNAs (ceRNAs) and have the ability to regulate the expression of target genes. The ceRNAs play an important role in the initiation and progression of various cancers. However, until now, there is no a database including a collection of experimentally verified, human ceRNAs. We developed the LncCeRBase database, which encompasses 432 lncRNA–miRNA–mRNA interactions, including 130 lncRNAs, 214 miRNAs and 245 genes from 300 publications. In addition, we compiled the signaling pathways associated with the included lncRNA–miRNA–mRNA interactions as a tool to explore their functions. LncCeRBase is useful for understanding the regulatory mechanisms of lncRNA.Database URL:
      PubDate: Thu, 21 Jun 2018 00:00:00 GMT
      DOI: 10.1093/database/bay061
      Issue No: Vol. 2018 (2018)
  • RBPMetaDB: a comprehensive annotation of mouse RNA-Seq datasets with
           perturbations of RNA-binding proteins

    • Authors: Li J; Deng S, Vieira J, et al.
      Abstract: RNA-binding proteins (RBPs) may play a critical role in gene regulation in various diseases or biological processes by controlling post-transcriptional events such as polyadenylation, splicing and mRNA stabilization via binding activities to RNA molecules. Owing to the importance of RBPs in gene regulation, a great number of studies have been conducted, resulting in a large amount of RNA-Seq datasets. However, these datasets usually do not have structured organization of metadata, which limits their potentially wide use. To bridge this gap, the metadata of a comprehensive set of publicly available mouse RNA-Seq datasets with perturbed RBPs were collected and integrated into a database called RBPMetaDB. This database contains 292 mouse RNA-Seq datasets for a comprehensive list of 187 RBPs. These RBPs account for only ∼10% of all known RBPs annotated in Gene Ontology, indicating that most are still unexplored using high-throughput sequencing. This negative information provides a great pool of candidate RBPs for biologists to conduct future experimental studies. In addition, we found that DNA-binding activities are significantly enriched among RBPs in RBPMetaDB, suggesting that prior studies of these DNA- and RNA-binding factors focus more on DNA-binding activities instead of RNA-binding activities. This result reveals the opportunity to efficiently reuse these data for investigation of the roles of their RNA-binding activities. A web application has also been implemented to enable easy access and wide use of RBPMetaDB. It is expected that RBPMetaDB will be a great resource for improving understanding of the biological roles of RBPs.Database URL:
      PubDate: Tue, 19 Jun 2018 00:00:00 GMT
      DOI: 10.1093/database/bay054
      Issue No: Vol. 2018 (2018)
  • MPD: a pathogen genome and metagenome database

    • Authors: Zhang T; Miao J, Han N, et al.
      Abstract: Advances in high-throughput sequencing have led to unprecedented growth in the amount of available genome sequencing data, especially for bacterial genomes, which has been accompanied by a challenge for the storage and management of such huge datasets. To facilitate bacterial research and related studies, we have developed the Mypathogen database (MPD), which provides access to users for searching, downloading, storing and sharing bacterial genomics data. The MPD represents the first pathogenic database for microbial genomes and metagenomes, and currently covers pathogenic microbial genomes (6604 genera, 11 071 species, 41 906 strains) and metagenomic data from host, air, water and other sources (28 816 samples). The MPD also functions as a management system for statistical and storage data that can be used by different organizations, thereby facilitating data sharing among different organizations and research groups. A user-friendly local client tool is provided to maintain the steady transmission of big sequencing data. The MPD is a useful tool for analysis and management in genomic research, especially for clinical Centers for Disease Control and epidemiological studies, and is expected to contribute to advancing knowledge on pathogenic bacteria genomes and metagenomes.Database URL:
      PubDate: Thu, 14 Jun 2018 00:00:00 GMT
      DOI: 10.1093/database/bay055
      Issue No: Vol. 2018 (2018)
  • LeptoDB: an integrated database of genomics and proteomics resource of

    • Authors: Beriwal S; Padhiyar N, Bhatt D, et al.
      Abstract: Leptospirosis is a potentially fatal zoo-anthroponosis caused by pathogenic species of Leptospira belonging to the family of Leptospiraceae, with a worldwide distribution and effect, in terms of its burden and risk to human health. The ‘LeptoDB’ is a single window dedicated architecture (5 948 311 entries), modeled using heterogeneous data as a core resource for global Leptospira species. LeptoDB facilitates well-structured knowledge of genomics, proteomics and therapeutic aspects with more than 500 assemblies including 17 complete and 496 draft genomes encoding 1.7 million proteins for 23 Leptospira species with more than 250 serovars comprising pathogenic, intermediate and saprophytic strains. Also, it seeks to be a dynamic compendium for therapeutically essential components such as epitope, primers, CRISPR/Cas9 and putative drug targets. Integration of JBrowse provides elaborated locus centric description of sequence or contig. Jmol for structural visualization of protein structures, MUSCLE for interactive multiple sequence alignment annotation and analysis. The data on genomic islands will definitely provide an understanding of virulence and pathogenicity. Phylogenetics analysis integrated suggests the evolutionary division of strains. Easily accessible on a public web server, we anticipate wide use of this metadata on Leptospira for the development of potential therapeutics.Database URL:
      PubDate: Tue, 12 Jun 2018 00:00:00 GMT
      DOI: 10.1093/database/bay057
      Issue No: Vol. 2018 (2018)
  • ILDgenDB: integrated genetic knowledge resource for interstitial lung
           diseases (ILDs)

    • Authors: Mishra S; Shah M, Sarkar M, et al.
      Abstract: Interstitial lung diseases (ILDs) are a diverse group of ∼200 acute and chronic pulmonary disorders that are characterized by variable amounts of inflammation, fibrosis and architectural distortion with substantial morbidity and mortality. Inaccurate and delayed diagnoses increase the risk, especially in developing countries. Studies have indicated the significant roles of genetic elements in ILDs pathogenesis. Therefore, the first genetic knowledge resource, ILDgenDB, has been developed with an objective to provide ILDs genetic data and their integrated analyses for the better understanding of disease pathogenesis and identification of diagnostics-based biomarkers. This resource contains literature-curated disease candidate genes (DCGs) enriched with various regulatory elements that have been generated using an integrated bioinformatics workflow of databases searches, literature-mining and DCGs–microRNA (miRNAs)–single nucleotide polymorphisms (SNPs) association analyses. To provide statistical significance to disease-gene association, ILD-specificity index and hypergeomatric test scores were also incorporated. Association analyses of miRNAs, SNPs and pathways responsible for the pathogenesis of different sub-classes of ILDs were also incorporated. Manually verified 299 DCGs and their significant associations with 1932 SNPs, 2966 miRNAs and 9170 miR-polymorphisms were also provided. Furthermore, 216 literature-mined and proposed biomarkers were identified. The ILDgenDB resource provides user-friendly browsing and extensive query-based information retrieval systems. Additionally, this resource also facilitates graphical view of predicted DCGs–SNPs/miRNAs and literature associated DCGs–ILDs interactions for each ILD to facilitate efficient data interpretation. Outcomes of analyses suggested the significant involvement of immune system and defense mechanisms in ILDs pathogenesis. This resource may potentially facilitate genetic-based disease monitoring and diagnosis.Database URL:
      PubDate: Sat, 09 Jun 2018 00:00:00 GMT
      DOI: 10.1093/database/bay053
      Issue No: Vol. 2018 (2018)
  • A comparative synteny analysis tool for target-gene SNP marker discovery:
           connecting genomics data to breeding in Solanaceae

    • Authors: Choe J; Kim J, Lee B, et al.
      Abstract: It is necessary for molecular breeders to overcome the difficulties in applying abundant genomic information to crop breeding. Candidate orthologs would be discovered more efficiently in less-studied crops if the information gained from studies of related crops were used. We developed a comparative analysis tool and web-based genome viewer to identify orthologous genes based synteny as well as sequence similarity between tomato, pepper and potato. The tool has a step-by-step interface with multiple viewing levels to support the easy and accurate exploration of functional orthologs. Furthermore, it provides access to single nucleotide-polymorphism markers from the massive genetic resource pool in order to accelerate the development of molecular markers for candidate orthologs in the Solanaceae. This tool provides a bridge between genome data and breeding by supporting effective marker development, data utilization and communication.Database URL:
      PubDate: Sun, 03 Jun 2018 00:00:00 GMT
      DOI: 10.1093/database/bay047
      Issue No: Vol. 2018 (2018)
  • A systematic approach for identifying shared mechanisms in epilepsy and
           its comorbidities

    • Authors: Hoyt C; Domingo-Fernández D, Balzer N, et al.
      Abstract: Cross-sectional epidemiological studies have shown that the incidence of several nervous system diseases is more frequent in epilepsy patients than in the general population. Some comorbidities [e.g. Alzheimer’s disease (AD) and Parkinson’s disease] are also risk factors for the development of seizures; suggesting they may share pathophysiological mechanisms with epilepsy. A literature-based approach was used to identify gene overlap between epilepsy and its comorbidities as a proxy for a shared genetic basis for disease, or genetic pleiotropy, as a first effort to identify shared mechanisms. While the results identified neurological disorders as the group of diseases with the highest gene overlap, this analysis was insufficient for identifying putative common mechanisms shared across epilepsy and its comorbidities. This motivated the use of a dedicated literature mining and knowledge assembly approach in which a cause-and-effect model of epilepsy was captured with Biological Expression Language. After enriching the knowledge assembly with information surrounding epilepsy, its risk factors, its comorbidities, and anti-epileptic drugs, a novel comparative mechanism enrichment approach was used to propose several downstream effectors (including the GABA receptor, GABAergic pathways, etc.) that could explain the therapeutic effects carbamazepine in both the contexts of epilepsy and AD. We have made the Epilepsy Knowledge Assembly available at and queryable through NeuroMMSig at The source code used for analysis and tutorials for reproduction are available on GitHub at
      PubDate: Sun, 03 Jun 2018 00:00:00 GMT
      DOI: 10.1093/database/bay050
      Issue No: Vol. 2018 (2018)
  • SPRENO: a BioC module for identifying organism terms in figure captions

    • Authors: Dai H; Singh O.
      Abstract: Recent advances in biological research reveal that the majority of the experiments strive for comprehensive exploration of the biological system rather than targeting specific biological entities. The qualitative and quantitative findings of the investigations are often exclusively available in the form of figures in published papers. There is no denying that such findings have been instrumental in intensive understanding of biological processes and pathways. However, data as such is unacknowledged by machines as the descriptions in the figure captions comprise of sumptuous information in an ambiguous manner. The abbreviated term ‘SIN’ exemplifies such issue as it may stand for Sindbis virus or the sex-lethal interactor gene (Drosophila melanogaster). To overcome this ambiguity, entities should be identified by linking them to the respective entries in notable biological databases. Among all entity types, the task of identifying species plays a pivotal role in disambiguating related entities in the text. In this study, we present our species identification tool SPRENO (Species Recognition and Normalization), which is established for recognizing organism terms mentioned in figure captions and linking them to the NCBI taxonomy database by exploiting the contextual information from both the figure caption and the corresponding full text. To determine the ID of ambiguous organism mentions, two disambiguation methods have been developed. One is based on the majority rule to select the ID that has been successfully linked to previously mentioned organism terms. The other is a convolutional neural network (CNN) model trained by learning both the context and the distance information of the target organism mention. As a system based on the majority rule, SPRENO was one of the top-ranked systems in the BioCreative VI BioID track and achieved micro F-scores of 0.776 (entity recognition) and 0.755 (entity normalization) on the official test set, respectively. Additionally, the SPRENO-CNN exhibited better precisions with lower recalls and F-scores (0.720/0.711 for entity recognition/normalization). SPRENO is freely available at URL:
      PubDate: Sun, 03 Jun 2018 00:00:00 GMT
      DOI: 10.1093/database/bay048
      Issue No: Vol. 2018 (2018)
  • dbCRSR: a manually curated database for regulation of cancer

    • Authors: Wen P; Xia J, Cao X, et al.
      Abstract: Radiotherapy is used to treat approximately 50% of all cancer patients, with varying prognoses. Intrinsic radiosensitivity is an important factor underlying the radiotherapeutic efficacy of this precise treatment. During the past decades, great efforts have been made to improve radiotherapy treatment through multiple strategies. However, invaluable data remains buried in the extensive radiotherapy literature, making it difficult to obtain an overall view of the detailed mechanisms leading to radiosensitivity, thus limiting advances in radiotherapy. To address this issue, we collected data from the relevant literature contained in the PubMed database and developed a literature-based database that we term the cancer radiosensitivity regulation factors database (dbCRSR). dbCRSR is a manually curated catalogue of radiosensitivity, containing multiple radiosensitivity regulation factors (395 coding genes, 119 non-coding RNAs and 306 chemical compounds) with appropriate annotation. To illustrate the value of the data we collected, data mining was performed including functional annotation and network analysis. In summary, dbCRSR is the first literature-based database to focus on radiosensitivity and provides a resource to better understand the detailed mechanisms of radiosensitivity. We anticipate dbCRSR will be a useful resource to enrich our knowledge and to promote further study of radiosensitivity.Database URL: 8080/dbCRSR/
      PubDate: Wed, 30 May 2018 00:00:00 GMT
      DOI: 10.1093/database/bay049
      Issue No: Vol. 2018 (2018)
  • DEXTER: Disease-Expression Relation Extraction from Text

    • Authors: Gupta S; Dingerdissen H, Ross K, et al.
      Abstract: Gene expression levels affect biological processes and play a key role in many diseases. Characterizing expression profiles is useful for clinical research, and diagnostics and prognostics of diseases. There are currently several high-quality databases that capture gene expression information, obtained mostly from large-scale studies, such as microarray and next-generation sequencing technologies, in the context of disease. The scientific literature is another rich source of information on gene expression–disease relationships that not only have been captured from large-scale studies but have also been observed in thousands of small-scale studies. Expression information obtained from literature through manual curation can extend expression databases. While many of the existing databases include information from literature, they are limited by the time-consuming nature of manual curation and have difficulty keeping up with the explosion of publications in the biomedical field. In this work, we describe an automated text-mining tool, Disease-Expression Relation Extraction from Text (DEXTER) to extract information from literature on gene and microRNA expression in the context of disease. One of the motivations in developing DEXTER was to extend the BioXpress database, a cancer-focused gene expression database that includes data derived from large-scale experiments and manual curation of publications. The literature-based portion of BioXpress lags behind significantly compared to expression information obtained from large-scale studies and can benefit from our text-mined results. We have conducted two different evaluations to measure the accuracy of our text-mining tool and achieved average F-scores of 88.51 and 81.81% for the two evaluations, respectively. Also, to demonstrate the ability to extract rich expression information in different disease-related scenarios, we used DEXTER to extract information on differential expression information for 2024 genes in lung cancer, 115 glycosyltransferases in 62 cancers and 826 microRNA in 171 cancers. All extractions using DEXTER are integrated in the literature-based portion of BioXpress.Database URL:
      PubDate: Wed, 30 May 2018 00:00:00 GMT
      DOI: 10.1093/database/bay045
      Issue No: Vol. 2018 (2018)
  • CBD: a biomarker database for colorectal cancer

    • Authors: Zhang X; Sun X, Cao Y, et al.
      Abstract: Colorectal cancer (CRC) biomarker database (CBD) was established based on 870 identified CRC biomarkers and their relevant information from 1115 original articles in PubMed published from 1986 to 2017. In this version of the CBD, CRC biomarker data were collected, sorted, displayed and analysed. The CBD with the credible contents as a powerful and time-saving tool provide more comprehensive and accurate information for further CRC biomarker research. The CBD was constructed under MySQL server. HTML, PHP and JavaScript languages have been used to implement the web interface. The Apache was selected as HTTP server. All of these web operations were implemented under the Windows system. The CBD could provide to users the multiple individual biomarker information and categorized into the biological category, source and application of biomarkers; the experiment methods, results, authors and publication resources; the research region, the average age of cohort, gender, race, the number of tumours, tumour location and stage. We only collect data from the articles with clear and credible results to prove the biomarkers are useful in the diagnosis, treatment or prognosis of CRC. The CBD can also provide a professional platform to researchers who are interested in CRC research to communicate, exchange their research ideas and further design high-quality research in CRC. They can submit their new findings to our database via the submission page and communicate with us in the CBD.Database URL:
      PubDate: Sat, 26 May 2018 00:00:00 GMT
      DOI: 10.1093/database/bay046
      Issue No: Vol. 2018 (2018)
  • LnChrom: a resource of experimentally validated lncRNA–chromatin
           interactions in human and mouse

    • Authors: Yu F; Zhang G, Shi A, et al.
      Abstract: Long non-coding RNAs (lncRNAs) constitute an important layer of chromatin regulation that contributes to various biological processes and diseases. By interacting with chromatin, many lncRNAs can regulate that state of chromatin by recruiting chromatin-modifying complexes and thus control large-scale gene expression programs. However, the available information on interactions between lncRNAs and chromatin is hidden in a large amount of dispersed literature and has not been extensively collected. We established the LnChrom database, a manually curated resource of experimentally validated lncRNA–chromatin interactions. The current release of LnChrom includes 382 743 interactions in human and mouse. We also manually collected detailed metadata for each interaction pair, including those of chromatin modifying factors, epigenetic marks and disease associations. LnChrom provides a user-friendly interface to facilitate browsing, searching and retrieving of lncRNA–chromatin interaction data. Additionally, a large amount of multi-omics data was integrated into LnChrom to aid in characterizing the effects of lncRNA–chromatin interactions on epigenetic modifications and transcriptional expression. We believe that LnChrom is a timely and valuable resource that can greatly motivate mechanistic research into lncRNAs.Database URL:
      PubDate: Fri, 18 May 2018 00:00:00 GMT
      DOI: 10.1093/database/bay039
      Issue No: Vol. 2018 (2018)
  • SolCyc: a database hub at the Sol Genomics Network (SGN) for the manual
           curation of metabolic networks in Solanum and Nicotiana specific databases

    • Authors: Foerster H; Bombarely A, Battey J, et al.
      Abstract: SolCyc is the entry portal to pathway/genome databases (PGDBs) for major species of the Solanaceae family hosted at the Sol Genomics Network. Currently, SolCyc comprises six organism-specific PGDBs for tomato, potato, pepper, petunia, tobacco and one Rubiaceae, coffee. The metabolic networks of those PGDBs have been computationally predicted by the pathologic component of the pathway tools software using the manually curated multi-domain database MetaCyc ( as reference. SolCyc has been recently extended by taxon-specific databases, i.e. the family-specific SolanaCyc database, containing only curated data pertinent to species of the nightshade family, and NicotianaCyc, a genus-specific database that stores all relevant metabolic data of the Nicotiana genus. Through manual curation of the published literature, new metabolic pathways have been created in those databases, which are complemented by the continuously updated, relevant species-specific pathways from MetaCyc. At present, SolanaCyc comprises 199 pathways and 29 superpathways and NicotianaCyc accounts for 72 pathways and 13 superpathways. Curator-maintained, taxon-specific databases such as SolanaCyc and NicotianaCyc are characterized by an enrichment of data specific to these taxa and free of falsely predicted pathways. Both databases have been used to update recently created Nicotiana-specific databases for Nicotiana tabacum, Nicotiana benthamiana, Nicotiana sylvestris and Nicotiana tomentosiformis by propagating verifiable data into those PGDBs. In addition, in-depth curation of the pathways in N.tabacum has been carried out which resulted in the elimination of 156 pathways from the 569 pathways predicted by pathway tools. Together, in-depth curation of the predicted pathway network and the supplementation with curated data from taxon-specific databases has substantially improved the curation status of the species–specific N.tabacum PGDB. The implementation of this strategy will significantly advance the curation status of all organism-specific databases in SolCyc resulting in the improvement on database accuracy, data analysis and visualization of biochemical networks in those species.Database URL
      PubDate: Thu, 10 May 2018 00:00:00 GMT
      DOI: 10.1093/database/bay035
      Issue No: Vol. 2018 (2018)
  • CircR2Disease: a manually curated database for experimentally supported
           circular RNAs associated with various diseases

    • Authors: Fan C; Lei X, Fang Z, et al.
      Abstract: CircR2Disease is a manually curated database, which provides a comprehensive resource for circRNA deregulation in various diseases. Increasing evidences have shown that circRNAs play critical roles in transcriptional, post-transcriptional and translational regulation. Therefore, the aberrant expression of circRNAs has been associated with a group of diseases. It is significant to develop a high-quality database to deposit the deregulated circRNAs in diseases. The current version of CircR2Disease contains 725 associations between 661 circRNAs and 100 diseases by reviewing existing literatures. Each entry in the CircR2Disease contains detailed information for the circRNA–disease relationship, including circRNA name, coordinates and gene symbol, disease name, expression patterns of circRNA, experimental techniques, a brief description of the circRNA–disease relationship, year of publication and the PubMed ID. CircR2Disease provides a user-friendly interface to browse, search and download as well as to submit novel disease-related circRNAs. CircR2Disease could be very beneficial for researches to investigate the mechanism of disease-related circRNAs and explore the appropriate algorithms for predicting novel associations.Database URL:
      PubDate: Fri, 04 May 2018 00:00:00 GMT
      DOI: 10.1093/database/bay044
      Issue No: Vol. 2018 (2018)
  • GEMiCCL: mining genotype and expression data of cancer cell lines with
           elaborate visualization

    • Authors: Jeong I; Yu N, Jang I, et al.
      Abstract: Cancer cell lines are essential components for biomedical research. However, proper choice of cell lines for experimental purposes is often difficult because genotype and/or expression data are missing or scattered in diverse resources. Here, we report Gene Expression and Mutations in Cancer Cell Lines (GEMiCCL), an online database of human cancer cell lines that provides genotype and expression information. We have collected mutation, gene expression and copy number variation (CNV) data from three representative databases on cell lines—Cancer Cell Line Encyclopedia , Catalogue of Somatic Mutations in Cancer and NCI60. In total, GEMiCCL includes 1406 cell lines from 185 cancer types and 29 tissues. Gene expression, mutation and CNV information are available for 1304, 1334 and 1365 cell lines, respectively. We removed batch effects due to different microarray platforms using the ComBat software and re-processed the entire gene expression and SNP chip data. Cell line names and clinical information were standardized using Cellosaurus from ExPASy. Our user interface supports cell line search, gene search, browsing for specific molecular characteristics and complex queries-based on Boolean logic rules. We also implemented many interactive features and user-friendly visualizations. Providing molecular characteristics and clinical information, we believe that GEMiCCL would be a valuable resource for biomedical research for functional or screening studies.Database URL: GEMiCCL is available at
      PubDate: Wed, 02 May 2018 00:00:00 GMT
      DOI: 10.1093/database/bay041
      Issue No: Vol. 2018 (2018)
  • AbDb: antibody structure database—a database of PDB-derived antibody

    • Authors: Ferdous S; Martin A.
      Abstract: In order to analyse structures of proteins of a particular class, these need to be extracted from Protein Data Bank (PDB) files. In the case of antibodies, there are a number of special considerations: (i) identifying antibodies in the PDB is not trivial, (ii) they may be crystallized with or without antigen, (iii) for analysis purposes, one is normally only interested in the Fv region of the antibody, (iv) structural analysis of epitopes, in particular, requires individual antibody–antigen complexes from a PDB file which may contain multiple copies of the same, or different, antibodies and (v) standard numbering schemes should be applied. Consequently, there is a need for a specialist resource containing pre-numbered non-redundant antibody Fv structures with their cognate antigens. We have created an automatically updated resource, AbDb, which collects the Fv regions from antibody structures using information from our SACS database which summarizes antibody structures from the PDB. PDB files containing multiple structures are split and numbered and each antibody structure is associated with its antigen where available. Antibody structures with only light or heavy chains have also been processed and sequences of antibodies are compared to identify multiple structures of the same antibody. The data may be queried on the basis of PDB code, or the name or species of the antibody or antigen, and the complete datasets may be downloaded.Database URL:
      PubDate: Fri, 27 Apr 2018 00:00:00 GMT
      DOI: 10.1093/database/bay040
      Issue No: Vol. 2018 (2018)
  • PVCbase: an integrated web resource for the PVC bacterial proteomes

    • Authors: Bordin N; González-Sánchez J, Devos D.
      Abstract: Interest in the Planctomycetes-Verrucomicrobia-Chlamydiae (PVC) bacterial superphylum is growing within the microbiology community. These organisms do not have a specialized web resource that gathers in silico predictions in an integrated fashion. Hence, we are providing the PVC community with PVCbase, a specialized web resource that gathers in silico predictions in an integrated fashion. PVCbase integrates protein function annotations obtained through sequence analysis and tertiary structure prediction for 39 representative PVC proteomes (PVCdb), a protein feature visualizer (Foundation) and a custom BLAST webserver (PVCBlast) that allows to retrieve the annotation of a hit directly from the DataTables. We display results from various predictors, encompassing most functional aspects, allowing users to have a more comprehensive overview of protein identities. Additionally, we illustrate how the application of PVCdb can be used to address biological questions from raw data.Database URL: PVCbase is freely accessible at
      PubDate: Tue, 24 Apr 2018 00:00:00 GMT
      DOI: 10.1093/database/bay042
      Issue No: Vol. 2018 (2018)
  • SNPversity: a web-based tool for visualizing diversity

    • Authors: Schott D; Vinnakota A, Portwood J, II, et al.
      Abstract: Many stand-alone desktop software suites exist to visualize single nucleotide polymorphism (SNP) diversity, but web-based software that can be easily implemented and used for biological databases is absent. SNPversity was created to answer this need by building an open-source visualization tool that can be implemented on a Unix-like machine and served through a web browser that can be accessible worldwide. SNPversity consists of a HDF5 database back-end for SNPs, a data exchange layer powered by TASSEL libraries that represent data in JSON format, and an interface layer using PHP to visualize SNP information. SNPversity displays data in real-time through a web browser in grids that are color-coded according to a given SNP’s allelic status and mutational state. SNPversity is currently available at MaizeGDB, the maize community’s database, and will be soon available at GrainGenes, the clade-oriented database for Triticeae and Avena species, including wheat, barley, rye, and oat. The code and documentation are uploaded onto github, and they are freely available to the public. We expect that the tool will be highly useful for other biological databases with a similar need to display SNP diversity through their web interfaces.Database URL:
      PubDate: Fri, 20 Apr 2018 00:00:00 GMT
      DOI: 10.1093/database/bay037
      Issue No: Vol. 2018 (2018)
  • StraPep: a structure database of bioactive peptides

    • Authors: Wang J; Yin T, Xiao X, et al.
      Abstract: Bioactive peptides, with a variety of biological activities and wide distribution in nature, have attracted great research interest in biological and medical fields, especially in pharmaceutical industry. The structural information of bioactive peptide is important for the development of peptide-based drugs. Many databases have been developed cataloguing bioactive peptides. However, to our knowledge, database dedicated to collect all the bioactive peptides with known structure is not available yet. Thus, we developed StraPep, a structure database of bioactive peptides. StraPep holds 3791 bioactive peptide structures, which belong to 1312 unique bioactive peptide sequences. About 905 out of 1312 (68%) bioactive peptides in StraPep contain disulfide bonds, which is significantly higher than that (21%) of PDB. Interestingly, 150 out of 616 (24%) bioactive peptides with three or more disulfide bonds form a structural motif known as cystine knot, which confers considerable structural stability on proteins and is an attractive scaffold for drug design. Detailed information of each peptide, including the experimental structure, the location of disulfide bonds, secondary structure, classification, post-translational modification and so on, has been provided. A wide range of user-friendly tools, such as browsing, sequence and structure-based searching and so on, has been incorporated into StraPep. We hope that this database will be helpful for the research community.Database URL:
      PubDate: Mon, 16 Apr 2018 00:00:00 GMT
      DOI: 10.1093/database/bay038
      Issue No: Vol. 2018 (2018)
  • Maser: one-stop platform for NGS big data from analysis to visualization

    • Authors: Kinjo S; Monma N, Misu S, et al.
      Abstract: A major challenge in analyzing the data from high-throughput next-generation sequencing (NGS) is how to handle the huge amounts of data and variety of NGS tools and visualize the resultant outputs. To address these issues, we developed a cloud-based data analysis platform, Maser (Management and Analysis System for Enormous Reads), and an original genome browser, Genome Explorer (GE). Maser enables users to manage up to 2 terabytes of data to conduct analyses with easy graphical user interface operations and offers analysis pipelines in which several individual tools are combined as a single pipeline for very common and standard analyses. GE automatically visualizes genome assembly and mapping results output from Maser pipelines, without requiring additional data upload. With this function, the Maser pipelines can graphically display the results output from all the embedded tools and mapping results in a web browser. Therefore Maser realized a more user-friendly analysis platform especially for beginners by improving graphical display and providing the selected standard pipelines that work with built-in genome browser. In addition, all the analyses executed on Maser are recorded in the analysis history, helping users to trace and repeat the analyses. The entire process of analysis and its histories can be shared with collaborators or opened to the public. In conclusion, our system is useful for managing, analyzing, and visualizing NGS data and achieves traceability, reproducibility, and transparency of NGS analysis.Database URL:
      PubDate: Fri, 13 Apr 2018 00:00:00 GMT
      DOI: 10.1093/database/bay027
      Issue No: Vol. 2018 (2018)
  • The tragedy of the biodiversity data commons: a data impediment creeping

    • Authors: Escribano N; Galicia D, Ariño A.
      Abstract: Researchers are embracing the open access movement to facilitate unrestricted availability of scientific results. One sign of this willingness is the steady increase in data freely shared online, which has prompted a corresponding increase in the number of papers using such data. Publishing datasets is a time-consuming process that is often seen as a courtesy, rather than a necessary step in the research process. Making data accessible allows further research, provides basic information for decision-making and contributes to transparency in science. Nevertheless, the ease of access to heaps of data carries a perception of ‘free lunch for all’, and the work of data publishers is largely going unnoticed. Acknowledging such a significant effort involving the creation, management and publication of a dataset remains a flimsy, not well established practice in the scientific community. In a meta-analysis of published literature, we have observed various dataset citation practices, but mostly (92%) consisting of merely citing the data repository rather than the data publisher. Failing to recognize the work of data publishers might lead to a decrease in the number of quality datasets shared online, compromising potential research that is dependent on the availability of such data. We make an urgent appeal to raise awareness about this issue.
      PubDate: Mon, 09 Apr 2018 00:00:00 GMT
      DOI: 10.1093/database/bay033
      Issue No: Vol. 2018 (2018)
  • ATD: a comprehensive bioinformatics resource for deciphering the
           association of autophagy and diseases

    • Authors: Wang W; Zhang P, Li L, et al.
      Pages: 1 - 10
      Abstract: Autophagy is the natural, regulated, destructive mechanism of the eukaryotes cell that disassembles unnecessary or dysfunctional components. In recent years, the association between autophagy and diseases has attracted more and more attention, but our understanding of the molecular mechanism about the association in the system perspective is limited and ambiguous. Hence, we developed the comprehensive bioinformatics resource Autophagy To Disease (ATD, to archive autophagy-associated diseases. This resource provides bioinformatics annotation system about genes and chemicals about autophagy and human diseases by extracting results from previous studies with text mining technology. Based on the big data from ATD, we found that some classes of disease tend to be related with autophagy, including respiratory disease, cancer, urogenital disease and digestive system disease. We also found that some classes of autophagy-related diseases have a strong association among each other and constitute modules. Furthermore, we extracted the autophagy–disease-related genes (ADGs) from ATD and provided a novel algorithm Optimized Random Forest with Label model to predict potential ADGs. This bioinformatics annotation system about autophagy and human diseases may provide a basic resource for the further detection of the molecular mechanisms of autophagy pathway to disease.
      PubDate: Tue, 18 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay093
      Issue No: Vol. 2018 (2018)
  • Growing and cultivating the forest genomics database, TreeGenes

    • Authors: Falk T; Herndon N, Grau E, et al.
      Pages: 1 - 11
      Abstract: Forest trees are valued sources of pulp, timber and biofuels, and serve a role in carbon sequestration, biodiversity maintenance and watershed stability. Examining the relationships among genetic, phenotypic and environmental factors for these species provides insight on the areas of concern for breeders and researchers alike. The TreeGenes database is a web-based repository that is home to 1790 tree species and over 1500 registered users. The database provides a curated archive for high-throughput genomics, including reference genomes, transcriptomes, genetic maps and variant data. These resources are paired with extensive phenotypic information and environmental layers. TreeGenes recently migrated to Tripal, an integrated and open-source database schema and content management system. This migration enabled developments focused on data exchange, data transfer and improved analytical capacity, as well as providing TreeGenes the opportunity to communicate with the following partner databases: Hardwood Genomics Web, Genome Database for Rosaceae, and the Citrus Genome Database. Recent development in TreeGenes has focused on coordinating information for georeferenced accessions, including metadata acquisition and ontological frameworks, to improve integration across studies combining genetic, phenotypic and environmental data. This focus was paired with the development of tools to enable comparative genomics and data visualization. By combining advanced data importers, relevant metadata standards and integrated analytical frameworks, TreeGenes provides a platform for researchers to store, submit and analyze forest tree data.
      PubDate: Tue, 18 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay084
      Issue No: Vol. 2018 (2018)
  • An end-to-end deep learning architecture for extracting protein–protein
           interactions affected by genetic mutations

    • Authors: Tran T; Kavuluru R.
      Pages: 1 - 13
      Abstract: The BioCreative VI Track IV (mining protein interactions and mutations for precision medicine) challenge was organized in 2017 with the goal of applying biomedical text mining methods to support advancements in precision medicine approaches. As part of the challenge, a new dataset was introduced for the purpose of building a supervised relation extraction model capable of taking a test article and returning a list of interacting protein pairs identified by their Entrez Gene IDs. Specifically, such pairs represent proteins participating in a binary protein–protein interaction relation where the interaction is additionally affected by a genetic mutation—referred to as a PPIm relation. In this study, we explore an end-to-end approach for PPIm relation extraction by deploying a three-component pipeline involving deep learning-based named-entity recognition and relation classification models along with a knowledge-based approach for gene normalization. We propose several recall-focused improvements to our original challenge entry that placed second when matching on Entrez Gene ID (exact matching) and on HomoloGene ID. On exact matching, the improved system achieved new competitive test results of 37.78% micro-F1 with a precision of 38.22% and recall of 37.34% that corresponds to an improvement from the prior best system by approximately three micro-F1 points. When matching on HomoloGene IDs, we report similarly competitive test results at 46.17% micro-F1 with a precision and recall of 46.67 and 45.59%, respectively, corresponding to an improvement of more than eight micro-F1 points over the prior best result. The code for our deep learning system is made publicly available at
      PubDate: Tue, 18 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay092
      Issue No: Vol. 2018 (2018)
  • AgBioData consortium recommendations for sustainable genomics and genetics
           databases for agriculture

    • Authors: Harper L; Campbell J, Cannon E, et al.
      Pages: 1 - 32
      Abstract: The future of agricultural research depends on data. The sheer volume of agricultural biological data being produced today makes excellent data management essential. Governmental agencies, publishers and science funders require data management plans for publicly funded research. Furthermore, the value of data increases exponentially when they are properly stored, described, integrated and shared, so that they can be easily utilized in future analyses. AgBioData ( is a consortium of people working at agricultural biological databases, data archives and knowledgbases who strive to identify common issues in database development, curation and management, with the goal of creating database products that are more Findable, Accessible, Interoperable and Reusable. We strive to promote authentic, detailed, accurate and explicit communication between all parties involved in scientific data. As a step toward this goal, we present the current state of biocuration, ontologies, metadata and persistence, database platforms, programmatic (machine) access to data, communication and sustainability with regard to data curation. Each section describes challenges and opportunities for these topics, along with recommendations and best practices.
      PubDate: Tue, 18 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay088
      Issue No: Vol. 2018 (2018)
  • Improved ontology for eukaryotic single-exon coding sequences in
           biological databases

    • Authors: Jorquera R; González C, Clausen P, et al.
      Pages: 1 - 6
      Abstract: Efficient extraction of knowledge from biological data requires the development of structured vocabularies to unambiguously define biological terms. This paper proposes descriptions and definitions to disambiguate the term ‘single-exon gene’. Eukaryotic Single-Exon Genes (SEGs) have been defined as genes that do not have introns in their protein coding sequences. They have been studied not only to determine their origin and evolution but also because their expression has been linked to several types of human cancer and neurological/developmental disorders and many exhibit tissue-specific transcription. Unfortunately, the term ‘SEGs’ is rife with ambiguity, leading to biological misinterpretations. In the classic definition, no distinction is made between SEGs that harbor introns in their untranslated regions (UTRs) versus those without. This distinction is important to make because the presence of introns in UTRs affects transcriptional regulation and post-transcriptional processing of the mRNA. In addition, recent whole-transcriptome shotgun sequencing has led to the discovery of many examples of single-exon mRNAs that arise from alternative splicing of multi-exon genes, these single-exon isoforms are being confused with SEGs despite their clearly different origin. The increasing expansion of RNA-seq datasets makes it imperative to distinguish the different SEG types before annotation errors become indelibly propagated in biological databases. This paper develops a structured vocabulary for their disambiguation, allowing a major reassessment of their evolutionary trajectories, regulation, RNA processing and transport, and provides the opportunity to improve the detection of gene associations with disorders including cancers, neurological and developmental diseases.
      PubDate: Tue, 18 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay089
      Issue No: Vol. 2018 (2018)
  • Assisting document triage for human kinome curation via machine learning

    • Authors: Hsu Y; Wei C, Lu Z.
      Pages: 1 - 7
      Abstract: In the era of data explosion, the increasing frequency of published articles presents unorthodox challenges to fulfill specific curation requirements for bio-literature databases. Recognizing these demands, we designed a document triage system with automatic methods that can improve efficiency to retrieve the most relevant articles in curation workflows and reduce workloads for biocurators. Since the BioCreative VI (2017), we have implemented texting mining processing in our system in hopes of providing higher effectiveness for curating articles related to human kinase proteins. We tested several machine learning methods together with state-of-the-art concept extraction tools. For features, we extracted rich co-occurrence and linguistic information to model the curation process of human kinome articles by the neXtProt database. As shown in the official evaluation on the human kinome curation task in BioCreative VI, our system can effectively retrieve 5.2 and 6.5 kinase articles with the relevant disease (DIS) and biological process (BP) information, respectively, among the top 100 returned results. Comparing to neXtA5, our system demonstrates significant improvements in prioritizing kinome-related articles as follows: our system achieves 0.458 and 0.109 for the DIS axis whereas the neXtA5’s best-reported mean average precision (MAP) and maximum precision observed are 0.41 and 0.04. Our system also outperforms the neXtA5 in retrieving BP axis with 0.195 for MAP and the neXtA5’s reported value was 0.11. These results suggest that our system may be able to assist neXtProt biocurators in practice.
      PubDate: Tue, 18 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay091
      Issue No: Vol. 2018 (2018)
  • PubMed Labs: an experimental system for improving biomedical literature

    • Authors: Fiorini N; Canese K, Bryzgunov R, et al.
      Pages: 1 - 8
      Abstract: PubMed is a freely accessible system for searching the biomedical literature, with $ \sim $ 2.5 million users worldwide on an average workday. In order to better meet our users’ needs in an era of information overload, we have recently developed PubMed Labs (, an experimental system for users to test new search features/tools (e.g. Best Match) and provide feedback, which enables us to make more informed decisions about potential changes to improve the search quality and overall usability of PubMed. In addition, PubMed Labs features a mobile-first and responsive layout that offers better support for accessing PubMed from increasingly popular mobiles and small-screen devices. In this paper, we detail PubMed Labs, its purpose, new features and best practices. We also encourage users to share their experience with us; based on which we are continuously improving PubMed Labs with more advanced features and better user experience.
      PubDate: Tue, 18 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay094
      Issue No: Vol. 2018 (2018)
  • PalmXplore: oil palm gene database

    • Authors: Sanusi N; Rosli R, Halim M, et al.
      Pages: 1 - 9
      Abstract: A set of Elaeis guineensis genes had been generated by combining two gene prediction pipelines: Fgenesh++ developed by Softberry and Seqping by the Malaysian Palm Oil Board. PalmXplore was developed to provide a scalable data repository and a user-friendly search engine system to efficiently store, manage and retrieve the oil palm gene sequences and annotations. Information deposited in PalmXplore includes predicted genes, their genomic coordinates, as well as the annotations derived from external databases, such as Pfam, Gene Ontology and Kyoto Encyclopedia of Genes and Genomes. Information about genes related to important traits, such as those involved in fatty acid biosynthesis (FAB) and disease resistance, is also provided. The system offers Basic Local Alignment Search Tool homology search, where the results can be downloaded or visualized in the oil palm genome browser (MYPalmViewer). PalmXplore is regularly updated offering new features, improvements to genome annotation and new genomic sequences. The system is freely accessible at
      PubDate: Tue, 18 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay095
      Issue No: Vol. 2018 (2018)
  • THE-DB: a threading model database for comparative protein structure
           analysis of the E. coli K12 and human proteomes

    • Authors: Diamond J; Zhang Y.
      Pages: 1 - 9
      Abstract: New methodology must be developed to improve the ability to characterize the growing number of amino acid sequences, which vastly exceeds the number of experimentally determined protein structures. Homologous proteins can be used as structural templates for modeling proteins that do not have experimentally determined structures. However, in many cases, there are no homologous proteins (typically <30% sequence identity) with determined structures from which a query sequence can be reliably modeled. The aim of protein threading is to use features, such as secondary structure, solvent accessibility and torsional angles, in addition to sequence patterns to identify structural templates from the protein databank to assist for full-length atomic-level structural modeling. However, there are still numerous protein sequences for which correct templates cannot be recognized. This raises the question as to what attributes allow query sequences to be matched to the correct but distantly homologous templates. To aid the investigation into this question and to provide genome-score protein structure for the biological community, a database called THE-DB (threading hard and easy protein database) has been developed in which it becomes possible to analyze over 15 000 query sequences from the Escherichia coli (E. coli) K12 and human proteomes, as well as to find their three-dimensional templates derived from the state-of-the-art threading algorithms which is not feasible with existing protein template databases. The E. coli K12 and human data can be downloaded in bulk from the THE-DB page.
      PubDate: Tue, 18 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay090
      Issue No: Vol. 2018 (2018)
  • SMMDB: a web-accessible database for small molecule modulators and their
           targets involved in neurological diseases

    • Authors: Mishra S; Jain N, Shankar U, et al.
      Pages: 1 - 12
      Abstract: High-throughput screening and better understanding of small molecule’s structure–activity relationship (SAR) using computational biology techniques have greatly expanded the face of drug discovery process in better discovery of therapeutics for various disease. Small Molecule Modulators Database (SMMDB) includes >1100 small molecules that have been either approved by US Food and Drug Administration, are under investigation or were rejected in clinical trial for any kind of neurological diseases. The comprehensive information about small molecules includes the details about their molecular targets (such as protein or enzyme, DNA, RNA, antisense RNA etc.), pharmacokinetic and pharmacodynamic properties such as binding affinity to their targets (Kd, Ki, IC50 and EC50 if available), mode of action, log P-value, number of hydrogen bond donor and acceptors, their clinical trial status, their 2D and three-dimensional structures etc. To enrich the basic annotation of every small molecule entry present in SMMDB, it is hyperlinked to their description present in PubChem, DrugBank, PubMed and KEGG database. The annotation about their molecular targets was enriched by linking it with UniProt and GenBank and STRING database that can be utilized to study the interaction and relation between various targets involved in single neurological disease. All molecules present in the SMMDB are made available to download in single file and can be further used in establishing the SAR, structure-based drug designing as well as shape-based virtual screening for developing the novel therapeutics against neurological diseases. The scope of this database majorly covers the interest of scientific community and researchers who are engaged in putting their endeavor toward therapeutic development and investigating the pathogenic mechanism of various neurological diseases. The graphical user interface of the SMMDB is accessible on
      PubDate: Thu, 13 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay082
      Issue No: Vol. 2018 (2018)
  • Drug Target Commons 2.0: a community platform for systematic analysis of
           drug–target interaction profiles

    • Authors: Tanoli Z; Alam Z, Vähä-Koskela M, et al.
      Pages: 1 - 13
      Abstract: Drug Target Commons (DTC) is a web platform (database with user interface) for community-driven bioactivity data integration and standardization for comprehensive mapping, reuse and analysis of compound–target interaction profiles. End users can search, upload, edit, annotate and export expert-curated bioactivity data for further analysis, using an application programmable interface, database dump or tab-delimited text download options. To guide chemical biology and drug-repurposing applications, DTC version 2.0 includes updated clinical development information for the compounds and target gene–disease associations, as well as cancer-type indications for mutant protein targets, which are critical for precision oncology developments.
      PubDate: Thu, 13 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay083
      Issue No: Vol. 2018 (2018)
  • lncSLdb: a resource for long non-coding RNA subcellular localization

    • Authors: Wen X; Gao L, Guo X, et al.
      Pages: 1 - 6
      Abstract: While long non-coding RNAs (lncRNAs) may play important roles in cellular function and biological process, we still know little about them. Growing evidences indicate that subcellular localization of lncRNAs may provide clues to their functionality. To facilitate researchers functionally characterize thousands of lncRNAs, we developed a database-driven application, lncSLdb, which stores and manages user-collected qualitative and quantitative subcellular localization information of lncRNAs from literature mining. The current release contains >11 000 transcripts from three species. Based on the accumulated region of lncRNAs, we classify transcripts into three basic localization types (nucleus, cytoplasm and nucleus/cytoplasm). In some conditions, the nucleus and cytoplasm types can be divided into three more accurate subtypes (chromosome, nucleoplasm and ribosome). Besides browsing and downloading data in lncSLdb, our system provides a set of comprehensive tools to search by gene symbols, genome coordinates or sequence similarity. We hope that lncSLdb will provide a convenient platform for researchers to investigate the functions and the molecular mechanisms of lncRNAs in the view of subcellular localization.
      PubDate: Thu, 13 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay085
      Issue No: Vol. 2018 (2018)
  • WaspBase: a genomic resource for the interactions among parasitic wasps,
           insect hosts and plants

    • Authors: Chen L; Lang K, Bi S, et al.
      Pages: 1 - 9
      Abstract: Insect pests reduce yield and cause economic losses, which are major problems in agriculture. Parasitic wasps are the natural enemies of many agricultural pests and thus have been widely used as biological control agents. Plants, phytophagous insects and parasitic wasps form a tritrophic food chain. Understanding the interactions in this tritrophic system should be helpful for developing parasitic wasps for pest control and deciphering the mechanisms of parasitism. However, the genomic resources for this tritrophic system are not well organized. Here, we describe the WaspBase, a new database that contains 573 transcriptomes of 35 parasitic wasps and the genomes of 12 parasitic wasps, 5 insect hosts and 8 plants. In addition, we identified long non-coding RNA, untranslated regions and 25 widely studied gene families from the genome and transcriptome data of these species. WaspBase provides conventional web services such as Basic Local Alignment Search Tool, search and download, together with several widely used tools such as profile hidden Markov model, Multiple Alignment using Fast Fourier Transform, automated alignment trimming and JBrowse. We also present a collection of active researchers in the field of parasitic wasps, which should be useful for constructing scientific networks in this field.
      PubDate: Thu, 13 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay081
      Issue No: Vol. 2018 (2018)
  • MetaHCR: a web-enabled metagenome data management system for hydrocarbon

    • Authors: Marks P; Bigler M, Alsop E, et al.
      Pages: 1 - 10
      Abstract: The ever-increasing metagenomic data necessitate appropriate cataloguing in a way that facilitates the comparison and better contextualization of the underlying investigations. To this extent, information associated with the sequencing data as well as the original sample and the environment where it was obtained from is crucial. To date, there are not any publicly available repositories able to capture environmental metadata pertaining to hydrocarbon-rich environments. As such, contextualization and comparative analysis among sequencing datasets derived from these environments is to a certain degree hindered or cannot be fully evaluated. The metagenomics data management system for hydrocarbon resources (MetaHCRs) enables the capturing of marker gene and whole metagenome sequencing data as well as over 300 contextual attributes associated with samples, organisms, environments and geological properties, among others. Moreover, MetaHCR implements the Minimum Information about any Sequence–hydrocarbon resource specification from the Genomic Standards Consortium; it integrates a user-friendly web interface and relational database model, and it enables the generation of complex custom search. MetaHCR has been tested with 36 publicly available metagenomic studies, and its modular architecture can be easily customized for other types of environmental and metagenomics studies.
      PubDate: Thu, 13 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay087
      Issue No: Vol. 2018 (2018)
  • Spfy: an integrated graph database for real-time prediction of bacterial
           phenotypes and downstream comparative analyses

    • Authors: Le K; Whiteside M, Hopkins J, et al.
      Pages: 1 - 10
      Abstract: Public health laboratories are currently moving to whole-genome sequence (WGS)-based analyses, and require the rapid prediction of standard reference laboratory methods based solely on genomic data. Currently, these predictive genomics tasks rely on workflows that chain together multiple programs for the requisite analyses. While useful, these systems do not store the analyses in a genome-centric way, meaning the same analyses are often re-computed for the same genomes.To solve this problem, we created Spfy, a platform that rapidly performs the common reference laboratory tests, uses a graph database to store and retrieve the results from the computational workflows and links data to individual genomes using standardized ontologies. The Spfy platform facilitates rapid phenotype identification, as well as the efficient storage and downstream comparative analysis of tens of thousands of genome sequences. Though generally applicable to bacterial genome sequences, Spfy currently contains 10 243 Escherichia coli genomes, for which in-silico serotype and Shiga-toxin subtype, as well as the presence of known virulence factors and antimicrobial resistance determinants have been computed. Additionally, the presence/absence of the entire E. coli pan-genome was computed and linked to each genome. Owing to its database of diverse pre-computed results, and the ability to easily incorporate user data, Spfy facilitates hypothesis testing in fields ranging from population genomics to epidemiology, while mitigating the re-computation of analyses. The graph approach of Spfy is flexible, and can accommodate new analysis software modules as they are developed, easily linking new results to those already stored. Spfy provides a database and analyses approach for E. coli that is able to match the rapid accumulation of WGS data in public databases.
      PubDate: Thu, 13 Sep 2018 00:00:00 GMT
      DOI: 10.1093/database/bay086
      Issue No: Vol. 2018 (2018)
  • SAGE: a comprehensive resource of genetic variants integrating South Asian
           whole genomes and exomes

    • Authors: Hariprakash J; Vellarikkal S, Verma A, et al.
      Pages: 1 - 10
      Abstract: South Asia is home to $\sim $20% of the world population and characterized by distinct ethnic, linguistic, cultural and genetic lineages. Only limited representative samples from the region have found its place in large population-scale international genome projects. The recent availability of genome scale data from multiple populations and datasets from South Asian countries in public domain motivated us to integrate the data into a comprehensive resource. In the present study, we have integrated a total of six datasets encompassing 1213 human exomes and genomes to create a compendium of 154 814 557 genetic variants and adding a total of 69 059 255 novel variants. The variants were systematically annotated using public resources and along with the allele frequencies are available as a browsable-online resource South Asian genomes and exomes. As a proof of principle application of the data and resource for genetic epidemiology, we have analyzed the pathogenic genetic variants causing retinitis pigmentosa. Our analysis reveals the genetic landscape of the disease and suggests subset of genetic variants to be highly prevalent in South Asia.
      PubDate: Fri, 31 Aug 2018 00:00:00 GMT
      DOI: 10.1093/database/bay080
      Issue No: Vol. 2018 (2018)
  • PlaNC-TE: a comprehensive knowledgebase of non-coding RNAs and
           transposable elements in plants

    • Authors: Pedro D; Lorenzetti A, Domingues D, et al.
      Pages: 1 - 7
      Abstract: Transposable elements (TEs) play an essential role in the genetic variability of eukaryotic species. In plants, they may comprise up to 90% of the total genome. Non-coding RNAs (ncRNAs) are known to control gene expression and regulation. Although the relationship between ncRNAs and TEs is known, obtaining the organized data for sequenced genomes is not straightforward. In this study, we describe the PlaNC-TE (, a user-friendly portal harboring a knowledgebase created by integrating and analysing plant ncRNA-TE data. We identified a total of 14 350 overlaps between ncRNAs and TEs in 40 plant genomes. The database allows users to browse, search and download all ncRNA and TE data analysed. Overall, PlaNC-TE not only organizes data and provides insights about the relationship between ncRNA and TEs in plants but also helps improve genome annotation strategies. Moreover, this is the first database to provide resources to broadly investigate functions and mechanisms involving TEs and ncRNAs in plants.
      PubDate: Fri, 31 Aug 2018 00:00:00 GMT
      DOI: 10.1093/database/bay078
      Issue No: Vol. 2018 (2018)
  • A probabilistic automated tagger to identify human-related publications

    • Authors: Cohen A; Dunivin Z, Smalheiser N.
      Pages: 1 - 8
      Abstract: The Medical Subject Heading ‘Humans’ is manually curated and indicates human-related studies within MEDLINE. However, newly published MEDLINE articles may take months to be indexed and non-MEDLINE articles lack consistent, transparent indexing of this feature. Therefore, for up to date and broad literature searches, there is a need for an independent automated system to identify whether a given publication is human-related, particularly when they lack Medical Subject Headings. One million MEDLINE records published in 1987–2014 were randomly selected. Text-based features from the title, abstract, author name and journal fields were extracted. A linear support vector machine was trained to estimate the probability that a given article should be indexed as Humans and was evaluated on records from 2015 to 2016. Overall accuracy was high: area under the receiver operating curve = 0.976, F1 = 95% relative to MeSH indexing. Manual review of cases of extreme disagreement with MEDLINE showed 73.5% agreement with the automated prediction. We have tagged all articles indexed in PubMed with predictive scores and have made the information publicly available at We have also made available a web-based interface to allow users to obtain predictive scores for non-MEDLINE articles. This will assist in the triage of clinical evidence for writing systematic reviews.
      PubDate: Fri, 31 Aug 2018 00:00:00 GMT
      DOI: 10.1093/database/bay079
      Issue No: Vol. 2018 (2018)
  • Signalling maps in cancer research: construction and data analysis

    • Authors: Kondratova M; Sompairac N, Barillot E, et al.
      Abstract: Generation and usage of high-quality molecular signalling network maps can be augmented by standardizing notations, establishing curation workflows and application of computational biology methods to exploit the knowledge contained in the maps. In this manuscript, we summarize the major aims and challenges of assembling information in the form of comprehensive maps of molecular interactions. Mainly, we share our experience gained while creating the Atlas of Cancer Signalling Network. In the step-by-step procedure, we describe the map construction process and suggest solutions for map complexity management by introducing a hierarchical modular map structure. In addition, we describe the NaviCell platform, a computational technology using Google Maps API to explore comprehensive molecular maps similar to geographical maps and explain the advantages of semantic zooming principles for map navigation. We also provide the outline to prepare signalling network maps for navigation using the NaviCell platform. Finally, several examples of cancer high-throughput data analysis and visualization in the context of comprehensive signalling maps are presented.
      PubDate: Mon, 09 Apr 2018 00:00:00 GMT
      DOI: 10.1093/database/bay036
  • An entropy-reducing data representation approach for bioinformatic data

    • Authors: McCulloch A; Jauregui R, Maclean P, et al.
      Abstract: Non-semantic approaches to bioinformatic data analysis have potential relevance where semantic resources such as annotated finished reference genomes are lacking, such as in the analysis and utilisation of growing amounts of sequence data from non-model organisms, often associated with sequence-based agricultural, aqua-cultural and environmental sampling studies and commercial services. Even where rich semantic resources are available, semantic approaches to problems such as contrasting and comparing reference assemblies, and utilising multiple references in parallel to avoid reference bias, are costly and difficult to fully automate. We introduce and discuss a non-semantic data representation approach intended mainly for bioinformatic data called non-semantic labelling. Non-semantic labelling involves tensorially combining multiple kinds of model-based entropy-reducing data representation, with multiple representation models, so as to map both data and models into dual metric representation spaces, with goals of both reducing the statistical complexity of the data, and highlighting latent structure via machine learning and statistical analyses conducted within the dual representation spaces. As part of the framework, we introduce a novel algebraic abstraction of data representation mappings, and present four proof-of-concept examples of its application, to problems such as comparing and contrasting sequence assemblies, utilisation of multiple references for annotation and development of quality control diagnostics in a variety of high-throughput sequencing contexts.Database URL:
      PubDate: Thu, 05 Apr 2018 00:00:00 GMT
      DOI: 10.1093/database/bay029
  • Expert curation for building network-based dynamical models: a case study
           on atherosclerotic plaque formation

    • Authors: Bekkar A; Estreicher A, Niknejad A, et al.
      Abstract: Knowledgebases play an increasingly important role in scientific research, where the expert curation of biological knowledge in forms that are amenable to computational analysis (using ontologies for example)–provides a significant added value and enables new types of computational analyses for high throughput datasets. In this work, we demonstrate how expert curation can also play a more direct role in research, by supporting the use of network-based dynamical models to study a specific biological process. This curation effort is focused on the regulatory interactions between biological entities, such as genes or proteins and compounds, which may interact with each other in a complex manner, including regulatory complexes and conditional dependencies between co-regulators. This critical information has to be captured and encoded in a computable manner, which is currently far beyond the current capabilities of automatically constructed network. As a case study, we report here the prior knowledge network constructed by the sysVASC consortium to model the biological events leading to the formation of atherosclerotic plaques, during the onset of cardiovascular disease and discuss some specific examples to illustrate the main pitfalls and added value provided by the expert curation during this endeavor.Database URL:
      PubDate: Wed, 04 Apr 2018 00:00:00 GMT
      DOI: 10.1093/database/bay031
  • A tutorial of diverse genome analysis tools found in the CoGe web-platform
           using Plasmodium spp. as a model

    • Authors: Castillo A; Nelson A, Haug-Baltzell A, et al.
      Abstract: Integrated platforms for storage, management, analysis and sharing of large quantities of omics data have become fundamental to comparative genomics. CoGe ( is an online platform designed to manage and study genomic data, enabling both data- and hypothesis-driven comparative genomics. CoGe’s tools and resources can be used to organize and analyse both publicly available and private genomic data from any species. Here, we demonstrate the capabilities of CoGe through three example workflows using 17 Plasmodium genomes as a model. Plasmodium genomes present unique challenges for comparative genomics due to their rapidly evolving and highly variable genomic AT/GC content. These example workflows are intended to serve as templates to help guide researchers who would like to use CoGe to examine diverse aspects of genome evolution. In the first workflow, trends in genome composition and amino acid usage are explored. In the second, changes in genome structure and the distribution of synonymous (Ks) and non-synonymous (Kn) substitution values are evaluated across species with different levels of evolutionary relatedness. In the third workflow, microsyntenic analyses of multigene families’ genomic organization are conducted using two Plasmodium-specific gene families—serine repeat antigen, and cytoadherence-linked asexual gene—as models. In general, these example workflows show how to achieve quick, reproducible and shareable results using the CoGe platform. We were able to replicate previously published results, as well as leverage CoGe’s tools and resources to gain additional insight into various aspects of Plasmodium genome evolution. Our results highlight the usefulness of the CoGe platform, particularly in understanding complex features of genome evolution.Database URL:
      PubDate: Tue, 03 Apr 2018 00:00:00 GMT
      DOI: 10.1093/database/bay030
  • dbAMEPNI: a database of alanine mutagenic effects for
           protein–nucleic acid interactions

    • Authors: Liu L; Xiong Y, Gao H, et al.
      Abstract: Protein–nucleic acid interactions play essential roles in various biological activities such as gene regulation, transcription, DNA repair and DNA packaging. Understanding the effects of amino acid substitutions on protein–nucleic acid binding affinities can help elucidate the molecular mechanism of protein–nucleic acid recognition. Until now, no comprehensive and updated database of quantitative binding data on alanine mutagenic effects for protein–nucleic acid interactions is publicly accessible. Thus, we developed a new database of Alanine Mutagenic Effects for Protein-Nucleic Acid Interactions (dbAMEPNI). dbAMEPNI is a manually curated, literature-derived database, comprising over 577 alanine mutagenic data with experimentally determined binding affinities for protein–nucleic acid complexes. It contains several important parameters, such as dissociation constant (Kd), Gibbs free energy change (ΔΔG), experimental conditions and structural parameters of mutant residues. In addition, the database provides an extended dataset of 282 single alanine mutations with only qualitative data (or descriptive effects) of thermodynamic information.Database URL:
      PubDate: Mon, 02 Apr 2018 00:00:00 GMT
      DOI: 10.1093/database/bay034
  • Probabilistic and machine learning-based retrieval approaches for
           biomedical dataset retrieval

    • Authors: Karisani P; Qin Z, Agichtein E.
      Abstract: The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system.Database URL:
      PubDate: Wed, 28 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bax104
  • Biopanning data bank 2018: hugging next generation phage display

    • Authors: He B; Jiang L, Duan Y, et al.
      Abstract: The 2018 update of the biopanning data bank (BDB) stores phage display data sequenced by Sanger sequencing and next generation sequencing technologies. In this work, we upgraded the database with more biopanning data sets and several new features, including (i) incorporation of next generation biopanning data and the unselected population where the target is not determined and the round of screening is zero; (ii) addition of sequencing information; (iii) improvement of browsing and searching systems and 3 D chemical structure viewer; (iv) integration of standalone tools for target-unrelated peptides analysis within conventional phage display and next generation phage display (NGPD) data. In the current version of BDB (released on 19 January 2018), the database houses 3291 sets of biopanning data collected from 1540 published articles, including 95 NGPD data sets and 3196 traditional biopanning data sets. The BDB database serves as an important and comprehensive resource for developing peptide ligands.Database URL: The BDB database is available at
      PubDate: Tue, 27 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay032
  • CITGeneDB: a comprehensive database of human and mouse genes enhancing or
           suppressing cold-induced thermogenesis validated by perturbation
           experiments in mice

    • Authors: Li J; Deng S, Wei G, et al.
      Abstract: Cold-induced thermogenesis increases energy expenditure and can reduce body weight in mammals, so the genes involved in it are thought to be potential therapeutic targets for treating obesity and diabetes. In the quest for more effective therapies, a great deal of research has been conducted to elucidate the regulatory mechanism of cold-induced thermogenesis. Over the last decade, a large number of genes that can enhance or suppress cold-induced thermogenesis have been discovered, but a comprehensive list of these genes is lacking. To fill this gap, we examined all of the annotated human and mouse genes and curated those demonstrated to enhance or suppress cold-induced thermogenesis by in vivo or ex vivo experiments in mice. The results of this highly accurate and comprehensive annotation are hosted on a database called CITGeneDB, which includes a searchable web interface to facilitate broad public use. The database will be updated as new genes are found to enhance or suppress cold-induced thermogenesis. It is expected that CITGeneDB will be a valuable resource in future explorations of the molecular mechanism of cold-induced thermogenesis, helping pave the way for new obesity and diabetes treatments.Database URL:
      PubDate: Fri, 23 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay012
  • GEOMetaCuration: a web-based application for accurate manual curation of
           Gene Expression Omnibus metadata

    • Authors: Li Z; Li J, Yu P.
      Abstract: Metadata curation has become increasingly important for biological discovery and biomedical research because a large amount of heterogeneous biological data is currently freely available. To facilitate efficient metadata curation, we developed an easy-to-use web-based curation application, GEOMetaCuration, for curating the metadata of Gene Expression Omnibus datasets. It can eliminate mechanical operations that consume precious curation time and can help coordinate curation efforts among multiple curators. It improves the curation process by introducing various features that are critical to metadata curation, such as a back-end curation management system and a curator-friendly front-end. The application is based on a commonly used web development framework of Python/Django and is open-sourced under the GNU General Public License V3. GEOMetaCuration is expected to benefit the biocuration community and to contribute to computational generation of biological insights using large-scale biological data. An example use case can be found at the demo website: URL:
      PubDate: Fri, 23 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay019
  • Improved ontology-based similarity calculations using a study-wise
           annotation model

    • Authors: Köhler S.
      Abstract: A typical use case of ontologies is the calculation of similarity scores between items that are annotated with classes of the ontology. For example, in differential diagnostics and disease gene prioritization, the human phenotype ontology (HPO) is often used to compare a query phenotype profile against gold-standard phenotype profiles of diseases or genes. The latter have long been constructed as flat lists of ontology classes, which, as we show in this work, can be improved by exploiting existing structure and information in annotation datasets or full text disease descriptions. We derive a study-wise annotation model of diseases and genes and show that this can improve the performance of semantic similarity measures. Inferred weights of individual annotations are one reason for this improvement, but more importantly using the study-wise structure further boosts the results of the algorithms according to precision-recall analyses. We test the study-wise annotation model for diseases annotated with classes from the HPO and for genes annotated with gene ontology (GO) classes. We incorporate this annotation model into similarity algorithms and show how this leads to improved performance. This work adds weight to the need for enhancing simple list-based representations of disease or gene annotations. We show how study-wise annotations can be automatically derived from full text summaries of disease descriptions and from the annotation data provided by the GO Consortium and how semantic similarity measure can utilize this extended annotation model.Database URL:
      PubDate: Fri, 23 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay026
  • TISSUES 2.0: an integrative web resource on mammalian tissue expression

    • Authors: Palasca O; Santos A, Stolte C, et al.
      Abstract: Database (2018), doi: 10.1093/database/bay003
      PubDate: Fri, 16 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay028
  • Finding relevant biomedical datasets: the UC San Diego solution for the
           bioCADDIE Retrieval Challenge

    • Authors: Wei W; Ji Z, He Y, et al.
      Abstract: The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval.Database URL:
      PubDate: Fri, 16 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay017
  • PvaxDB: a comprehensive structural repository of Plasmodium vivax proteome

    • Authors: Singh A; Kaushik R, Kuntal H, et al.
      Abstract: The severity of malaria caused by Plasmodium vivax worldwide and its resistance against the available general antimalarial drugs has created an urgent need for a comprehensive insight into its biology and biochemistry for developing some novel potential vaccines and therapeutics. P.vivax comprises 5392 proteins mostly predicted, out of which 4211 are soluble proteins and 2205 of these belong to blood and liver stages of malarial cycle. Presently available public resources report functional annotation (gene ontology) of only 28% (627 proteins) of the enzymatic soluble proteins and experimental structures are determined for only 42 proteins P. vivax proteome. In this milieu of severe paucity of structural and functional data, we have generated structures of 2205 soluble proteins, validated them thoroughly, identified their binding pockets (including active sites) and annotated their function increasing the coverage from the existing 28% to 100%. We have pooled all this information together and created a database christened as PvaxDB, which furnishes extensive sequence, structure, ligand binding site and functional information. We believe PvaxDB could be helpful in identifying novel protein drug targets, expediting development of new drugs to combat malaria. This is also the first attempt to create a reliable comprehensive computational structural repository of all the soluble proteins of P. vivax.Database URL:
      PubDate: Wed, 14 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay021
  • Baseline and extensions approach to information retrieval of complex
           medical data: Poznan's approach to the bioCADDIE 2016

    • Authors: Cieslewicz A; Dutkiewicz J, Jedrzejek C.
      Abstract: Information retrieval from biomedical repositories has become a challenging task because of their increasing size and complexity. To facilitate the research aimed at improving the search for relevant documents, various information retrieval challenges have been launched. In this article, we present the improved medical information retrieval systems designed by Poznan University of Technology and Poznan University of Medical Sciences as a contribution to the bioCADDIE 2016 challenge—a task focusing on information retrieval from a collection of 794 992 datasets generated from 20 biomedical repositories. The system developed by our team utilizes the Terrier 4.2 search platform enhanced by a query expansion method using word embeddings. This approach, after post-challenge modifications and improvements (with particular regard to assigning proper weights for original and expanded terms), allowed us achieving the second best infNDCG measure (0.4539) compared with the challenge results and infAP 0.3978. This demonstrates that proper utilization of word embeddings can be a valuable addition to the information retrieval process. Some analysis is provided on related work involving other bioCADDIE contributions. We discuss the possibility of improving our results by using better word embedding schemes to find candidates for query expansion.Database URL:
      PubDate: Mon, 12 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bax103
  • SPTEdb: a database for transposable elements in salicaceous plants

    • Authors: Yi F; Jia Z, Xiao Y, et al.
      Abstract: Although transposable elements (TEs) play significant roles in structural, functional and evolutionary dynamics of the salicaceous plants genome and the accurate identification, definition and classification of TEs are still inadequate. In this study, we identified 18 393 TEs from Populus trichocarpa, Populus euphratica and Salix suchowensis using a combination of signature-based, similarity-based and De novo method, and annotated them into 1621 families. A comprehensive and user-friendly web-based database, SPTEdb, was constructed and served for researchers. SPTEdb enables users to browse, retrieve and download the TEs sequences from the database. Meanwhile, several analysis tools, including BLAST, HMMER, GetORF and Cut sequence, were also integrated into SPTEdb to help users to mine the TEs data easily and effectively. In summary, SPTEdb will facilitate the study of TEs biology and functional genomics in salicaceous plants.Database URL:
      PubDate: Fri, 09 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay024
  • YummyData: providing high-quality open life science data

    • Authors: Yamamoto Y; Yamaguchi A, Splendiani A.
      Abstract: Many life science datasets are now available via Linked Data technologies, meaning that they are represented in a common format (the Resource Description Framework), and are accessible via standard APIs (SPARQL endpoints). While this is an important step toward developing an interoperable bioinformatics data landscape, it also creates a new set of obstacles, as it is often difficult for researchers to find the datasets they need. Different providers frequently offer the same datasets, with different levels of support: as well as having more or less up-to-date data, some providers add metadata to describe the content, structures, and ontologies of the stored datasets while others do not. We currently lack a place where researchers can go to easily assess datasets from different providers in terms of metrics such as service stability or metadata richness. We also lack a space for collecting feedback and improving data providers’ awareness of user needs. To address this issue, we have developed YummyData, which consists of two components. One periodically polls a curated list of SPARQL endpoints, monitoring the states of their Linked Data implementations and content. The other presents the information measured for the endpoints and provides a forum for discussion and feedback. YummyData is designed to improve the findability and reusability of life science datasets provided as Linked Data and to foster its adoption. It is freely accessible at URL:
      PubDate: Fri, 09 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay022
  • The SNPcurator: literature mining of enriched SNP-disease associations

    • Authors: Tawfik N; Spruit M.
      Abstract: The uniqueness of each human genetic structure motivated the shift from the current practice of medicine to a more tailored one. This personalized medicine revolution would not be possible today without the genetics data collected from genome-wide association studies (GWASs) that investigate the relation between different phenotypic traits and single-nucleotide polymorphisms (SNPs). The huge increase in the literature publication space imposes a challenge on the conventional manual curation process which is becoming more and more expensive. This research aims at automatically extracting SNP associations of any given disease and its reported statistical significance (P-value) and odd ratio as well as cohort information such as size and ethnicity. Our evaluation illustrates that SNPcurator was able to replicate a large number of SNP-disease associations that were also reported in the NHGRI-EBI Catalog of published GWASs. SNPcurator was also tested by eight external genetics experts, who queried the system to examine diseases of their choice, and was found to be efficient and satisfactory. We conclude that the text-mining-based system has a great potential for helping researchers and scientists, especially in their preliminary genetics research. SNPcurator is publicly available at URL:
      PubDate: Thu, 08 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay020
  • NDDVD: an integrated and manually curated Neurodegenerative Diseases
           Variation Database

    • Authors: Yang Y; Xu C, Liu X, et al.
      Abstract: Neurodegenerative diseases (NDDs) are associated with genetic variations including point substitutions, copy number alterations, insertions and deletions. At present, a few genetic variation repositories for some individual NDDs have been created, however, these databases are needed to be integrated and expanded to all the NDDs for systems biological investigation. We here build a relational database termed as NDDVD to integrate all the variations of NDDs using Leiden Open Variation Database (LOVD) platform. The items in the NDDVD are collected manually from PubMed or extracted from the existed variation databases. The cross-disease database includes over 6374 genetic variations of 289 genes associated with 37 different NDDs. The patterns, conservations and biological functions for variations in different NDDs are statistically compared and a user-friendly interface is provided for NDDVD at:
      PubDate: Mon, 05 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay018
  • Micropublication: incentivizing community curation and placing unpublished
           data into the public domain

    • Authors: Raciti D; Yook K, Harris T, et al.
      Abstract: Large volumes of data generated by research laboratories coupled with the required effort and cost of curation present a significant barrier to inclusion of these data in authoritative community databases. Further, many publicly funded experimental observations remain invisible to curation simply because they are never published: results often do not fit within the scope of a standard publication; trainee-generated data are forgotten when the experimenter (e.g. student, post-doc) leaves the lab; results are omitted from science narratives due to publication bias where certain results are considered irrelevant for the publication. While authors are in the best position to curate their own data, they face a steep learning curve to ensure that appropriate referential tags, metadata, and ontologies are applied correctly to their observations, a task sometimes considered beyond the scope of their research and other numerous responsibilities. Getting researchers to adopt a new system of data reporting and curation requires a fundamental change in behavior among all members of the research community. To solve these challenges, we have created a novel scholarly communication platform that captures data from researchers and directly delivers them to information resources via Micropublication. This platform incentivizes authors to publish their unpublished observations along with associated metadata by providing a deliberately fast and lightweight but still peer-reviewed process that results in a citable publication. Our long-term goal is to develop a data ecosystem that improves reproducibility and accountability of publicly funded research and in turn accelerates both basic and translational discovery.Database URL:
      PubDate: Fri, 02 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay013
  • BioDataome: a collection of uniformly preprocessed and automatically
           annotated datasets for data-driven biology

    • Authors: Lakiotaki K; Vorniotakis N, Tsagris M, et al.
      Abstract: Biotechnology revolution generates a plethora of omics data with an exponential growth pace. Therefore, biological data mining demands automatic, ‘high quality’ curation efforts to organize biomedical knowledge into online databases. BioDataome is a database of uniformly preprocessed and disease-annotated omics data with the aim to promote and accelerate the reuse of public data. We followed the same preprocessing pipeline for each biological mart (microarray gene expression, RNA-Seq gene expression and DNA methylation) to produce ready for downstream analysis datasets and automatically annotated them with disease-ontology terms. We also designate datasets that share common samples and automatically discover control samples in case-control studies. Currently, BioDataome includes ∼5600 datasets, ∼260 000 samples spanning ∼500 diseases and can be easily used in large-scale massive experiments and meta-analysis. All datasets are publicly available for querying and downloading via BioDataome web application. We demonstrate BioDataome’s utility by presenting exploratory data analysis examples. We have also developed BioDataome R package found in: URL:
      PubDate: Fri, 02 Mar 2018 00:00:00 GMT
      DOI: 10.1093/database/bay011
  • AntiTbPdb: a knowledgebase of anti-tubercular peptides

    • Authors: Usmani S; Kumar R, Kumar V, et al.
      Abstract: Tuberculosis is a global menace, caused by Mycobacterium tuberculosis, responsible for millions of premature deaths every year. In the era of drug-resistant tuberculosis, peptide-based therapeutics may provide alternate to small molecule based drugs. In order to create knowledgebase, AntiTbPdb (, experimentally validated anti-tubercular and anti-mycobacterial peptides were compiled from literature. We curate 10 652 research articles and 35 patents to extract anti-tubercular peptides and annotate these peptides manually. This knowledgebase has 1010 entries, each entry provides extensive information about an anti-tubercular peptide such as sequence, chemical modification, chirality, nature and source of origin. The tertiary structure of these anti-tubercular peptides containing natural as well as chemically modified residues was predicted using PEPstrMOD and I-TASSER. In addition to structural information, database maintains other properties of peptides like physiochemical properties. Numerous web-based tools have been integrated for data retrieval, browsing, sequence similarity search and peptide mapping. In order to assist wide range of user, we developed a responsive website suitable for smartphone, tablet and desktop.Database URL:
      PubDate: Wed, 28 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay025
  • miRwayDB: a database for experimentally validated microRNA-pathway
           associations in pathophysiological conditions

    • Authors: Das S; Saha P, Chakravorty N.
      Abstract: MicroRNAs (miRNAs) are well-known as key regulators of diverse biological pathways. A series of experimental evidences have shown that abnormal miRNA expression profiles are responsible for various pathophysiological conditions by modulating genes in disease associated pathways. In spite of the rapid increase in research data confirming such associations, scientists still do not have access to a consolidated database offering these miRNA-pathway association details for critical diseases. We have developed miRwayDB, a database providing comprehensive information of experimentally validated miRNA-pathway associations in various pathophysiological conditions utilizing data collected from published literature. To the best of our knowledge, it is the first database that provides information about experimentally validated miRNA mediated pathway dysregulation as seen specifically in critical human diseases and hence indicative of a cause-and-effect relationship in most cases. The current version of miRwayDB collects an exhaustive list of miRNA-pathway association entries for 76 critical disease conditions by reviewing 663 published articles. Each database entry contains complete information on the name of the pathophysiological condition, associated miRNA(s), experimental sample type(s), regulation pattern (up/down) of miRNA, pathway association(s), targeted member of dysregulated pathway(s) and a brief description. In addition, miRwayDB provides miRNA, gene and pathway score to evaluate the role of a miRNA regulated pathways in various pathophysiological conditions. The database can also be used for other biomedical approaches such as validation of computational analysis, integrated analysis and prediction of computational model. It also offers a submission page to submit novel data from recently published studies. We believe that miRwayDB will be a useful tool for miRNA research community.Database URL:
      PubDate: Wed, 28 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay023
  • Prevention of data duplication for high throughput sequencing repositories

    • Authors: Gabdank I; Chan E, Davidson J, et al.
      Abstract: Prevention of unintended duplication is one of the ongoing challenges many databases have to address. Working with high-throughput sequencing data, the complexity of that challenge increases with the complexity of the definition of a duplicate. In a computational data model, a data object represents a real entity like a reagent or a biosample. This representation is similar to how a card represents a book in a paper library catalog. Duplicated data objects not only waste storage, they can mislead users into assuming the model represents more than the single entity. Even if it is clear that two objects represent a single entity, data duplication opens the door to potential inconsistencies between the objects since the content of the duplicated objects can be updated independently, allowing divergence of the metadata associated with the objects. Analogously to a situation in which a catalog in a paper library would contain by mistake two cards for a single copy of a book. If these cards are listing simultaneously two different individuals as current book borrowers, it would be difficult to determine which borrower (out of the two listed) actually has the book. Unfortunately, in a large database with multiple submitters, unintended duplication is to be expected. In this article, we present three principal guidelines the Encyclopedia of DNA Elements (ENCODE) Portal follows in order to prevent unintended duplication of both actual files and data objects: definition of identifiable data objects (I), object uniqueness validation (II) and de-duplication mechanism (III). In addition to explaining our modus operandi, we elaborate on the methods used for identification of sequencing data files. Comparison of the approach taken by the ENCODE Portal vs other widely used biological data repositories is provided.Database URL:
      PubDate: Tue, 27 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay008
  • Updated regulation curation model at the Saccharomyces Genome Database

    • Authors: Engel S; Skrzypek M, Hellerstedt S, et al.
      Abstract: The Saccharomyces Genome Database (SGD) provides comprehensive, integrated biological information for the budding yeast Saccharomyces cerevisiae, along with search and analysis tools to explore these data, enabling the discovery of functional relationships between sequence and gene products in fungi and higher organisms. We have recently expanded our data model for regulation curation to address regulation at the protein level in addition to transcription, and are presenting the expanded data on the ‘Regulation’ pages at SGD. These pages include a summary describing the context under which the regulator acts, manually curated and high-throughput annotations showing the regulatory relationships for that gene and a graphical visualization of its regulatory network and connected networks. For genes whose products regulate other genes or proteins, the Regulation page includes Gene Ontology enrichment analysis of the biological processes in which those targets participate. For DNA-binding transcription factors, we also provide other information relevant to their regulatory function, such as DNA binding site motifs and protein domains. As with other data types at SGD, all regulatory relationships and accompanying data are available through YeastMine, SGD’s data warehouse based on InterMine.Database URL:
      PubDate: Tue, 27 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay007
  • The NCBI BioCollections Database

    • Authors: Sharma S; Ciufo S, Starchenko E, et al.
      Abstract: The rapidly growing set of GenBank submissions includes sequences that are derived from vouchered specimens. These are associated with culture collections, museums, herbaria and other natural history collections, both living and preserved. Correct identification of the specimens studied, along with a method to associate the sample with its institution, is critical to the outcome of related studies and analyses. The National Center for Biotechnology Information BioCollections Database was established to allow the association of specimen vouchers and related sequence records to their home institutions. This process also allows cross-linking from the home institution for quick identification of all records originating from each collection.Database URL:
      PubDate: Fri, 23 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay006
  • TransAtlasDB: an integrated database connecting expression data, metadata
           and variants

    • Authors: Adetunji M; Lamont S, Schmidt C.
      Abstract: High-throughput transcriptome sequencing (RNAseq) is the universally applied method for target-free transcript identification and gene expression quantification, generating huge amounts of data. The constraint of accessing such data and interpreting results can be a major impediment in postulating suitable hypothesis, thus an innovative storage solution that addresses these limitations, such as hard disk storage requirements, efficiency and reproducibility are paramount. By offering a uniform data storage and retrieval mechanism, various data can be compared and easily investigated. We present a sophisticated system, TransAtlasDB, which incorporates a hybrid architecture of both relational and NoSQL databases for fast and efficient data storage, processing and querying of large datasets from transcript expression analysis with corresponding metadata, as well as gene-associated variants (such as SNPs) and their predicted gene effects. TransAtlasDB provides the data model of accurate storage of the large amount of data derived from RNAseq analysis and also methods of interacting with the database, either via the command-line data management workflows, written in Perl, with useful functionalities that simplifies the complexity of data storage and possibly manipulation of the massive amounts of data generated from RNAseq analysis or through the web interface. The database application is currently modeled to handle analyses data from agricultural species, and will be expanded to include more species groups. Overall TransAtlasDB aims to serve as an accessible repository for the large complex results data files derived from RNAseq gene expression profiling and variant analysis.Database URL:
      PubDate: Fri, 23 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay014
  • AllerGAtlas 1.0: a human allergy-related genes database

    • Authors: Liu J; Liu Y, Wang D, et al.
      Abstract: Allergy is a detrimental hypersensitive response to innocuous environmental antigen, which is caused by the effect of interaction between environmental factors and multiple genetic pre-disposition. In the past decades, hundreds of allergy-related genes have been identified to illustrate the epidemiology and pathogenesis of allergic diseases, which are associated with better endophenotype, novel biomarkers, early-life risk factors and individual differences in treatment responses. However, the information of all these allergy-related genes is dispersed in thousands of publications. Here, we present a manually curated human allergy-related gene database of AllerGAtlas, which contained 1195 well-annotated human allergy-related genes, determined by text-mining and manual curation. AllerGAtlas will be a valuable bioinformatics resource to search human allergy-related genes and explore their functions in allergy for experimental research.Database URL:
      PubDate: Thu, 22 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay010
  • dbDEPC 3.0: the database of differentially expressed proteins in human
           cancer with multi-level annotation and drug indication

    • Authors: Yang Q; Zhang Y, Cui H, et al.
      Abstract: Proteins are major effectors of biological functions, and differentially expressed proteins (DEPs) are widely reported as biomarkers in pathological mechanism, prognosis prediction as well as treatment targeting in cancer research. High-throughput technology of mass spectrometry (MS) has identified large amounts of DEPs in human cancers. Through mining published researches with detailed experiment information, dbDEPC was the first database aimed to provide a systematic resource for the storage and query of the DEPs generated by MS in cancer research. It was updated to dbDEPC 2.0 in 2012. Here, we provide another updated version of dbDEPC, with improvement of database contents and enhanced web interface. The current version of dbDEPC 3.0 contains 11 669 unique DEPs in 26 different cancer types. Multi-level annotations of DEPs have been firstly introduced this time, including cancer-related peptide amino acid variations, post-translational modifications and drug information. Moreover, these multi-level annotations can be displayed in the biological networks, which can benefit integrative analysis. Finally, an online enrichment analysis tool has been developed, to support a KEGG enrichment analysis and to browse the relationship among interested protein list and known DEPs in KEGG pathways. In summary, dbDEPC 3.0 provides a comprehensive resource for accessing integrated and highly annotated DEPs in human cancer.Database URL:
      PubDate: Thu, 22 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay015
  • Identification of errors in the IEDB using ontologies

    • Authors: Vita R; Overton J, Peters B.
      Abstract: The Immune Epitope Database (IEDB) is a free online resource that has manually curated over 18 500 references from the scientific literature. Our database presents experimental data relating to the recognition of immune epitopes by the adaptive immune system in a structured, searchable manner. In order to be consistent and accurate in our data representation across many different journals, authors and curators, we have implemented several quality control measures, such as curation rules, controlled vocabularies and links to external ontologies and other resources. Ontologies and other resources have greatly benefited the IEDB through improved search interfaces, easier curation practices, interoperability between the IEDB and other databases and the identification of errors within our dataset. Here, we will elaborate on how ontology mapping and usage can be used to find and correct errors in a manually curated database.Database URL:
      PubDate: Thu, 22 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay005
  • GAN: a platform of genomics and genetics analysis and application in

    • Authors: Yang S; Zhang X, Li H, et al.
      Abstract: Nicotiana is an important Solanaceae genus, and plays a significant role in modern biological research. Massive Nicotiana biological data have emerged from in-depth genomics and genetics studies. From big data to big discovery, large-scale analysis and application with new platforms is critical. Based on data accumulation, a comprehensive platform of Genomics and Genetics Analysis and Application in Nicotiana (GAN) has been developed, and is publicly available at GAN consists of four main sections: (i) Sources, a total of 5267 germplasm lines, along with detailed descriptions of associated characteristics, are all available on the Germplasm page, which can be queried using eight different inquiry modes. Seven fully sequenced species with accompanying sequences and detailed genomic annotation are available on the Genomics page. (ii) Genetics, detailed descriptions of 10 genetic linkage maps, constructed by different parents, 2239 KEGG metabolic pathway maps and 209 945 gene families across all catalogued genes, along with two co-linearity maps combining N. tabacum with available tomato and potato linkage maps are available here. Furthermore, 3 963 119 genome-SSRs, 10 621 016 SNPs, 12 388 PIPs and 102 895 reverse transcription-polymerase chain reaction primers, are all available to be used and searched on the Markers page. (iii) Tools, the genome browser JBrowse and five useful online bioinformatics softwares, Blast, Primer3, SSR-detect, Nucl-Protein and E-PCR, are provided on the JBrowse and Tools pages. (iv) Auxiliary, all the datasets are shown on a Statistics page, and are available for download on a Download page. In addition, the user’s manual is provided on a Manual page in English and Chinese languages. GAN provides a user-friendly Web interface for searching, browsing and downloading the genomics and genetics datasets in Nicotiana. As far as we can ascertain, GAN is the most comprehensive source of bio-data available, and the most applicable resource for breeding, gene mapping, gene cloning, the study of the origin and evolution of polyploidy, and related studies in Nicotiana.Database URL:
      PubDate: Wed, 21 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay001
  • FAIR principles and the IEDB: short-term improvements and a long-term
           vision of OBO-foundry mediated machine-actionable interoperability

    • Authors: Vita R; Overton J, Mungall C, et al.
      Abstract: The Immune Epitope Database (IEDB), at, has the mission to make published experimental data relating to the recognition of immune epitopes easily available to the scientific public. By presenting curated data in a searchable database, we have liberated it from the tables and figures of journal articles, making it more accessible and usable by immunologists. Recently, the principles of Findability, Accessibility, Interoperability and Reusability have been formulated as goals that data repositories should meet to enhance the usefulness of their data holdings. We here examine how the IEDB complies with these principles and identify broad areas of success, but also areas for improvement. We describe short-term improvements to the IEDB that are being implemented now, as well as a long-term vision of true ‘machine-actionable interoperability’, which we believe will require community agreement on standardization of knowledge representation that can be built on top of the shared use of ontologies.
      PubDate: Mon, 19 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bax105
  • miRToolsGallery: a tag-based and rankable microRNA bioinformatics
           resources database portal

    • Authors: Chen L; Heikkinen L, Wang C, et al.
      Abstract: Hundreds of bioinformatics tools have been developed for MicroRNA (miRNA) investigations including those used for identification, target prediction, structure and expression profile analysis. However, finding the correct tool for a specific application requires the tedious and laborious process of locating, downloading, testing and validating the appropriate tool from a group of nearly a thousand. In order to facilitate this process, we developed a novel database portal named miRToolsGallery. We constructed the portal by manually curating > 950 miRNA analysis tools and resources. In the portal, a query to locate the appropriate tool is expedited by being searchable, filterable and rankable. The ranking feature is vital to quickly identify and prioritize the more useful from the obscure tools. Tools are ranked via different criteria including the PageRank algorithm, date of publication, number of citations, average of votes and number of publications. miRToolsGallery provides links and data for the comprehensive collection of currently available miRNA tools with a ranking function which can be adjusted using different criteria according to specific requirements.Database URL:
      PubDate: Mon, 19 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay004
  • Fungal Stress Database (FSD)––a repository of fungal stress
           physiological data

    • Authors: Orosz E; van de Wiele N, Emri T, et al.
      Abstract: The construction of the Fungal Stress Database (FSD) was initiated and fueled by two major goals. At first, some outstandingly important groups of filamentous fungi including the aspergilli possess remarkable capabilities to adapt to a wide spectrum of environmental stress conditions but the underlying mechanisms of this stress tolerance have remained yet to be elucidated. Furthermore, the lack of any satisfactory interlaboratory standardization of stress assays, e.g. the widely used stress agar plate experiments, often hinders the direct comparison and discussion of stress physiological data gained for various fungal species by different research groups. In order to overcome these difficulties and to promote multilevel, e.g. combined comparative physiology-based and comparative genomics-based, stress research in filamentous fungi, we constructed FSD, which currently stores 1412 photos taken on Aspergillus colonies grown under precisely defined stress conditions. This study involved altogether 18 Aspergillus strains representing 17 species with two different strains for Aspergillus niger and covered six different stress conditions. Stress treatments were selected considering the frequency of various stress tolerance studies published in the last decade in the aspergilli and included oxidative (H2O2, menadione sodium bisulphite), high-osmolarity (NaCl, sorbitol), cell wall integrity (Congo Red) and heavy metal (CdCl2) stress exposures. In the future, we would like to expand this database to accommodate further fungal species and stress treatments.URL:
      PubDate: Mon, 12 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay009
  • OliveNet™: a comprehensive library of compounds from Olea europaea

    • Authors: Bonvino N; Liang J, McCord E, et al.
      Abstract: Accumulated epidemiological, clinical and experimental evidence has indicated the beneficial health effects of the Mediterranean diet, which is typified by the consumption of virgin olive oil (VOO) as a main source of dietary fat. At the cellular level, compounds derived from various olive (Olea europaea), matrices, have demonstrated potent antioxidant and anti-inflammatory effects, which are thought to account, at least in part, for their biological effects. Research efforts are expanding into the characterization of compounds derived from Olea europaea, however, the considerable diversity and complexity of the vast array of chemical compounds have made their precise identification and quantification challenging. As such, only a relatively small subset of olive-derived compounds has been explored for their biological activity and potential health effects to date. Although there is adequate information describing the identification or isolation of olive-derived compounds, these are not easily searchable, especially when attempting to acquire chemical or biological properties. Therefore, we have created the OliveNet™ database containing a comprehensive catalogue of compounds identified from matrices of the olive, including the fruit, leaf and VOO, as well as in the wastewater and pomace accrued during oil production. From a total of 752 compounds, chemical analysis was sufficient for 676 individual compounds, which have been included in the database. The database is curated and comprehensively referenced containing information for the 676 compounds, which are divided into 13 main classes and 47 subclasses. Importantly, with respect to current research trends, the database includes 222 olive phenolics, which are divided into 13 subclasses. To our knowledge, OliveNet™ is currently the only curated open access database with a comprehensive collection of compounds associated with Olea europaea.Database URL:
      PubDate: Mon, 12 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay016
  • TISSUES 2.0: an integrative web resource on mammalian tissue expression

    • Authors: Palasca O; Santos A, Stolte C, et al.
      Abstract: Physiological and molecular similarities between organisms make it possible to translate findings from simpler experimental systems—model organisms—into more complex ones, such as human. This translation facilitates the understanding of biological processes under normal or disease conditions. Researchers aiming to identify the similarities and differences between organisms at the molecular level need resources collecting multi-organism tissue expression data. We have developed a database of gene–tissue associations in human, mouse, rat and pig by integrating multiple sources of evidence: transcriptomics covering all four species and proteomics (human only), manually curated and mined from the scientific literature. Through a scoring scheme, these associations are made comparable across all sources of evidence and across organisms. Furthermore, the scoring produces a confidence score assigned to each of the associations. The TISSUES database (version 2.0) is publicly accessible through a user-friendly web interface and as part of the STRING app for Cytoscape. In addition, we analyzed the agreement between datasets, across and within organisms, and identified that the agreement is mainly affected by the quality of the datasets rather than by the technologies used or organisms compared.Database URL:
      PubDate: Mon, 12 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay003
  • Worldwide Protein Data Bank biocuration supporting open access to
           high-quality 3D structural biology data

    • Authors: Young J; Westbrook J, Feng Z, et al.
      Abstract: The Protein Data Bank (PDB) is the single global repository for experimentally determined 3D structures of biological macromolecules and their complexes with ligands. The worldwide PDB (wwPDB) is the international collaboration that manages the PDB archive according to the FAIR principles: Findability, Accessibility, Interoperability and Reusability. The wwPDB recently developed OneDep, a unified tool for deposition, validation and biocuration of structures of biological macromolecules. All data deposited to the PDB undergo critical review by wwPDB Biocurators. This article outlines the importance of biocuration for structural biology data deposited to the PDB and describes wwPDB biocuration processes and the role of expert Biocurators in sustaining a high-quality archive. Structural data submitted to the PDB are examined for self-consistency, standardized using controlled vocabularies, cross-referenced with other biological data resources and validated for scientific/technical accuracy. We illustrate how biocuration is integral to PDB data archiving, as it facilitates accurate, consistent and comprehensive representation of biological structure data, allowing efficient and effective usage by research scientists, educators, students and the curious public worldwide.Database URL:
      PubDate: Wed, 07 Feb 2018 00:00:00 GMT
      DOI: 10.1093/database/bay002
  • FishTEDB: a collective database of transposable elements identified in the
           complete genomes of fish

    • Authors: Shao F; Wang J, Xu H, et al.
      Abstract: Transposable elements (TEs) are important for host gene regulation and genome evolution. Consensus sequences of TEs can assist investigators in accelerating studies on TE origins, amplification, functions and evolution, as well as comparative analyses and prediction of TEs in different species. In evolution, physiology, ecology and heredity research, fish are important models. However, to date, no comprehensive resource for TE consensus sequences exists for fish. Here, we collected genome-wide data and developed a novel database, FishTEDB, including 27 bony fishes, 1 cartilaginous fish, 1 lamprey and 1 lancelet. De novo, structure-based and homology-based approaches were combined to detect TEs. The database is open-source and user-friendly, and users can browse, search and download all data. FishTEDB also provides GetORF, BLAST and HMMER tools to analyze sequences.Database URL:
      PubDate: Tue, 16 Jan 2018 00:00:00 GMT
      DOI: 10.1093/database/bax106
  • CellExpress: a comprehensive microarray-based cancer cell line and
           clinical sample gene expression analysis online system

    • Authors: Lee Y; Lee C, Lai L, et al.
      Abstract: With the advancement of high-throughput technologies, gene expression profiles in cell lines and clinical samples are widely available in the public domain for research. However, a challenge arises when trying to perform a systematic and comprehensive analysis across independent datasets. To address this issue, we developed a web-based system, CellExpress, for analyzing the gene expression levels in more than 4000 cancer cell lines and clinical samples obtained from public datasets and user-submitted data. First, a normalization algorithm can be utilized to reduce the systematic biases across independent datasets. Next, a similarity assessment of gene expression profiles can be achieved through a dynamic dot plot, along with a distance matrix obtained from principal component analysis. Subsequently, differentially expressed genes can be visualized using hierarchical clustering. Several statistical tests and analytical algorithms are implemented in the system for dissecting gene expression changes based on the groupings defined by users. Lastly, users are able to upload their own microarray and/or next-generation sequencing data to perform a comparison of their gene expression patterns, which can help classify user data, such as stem cells, into different tissue types. In conclusion, CellExpress is a user-friendly tool that provides a comprehensive analysis of gene expression levels in both cell lines and clinical samples. The website is freely available at Source code is available at under the MIT License.Database URL:
      PubDate: Fri, 12 Jan 2018 00:00:00 GMT
      DOI: 10.1093/database/bax101
  • A generic workflow for effective sampling of environmental vouchers with
           UUID assignment and image processing

    • Authors: Triebel D; Reichert W, Bosert S, et al.
      Abstract: Sampling of biological and environmental vouchers in the field is rather challenging, particularly under adverse habitat conditions and when various activities need to be handled simultaneously. The workflow described here includes five procedural steps, which result in professional sampling and the generation of universally identifiable data. In preparation for the field campaign, sample containers need to be labelled with universally unique identifier (UUID)-QR-codes. At the collection site, labelled containers, sampled material and attached supplementary information are imaged using a GNSS- respectively GPS-enabled smartphone or camera. Image processing, tagging and data storage as CSV text file is subsequently achieved in a field station or laboratory. For this purposes, the newly implemented tool DiversityImageInspector (URL: is used. It addresses combined image and data processing in such a context including the extraction of the QR-coded UUID from the image content and the extraction of geodata and time information from the Exif image header. The import of the resulting data files into a relational database or other kind of data management systems is optional but recommended. If applied, the import might be guided by a data transformation tool with compliant schema as described here. The new approach is discussed also with regard to implications for virtual research environments and data publication networks.Database URL:
      PubDate: Tue, 09 Jan 2018 00:00:00 GMT
      DOI: 10.1093/database/bax096
  • YAAM: Yeast Amino Acid Modifications Database

    • Authors: Ledesma L; Sandoval E, Cruz-Martínez U, et al.
      Abstract: Proteins are dynamic molecules that regulate a myriad of cellular functions; these functions may be regulated by protein post-translational modifications (PTMs) that mediate the activity, localization and interaction partners of proteins. Thus, understanding the meaning of a single PTM or the combination of several of them is essential to unravel the mechanisms of protein regulation. Yeast Amino Acid Modification (YAAM) ( is a comprehensive database that contains information from 121 921 residues of proteins, which are post-translationally modified in the yeast model Saccharomyces cerevisiae. All the PTMs contained in YAAM have been confirmed experimentally. YAAM database maps PTM residues in a 3D canvas for 680 proteins with a known 3D structure. The structure can be visualized and manipulated using the most common web browsers without the need for any additional plugin. The aim of our database is to retrieve and organize data about the location of modified amino acids providing information in a concise but comprehensive and user-friendly way, enabling users to find relevant information on PTMs. Given that PTMs influence almost all aspects of the biology of both healthy and diseased cells, identifying and understanding PTMs is critical in the study of molecular and cell biology. YAAM allows users to perform multiple searches, up to three modifications at the same residue, giving the possibility to explore possible regulatory mechanism for some proteins. Using YAAM search engine, we found three different PTMs of lysine residues involved in protein translation. This suggests an important regulatory mechanism for protein translation that needs to be further studied.Database URL:
      PubDate: Tue, 09 Jan 2018 00:00:00 GMT
      DOI: 10.1093/database/bax099
  • To increase trust, change the social design behind aggregated biodiversity

    • Authors: Franz N; Sterner B.
      Abstract: Growing concerns about the quality of aggregated biodiversity data are lowering trust in large-scale data networks. Aggregators frequently respond to quality concerns by recommending that biologists work with original data providers to correct errors ‘at the source.’ We show that this strategy falls systematically short of a full diagnosis of the underlying causes of distrust. In particular, trust in an aggregator is not just a feature of the data signal quality provided by the sources to the aggregator, but also a consequence of the social design of the aggregation process and the resulting power balance between individual data contributors and aggregators. The latter have created an accountability gap by downplaying the authorship and significance of the taxonomic hierarchies—frequently called ‘backbones’—they generate, and which are in effect novel classification theories that operate at the core of data-structuring process. The Darwin Core standard for sharing occurrence records plays an under-appreciated role in maintaining the accountability gap, because this standard lacks the syntactic structure needed to preserve the taxonomic coherence of data packages submitted for aggregation, potentially leading to inferences that no individual source would support. Since high-quality data packages can mirror competing and conflicting classifications, i.e. unsettled systematic research, this plurality must be accommodated in the design of biodiversity data integration. Looking forward, a key directive is to develop new technical pathways and social incentives for experts to contribute directly to the validation of taxonomically coherent data packages as part of a greater, trustworthy aggregation process.
      PubDate: Thu, 04 Jan 2018 00:00:00 GMT
      DOI: 10.1093/database/bax100
  • HTT-DB: new features and updates

    • Authors: Dotto B; Carvalho E, da Silva A, et al.
      Abstract: Horizontal Transfer (HT) of genetic material between species is a common phenomenon among Bacteria and Archaea species and several databases are available for information retrieval and data mining. However, little attention has been given to this phenomenon among eukaryotic species mainly due to the lower proportion of these events. In the last years, a vertiginous amount of new HT events involving eukaryotic species was reported in the literature, highlighting the need of a common repository to keep the scientific community up to date and describe overall trends. Recently, we published the first HT database focused on HT of transposable elements among eukaryotes: the Horizontal Transposon Transfer DataBase: Database URL: ( 8080/httdatabase/). Here, we present new features and updates of this unique database: (i) its expansion to include virus-host exchange of genetic material, which we called Horizontal Virus Transfer (HVT) and (ii) the availability of a web server for HT detection, where we implemented the online version of vertical and horizontal inheritance consistence analysis (VHICA), an R package developed for HT detection. These improvements will help researchers to navigate through known HVT cases, take data-informed decision and export figures based on keywords searches. Moreover, the availability of the VHICA as an online tool will make this software easily reachable even for researchers with no or little computation knowledge as well as foster our capability to detect new HT events in a wide variety of taxa.Database URL:
      PubDate: Thu, 04 Jan 2018 00:00:00 GMT
      DOI: 10.1093/database/bax102
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-