for Journals by Title or ISSN
for Articles by Keywords

Publisher: Oxford University Press   (Total: 406 journals)

 A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  

        1 2 3 | Last   [Sort by number of followers]   [Restore default list]

Showing 1 - 200 of 406 Journals sorted alphabetically
ACS Symposium Series     Full-text available via subscription   (Followers: 1, SJR: 0.189, CiteScore: 0)
Acta Biochimica et Biophysica Sinica     Hybrid Journal   (Followers: 5, SJR: 0.79, CiteScore: 2)
Adaptation     Hybrid Journal   (Followers: 9, SJR: 0.143, CiteScore: 0)
Advances in Nutrition     Hybrid Journal   (Followers: 54, SJR: 2.196, CiteScore: 5)
Aesthetic Surgery J.     Hybrid Journal   (Followers: 6, SJR: 1.434, CiteScore: 1)
Aesthetic Surgery J. Open Forum     Open Access  
African Affairs     Hybrid Journal   (Followers: 66, SJR: 1.869, CiteScore: 2)
Age and Ageing     Hybrid Journal   (Followers: 90, SJR: 1.989, CiteScore: 4)
Alcohol and Alcoholism     Hybrid Journal   (Followers: 18, SJR: 1.376, CiteScore: 3)
American Entomologist     Full-text available via subscription   (Followers: 8)
American Historical Review     Hybrid Journal   (Followers: 179, SJR: 0.467, CiteScore: 1)
American J. of Agricultural Economics     Hybrid Journal   (Followers: 44, SJR: 2.113, CiteScore: 3)
American J. of Clinical Nutrition     Hybrid Journal   (Followers: 187, SJR: 3.438, CiteScore: 6)
American J. of Epidemiology     Hybrid Journal   (Followers: 195, SJR: 2.713, CiteScore: 3)
American J. of Health-System Pharmacy     Full-text available via subscription   (Followers: 55, SJR: 0.595, CiteScore: 1)
American J. of Hypertension     Hybrid Journal   (Followers: 26, SJR: 1.322, CiteScore: 3)
American J. of Jurisprudence     Hybrid Journal   (Followers: 19, SJR: 0.281, CiteScore: 1)
American J. of Legal History     Full-text available via subscription   (Followers: 9, SJR: 0.116, CiteScore: 0)
American Law and Economics Review     Hybrid Journal   (Followers: 28, SJR: 1.053, CiteScore: 1)
American Literary History     Hybrid Journal   (Followers: 17, SJR: 0.391, CiteScore: 0)
Analysis     Hybrid Journal   (Followers: 23, SJR: 1.038, CiteScore: 1)
Animal Frontiers     Hybrid Journal   (Followers: 1)
Annals of Behavioral Medicine     Hybrid Journal   (Followers: 16, SJR: 1.423, CiteScore: 3)
Annals of Botany     Hybrid Journal   (Followers: 38, SJR: 1.721, CiteScore: 4)
Annals of Oncology     Hybrid Journal   (Followers: 55, SJR: 5.599, CiteScore: 9)
Annals of the Entomological Society of America     Full-text available via subscription   (Followers: 11, SJR: 0.722, CiteScore: 1)
Annals of Work Exposures and Health     Hybrid Journal   (Followers: 34, SJR: 0.728, CiteScore: 2)
Antibody Therapeutics     Open Access  
AoB Plants     Open Access   (Followers: 4, SJR: 1.28, CiteScore: 3)
Applied Economic Perspectives and Policy     Hybrid Journal   (Followers: 17, SJR: 0.858, CiteScore: 2)
Applied Linguistics     Hybrid Journal   (Followers: 60, SJR: 2.987, CiteScore: 3)
Applied Mathematics Research eXpress     Hybrid Journal   (Followers: 1, SJR: 1.241, CiteScore: 1)
Arbitration Intl.     Full-text available via subscription   (Followers: 21)
Arbitration Law Reports and Review     Hybrid Journal   (Followers: 14)
Archives of Clinical Neuropsychology     Hybrid Journal   (Followers: 30, SJR: 0.731, CiteScore: 2)
Aristotelian Society Supplementary Volume     Hybrid Journal   (Followers: 3)
Arthropod Management Tests     Hybrid Journal   (Followers: 2)
Astronomy & Geophysics     Hybrid Journal   (Followers: 44, SJR: 0.146, CiteScore: 0)
Behavioral Ecology     Hybrid Journal   (Followers: 53, SJR: 1.871, CiteScore: 3)
Bioinformatics     Hybrid Journal   (Followers: 347, SJR: 6.14, CiteScore: 8)
Biology Methods and Protocols     Hybrid Journal  
Biology of Reproduction     Full-text available via subscription   (Followers: 10, SJR: 1.446, CiteScore: 3)
Biometrika     Hybrid Journal   (Followers: 20, SJR: 3.485, CiteScore: 2)
BioScience     Hybrid Journal   (Followers: 29, SJR: 2.754, CiteScore: 4)
Bioscience Horizons : The National Undergraduate Research J.     Open Access   (Followers: 2, SJR: 0.146, CiteScore: 0)
Biostatistics     Hybrid Journal   (Followers: 17, SJR: 1.553, CiteScore: 2)
BJA : British J. of Anaesthesia     Hybrid Journal   (Followers: 188, SJR: 2.115, CiteScore: 3)
BJA Education     Hybrid Journal   (Followers: 66)
Brain     Hybrid Journal   (Followers: 70, SJR: 5.858, CiteScore: 7)
Briefings in Bioinformatics     Hybrid Journal   (Followers: 49, SJR: 2.505, CiteScore: 5)
Briefings in Functional Genomics     Hybrid Journal   (Followers: 3, SJR: 2.15, CiteScore: 3)
British J. for the Philosophy of Science     Hybrid Journal   (Followers: 38, SJR: 2.161, CiteScore: 2)
British J. of Aesthetics     Hybrid Journal   (Followers: 25, SJR: 0.508, CiteScore: 1)
British J. of Criminology     Hybrid Journal   (Followers: 603, SJR: 1.828, CiteScore: 3)
British J. of Social Work     Hybrid Journal   (Followers: 86, SJR: 1.019, CiteScore: 2)
British Medical Bulletin     Hybrid Journal   (Followers: 6, SJR: 1.355, CiteScore: 3)
British Yearbook of Intl. Law     Hybrid Journal   (Followers: 35)
Bulletin of the London Mathematical Society     Hybrid Journal   (Followers: 4, SJR: 1.376, CiteScore: 1)
Cambridge J. of Economics     Hybrid Journal   (Followers: 71, SJR: 0.764, CiteScore: 2)
Cambridge J. of Regions, Economy and Society     Hybrid Journal   (Followers: 12, SJR: 2.438, CiteScore: 4)
Cambridge Quarterly     Hybrid Journal   (Followers: 10, SJR: 0.104, CiteScore: 0)
Capital Markets Law J.     Hybrid Journal   (Followers: 2, SJR: 0.222, CiteScore: 0)
Carcinogenesis     Hybrid Journal   (Followers: 2, SJR: 2.135, CiteScore: 5)
Cardiovascular Research     Hybrid Journal   (Followers: 14, SJR: 3.002, CiteScore: 5)
Cerebral Cortex     Hybrid Journal   (Followers: 52, SJR: 3.892, CiteScore: 6)
CESifo Economic Studies     Hybrid Journal   (Followers: 23, SJR: 0.483, CiteScore: 1)
Chemical Senses     Hybrid Journal   (Followers: 1, SJR: 1.42, CiteScore: 3)
Children and Schools     Hybrid Journal   (Followers: 6, SJR: 0.246, CiteScore: 0)
Chinese J. of Comparative Law     Hybrid Journal   (Followers: 5, SJR: 0.412, CiteScore: 0)
Chinese J. of Intl. Law     Hybrid Journal   (Followers: 23, SJR: 0.329, CiteScore: 0)
Chinese J. of Intl. Politics     Hybrid Journal   (Followers: 10, SJR: 1.392, CiteScore: 2)
Christian Bioethics: Non-Ecumenical Studies in Medical Morality     Hybrid Journal   (Followers: 10, SJR: 0.183, CiteScore: 0)
Classical Receptions J.     Hybrid Journal   (Followers: 27, SJR: 0.123, CiteScore: 0)
Clean Energy     Open Access   (Followers: 1)
Clinical Infectious Diseases     Hybrid Journal   (Followers: 70, SJR: 5.051, CiteScore: 5)
Communication Theory     Hybrid Journal   (Followers: 25, SJR: 2.424, CiteScore: 3)
Communication, Culture & Critique     Hybrid Journal   (Followers: 28, SJR: 0.222, CiteScore: 1)
Community Development J.     Hybrid Journal   (Followers: 27, SJR: 0.268, CiteScore: 1)
Computer J.     Hybrid Journal   (Followers: 9, SJR: 0.319, CiteScore: 1)
Conservation Physiology     Open Access   (Followers: 3, SJR: 1.818, CiteScore: 3)
Contemporary Women's Writing     Hybrid Journal   (Followers: 9, SJR: 0.121, CiteScore: 0)
Contributions to Political Economy     Hybrid Journal   (Followers: 6, SJR: 0.906, CiteScore: 1)
Critical Values     Full-text available via subscription  
Current Developments in Nutrition     Open Access   (Followers: 3)
Current Legal Problems     Hybrid Journal   (Followers: 29)
Current Zoology     Full-text available via subscription   (Followers: 3, SJR: 1.164, CiteScore: 2)
Database : The J. of Biological Databases and Curation     Open Access   (Followers: 9, SJR: 1.791, CiteScore: 3)
Digital Scholarship in the Humanities     Hybrid Journal   (Followers: 14, SJR: 0.259, CiteScore: 1)
Diplomatic History     Hybrid Journal   (Followers: 21, SJR: 0.45, CiteScore: 1)
DNA Research     Open Access   (Followers: 5, SJR: 2.866, CiteScore: 6)
Dynamics and Statistics of the Climate System     Open Access   (Followers: 4)
Early Music     Hybrid Journal   (Followers: 17, SJR: 0.139, CiteScore: 0)
Econometrics J.     Hybrid Journal   (Followers: 32, SJR: 2.926, CiteScore: 1)
Economic J.     Hybrid Journal   (Followers: 116, SJR: 5.161, CiteScore: 3)
Economic Policy     Hybrid Journal   (Followers: 48, SJR: 3.584, CiteScore: 3)
ELT J.     Hybrid Journal   (Followers: 24, SJR: 0.942, CiteScore: 1)
English Historical Review     Hybrid Journal   (Followers: 56, SJR: 0.612, CiteScore: 1)
English: J. of the English Association     Hybrid Journal   (Followers: 18, SJR: 0.1, CiteScore: 0)
Environmental Entomology     Full-text available via subscription   (Followers: 11, SJR: 0.818, CiteScore: 2)
Environmental Epigenetics     Open Access   (Followers: 2)
Environmental History     Hybrid Journal   (Followers: 26, SJR: 0.408, CiteScore: 1)
EP-Europace     Hybrid Journal   (Followers: 3, SJR: 2.748, CiteScore: 4)
Epidemiologic Reviews     Hybrid Journal   (Followers: 9, SJR: 4.505, CiteScore: 8)
ESHRE Monographs     Hybrid Journal  
Essays in Criticism     Hybrid Journal   (Followers: 20, SJR: 0.113, CiteScore: 0)
European Heart J.     Hybrid Journal   (Followers: 66, SJR: 9.315, CiteScore: 9)
European Heart J. - Cardiovascular Imaging     Hybrid Journal   (Followers: 10, SJR: 3.625, CiteScore: 3)
European Heart J. - Cardiovascular Pharmacotherapy     Full-text available via subscription   (Followers: 2)
European Heart J. - Quality of Care and Clinical Outcomes     Hybrid Journal  
European Heart J. : Case Reports     Open Access  
European Heart J. Supplements     Hybrid Journal   (Followers: 8, SJR: 0.223, CiteScore: 0)
European J. of Cardio-Thoracic Surgery     Hybrid Journal   (Followers: 9, SJR: 1.681, CiteScore: 2)
European J. of Intl. Law     Hybrid Journal   (Followers: 205, SJR: 0.694, CiteScore: 1)
European J. of Orthodontics     Hybrid Journal   (Followers: 5, SJR: 1.279, CiteScore: 2)
European J. of Public Health     Hybrid Journal   (Followers: 19, SJR: 1.36, CiteScore: 2)
European Review of Agricultural Economics     Hybrid Journal   (Followers: 10, SJR: 1.172, CiteScore: 2)
European Review of Economic History     Hybrid Journal   (Followers: 30, SJR: 0.702, CiteScore: 1)
European Sociological Review     Hybrid Journal   (Followers: 43, SJR: 2.728, CiteScore: 3)
Evolution, Medicine, and Public Health     Open Access   (Followers: 12)
Family Practice     Hybrid Journal   (Followers: 15, SJR: 1.018, CiteScore: 2)
Fems Microbiology Ecology     Hybrid Journal   (Followers: 16, SJR: 1.492, CiteScore: 4)
Fems Microbiology Letters     Hybrid Journal   (Followers: 28, SJR: 0.79, CiteScore: 2)
Fems Microbiology Reviews     Hybrid Journal   (Followers: 33, SJR: 7.063, CiteScore: 13)
Fems Yeast Research     Hybrid Journal   (Followers: 14, SJR: 1.308, CiteScore: 3)
Food Quality and Safety     Open Access   (Followers: 1)
Foreign Policy Analysis     Hybrid Journal   (Followers: 24, SJR: 1.425, CiteScore: 1)
Forest Science     Hybrid Journal   (Followers: 8, SJR: 0.89, CiteScore: 2)
Forestry: An Intl. J. of Forest Research     Hybrid Journal   (Followers: 16, SJR: 1.133, CiteScore: 3)
Forum for Modern Language Studies     Hybrid Journal   (Followers: 6, SJR: 0.104, CiteScore: 0)
French History     Hybrid Journal   (Followers: 34, SJR: 0.118, CiteScore: 0)
French Studies     Hybrid Journal   (Followers: 21, SJR: 0.148, CiteScore: 0)
French Studies Bulletin     Hybrid Journal   (Followers: 10, SJR: 0.152, CiteScore: 0)
Gastroenterology Report     Open Access   (Followers: 3)
Genome Biology and Evolution     Open Access   (Followers: 16, SJR: 2.578, CiteScore: 4)
Geophysical J. Intl.     Hybrid Journal   (Followers: 39, SJR: 1.506, CiteScore: 3)
German History     Hybrid Journal   (Followers: 23, SJR: 0.161, CiteScore: 0)
GigaScience     Open Access   (Followers: 6, SJR: 5.022, CiteScore: 7)
Global Summitry     Hybrid Journal   (Followers: 1)
Glycobiology     Hybrid Journal   (Followers: 13, SJR: 1.493, CiteScore: 3)
Health and Social Work     Hybrid Journal   (Followers: 57, SJR: 0.388, CiteScore: 1)
Health Education Research     Hybrid Journal   (Followers: 15, SJR: 0.854, CiteScore: 2)
Health Policy and Planning     Hybrid Journal   (Followers: 24, SJR: 1.512, CiteScore: 2)
Health Promotion Intl.     Hybrid Journal   (Followers: 22, SJR: 0.812, CiteScore: 2)
History Workshop J.     Hybrid Journal   (Followers: 33, SJR: 1.278, CiteScore: 1)
Holocaust and Genocide Studies     Hybrid Journal   (Followers: 28, SJR: 0.105, CiteScore: 0)
Human Communication Research     Hybrid Journal   (Followers: 15, SJR: 2.146, CiteScore: 3)
Human Molecular Genetics     Hybrid Journal   (Followers: 9, SJR: 3.555, CiteScore: 5)
Human Reproduction     Hybrid Journal   (Followers: 75, SJR: 2.643, CiteScore: 5)
Human Reproduction Open     Open Access   (Followers: 1)
Human Reproduction Update     Hybrid Journal   (Followers: 21, SJR: 5.317, CiteScore: 10)
Human Rights Law Review     Hybrid Journal   (Followers: 64, SJR: 0.756, CiteScore: 1)
ICES J. of Marine Science: J. du Conseil     Hybrid Journal   (Followers: 58, SJR: 1.591, CiteScore: 3)
ICSID Review : Foreign Investment Law J.     Hybrid Journal   (Followers: 11)
ILAR J.     Hybrid Journal   (Followers: 3, SJR: 1.732, CiteScore: 4)
IMA J. of Applied Mathematics     Hybrid Journal   (SJR: 0.679, CiteScore: 1)
IMA J. of Management Mathematics     Hybrid Journal   (SJR: 0.538, CiteScore: 1)
IMA J. of Mathematical Control and Information     Hybrid Journal   (Followers: 2, SJR: 0.496, CiteScore: 1)
IMA J. of Numerical Analysis - advance access     Hybrid Journal   (SJR: 1.987, CiteScore: 2)
Industrial and Corporate Change     Hybrid Journal   (Followers: 10, SJR: 1.792, CiteScore: 2)
Industrial Law J.     Hybrid Journal   (Followers: 41, SJR: 0.249, CiteScore: 1)
Inflammatory Bowel Diseases     Hybrid Journal   (Followers: 47, SJR: 2.511, CiteScore: 4)
Information and Inference     Free  
Innovation in Aging     Open Access  
Integrative and Comparative Biology     Hybrid Journal   (Followers: 9, SJR: 1.319, CiteScore: 2)
Integrative Biology     Full-text available via subscription   (Followers: 6, SJR: 1.36, CiteScore: 3)
Integrative Organismal Biology     Open Access  
Interacting with Computers     Hybrid Journal   (Followers: 11, SJR: 0.292, CiteScore: 1)
Interactive CardioVascular and Thoracic Surgery     Hybrid Journal   (Followers: 7, SJR: 0.762, CiteScore: 1)
Intl. Affairs     Hybrid Journal   (Followers: 68, SJR: 1.505, CiteScore: 3)
Intl. Data Privacy Law     Hybrid Journal   (Followers: 27)
Intl. Health     Hybrid Journal   (Followers: 6, SJR: 0.851, CiteScore: 2)
Intl. Immunology     Hybrid Journal   (Followers: 3, SJR: 2.167, CiteScore: 4)
Intl. J. for Quality in Health Care     Hybrid Journal   (Followers: 36, SJR: 1.348, CiteScore: 2)
Intl. J. of Constitutional Law     Hybrid Journal   (Followers: 65, SJR: 0.601, CiteScore: 1)
Intl. J. of Epidemiology     Hybrid Journal   (Followers: 260, SJR: 3.969, CiteScore: 5)
Intl. J. of Law and Information Technology     Hybrid Journal   (Followers: 5, SJR: 0.202, CiteScore: 1)
Intl. J. of Law, Policy and the Family     Hybrid Journal   (Followers: 28, SJR: 0.223, CiteScore: 1)
Intl. J. of Lexicography     Hybrid Journal   (Followers: 10, SJR: 0.285, CiteScore: 1)
Intl. J. of Low-Carbon Technologies     Open Access   (Followers: 1, SJR: 0.403, CiteScore: 1)
Intl. J. of Neuropsychopharmacology     Open Access   (Followers: 3, SJR: 1.808, CiteScore: 4)
Intl. J. of Public Opinion Research     Hybrid Journal   (Followers: 11, SJR: 1.545, CiteScore: 1)
Intl. J. of Refugee Law     Hybrid Journal   (Followers: 39, SJR: 0.389, CiteScore: 1)
Intl. J. of Transitional Justice     Hybrid Journal   (Followers: 11, SJR: 0.724, CiteScore: 2)
Intl. Mathematics Research Notices     Hybrid Journal   (Followers: 1, SJR: 2.168, CiteScore: 1)
Intl. Political Sociology     Hybrid Journal   (Followers: 40, SJR: 1.465, CiteScore: 3)
Intl. Relations of the Asia-Pacific     Hybrid Journal   (Followers: 24, SJR: 0.401, CiteScore: 1)
Intl. Studies Perspectives     Hybrid Journal   (Followers: 9, SJR: 0.983, CiteScore: 1)
Intl. Studies Quarterly     Hybrid Journal   (Followers: 50, SJR: 2.581, CiteScore: 2)
Intl. Studies Review     Hybrid Journal   (Followers: 25, SJR: 1.201, CiteScore: 1)
ISLE: Interdisciplinary Studies in Literature and Environment     Hybrid Journal   (Followers: 2, SJR: 0.15, CiteScore: 0)
ITNOW     Hybrid Journal   (Followers: 1, SJR: 0.103, CiteScore: 0)
J. of African Economies     Hybrid Journal   (Followers: 17, SJR: 0.533, CiteScore: 1)
J. of American History     Hybrid Journal   (Followers: 46, SJR: 0.297, CiteScore: 1)
J. of Analytical Toxicology     Hybrid Journal   (Followers: 14, SJR: 1.065, CiteScore: 2)
J. of Antimicrobial Chemotherapy     Hybrid Journal   (Followers: 15, SJR: 2.419, CiteScore: 4)
J. of Antitrust Enforcement     Hybrid Journal   (Followers: 1)
J. of Applied Poultry Research     Hybrid Journal   (Followers: 5, SJR: 0.585, CiteScore: 1)
J. of Biochemistry     Hybrid Journal   (Followers: 41, SJR: 1.226, CiteScore: 2)
J. of Breast Imaging     Full-text available via subscription   (Followers: 1)
J. of Burn Care & Research     Hybrid Journal   (Followers: 11, SJR: 0.768, CiteScore: 2)

        1 2 3 | Last   [Sort by number of followers]   [Restore default list]

Similar Journals
Journal Cover
Database : The Journal of Biological Databases and Curation
Journal Prestige (SJR): 1.791
Citation Impact (citeScore): 3
Number of Followers: 9  

  This is an Open Access Journal Open Access journal
ISSN (Online) 1758-0463
Published by Oxford University Press Homepage  [406 journals]
  • Re-curation and rational enrichment of knowledge graphs in Biological
           Expression Language

    • Authors: Hoyt C; Domingo-Fernández D, Aldisi R, et al.
      Abstract: The rapid accumulation of new biomedical literature not only causes curated knowledge graphs (KGs) to become outdated and incomplete, but also makes manual curation an impractical and unsustainable solution. Automated or semi-automated workflows are necessary to assist in prioritizing and curating the literature to update and enrich KGs. We have developed two workflows: one for re-curating a given KG to assure its syntactic and semantic quality and another for rationally enriching it by manually revising automatically extracted relations for nodes with low information density. We applied these workflows to the KGs encoded in Biological Expression Language from the NeuroMMSig database using content that was pre-extracted from MEDLINE abstracts and PubMed Central full-text articles using text mining output integrated by INDRA. We have made this workflow freely available at
      PubDate: Fri, 21 Jun 2019 00:00:00 GMT
      DOI: 10.1093/database/baz068
      Issue No: Vol. 2019 (2019)
  • CCRDB: a cancer circRNAs-related database and its application in
           hepatocellular carcinoma-related circRNAs

    • Authors: Liu Q; Cai Y, Xiong H, et al.
      Abstract: Circular RNAs (circRNAs) are widely expressed in human cells and tissues and can form a covalently closed exon circularization, which have stable patterns and play important regulatory roles in physiological or pathological process. There is still lack of a comprehensively disease-related knowledge base for in-depth analysis of circRNAs. In this paper, a cancer circRNAs-related database (CCRDB) was established. The CCRDB’s initial circRNAs data were collected by sequencing experimental data of 10 samples from 5 patients with hepatocellular carcinoma (HCC), where a total of 11 501 circRNAs were found and can easily be expanded by collecting and analyzing external data sources such as circBASE (1). Using CCRDB, we have further studied the relationships between circRNAs and HCC and found that circRNAs (hsa_circ_ 0002130, hsa_circ_0084615, hsa_circ_0001445, hsa_circ_0001727 and hsa_circ_0001361) and the corresponding genes ID [C3 (2, 3), ASPH (4), SMARCA5 (5), ZKSCAN1 (6) and FNDC3B (7)], respectively, might be the potential biomarker targets for HCC. Furthermore, our experiment also found that some new circRNAs chromosome sites chr12:23998917 24048958 and chr16:72090429 72093087 and the corresponding genes ID (SOX5 (8) and HP (9), respectively), might be the potential biomarker targets for HCC. These results indicate that CCRDB can effectively reveal the relationships between circRNAs and HCC. As the first circRNAs database to provide analysis and comparison functions, it is of great significance for researchers to further study the rules of circRNAs, to understand the causes of circRNAs in disease discovery and to find target genes for therapeutic approaches.
      PubDate: Wed, 19 Jun 2019 00:00:00 GMT
      DOI: 10.1093/database/baz063
      Issue No: Vol. 2019 (2019)
  • AmyloWiki: an integrated database for Bacillus velezensis FZB42, the model
           strain for plant growth-promoting Bacilli

    • Authors: Fan B; Wang C, Ding X, et al.
      Abstract: Since its isolation 20 years ago, many studies have been devoted to Bacillus velezensis FZB42 (former name Bacillus amyloliquefaciens subsp. plantarum FZB42), which has been gradually accepted as a model organism for Gram-positive rhizobacteria. FZB42 is different from another widely studied bacterial strain, Bacillus subtilis 168, in its many features that are closely associated with plants. FZB42 represents a large group of Bacillus isolates that are beneficial to plants and of great importance in agriculture. In this work a database for FZB42 named ‘AmyloWiki’ is built to integrate all information of FZB42 available to date. The information includes the genomic, transcriptomic, proteomic, post-translational data as well as FZB42 unique genes, protein regulators, mutant availability, publications and etc. The website is built up with PHP and MySQL with a function of keyword searching, browsing, data-downloading and other functions.
      PubDate: Wed, 19 Jun 2019 00:00:00 GMT
      DOI: 10.1093/database/baz071
      Issue No: Vol. 2019 (2019)
  • SpinachBase: a central portal for spinach genomics

    • Authors: Collins K; Zhao K, Jiao C, et al.
      Abstract: Spinach (Spinacia oleracea L.) is a nutritious vegetable enriched with many essential minerals and vitamins. A reference spinach genome has been recently released, and additional spinach genomic resources are being rapidly developed. Therefore, there is an urgent need of a central database to store, query, analyze and integrate various resources of spinach genomic data. To this end, we developed SpinachBase (, which provides centralized public accesses to genomic data as well as analytical tools to assist research and breeding in spinach. The database currently stores the spinach reference genome sequence, and sequences and comprehensive functional annotations of protein-coding genes predicted from the genome. The database also contains gene expression profiles derived from RNA-Seq experiments as well as highly co-expressed genes and genetic variants called from transcriptome sequences of 120 cultivated and wild Spinacia accessions. Biochemical pathways have been predicted from spinach protein-coding genes and are available through a pathway database (SpinachCyc) within SpinachBase. SpinachBase provides a suite of analysis and visualization tools including a genome browser, sequence similarity searches with BLAST, functional enrichment and functional classification analyses and functions to query and retrieve gene sequences and annotations.
      PubDate: Tue, 18 Jun 2019 00:00:00 GMT
      DOI: 10.1093/database/baz072
      Issue No: Vol. 2019 (2019)
  • ChlamBase: a curated model organism database for the Chlamydia research

    • Authors: Putman T; Hybiske K, Jow D, et al.
      Abstract: This manuscript has been amended to include additional authors who were inadvertentlyomitted.
      PubDate: Tue, 18 Jun 2019 00:00:00 GMT
      DOI: 10.1093/database/baz091
      Issue No: Vol. 2019 (2019)
  • GrainGenes: centralized small grain resources and digital platform for
           geneticists and breeders

    • Authors: Blake V; Woodhouse M, Lazo G, et al.
      Abstract: GrainGenes ( or is an international centralized repository for curated, peer-reviewed datasets useful to researchers working on wheat, barley, rye and oat. GrainGenes manages genomic, genetic, germplasm and phenotypic datasets through a dynamically generated web interface for facilitated data discovery. Since 1992, GrainGenes has served geneticists and breeders in both the public and private sectors on six continents. Recently, several new datasets were curated into the database along with new tools for analysis. The GrainGenes homepage was enhanced by making it more visually intuitive and by adding links to commonly used pages. Several genome assemblies and genomic tracks are displayed through the genome browsers at GrainGenes, including the Triticum aestivum (bread wheat) cv. ‘Chinese Spring’ IWGSC RefSeq v1.0 genome assembly, the Aegilops tauschii (D genome progenitor) Aet v4.0 genome assembly, the Triticum turgidum ssp. dicoccoides (wild emmer wheat) cv. ‘Zavitan’ WEWSeq v.1.0 genome assembly, a T. aestivum (bread wheat) pangenome, the Hordeum vulgare (barley) cv. ‘Morex’ IBSC genome assembly, the Secale cereale (rye) select ‘Lo7’ assembly, a partial hexaploid Avena sativa (oat) assembly and the Triticum durum cv. ‘Svevo’ (durum wheat) RefSeq Release 1.0 assembly. New genetic maps and markers were added and can be displayed through CMAP. Quantitative trait loci, genetic maps and genes from the Wheat Gene Catalogue are indexed and linked through the Wheat Information System (WheatIS) portal. Training videos were created to help users query and reach the data they need. GSP (Genome Specific Primers) and PIECE2 (Plant Intron Exon Comparison and Evolution) tools were implemented and are available to use. As more small grains reference sequences become available, GrainGenes will play an increasingly vital role in helping researchers improve crops.
      PubDate: Tue, 18 Jun 2019 00:00:00 GMT
      DOI: 10.1093/database/baz065
      Issue No: Vol. 2019 (2019)
  • Using association rule mining and ontologies to generate metadata
           recommendations from multiple biomedical databases

    • Authors: Martínez-Romero M; O'Connor M, Egyedi A, et al.
      Abstract: Metadata—the machine-readable descriptions of the data—are increasingly seen as crucial for describing the vast array of biomedical datasets that are currently being deposited in public repositories. While most public repositories have firm requirements that metadata must accompany submitted datasets, the quality of those metadata is generally very poor. A key problem is that the typical metadata acquisition process is onerous and time consuming, with little interactive guidance or assistance provided to users. Secondary problems include the lack of validation and sparse use of standardized terms or ontologies when authoring metadata. There is a pressing need for improvements to the metadata acquisition process that will help users to enter metadata quickly and accurately. In this paper, we outline a recommendation system for metadata that aims to address this challenge. Our approach uses association rule mining to uncover hidden associations among metadata values and to represent them in the form of association rules. These rules are then used to present users with real-time recommendations when authoring metadata. The novelties of our method are that it is able to combine analyses of metadata from multiple repositories when generating recommendations and can enhance those recommendations by aligning them with ontology terms. We implemented our approach as a service integrated into the CEDAR Workbench metadata authoring platform, and evaluated it using metadata from two public biomedical repositories: US-based National Center for Biotechnology Information BioSample and European Bioinformatics Institute BioSamples. The results show that our approach is able to use analyses of previously entered metadata coupled with ontology-based mappings to present users with accurate recommendations when authoring metadata.
      PubDate: Mon, 10 Jun 2019 00:00:00 GMT
      DOI: 10.1093/database/baz059
      Issue No: Vol. 2019 (2019)
  • Chickspress: a resource for chicken gene expression

    • Authors: McCarthy F; Pendarvis K, Cooksey A, et al.
      Abstract: High-throughput sequencing and proteomics technologies are markedly increasing the amount of RNA and peptide data that are available to researchers, which are typically made publicly available via data repositories such as the NCBI Sequence Read Archive and proteome archives, respectively. These data sets contain valuable information about when and where gene products are expressed, but this information is not readily obtainable from archived data sets. Here we report Chickspress (, the first publicly available gene expression resource for chicken tissues. Since there is no single source of chicken gene models, Chickspress incorporates both NCBI and Ensembl gene models and links these gene sets with experimental gene expression data and QTL information. By linking gene models from both NCBI and Ensembl gene prediction pipelines, researchers can, for the first time, easily compare gene models from each of these prediction workflows to available experimental data for these products. We use Chickspress data to show the differences between these gene annotation pipelines. Chickspress also provides rapid search, visualization and download capacity for chicken gene sets based upon tissue type, developmental stage and experiment type. This first Chickspress release contains 161 gene expression data sets, including expression of mRNAs, miRNAs, proteins and peptides. We provide several examples demonstrating how researchers may use this resource.
      PubDate: Mon, 10 Jun 2019 00:00:00 GMT
      DOI: 10.1093/database/baz058
      Issue No: Vol. 2019 (2019)
  • A web-based tool for the prediction of rice transcription factor function

    • Authors: Chandran A; Moon S, Yoo Y, et al.
      Abstract: Transcription factors (TFs) are an important class of regulatory molecules. Despite their importance, only a small number of genes encoding TFs have been characterized in Oryza sativa (rice), often because gene duplication and functional redundancy complicate their analysis. To address this challenge, we developed a web-based tool called the Rice Transcription Factor Phylogenomics Database (RTFDB) and demonstrate its application for predicting TF function. The RTFDB hosts transcriptome and co-expression analyses. Sources include high-throughput data from oligonucleotide microarray (Affymetrix and Agilent) as well as RNA-Seq-based expression profiles. We used the RTFDB to identify tissue-specific and stress-related gene expression. Subsequently, 273 genes preferentially expressed in specific tissues or organs, 455 genes showing a differential expression pattern in response to 4 abiotic stresses, 179 genes responsive to infection of various pathogens and 512 genes showing differential accumulation in response to various hormone treatments were identified through the meta-expression analysis. Pairwise Pearson correlation coefficient analysis between paralogous genes in a phylogenetic tree was used to assess their expression collinearity and thereby provides a hint on their genetic redundancy. Integrating transcriptome with the gene evolutionary information reveals the possible functional redundancy or dominance played by paralog genes in a highly duplicated genome such as rice. With this method, we estimated a predominant role for 83.3% (65/78) of the TF or transcriptional regulator genes that had been characterized via loss-of-function studies. In this regard, the proposed method is applicable for functional studies of other plant species with annotated genome.
      PubDate: Thu, 06 Jun 2019 00:00:00 GMT
      DOI: 10.1093/database/baz061
      Issue No: Vol. 2019 (2019)
  • Endometriosis Knowledgebase: a gene-based resource on endometriosis

    • Authors: Joseph S; Mahale S.
      Abstract: Endometriosis is a complex, benign, estrogen-dependent gynecological disorder with an incidence of ~10% women in reproductive age. The implantation and growth of endometrial cells outside the uterus leads to the development of endometriosis. Endometriosis is also associated with comorbid conditions like cardiovascular and autoimmune diseases. The absence of non-invasive diagnostic markers, delayed diagnosis, high risk of recurrence of the disease on surgical removal of the tissue and absence of a definitive cure for endometriosis makes it imperative to gain insights into the complex etiology of endometriosis. A plethora of genes identified from blood and endometrial biopsies, involved in different pathways like steroid metabolism, angiogenesis, inflammation, etc. have been associated with endometriosis. However, the exact mechanism and genetic etiology of endometriosis still remain unclear. The polygenic nature of the disease, incongruent phenotypic manifestations in different ethnic populations and information scattered in literature makes it difficult to delineate the sub-network of genes that will aid in disease diagnosis and effective treatment. Endometriosis Knowledgebase is a manually curated database with information on genes associated with endometriosis. It holds information on 831 genes, their associated polymorphisms, gene ontologys, pathways and diseases. Genes in the database are enriched in pathways important for cell signaling, immune regulation and reproduction. A genetic overlap is seen between endometriosis and cancers, endocrine/reproductive, nervous system, immune and metabolic diseases. Network analysis of genes in the Endometriosis Knowledgebase helped predict 13 new candidate genes for endometriosis. These genes were found to be enriched in biological processes associated with endometriosis. The Endometriosis Knowledgebase and incorporated tools for gene and sequence-based analysis will benefit both researchers and clinicians working in the realm of reproductive biology.
      PubDate: Wed, 05 Jun 2019 00:00:00 GMT
      DOI: 10.1093/database/baz062
      Issue No: Vol. 2019 (2019)
  • ResMarkerDB: a database of biomarkers of response to antibody therapy in
           breast and colorectal cancer

    • Authors: Pérez-Granado J; Piñero J, Furlong L.
      Abstract: The clinical efficacy of therapeutic monoclonal antibodies for breast and colorectal cancer has greatly contributed to the improvement of patients’ outcomes by individualizing their treatments according to their genomic background. However, primary or acquired resistance to treatment reduces its efficacy. In this context, the identification of biomarkers predictive of drug response would support research and development of new alternative treatments. Biomarkers play a major role in the genomic revolution, supporting disease diagnosis and treatment decision-making. Currently, several molecular biomarkers of treatment response for breast and colorectal cancer have been described. However, information on these biomarkers is scattered across several resources, and needs to be identified, collected and properly integrated to be fully exploited to inform monitoring of drug response in patients. Therefore, there is a need of resources that offer biomarker data in a harmonized manner to the user to support the identification of actionable biomarkers of response to treatment in cancer. ResMarkerDB was developed as a comprehensive resource of biomarkers of drug response in colorectal and breast cancer. It integrates data of biomarkers of drug response from existing repositories, and new data extracted and curated from the literature (referred as ResCur). ResMarkerDB currently features 266 biomarkers of diverse nature. Twenty-five percent of these biomarkers are exclusive of ResMarkerDB. Furthermore, ResMarkerDB is one of the few resources offering non-coding DNA data in response to drug treatment. The database contains more than 500 biomarker-drug-tumour associations, covering more than 100 genes. ResMarkerDB provides a web interface to facilitate the exploration of the current knowledge of biomarkers of response in breast and colorectal cancer. It aims to enhance translational research efforts in identifying actionable biomarkers of drug response in cancer.
      PubDate: Tue, 04 Jun 2019 00:00:00 GMT
      DOI: 10.1093/database/baz060
      Issue No: Vol. 2019 (2019)
  • VigSatDB: genome-wide microsatellite DNA marker database of three species
           of Vigna for germplasm characterization and improvement

    • Authors: Jasrotia R; Yadav P, Iquebal M, et al.
      Abstract: Genus Vigna represented by more than 100 species is a source of nutritious edible seeds and sprouts that are rich sources of protein and dietary supplements. It is further valuable because of therapeutic attributes due to its antioxidant and anti-diabetic properties. A highly diverse and an extremely ecological niche of different species can be valuable genomic resources for productivity enhancement. It is one of the most underutilized crops for food security and animal feeds. In spite of huge species diversity, only three species of Vigna have been sequenced; thus, there is a need for molecular markers for the remaining species. Computational approach of microsatellite marker discovery along with evaluation of polymorphism utilizing available genomic data of different genotypes can be a quick and an economical approach for genomic resource development. Cross-species transferability by e-PCR over available genomes can further prioritize the potential SSR markers, which could be used for genetic diversity and population differentiation of the remaining species saving cost and time. We present VigSatDB—the world’s first comprehensive microsatellite database of genus Vigna, containing >875 K putative microsatellite markers with 772 354 simple and 103 865 compound markers mined from six genome assemblies of three Vigna species, namely, Vigna radiata (Mung bean), Vigna angularis (Adzuki bean) and Vigna unguiculata (Cowpea). It also contains 1976 validated published markers. Markers can be selected on the basis of chromosomes/location specificity, and primers can be generated using Primer3core tool integrated at backend. Efficacy of VigSatDB for microsatellite loci genotyping has been evaluated by 15 markers over a panel of 10 diverse genotype of V. radiata. Our web genomic resources can be used in diversity analysis, population and varietal differentiation, discovery of quantitative trait loci/genes, marker-assisted varietal improvement in endeavor of Vigna crop productivity and management.
      PubDate: Fri, 31 May 2019 00:00:00 GMT
      DOI: 10.1093/database/baz055
      Issue No: Vol. 2019 (2019)
  • Chemical–protein interaction extraction via contextualized word
           representations and multihead attention

    • Authors: Zhang Y; Lin H, Yang Z, et al.
      Abstract: A rich source of chemical–protein interactions (CPIs) is locked in the exponentially growing biomedical literature. Automatic extraction of CPIs is a crucial task in biomedical natural language processing (NLP), which has great benefits for pharmacological and clinical research. Deep context representation and multihead attention are recent developments in deep learning and have shown their potential in some NLP tasks. Unlike traditional word embedding, deep context representation has the ability to generate comprehensive sentence representation based on the sentence context. The multihead attention mechanism can effectively learn the important features from different heads and emphasize the relatively important features. Integrating deep context representation and multihead attention with a neural network-based model may improve CPI extraction. We present a deep neural model for CPI extraction based on deep context representation and multihead attention. Our model mainly consists of the following three parts: a deep context representation layer, a bidirectional long short-term memory networks (Bi-LSTMs) layer and a multihead attention layer. The deep context representation is employed to provide more comprehensive feature input for Bi-LSTMs. The multihead attention can effectively emphasize the important part of the Bi-LSTMs output. We evaluated our method on the public ChemProt corpus. These experimental results show that both deep context representation and multihead attention are helpful in CPI extraction. Our method can compete with other state-of-the-art methods on ChemProt corpus.
      PubDate: Fri, 24 May 2019 00:00:00 GMT
      DOI: 10.1093/database/baz054
      Issue No: Vol. 2019 (2019)
  • LanceletDB: an integrated genome database for lancelet, comparing domain
           types and combination in orthologues among lancelet and other species

    • Authors: You L; Chi J, Huang S, et al.
      Abstract: Lancelet (amphioxus) represents the most basally divergent extant chordate (cephalochordates) that diverged from the other two chordate lineages (urochordates and vertebrates) more than half a billion years ago. As it occupies a key position in evolution, it is considered as one of the best proxies for understanding the chordate ancestral state. Thus, the construction of a database with multiple lancelet genomes and gene annotation data, including protein domains, is urgently needed to investigate the loss and gain of domains in orthologues among species, especially ancient domain types (non-vertebrate-specific domains) and novel domain combination, which is helpful for providing new insight into the chordate ancestral state and vertebrate evolution. Here, we present an integrated genome database for lancelet, LanceletDB, which provides reference haploid genome sequence and annotation data for lancelet (Branchiostoma belcheri), including gene models and annotation, protein domain types, gene expression pattern in embryogenesis, different expression sequence tag sets and alternative polyadenylation (APA) sites profiled by the sequencing APA sites method. Especially, LanceletDB allows comparison of domain types and combination in orthologues among type species so as to decode the ancient domain types and novel domain combination during evolution. We also integrated the released diploid lancelet genome annotation data (Branchiostoma floridae) to expand LanceletDB and extend its usefulness. These data are available through the search and analysis page, basic local alignment search tool page and genome browser to provide an integrated display.
      PubDate: Sat, 18 May 2019 00:00:00 GMT
      DOI: 10.1093/database/baz056
      Issue No: Vol. 2019 (2019)
  • GIDB: a knowledge database for the automated curation and multidimensional
           analysis of molecular signatures in gastrointestinal cancer

    • Authors: Wang Y; Wang Y, Wang S, et al.
      Abstract: Gastrointestinal (GI) cancer is common, characterized by high mortality, and includes oesophagus, gastric, liver, bile duct, pancreas, rectal and colon cancers. The insufficient specificity and sensitivity of biomarkers is still a key clinical hindrance for GI cancer diagnosis and successful treatment. The emergence of `precision medicine’, `basket trial’ and `field cancerization’ concepts calls for an urgent need and importance for the understanding of how organ system cancers occur at the molecular levels. Knowledge from both the literature and data available in public databases is informative in elucidating the molecular alterations underlying GI cancer. Currently, most available cancer databases have not offered a comprehensive discovery of gene-disease associations, molecular alterations and clinical information by integrated text mining and data mining in GI cancer. We develop GIDB, a panoptic knowledge database that attempts to automate the curation of molecular signatures using natural language processing approaches and multidimensional analyses. GIDB covers information on 8730 genes with both literature and data supporting evidence, 248 miRNAs, 58 lncRNAs, 320 copy number variations, 49 fusion genes and 2381 semantic networks. It presents a comprehensive database, not only in parallelizing supporting evidence and data integration for signatures associated with GI cancer but also in providing the timeline feature of major molecular discoveries. It highlights the most comprehensive overview, research hotspots and the development of historical knowledge of genes in GI cancer. Furthermore, GIDB characterizes genomic abnormalities in multilevel analysis, including simple somatic mutations, gene expression, DNA methylation and prognosis. GIDB offers a user-friendly interface and two customizable online tools (Heatmap and Network) for experimental researchers and clinicians to explore data and help them shorten the learning curve and broaden the scope of knowledge. More importantly, GIDB is an ongoing research project that will continue to be updated and improve the automated method for reducing manual work.
      PubDate: Wed, 15 May 2019 00:00:00 GMT
      DOI: 10.1093/database/baz051
      Issue No: Vol. 2019 (2019)
  • CropCircDB: a comprehensive circular RNA resource for crops in response to
           abiotic stress

    • Authors: Wang K; Wang C, Guo B, et al.
      Abstract: Circular RNA (circRNAs) may mediate mRNA expression as miRNA sponge. Since the community has paid more attention on circRNAs, a lot of circRNA databases have been developed for plant. However, a comprehensive collection of circRNAs in crop response to abiotic stress is still lacking. In this work, we applied a big-data approach to take full advantage of large-scale sequencing data, and developed a rich circRNA resource: CropCircDB for maize and rice, later extending to incorporate more crop species. We also designed a metric: stress detections score, which is specifically for detecting circRNAs under stress condition. In summary, we systematically investigated 244 and 288 RNA-Seq samples for maize and rice, respectively, and found 38 785 circRNAs in maize, and 63 048 circRNAs in rice. This resource not only supports user-friendly JBrowser to visualize genome easily, but also provides elegant view of circRNA structures and dynamic profiles of circRNA expression in all samples. Together, this database will host all predicted and validated crop circRNAs response to abiotic stress.
      PubDate: Mon, 06 May 2019 00:00:00 GMT
      DOI: 10.1093/database/baz053
      Issue No: Vol. 2019 (2019)
  • The MACADAM database: a MetAboliC pAthways DAtabase for Microbial
           taxonomic groups for mining potential metabolic capacities of archaeal and
           bacterial taxonomic groups

    • Authors: Le Boulch M; Déhais P, Combes S, et al.
      Abstract: Progress in genome sequencing and bioinformatics opens up new possibilities, including that of correlating genome annotations with functional information such as metabolic pathways. Thanks to the development of functional annotation databases, scientists are able to link genome annotations with functional annotations. We present MetAboliC pAthways DAtabase for Microbial taxonomic groups (MACADAM) here, a user-friendly database that makes it possible to find presence/absence/completeness statistics for metabolic pathways at a given microbial taxonomic position. For each prokaryotic ‘RefSeq complete genome’, MACADAM builds a pathway genome database (PGDB) using Pathway Tools software based on MetaCyc data that includes metabolic pathways as well as associated metabolites, reactions and enzymes. To ensure the highest quality of the genome functional annotation data, MACADAM also contains MicroCyc, a manually curated collection of PGDBs; Functional Annotation of Prokaryotic Taxa (FAPROTAX), a manually curated functional annotation database; and the IJSEM phenotypic database. The MACADAM database contains 13 509 PGDBs (13 195 bacterial and 314 archaeal), 1260 unique metabolic pathways, completed with 82 functional annotations from FAPROTAX and 16 from the IJSEM phenotypic database. MACADAM contains a total of 7921 metabolites, 592 enzymatic reactions, 2134 EC numbers and 7440 enzymes. MACADAM can be queried at any rank of the NCBI taxonomy (from phyla to species). It provides the possibility to explore functional information completed with metabolites, enzymes, enzymatic reactions and EC numbers. MACADAM returns a tabulated file containing a list of pathways with two scores (pathway score and pathway frequency score) that are present in the queried taxa. The file also contains the names of the organisms in which the pathways are found and the metabolic hierarchy associated with the pathways. Finally, MACADAM can be downloaded as a single file and queried with SQLite or python command lines or explored through a web interface.
      PubDate: Mon, 29 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz049
      Issue No: Vol. 2019 (2019)
  • The NCBI BioCollections Database

    • Authors: Sharma S; Ciufo S, Starchenko E, et al.
      Abstract: The citation
      PubDate: Mon, 29 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz057
      Issue No: Vol. 2019 (2019)
  • YeasTSS: an integrative web database of yeast transcription start sites

    • Authors: McMillan J; Lu Z, Rodriguez J, et al.
      Abstract: The transcription initiation landscape of eukaryotic genes is complex and highly dynamic. In eukaryotes, genes can generate multiple transcript variants that differ in 5′ boundaries due to usages of alternative transcription start sites (TSSs), and the abundance of transcript isoforms are highly variable. Due to a large number and complexity of the TSSs, it is not feasible to depict details of transcript initiation landscape of all genes using text-format genome annotation files. Therefore, it is necessary to provide data visualization of TSSs to represent quantitative TSS maps and the core promoters (CPs). In addition, the selection and activity of TSSs are influenced by various factors, such as transcription factors, chromatin remodeling and histone modifications. Thus, integration and visualization of functional genomic data related to these features could provide a better understanding of the gene promoter architecture and regulatory mechanism of transcription initiation. Yeast species play important roles for the research and human society, yet no database provides visualization and integration of functional genomic data in yeast. Here, we generated quantitative TSS maps for 12 important yeast species, inferred their CPs and built a public database, YeasTSS ( YeasTSS was designed as a central portal for visualization and integration of the TSS maps, CPs and functional genomic data related to transcription initiation in yeast. YeasTSS is expected to benefit the research community and public education for improving genome annotation, studies of promoter structure, regulated control of transcription initiation and inferring gene regulatory network.
      PubDate: Fri, 26 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz048
      Issue No: Vol. 2019 (2019)
  • rPredictorDB: a predictive database of individual secondary structures of
           RNAs and their formatted plots

    • Authors: Jelínek J; Hoksza D, Hajič J, et al.
      Abstract: Secondary data structure of RNA molecules provides insights into the identity and function of RNAs. With RNAs readily sequenced, the question of their structural characterization is increasingly important. However, RNA structure is difficult to acquire. Its experimental identification is extremely technically demanding, while computational prediction is not accurate enough, especially for large structures of long sequences. We address this difficult situation with rPredictorDB, a predictive database of RNA secondary structures that aims to form a middle ground between experimentally identified structures in PDB and predicted consensus secondary structures in Rfam. The database contains individual secondary structures predicted using a tool for template-based prediction of RNA secondary structure for the homologs of the RNA families with at least one homolog with experimentally solved structure. Experimentally identified structures are used as the structural templates and thus the prediction has higher reliability than de novo predictions in Rfam. The sequences are downloaded from public resources. So far rPredictorDB covers 7365 RNAs with their secondary structures. Plots of the secondary structures use the Traveler package for readable display of RNAs with long sequences and complex structures, such as ribosomal RNAs. The RNAs in the output of rPredictorDB are extensively annotated and can be viewed, browsed, searched and downloaded according to taxonomic, sequence and structure data. Additionally, structure of user-provided sequences can be predicted using the templates stored in rPredictorDB.
      PubDate: Thu, 25 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz047
      Issue No: Vol. 2019 (2019)
  • An effective biomedical document classification scheme in support of
           biocuration: addressing class imbalance

    • Authors: Jiang X; Ringwald M, Blake J, et al.
      Abstract: Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory’s Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.
      PubDate: Thu, 25 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz045
      Issue No: Vol. 2019 (2019)
  • CANCROX: a cross-species cancer therapy database

    • Authors: de Ávila P; e Silva D, de Melo Bernardo P, et al.
      Abstract: Cancer comprises a set of more than 200 diseases resulting from the uncontrolled growth of cells that invade tissues and organs, which can spread to other regions of the body. The types of cancer found in humans are also described in animal models, a fact that has raised the interest of the scientific community in comparative oncology studies. In this study, bioinformatics tools were used to implement a computational model that uses text mining and natural language processing to construct a reference database that relates human and canine genes potentially associated with cancer, defining genetic pathways and information about cancer and cancer therapies. The CANCROX reference database was constructed by processing the scientific literature and lists more than 1300 drugs and therapies used to treat cancer, in addition to over 10 000 combinations of these drugs, including 40 types of cancer. A user-friendly interface was developed that enables researchers to search for different types of information about therapies, drug combinations, genes and types of cancer. In addition, data visualization tools allow to explore and relate different drugs and therapies for the treatment of cancer, providing information for groups studying animal models, in this case the dog, as well as groups studying cancer in humans.
      PubDate: Thu, 25 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz044
      Issue No: Vol. 2019 (2019)
  • PlantMP: a database for moonlighting plant proteins

    • Authors: Su B; Qian Z, Li T, et al.
      Abstract: Moonlighting proteins are single polypeptide chains capable of executing two or more distinct biochemical and/or biological functions. Here, we describe the development of PlantMP, which is a manually curated online-based database of plant proteins that are known to `moonlight’. The database contains searchable UniProt IDs and names, canonical and moonlighting functions, gene ontology numbers, plant species as well as links to the PubMed indexed articles. Proteins homologous to experimentally confirmed moonlighting proteins from the model plant Arabidopsis thaliana are provided as a separate list of `likely moonlighters’. Additionally, we also provide a list of predicted Arabidopsis moonlighting proteins reported in the literature. Currently, PlantMP contains 110 plant moonlighting proteins, 10 `likely moonlighters’ and 27 `predicted moonlighters’. Organizing plant moonlighting proteins in one platform enables researchers to conveniently harvest plant-specific raw and processed data such as the molecular functions, biological roles and structural features essential for hypothesis formulation in basic research and for biotechnological innovations.
      PubDate: Thu, 25 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz050
      Issue No: Vol. 2019 (2019)
  • ChlamBase: a curated model organism database for the Chlamydia research

    • Authors: Putman T; Hybiske K, Jow D, et al.
      Abstract: The accelerating growth of genomic and proteomic information for Chlamydia species, coupled with unique biological aspects of these pathogens, necessitates bioinformatic tools and features that are not provided by major public databases. To meet these growing needs, we developed ChlamBase, a model organism database for Chlamydia that is built upon the WikiGenomes application framework, and Wikidata, a community-curated database. ChlamBase was designed to serve as a central access point for genomic and proteomic information for the Chlamydia research community. ChlamBase integrates information from numerous external databases, as well as important data extracted from the literature that are otherwise not available in structured formats that are easy to use. In addition, a key feature of ChlamBase is that it empowers users in the field to contribute new annotations and data as the field advances with continued discoveries. ChlamBase is freely and publicly available at
      PubDate: Mon, 15 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz041
      Issue No: Vol. 2019 (2019)
  • The Natural History Museum Data Portal

    • Authors: Scott B; Baker E, Woodburn M, et al.
      Abstract: The Natural History Museum, London (NHM), generates and holds some of the largest global data sets relating to the biological and geological diversity of the natural world. A majority of these data were, until 2015, not widely accessible, and, even when published, were typically hard to find, poorly documented and in formats that impede discovery and integration. To better serve the bespoke needs of user communities outside and within the NHM, a dedicated data portal was developed to surface these data sets and provide a sustainable platform to encourage their citation and reuse. This paper describes the technical development of the data portal, from its inception to beta launch in December 2015, its first 2 years of operation, and future plans for the project. It outlines the development principles adopted for this prototypical project, which subsequently informed new digital project management methodologies at the NHM. The process of developing the data portal acted as a driver to implement policies necessary to encourage a culture of data sharing at the NHM.
      PubDate: Thu, 11 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz038
      Issue No: Vol. 2019 (2019)
  • PanglaoDB: a web server for exploration of mouse and human single-cell RNA
           sequencing data

    • Authors: Franzén O; Gan L, Björkegren J.
      Abstract: Single-cell RNA sequencing is an increasingly used method to measure gene expression at the single cell level and build cell-type atlases of tissues. Hundreds of single-cell sequencing datasets have already been published. However, studies are frequently deposited as raw data, a format difficult to access for biological researchers due to the need for data processing using complex computational pipelines. We have implemented an online database, PanglaoDB, accessible through a user-friendly interface that can be used to explore published mouse and human single cell RNA sequencing studies. PanglaoDB contains pre-processed and pre-computed analyses from more than 1054 single-cell experiments covering most major single cell platforms and protocols, based on more than 4 million cells from a wide range of tissues and organs. The online interface allows users to query and explore cell types, genetic pathways and regulatory networks. In addition, we have established a community-curated cell-type marker compendium, containing more than 6000 gene-cell-type associations, as a resource for automatic annotation of cell types.
      PubDate: Fri, 05 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz046
      Issue No: Vol. 2019 (2019)
  • GenDiS database update with improved approach and features to recognize
           homologous sequences of protein domain superfamilies

    • Authors: Iyer M; Bhargava K, Pavalam M, et al.
      Abstract: Since proteins evolve by divergent evolution, proteins with distant homology to each other may or may not bear similar functions. Improved computational approaches are required to recognize distant homologues that are functionally similar. One of the methods of assigning function to sequences is to use profiles derived from sequences of known structure. We describe an update of the Genomic Distribution of protein structural domain Superfamilies (GenDiS) database, namely GenDiS+, which provides a projection of SCOP superfamily members on the sequence space (NR database, NCBI). The sequences are validated using structure-based sequence alignment profiles and domain and full-length sequence alignments. GenDiS+ is a `tour de force’ for detecting homologues within around 160 000 taxonomic identifiers, starting from nearly 11 000 domains of known structure. Features, like full-sequence alignment and phylogeny, domain sequence alignment and phylogeny, list of associated structural and sequence domains with strength of interactions, links to databases like Pfam, UniProt and ModBase and list of sequences with a PDB structure, are provided.
      PubDate: Wed, 03 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz042
      Issue No: Vol. 2019 (2019)
  • A dimensional warehouse for integrating operational data from clinical

    • Authors: Farnum M; Mohanty L, Ashok M, et al.
      Abstract: Timely, consistent and integrated access to clinical trial data remains one of the pharmaceutical industry’s most pressing needs. As part of a comprehensive clinical data repository, we have developed a data warehouse that can integrate operational data from any source, conform it to a canonical data model and make it accessible to study teams in a timely, secure and contextualized manner to support operational oversight, proactive risk management and other analytic and reporting needs. Our solution consists of a dimensional relational data warehouse, a set of extraction, transformation and loading processes to coordinate data ingestion and mapping, a generalizable metrics engine to enable the computation of operational metrics and key performance, quality and risk indicators and a set of graphical user interfaces to facilitate configuration, management and administration. When combined with the appropriate data visualization tools, the warehouse enables convenient access to raw operational data and derived metrics to help track study conduct and performance, identify and mitigate risks, monitor and improve operational processes, manage resource allocation, strengthen investigator and sponsor relationships and other purposes.
      PubDate: Wed, 03 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz039
      Issue No: Vol. 2019 (2019)
  • Quantitative phenotype analysis to identify, validate and compare rat
           disease models

    • Authors: Zhao Y; Smith J, Wang S, et al.
      Abstract: The laboratory rat has been widely used as an animal model in biomedical research. There are many strains exhibiting a wide variety of phenotypes. Capturing these phenotypes in a centralized database provides researchers with an easy method for choosing the appropriate strains for their studies. Existing resources have provided some preliminary work in rat phenotype databases. However, existing resources suffer from problems such as small number of animals, lack of updating, web interface queries limitations and lack of standardized metadata. The Rat Genome Database (RGD) PhenoMiner tool has provided the first step in this effort by standardizing and integrating data from individual studies. Our work, mainly utilizing data curated in RGD, involves the following key steps: (i) we developed a meta-analysis pipeline to automatically integrate data from heterogeneous sources and to produce expected ranges (standardized phenotype ranges) for different strains and phenotypes under different experimental conditions; (ii) we created tools to visualize expected ranges for individual strains and strain groups. We developed a meta-analysis pipeline and an interactive web interface that summarizes and visualizes expected ranges produced from the meta-analysis pipeline. Automation of the pipeline allows for updates as additional data becomes available. The interactive web interface provides curators and researchers with a platform for identifying and validating expected ranges for a variety of quantitative phenotypes. The data analysis result and visualization tools will promote an understanding of rat disease models, guide researchers to choose optimal strains for their research needs and encourage data sharing from different research hubs. Such resources also help to promote research reproducibility. The interactive platforms created in this project will continue to provide a valuable resource for translational research efforts.
      PubDate: Tue, 02 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz037
      Issue No: Vol. 2019 (2019)
  • Building deep learning models for evidence classification from the open
           access biomedical literature

    • Authors: Burns G; Li X, Peng N.
      Abstract: We investigate the application of deep learning to biocuration tasks that involve classification of text associated with biomedical evidence in primary research articles. We developed a large-scale corpus of molecular papers derived from PubMed and PubMed Central open access records and used it to train deep learning word embeddings under the GloVe, FastText and ELMo algorithms. We applied those models to a distant supervised method classification task based on text from figure captions or fragments surrounding references to figures in the main text using a variety or models and parameterizations. We then developed document classification (triage) methods for molecular interaction papers by using deep learning mechanisms of attention to aggregate classification-based decisions over selected paragraphs in the document. We were able to obtain triage performance with an accuracy of 0.82 using a combined convolutional neural network, bi-directional long short-term memory architecture augmented by attention to produce a single decision for triage. In this work, we hope to encourage biocuration systems developers to apply deep learning methods to their specialized tasks by repurposing large-scale word embedding to apply to their data.
      PubDate: Tue, 02 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz034
      Issue No: Vol. 2019 (2019)
  • An enhanced workflow for variant interpretation in UniProtKB/Swiss-Prot
           improves consistency and reuse in ClinVar

    • Authors: Famiglietti M; Estreicher A, Breuza L, et al.
      Abstract: Personalized genomic medicine depends on integrated analyses that combine genetic and phenotypic data from individual patients with reference knowledge of the functional and clinical significance of sequence variants. Sources of this reference knowledge include the ClinVar repository of human genetic variants, a community resource that accepts submissions from external groups, and UniProtKB/Swiss-Prot, an expert-curated resource of protein sequences and functional annotation. UniProtKB/Swiss-Prot provides knowledge on the functional impact and clinical significance of over 30 000 human protein-coding sequence variants, curated from peer-reviewed literature reports. Here we present a pilot study that lays the groundwork for the integration of curated knowledge of protein sequence variation from UniProtKB/Swiss-Prot with ClinVar. We show that existing interpretations of variant pathogenicity in UniProtKB/Swiss-Prot and ClinVar are highly concordant, with 88% of variants that are common to the two resources having interpretations of clinical significance that agree. Re-curation of a subset of UniProtKB/Swiss-Prot variants according to American College of Medical Genetics and Genomics (ACMG) guidelines using ClinGen tools further increases this level of agreement, mainly due to the reclassification of supposedly pathogenic variants as benign, based on newly available population frequency data. We have now incorporated ACMG guidelines and ClinGen tools into the UniProt Knowledgebase (UniProtKB) curation workflow and routinely submit variant data from UniProtKB/Swiss-Prot to ClinVar. These efforts will increase the usability and utilization of UniProtKB variant data and will facilitate the continuing (re-)evaluation of clinical variant interpretations as data sets and knowledge evolve.
      PubDate: Tue, 02 Apr 2019 00:00:00 GMT
      DOI: 10.1093/database/baz040
      Issue No: Vol. 2019 (2019)
  • AYbRAH: a curated ortholog database for yeasts and fungi spanning 600
           million years of evolution

    • Authors: Correia K; Yu S, Mahadevan R.
      Abstract: Budding yeasts inhabit a range of environments by exploiting various metabolic traits. The genetic bases for these traits are mostly unknown, preventing their addition or removal in a chassis organism for metabolic engineering. Insight into the evolution of orthologs, paralogs and xenologs in the yeast pan-genome can help bridge these genotypes; however, existing phylogenomic databases do not span diverse yeasts, and sometimes cannot distinguish between these homologs. To help understand the molecular evolution of these traits in yeasts, we created Analyzing Yeasts by Reconstructing Ancestry of Homologs (AYbRAH), an open-source database of predicted and manually curated ortholog groups for 33 diverse fungi and yeasts in Dikarya, spanning 600 million years of evolution. OrthoMCL and OrthoDB were used to cluster protein sequence into ortholog and homolog groups, respectively; MAFFT and PhyML reconstructed the phylogeny of all homolog groups. Ortholog assignments for enzymes and small metabolite transporters were compared to their phylogenetic reconstruction, and curated to resolve any discrepancies. Information on homolog and ortholog groups can be viewed in the AYbRAH web portal (, including functional annotations, predictions for mitochondrial localization and transmembrane domains, literature references and phylogenetic reconstructions. Ortholog assignments in AYbRAH were compared to HOGENOM, KEGG Orthology, OMA, eggNOG and PANTHER. PANTHER and OMA had the most congruent ortholog groups with AYbRAH, while the other phylogenomic databases had greater amounts of under-clustering, over-clustering or no ortholog annotations for proteins. Future plans are discussed for AYbRAH, and recommendations are made for other research communities seeking to create curated ortholog databases.
      PubDate: Wed, 20 Mar 2019 00:00:00 GMT
      DOI: 10.1093/database/baz022
      Issue No: Vol. 2019 (2019)
  • Curating gene sets: challenges and opportunities for integrative analysis

    • Authors: Bubier J; Hill D, Mukherjee G, et al.
      Abstract: Genomic data interpretation often requires analyses that move from a gene-by-gene focus to a focus on sets of genes that are associated with biological phenomena such as molecular processes, phenotypes, diseases, drug interactions or environmental conditions. Unique challenges exist in the curation of gene sets beyond the challenges in curation of individual genes. Here we highlight a literature curation workflow whereby gene sets are curated from peer-reviewed published data into GeneWeaver (GW), a data repository and analysis platform. We describe the system features that allow for a flexible yet precise curation procedure. We illustrate the value of curation by gene sets through analysis of independently curated sets that relate to the integrated stress response, showing that sets curated from independent sources all share significant Jaccard similarity. A suite of reproducible analysis tools is provided in GW as services to carry out interactive functional investigation of user-submitted gene sets within the context of over 150 000 gene sets constructed from publicly available resources and published gene lists. A curation interface supports the ability of users to design and maintain curation workflows of gene sets, including assigning, reviewing and releasing gene sets within a curation project context.
      PubDate: Tue, 19 Mar 2019 00:00:00 GMT
      DOI: 10.1093/database/baz036
      Issue No: Vol. 2019 (2019)
  • Growing and cultivating the forest genomics database, TreeGenes

    • Authors: Falk T; Herndon N, Grau E, et al.
      Abstract: The ‘Database URL’ has been changed to link to
      PubDate: Wed, 13 Mar 2019 00:00:00 GMT
      DOI: 10.1093/database/baz043
      Issue No: Vol. 2019 (2019)
  • A late-binding, distributed, NoSQL warehouse for integrating patient data
           from clinical trials

    • Authors: Yang E; Scheff J, Shen S, et al.
      Abstract: Clinical trial data are typically collected through multiple systems developed by different vendors using different technologies and data standards. That data need to be integrated, standardized and transformed for a variety of monitoring and reporting purposes. The need to process large volumes of often inconsistent data in the presence of ever-changing requirements poses a significant technical challenge. As part of a comprehensive clinical data repository, we have developed a data warehouse that integrates patient data from any source, standardizes it and makes it accessible to study teams in a timely manner to support a wide range of analytic tasks for both in-flight and completed studies. Our solution combines Apache HBase, a NoSQL column store, Apache Phoenix, a massively parallel relational query engine and a user-friendly interface to facilitate efficient loading of large volumes of data under incomplete or ambiguous specifications, utilizing an extract–load–transform design pattern that defers data mapping until query time. This approach allows us to maintain a single copy of the data and transform it dynamically into any desirable format without requiring additional storage. Changes to the mapping specifications can be easily introduced and multiple representations of the data can be made available concurrently. Further, by versioning the data and the transformations separately, we can apply historical maps to current data or current maps to historical data, which simplifies the maintenance of data cuts and facilitates interim analyses for adaptive trials. The result is a highly scalable, secure and redundant solution that combines the flexibility of a NoSQL store with the robustness of a relational query engine to support a broad range of applications, including clinical data management, medical review, risk-based monitoring, safety signal detection, post hoc analysis of completed studies and many others.
      PubDate: Mon, 11 Mar 2019 00:00:00 GMT
      DOI: 10.1093/database/baz032
      Issue No: Vol. 2019 (2019)
  • Update on cpnDB: a reference database of chaperonin sequences

    • Authors: Vancuren S; Hill J.
      Abstract: cpnDB was established in 2004 to provide a manually curated database of type I (60 kDa chaperonin, CPN60, also known as GroEL or HSP60) and type II (CCT, TRiC, thermosome) chaperonin sequences and to support chaperonin sequence-based applications including microbial species identification, detection and quantification, phylogenetic investigations and microbial community profiling. Since its establishment, cpnDB has grown to over 25 000 sequence records including over 4 000 records from bacterial type strains. The updated cpnDB webpage ( provides tools for text- or sequence-based searches and links to protocols, and selected reference data sets are available for download. Here we present an updated description of the contents and taxonomic coverage of cpnDB and an analysis of cpn60 sequence diversity.
      PubDate: Fri, 01 Mar 2019 00:00:00 GMT
      DOI: 10.1093/database/baz033
      Issue No: Vol. 2019 (2019)
  • MSGP: the first database of the protein components of the mammalian stress

    • Authors: Nunes C; Mestre I, Marcelo A, et al.
      Abstract: In response to different stress stimuli, cells transiently form stress granules (SGs) in order to protect themselves and re-establish homeostasis. Besides these important cellular functions, SGs are now being implicated in different human diseases, such as neurodegenerative disorders and cancer. SGs are ribonucleoprotein granules, constituted by a variety of different types of proteins, RNAs, factors involved in translation and signaling molecules, being capable of regulating mRNA translation to facilitate stress response. However, until now a complete list of the SG components has not been available. Therefore, we aimer at identifying and linting in an open access database all the proteins described so far as components of SGs. The identification was made through an exhaustive search of studies listed in PubMed and double checked. Moreover, for each identified protein several details were also gathered from public databases, such as the molecular function, the cell types in which they were detected, the type of stress stimuli used to induce SG formation and the reference of the study describing the recruitment of the component to SGs. Expression levels in the context of different neurodegenerative diseases were also obtained and are also described in the database. The Mammalian Stress Granules Proteome is available at, being a new and unique open access online database, the first to list all the protein components of the SGs identified so far. The database constitutes an important and valuable tool for researchers in this research area of growing interest.
      PubDate: Fri, 01 Mar 2019 00:00:00 GMT
      DOI: 10.1093/database/baz031
      Issue No: Vol. 2019 (2019)
  • YESdb: integrative analysis of environmental stress in yeast

    • Authors: Berchtold E; Csaba G, Zimmer R.
      Abstract: The stress response in the model organisms Saccharomyces cerevisiae is a well-studied system for which many data sets are available. Already in 2000, it was discovered that yeast cells trigger a similar transcriptional response when different types of stress are applied. However, the exact regulatory mechanisms and differences between the different types of stress are still not understood.Here, we present the Yeast Environmental Stress database (YESdb), a database containing all high-throughput experiments measuring various kinds of stress in yeast. The goal of the database is to allow the user to execute complex, integrative analyses of selected data sets, e.g. the comparison of measurements of the same stress using different platforms or differences between strains, stress strengths or types of stress. The analyses can be visualized in various ways and can be compiled into interactive reports to summarize and communicate the results.The data sets are available as differential conditions (typically stressed vs control), which are grouped to time or concentration series when multiple measurements over time or concentrations are done in one experiment. An annotation ontology has been constructed to annotate the data sets with the type, duration and strength of the applied stress, the used strain and experimental platform as well as the publication date. These annotations can easily be combined to select all relevant data sets for an analysis.YESdb allows to construct and execute Petri net-based workflows to perform predefined and custom analyses. E.g. to compare two types of stress (e.g. salt vs oxidative stress), the corresponding data sets are selected from the database, the consistently changed genes are defined and combined and the shared genes are characterized by enrichment analysis.A broad collection of visualizations is available most of which are also interactive. The results of all analyses can be summarized in an interactive report. Visualizations of individual steps (transitions) of YESdb workflows can be automatically added to this report or customized visualizations as well as interpretive text can manually be added to the report.Overall, YESdb aims at making all published data sets on yeast stress immediately available and comparable for integrated analysis of data sets and sets of genes in order to identify and assess hypotheses and mechanisms.
      PubDate: Fri, 01 Mar 2019 00:00:00 GMT
      DOI: 10.1093/database/baz023
      Issue No: Vol. 2019 (2019)
  • PASS2 version 6: a database of structure-based sequence alignments of
           protein domain superfamilies in accordance with SCOPe

    • Authors: Ghosh P; Bhattacharyya T, Mathew O, et al.
      Abstract: The number of protein structures is increasing due to the individual initiatives and rapid development of structure determination techniques. Structure-based sequence alignments of distantly related proteins enable the investigation of structural, evolutionary and functional relationships between proteins and their domains leading to their common evolutionary origin. Protein Alignments organized as Structural Superfamilies (PASS2) is a database that provides such alignments of members of protein domain superfamilies of known structure and with less than 40% sequence identity. PASS2 has been continuously updated in accordance to Structural Classification of Proteins (SCOP), and now Structural Classification of Proteins - extended (SCOPe). The current update directly corresponds to SCOPe 2.06, dealing with 2006 domain superfamilies of known structure and about 14 000 domains. Alignments have been augmented by features such as hidden Markov models, highly conserved residues, structural motifs and gene ontology terms, which are available for download. In this update, we introduce the concepts of ‘extreme structural outliers’ and ‘split superfamilies’ as well.
      PubDate: Fri, 01 Mar 2019 00:00:00 GMT
      DOI: 10.1093/database/baz028
      Issue No: Vol. 2019 (2019)
  • Improved annotation of the insect vector of citrus greening disease:
           biocuration by a diverse genomics community

    • Authors: Saha S; Hosmani P, Villalobos-Ayala K, et al.
      Abstract: Author affiliation for Liliana Cano has been corrected to link to University of Florida/IFAS Indian RiverResearch and Education Center, Ft. Pierce, FL 34945.
      PubDate: Fri, 01 Mar 2019 00:00:00 GMT
      DOI: 10.1093/database/baz035
      Issue No: Vol. 2019 (2019)
  • Ontology based text mining of gene-phenotype associations: application to
           candidate gene prediction

    • Authors: Kafkas Ş; Hoehndorf R.
      Abstract: Gene–phenotype associations play an important role in understanding the disease mechanisms which is a requirement for treatment development. A portion of gene–phenotype associations are observed mainly experimentally and made publicly available through several standard resources such as MGI. However, there is still a vast amount of gene–phenotype associations buried in the biomedical literature. Given the large amount of literature data, we need automated text mining tools to alleviate the burden in manual curation of gene–phenotype associations and to develop comprehensive resources. In this study, we present an ontology-based approach in combination with statistical methods to text mine gene–phenotype associations from the literature. Our method achieved AUC values of 0.90 and 0.75 in recovering known gene–phenotype associations from HPO and MGI respectively. We posit that candidate genes and their relevant diseases should be expressed with similar phenotypes in publications. Thus, we demonstrate the utility of our approach by predicting disease candidate genes based on the semantic similarities of phenotypes associated with genes and diseases. To the best of our knowledge, this is the first study using an ontology based approach to extract gene–phenotype associations from the literature. We evaluated our disease candidate prediction model on the gene–disease associations from MGI. Our model achieved AUC values of 0.90 and 0.87 on OMIM (human) and MGI (mouse) datasets of gene–disease associations respectively. Our manual analysis on the text mined data revealed that our method can accurately extract gene–phenotype associations which are not currently covered by the existing public gene–phenotype resources. Overall, results indicate that our method can precisely extract known as well as new gene–phenotype associations from literature. All the data and methods are available at
      PubDate: Wed, 27 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz019
      Issue No: Vol. 2019 (2019)
  • Statistical principle-based approach for recognizing and normalizing
           microRNAs described in scientific literature

    • Authors: Dai H; Wang C, Chang N, et al.
      Abstract: The detection of MicroRNA (miRNA) mentions in scientific literature facilitates researchers with the ability to find relevant and appropriate literature based on queries formulated using miRNA information. Considering most published biological studies elaborated on signal transduction pathways or genetic regulatory information in the form of figure captions, the extraction of miRNA from both the main content and figure captions of a manuscript is useful in aggregate analysis and comparative analysis of the studies published. In this study, we present a statistical principle-based miRNA recognition and normalization method to identify miRNAs and link them to the identifiers in the Rfam database. As one of the core components in the text mining pipeline of the database miRTarBase, the proposed method combined the advantages of previous works relying on pattern, dictionary and supervised learning and provided an integrated solution for the problem of miRNA identification. Furthermore, the knowledge learned from the training data was organized in a human-interpretable manner to understand the reason why the system considers a span of text as a miRNA mention, and the represented knowledge can be further complemented by domain experts. We studied the ambiguity level of miRNA nomenclature to connect the miRNA mentions to the Rfam database and evaluated the performance of our approach on two datasets: the BioCreative VI Bio-ID corpus and the miRNA interaction corpus by extending the later corpus with additional Rfam normalization information. Our study highlights and also proposes a better understanding of the challenges associated with miRNA identification and normalization in scientific literature and the research gap that needs to be further explored in prospective studies.
      PubDate: Wed, 27 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz030
      Issue No: Vol. 2019 (2019)
  • Tetrahymena Comparative Genomics Database (TCGD): a community resource for

    • Authors: Yang W; Jiang C, Zhu Y, et al.
      Abstract: Ciliates are a large and diverse group of unicellular organisms characterized by having the following two distinct type of nuclei within a single cell: micronucleus (MIC) and macronucleus (MAC). Although the genomes of several ciliates in different groups have been sequenced, comparative genomics data for multiple species within a ciliate genus are not yet available. Here we collected the genome information and comparative genomics analysis results for 10 species in the Tetrahymena genus, including the previously sequenced model organism Tetrahymena thermophila and 9 newly sequenced species, and constructed a genus-level comparative analysis platform, the Tetrahymena Comparative Genomics Database (TCGD). Genome sequences, transcriptomic data, gene models, functional annotation, ortholog groups and synteny maps were built into this database and a user-friendly interface was developed for searching, visualizing and analyzing these data. In summary, the TCGD ( will be an important and useful resource for the ciliate research community.
      PubDate: Wed, 27 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz029
      Issue No: Vol. 2019 (2019)
  • SELER: a database of super-enhancer-associated lncRNA- directed
           transcriptional regulation in human cancers

    • Authors: Guo Z; Xie C, Li K, et al.
      Abstract: Super-enhancers (SEs) are enriched with a cluster of mediator binding sites, which are major contributors to cell-type-specific gene expression. Currently, a large quantity of long non-coding RNAs has been found to be transcribed from or to interact with SEs, which constitute super-enhancer associated long non-coding RNAs (SE-lncRNAs). These SE-lncRNAs play essential roles in transcriptional regulation through controlling SEs activity to regulate a broad range of physiological and pathological processes, especially tumorigenesis. However, the pathological functions of SE-lncRNAs in tumorigenesis are still obscure. In this paper, we characterized 5056 SE-lncRNAs and their associated genes by analysing 102 SE data sets. Then, we analysed their expression profiles and prognostic information derived from 19 cancer types to identify cancer-related SE-lncRNAs and to explore their potential functions. In total, 436 significantly differentially expressed SE-lncRNAs and 2035 SE-lncRNAs with high prognostic values were identified. Additionally, 3935 significant correlations between SE-lncRNAs and their regulatory genes were further validated by calculating their correlation coefficients in each cancer type. Finally, the SELER database incorporating the aforementioned data was provided for users to explore their physiological and pathological functions to comprehensively understand the blocks of living systems.
      PubDate: Tue, 26 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz027
      Issue No: Vol. 2019 (2019)
  • PIRSitePredict for protein functional site prediction using
           position-specific rules

    • Authors: Chen C; Wang Q, Huang H, et al.
      Abstract: Methods focused on predicting ‘global’ annotations for proteins (such as molecular function, biological process and presence of domains or membership in a family) have reached a relatively mature stage. Methods to provide fine-grained ‘local’ annotation of functional sites (at the level of individual amino acid) are now coming to the forefront, especially in light of the rapid accumulation of genetic variant data. We have developed a computational method and workflow that predicts functional sites within proteins using position-specific conditional template annotation rules (namely PIR Site Rules or PIRSRs for short). Such rules are curated through review of known protein structural and other experimental data by structural biologists and are used to generate high-quality annotations for the UniProt Knowledgebase (UniProtKB) unreviewed section. To share the PIRSR functional site prediction method with the broader scientific community, we have streamlined our workflow and developed a stand-alone Java software package named PIRSitePredict. We demonstrate the use of PIRSitePredict for functional annotation of de novo assembled genome/transcriptome by annotating uncharacterized proteins from Trinity RNA-seq assembly of embryonic transcriptomes of the following three cartilaginous fishes: Leucoraja erinacea (Little Skate), Scyliorhinus canicula (Small-spotted Catshark) and Callorhinchus milii (Elephant Shark). On average about 1200 lines of annotations were predicted for each species.
      PubDate: Tue, 26 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz026
      Issue No: Vol. 2019 (2019)
  • PKAD: a database of experimentally measured pKa values of ionizable groups
           in proteins

    • Authors: Pahari S; Sun L, Alexov E.
      Abstract: Ionizable residues play key roles in many biological phenomena including protein folding, enzyme catalysis and binding. We present PKAD, a database of experimentally measured pKas of protein residues reported in the literature or taken from existing databases. The database contains pKa data for 1350 residues in 157 wild-type proteins and for 232 residues in 45 mutant proteins. Most of these values are for Asp, Glu, His and Lys amino acids. The database is available as downloadable file as well as a web server ( The PKAD database can be used as a benchmarking source for development and improvement of pKa’s prediction methods. The web server provides additional information taken from the corresponding structures and amino acid sequences, which allows for easy search and grouping of the experimental pKas according to various biophysical characteristics, amino acid type and others.
      PubDate: Tue, 26 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz024
      Issue No: Vol. 2019 (2019)
  • PamulDB: a comprehensive genomic resource for the study of human- and
           animal-pathogenic Pasteurella multocida

    • Authors: Li T; Xu X, Du H, et al.
      Abstract: Pasteurella multocida can infect a wide range of host, including humans and animals of economic importance. Genomics studies on the pathogen have produced a large amount of omics data, which are deposited in GenBank but lacks a dedicated and comprehensive resource for further analysis and integration so that need to be brought together centrally in a coherent and systematic manner. Here we have collected the genomic data for 176 P. multocida strains that are categorized into 11 host groups and 9 serotype groups, and developed the open-access P. multocida Database (PamulDB) to make this resource readily available. The PamulDB implements and integrates Chado for genome data management, Drupal for web content management, and bioinformatics tools like NCBI BLAST, HMMER, PSORTb and OrthoMCL for data analysis. All the P. multocida genomes have been further annotated for search and analysis of homologous sequence, phylogeny, gene ontology, transposon, protein subcellular localization and secreted protein. Transcriptomic data of P. multocida are also selectively adopted for gene expression analysis. The PamulDB has been developing and improving to better aid researchers with identifying and classifying of pathogens, dissecting mechanisms of the pathogen infection and host response.
      PubDate: Mon, 25 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz025
      Issue No: Vol. 2019 (2019)
  • SuCComBase: a manually curated repository of plant sulfur-containing

    • Authors: Harun S; Abdullah-Zawawi M, A-Rahman M, et al.
      Abstract: Plants produce a wide range of secondary metabolites that play important roles in plant defense and immunity, their interaction with the environment and symbiotic associations. Sulfur-containing compounds (SCCs) are a group of important secondary metabolites produced in members of the Brassicales order. SCCs constitute various groups of phytochemicals, but not much is known about them. Findings from previous studies on SCCs were scattered in published literatures, hence SuCComBase was developed to store all molecular information related to the biosynthesis of SCCs. Information that includes genes, proteins and compounds that are involved in the SCC biosynthetic pathway was manually identified from databases and published scientific literatures. Sets of co-expression data was analyzed to search for other possible (previously unknown) genes that might be involved in the biosynthesis of SCC. These genes were named as potential SCC-related encoding genes. A total of 147 known and 92 putative Arabidopsis thaliana SCC-related genes from literatures were used to identify other potential SCC-related encoding genes. We identified 778 potential SCC-related encoding genes, 4026 homologs to the SCC-related encoding genes and 116 SCCs as shown on SuCComBase homepage. Data entries are searchable from the Main page, Search, Browse and Datasets tabs. Users can easily download all data stored in SuCComBase. All publications related to SCCs are also indexed in SuCComBase, which is currently the first and only database dedicated to plant SCCs. SuCComBase aims to become a manually curated and au fait knowledge-based repository for plant SCCs.
      PubDate: Fri, 22 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz021
      Issue No: Vol. 2019 (2019)
  • EnDisease: a manually curated database for enhancer-disease associations

    • Authors: Zeng W; Min X, Jiang R.
      Abstract: Genome-wide association studies have successfully identified thousands of genomic loci potentially associated with hundreds of complex traits in the past decade. Nevertheless, the fact that more than 90% of such disease-associated variants lie in non-coding DNA with unknown functional implications has been appealing for advanced analysis of plenty of genetic variants. Toward this goal, recent studies focusing on individual non-coding variants have revealed that complex diseases are often the consequences of erroneous interactions between enhancers and their target genes. However, such enhancer-disease associations are dispersed in a variety of independent studies, and thus far it is still difficult to carry out comprehensive downstream analysis with these experimentally supported enhancer-disease associations. To fill in this gap, we collected experimentally supported associations between complex diseases and enhancers and then developed a manually curated database called EnDisease ( Concretely, EnDisease documents 535 associations between 133 diseases and 454 enhancers, extracted from 199 articles. Moreover, after annotating these enhancers using 649 human and 115 mouse DNase-seq experiments, we find that cancer-related enhancers tend to be open across a large number of cell types. This database provides a user-friendly interface for browsing and searching, and it also allows users to download data freely. EnDisease has the potential to become a helpful and important resource for researchers who aim to understand the molecular mechanisms of enhancers involved in complex diseases.
      PubDate: Thu, 21 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz020
      Issue No: Vol. 2019 (2019)
  • FairBase: a comprehensive database of fungal A-to-I RNA editing

    • Authors: Liu J; Wang D, Su Y, et al.
      Abstract: Frequent A-to-I RNA editing has recently been identified in fungi despite the absence of recognizable homologues of metazoan ADARs (“Adenosine Deaminases Acting on RNA”). In particular, there is emerging evidence showing that A-to-I editing is involved in sexual reproduction of filamentous fungi. Here, we report on the creation of FairBase — a fungal A-to-I RNA editing database that provides a platform for deep exploration of fungal RNA editing to relevant academic communities. This database includes a comprehensive collection of A-to-I editing sites in six filamentous fungal species, together with extensive annotations for each editing site. In FairBase, users can conveniently search editing sites and obtain editing levels for each editing site in various RNA-seq samples. In addition, the pathways involving RNA editing are built in FairBase to help users understand the functions of RNA editing. Furthermore, each fungal species has a genome browser (JBrowse) that allows users to explore A-to-I editing in a genomic context. FairBase is the first fungal RNA editing database.
      PubDate: Tue, 19 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz018
      Issue No: Vol. 2019 (2019)
  • A cross-source, system-agnostic solution for clinical data review

    • Authors: Farnum M; Ashok M, Kowalski D, et al.
      Abstract: Assembly of complete and error-free clinical trial data sets for statistical analysis and regulatory submission requires extensive effort and communication among investigational sites, central laboratories, pharmaceutical sponsors, contract research organizations and other entities. Traditionally, this data is captured, cleaned and reconciled through multiple disjointed systems and processes, which is resource intensive and error prone. Here, we introduce a new system for clinical data review that helps data managers identify missing, erroneous and inconsistent data and manage queries in a unified, system-agnostic and efficient way. Our solution enables timely and integrated access to all study data regardless of source, facilitates the review of validation and discrepancy checks and the management of the resulting queries, tracks the status of page review, verification and locking activities, monitors subject data cleanliness and readiness for database lock and provides extensive configuration options to meet any study’s needs, automation for regular updates and fit-for-purpose user interfaces for global oversight and problem detection.
      PubDate: Mon, 18 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz017
      Issue No: Vol. 2019 (2019)
  • A scalable, aggregated genotypic–phenotypic database for human
           disease variation

    • Authors: Barrett R; Neben C, Zimmer A, et al.
      Abstract: Next generation sequencing multi-gene panels have greatly improved the diagnostic yield and cost effectiveness of genetic testing and are rapidly being integrated into the clinic for hereditary cancer risk. With this technology comes a dramatic increase in the volume, type and complexity of data. This invaluable data though is too often buried or inaccessible to researchers, especially to those without strong analytical or programming skills. To effectively share comprehensive, integrated genotypic–phenotypic data, we built Color Data, a publicly available, cloud-based database that supports broad access and data literacy. The database is composed of 50 000 individuals who were sequenced for 30 genes associated with hereditary cancer risk and provides useful information on allele frequency and variant classification, as well as associated phenotypic information such as demographics and personal and family history. Our user-friendly interface allows researchers to easily execute their own queries with filtering, and the results of queries can be shared and/or downloaded. The rapid and broad dissemination of these research results will help increase the value of, and reduce the waste in, scientific resources and data. Furthermore, the database is able to quickly scale and support integration of additional genes and human hereditary conditions. We hope that this database will help researchers and scientists explore genotype–phenotype correlations in hereditary cancer, identify novel variants for functional analysis and enable data-driven drug discovery and development.
      PubDate: Wed, 13 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz013
      Issue No: Vol. 2019 (2019)
  • LIVE: a manually curated encyclopedia of experimentally validated
           interactions of lncRNAs

    • Authors: An G; Sun J, Ren C, et al.
      Abstract: Advances in studies of long noncoding RNAs (lncRNAs) have provided data regarding the regulatory roles of lncRNAs, which perform functional roles through interactions with other functional elements. To track the underlying relationships among lncRNAs, various databases have been developed as repositories for lncRNA data. However, the ability to comprehensively explore the diverse interactions between lncRNAs and other functional elements is limited. To this end, we developed LIVE (LncRNA Interaction Validated Encyclopaedia), an interactive resource to integrate the diverse interactions of functional elements with lncRNAs. LIVE is a manually curated database of experimentally validated interactions of lncRNAs with genes, proteins and other various functional elements. By mining publications, we constructed LIVE with the following three interaction networks: a binding interaction network, a regulation network and a disease network; then, we combined them to form a comprehensive lncRNA interaction network. The current release of LIVE contains the validated interactions of 572 lncRNAs in humans and mice with 103 proteins, 209 genes, 56 transcription factors and 194 diseases. LIVE provides an interactive interface with charts and figures to aid users in searching and browsing interactions with lncRNAs. LIVE will greatly facilitate further investigation into the regulatory roles of lncRNAs and is freely available.
      PubDate: Wed, 13 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz011
      Issue No: Vol. 2019 (2019)
  • Increased interactivity and improvements to the GigaScience database,

    • Authors: Xiao S; Armit C, Edmunds S, et al.
      Abstract: With a large increase in the volume and type of data archived in GigaScience Database (GigaDB) since its launch in 2011, we have studied the metrics and user patterns to assess the important aspects needed to best suit current and future use. This has led to new front-end developments and enhanced interactivity and functionality that greatly improve user experience. In this article, we present an overview of the current practices including the Biocurational role of the GigaDB staff, the broad usage metrics of GigaDB datasets and an update on how the GigaDB platform has been overhauled and enhanced to improve the stability and functionality of the codebase. Finally, we report on future directions for the GigaDB resource.
      PubDate: Mon, 11 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz016
      Issue No: Vol. 2019 (2019)
  • RiceMetaSysB: a database of blast and bacterial blight responsive genes in
           rice and its utilization in identifying key blast-resistant WRKY genes

    • Authors: Sureshkumar V; Dutta B, Kumar V, et al.
      Abstract: Nearly two decades of revolution in the area of genomics serves as the basis of present-day molecular breeding in major food crops such as rice. Here we report an open source database on two major biotic stresses of rice, named RiceMetaSysB, which provides detailed information about rice blast and bacterial blight (BB) responsive genes (RGs). Meta-analysis of microarray data from different blast- and BB-related experiments across 241 and 186 samples identified 15135 unique genes for blast and 7475 for BB. A total of 9365 and 5375 simple sequence repeats (SSRs) in blast and BB RGs were identified for marker development. Retrieval of candidate genes using different search options like genotypes, tissue, developmental stage of the host, strain, hours/days post-inoculation, physical position and SSR marker information is facilitated in the database. Search options like ‘common genes among varieties’ and ‘strains’ have been enabled to identify robust candidate genes. A 2D representation of the data can be used to compare expression profiles across genes, genotypes and strains. To demonstrate the utility of this database, we queried for blast-responsive WRKY genes (fold change ≥5) using their gene IDs. The structural variations in the 12 WRKY genes so identified and their promoter regions were explored in two rice genotypes contrasting for their reaction to blast infection. Expression analysis of these genes in panicle tissue infected with a virulent and an avirulent strain of Magnaporthe oryzae could identify WRKY7, WRKY58, WRKY62, WRKY64 and WRKY76 as potential candidate genes for resistance to panicle blast, as they showed higher expression only in the resistant genotype against the virulent strain. Thus, we demonstrated that RiceMetaSysB can play an important role in providing robust candidate genes for rice blast and BB.
      PubDate: Mon, 11 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz015
      Issue No: Vol. 2019 (2019)
  • Integrated curation and data mining for disease and phenotype models at
           the Rat Genome Database

    • Authors: Wang S; Laulederkind S, Zhao Y, et al.
      Abstract: Rats have been used as research models in biomedical research for over 150 years. These disease models arise from naturally occurring mutations, selective breeding and, more recently, genome manipulation. Through the innovation of genome-editing technologies, genome-modified rats provide precision models of disease by disrupting or complementing targeted genes. To facilitate the use of these data produced from rat disease models, the Rat Genome Database (RGD) organizes rat strains and annotates these strains with disease and qualitative phenotype terms as well as quantitative phenotype measurements. From the curated quantitative data, the expected phenotype profile ranges were established through a meta-analysis pipeline using inbred rat strains in control conditions. The disease and qualitative phenotype annotations are propagated to their associated genes and alleles if applicable. Currently, RGD has curated nearly 1300 rat strains with disease/phenotype annotations and about 11% of them have known allele associations. All of the annotations (disease and phenotype) are integrated and displayed on the strain, gene and allele report pages. Finding disease and phenotype models at RGD can be done by searching for terms in the ontology browser, browsing the disease or phenotype ontology branches or entering keywords in the general search. Use cases are provided to show different targeted searches of rat strains at RGD.
      PubDate: Mon, 11 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz014
      Issue No: Vol. 2019 (2019)
  • Using deep learning to identify translational research in genomic medicine
           beyond bench to bedside

    • Authors: Hsu Y; Clyne M, Wei C, et al.
      Abstract: Tracking scientific research publications on the evaluation, utility and implementation of genomic applications is critical for the translation of basic research to impact clinical and population health. In this work, we utilize state-of-the-art machine learning approaches to identify translational research in genomics beyond bench to bedside from the biomedical literature. We apply the convolutional neural networks (CNNs) and support vector machines (SVMs) to the bench/bedside article classification on the weekly manual annotation data of the Public Health Genomics Knowledge Base database. Both classifiers employ salient features to determine the probability of curation-eligible publications, which can effectively reduce the workload of manual triage and curation process. We applied the CNNs and SVMs to an independent test set (n = 400), and the models achieved the F-measure of 0.80 and 0.74, respectively. We further tested the CNNs, which perform better results, on the routine annotation pipeline for 2 weeks and significantly reduced the effort and retrieved more appropriate research articles. Our approaches provide direct insight into the automated curation of genomic translational research beyond bench to bedside. The machine learning classifiers are found to be helpful for annotators to enhance the efficiency of manual curation.
      PubDate: Fri, 08 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz010
      Issue No: Vol. 2019 (2019)
  • ImmunoSPdb: an archive of immunosuppressive peptides

    • Authors: Usmani S; Agrawal P, Sehgal M, et al.
      Abstract: Immunosuppression proved as a captivating therapy in several autoimmune disorders, asthma as well as in organ transplantation. Immunosuppressive peptides are specific for reducing efficacy of immune system with wide range of therapeutic implementations. `ImmunoSPdb’ is a comprehensive, manually curated database of around 500 experimentally verified immunosuppressive peptides compiled from 79 research article and 32 patents. The current version comprises of 553 entries providing extensive information including peptide name, sequence, chirality, chemical modification, origin, nature of peptide, its target as well as mechanism of action, amino acid frequency and composition, etc. Data analysis revealed that most of the immunosuppressive peptides are linear (91%), are shorter in length i.e. up to 20 amino acids (62%) and have L form of amino acids (81%). About 30% peptide are either chemically modified or have end terminal modification. Most of the peptides either are derived from proteins (41%) or naturally (27%) exist. Blockage of potassium ion channel (24%) is one a major target for immunosuppressive peptides. In addition, we have annotated tertiary structure by using PEPstrMOD and I-TASSER. Many user-friendly, web-based tools have been integrated to facilitate searching, browsing and analyzing the data. We have developed a user-friendly responsive website to assist a wide range of users.
      PubDate: Fri, 08 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz012
      Issue No: Vol. 2019 (2019)
  • Enhanced taxonomy annotation of antiviral activity data from ChEMBL

    • Authors: Nikitina A; Orlov A, Kozlovskaya L, et al.
      Abstract: The discovery of antiviral drugs is a rapidly developing area of medicinal chemistry research. The emergence of resistant variants and outbreaks of poorly studied viral diseases make this area constantly developing. The amount of antiviral activity data available in ChEMBL consistently grows, but virus taxonomy annotation of these data is not sufficient for thorough studies of antiviral chemical space. We developed a procedure for semi-automatic extraction of antiviral activity data from ChEMBL and mapped them to the virus taxonomy developed by the International Committee for Taxonomy of Viruses (ICTV). The procedure is based on the lists of virus-related values of ChEMBL annotation fields and a dictionary of virus names and acronyms mapped to ICTV taxa. Application of this data extraction procedure allows retrieving from ChEMBL 1.6 times more assays linked to 2.5 times more compounds and data points than ChEMBL web interface allows. Mapping of these data to ICTV taxa allows analyzing all the compounds tested against each viral species. Activity values and structures of the compounds were standardized, and the antiviral activity profile was created for each standard structure. Data set compiled using this algorithm was called ViralChEMBL. As case studies, we compared descriptor and scaffold distributions for the full ChEMBL and its `viral’ and `non-viral’ subsets, identified the most studied compounds and created a self-organizing map for ViralChEMBL. Our approach to data annotation appeared to be a very efficient tool for the study of antiviral chemical space.
      PubDate: Fri, 08 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/bay139
      Issue No: Vol. 2019 (2019)
  • The radish genome database (RadishGD): an integrated information resource
           for radish genomics

    • Authors: Yu H; Baek S, Lee Y, et al.
      Abstract: Radish (Raphanus sativus L.) is an important root vegetable crop in the family Brassicaceae, which provides diverse nutrients for human health and is closely related to the Brassica crop species. Recently, we sequenced and assembled the radish genome into nine chromosome pseudomolecules. In addition, we developed diverse genomic resources, including genetic maps, molecular markers, transcriptome, genome-wide methylation and variome data. In this study, we describe the radish genome database (RadishGD), including details of data sets that we generated and the web interface that allows access to these data. RadishGD comprises six major units that enable researchers and general users to search, browse and analyze the radish genomic data in an integrated manner. The Search unit provides gene structures and sequences for gene models through keyword or BLAST searches. The Genome browser displays graphic representations of gene models, mRNAs, repetitive sequences, genome-wide methylation and variomes among various genotypes. The Functional annotation unit offers gene ontology, plant ontology, pathway and gene family information for gene models. The Genetic map unit provides information about markers and their genetic locations using two types of genetic maps. The Expression unit presents transcriptional characteristics and methylation levels for each gene in 18 tissues. All sequence data incorporated into RadishGD can be downloaded from the Data resources unit. RadishGD will be continually updated to serve as a community resource for radish genomics and breeding research.
      PubDate: Tue, 05 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz009
      Issue No: Vol. 2019 (2019)
  • ZincBind—the database of zinc binding sites

    • Authors: Ireland S; Martin A.
      Abstract: Zinc is one of the most important biologically active metals. Ten per cent of the human genome is thought to encode a zinc binding protein and its uses encompass catalysis, structural stability, gene expression and immunity. At present, there is no specific resource devoted to identifying and presenting all currently known zinc binding sites. Here we present ZincBind, a database of zinc binding sites and its web front-end. Using the structural data in the Protein Data Bank, ZincBind identifies every instance of zinc binding to a protein, identifies its binding site and clusters sites based on 90% sequence identity. There are currently 24 992 binding sites, clustered into 7489 unique sites. The data are available over the web where they can be browsed and downloaded, and via a REST API. ZincBind is regularly updated and will continue to be updated with new data and features.
      PubDate: Tue, 05 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz006
      Issue No: Vol. 2019 (2019)
  • Integration of macromolecular complex data into the Saccharomyces Genome

    • Authors: Wong E; Skrzypek M, Weng S, et al.
      Abstract: Proteins seldom function individually. Instead, they interact with other proteins or nucleic acids to form stable macromolecular complexes that play key roles in important cellular processes and pathways. One of the goals of Saccharomyces Genome Database (SGD; is to provide a complete picture of budding yeast biological processes. To this end, we have collaborated with the Molecular Interactions team that provides the Complex Portal database at EMBL-EBI to manually curate the complete yeast complexome. These data, from a total of 589 complexes, were previously available only in SGD’s YeastMine data warehouse ( and the Complex Portal ( We have now incorporated these macromolecular complex data into the SGD core database and designed complex-specific reports to make these data easily available to researchers. These web pages contain referenced summaries focused on the composition and function of individual complexes. In addition, detailed information about how subunits interact within the complex, their stoichiometry and the physical structure are displayed when such information is available. Finally, we generate network diagrams displaying subunits and Gene Ontology annotations that are shared between complexes. Information on macromolecular complexes will continue to be updated in collaboration with the Complex Portal team and curated as more data become available.
      PubDate: Mon, 04 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz008
      Issue No: Vol. 2019 (2019)
  • CircFunBase: a database for functional circular RNAs

    • Authors: Meng X; Hu D, Zhang P, et al.
      Abstract: Increasing evidence reveals that circular RNAs (circRNAs) are widespread in eukaryotes and play important roles in diverse biological processes. However, a comprehensive functionally annotated circRNA database is still lacking. CircFunBase is a web-accessible database that aims to provide a high-quality functional circRNA resource including experimentally validated and computationally predicted functions. The current version of CircFunBase documents more than 7000 manually curated functional circRNA entries, mainly including Homo sapiens, Mus musculus etc. CircFunBase provides visualized circRNA-miRNA interaction networks. In addition, a genome browser is provided to visualize the genome context of circRNAs. As a biological information platform for circRNAs, CircFunBase will contribute for circRNA studies and bridge the gap between circRNAs and their functions.
      PubDate: Mon, 04 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz003
      Issue No: Vol. 2019 (2019)
  • Annotation of gene product function from high-throughput studies using the
           Gene Ontology

    • Authors: Attrill H; Gaudet P, Huntley R, et al.
      Abstract: High-throughput studies constitute an essential and valued source of information for researchers. However, high-throughput experimental workflows are often complex, with multiple data sets that may contain large numbers of false positives. The representation of high-throughput data in the Gene Ontology (GO) therefore presents a challenging annotation problem, when the overarching goal of GO curation is to provide the most precise view of a gene's role in biology. To address this, representatives from annotation teams within the GO Consortium reviewed high-throughput data annotation practices. We present an annotation framework for high-throughput studies that will facilitate good standards in GO curation and, through the use of new high-throughput evidence codes, increase the visibility of these annotations to the research community.
      PubDate: Fri, 01 Feb 2019 00:00:00 GMT
      DOI: 10.1093/database/baz007
      Issue No: Vol. 2019 (2019)
  • APID database: redefining protein–protein interaction experimental
           evidences and binary interactomes

    • Authors: Alonso-López D; Campos-Laborie F, Gutiérrez M, et al.
      Abstract: The collection and integration of all the known protein–protein physical interactions within a proteome framework are critical to allow proper exploration of the protein interaction networks that drive biological processes in cells at molecular level. APID Interactomes is a public resource of biological data ( that provides a comprehensive and curated collection of `protein interactomes’ for more than 1100 organisms, including 30 species with more than 500 interactions, derived from the integration of experimentally detected protein-to-protein physical interactions (PPIs). We have performed an update of APID database including a redefinition of several key properties of the PPIs to provide a more precise data integration and to avoid false duplicated records. This includes the unification of all the PPIs from five primary databases of molecular interactions (BioGRID, DIP, HPRD, IntAct and MINT), plus the information from two original systematic sources of human data and from experimentally resolved 3D structures (i.e. PDBs, Protein Data Bank files, where more than two distinct proteins have been identified). Thus, APID provides PPIs reported in published research articles (with traceable PMIDs) and detected by valid experimental interaction methods that give evidences about such protein interactions (following the `ontology and controlled vocabulary’:; developed by `HUPO PSI-MI’). Within this data mining framework, all interaction detection methods have been grouped into two main types: (i) `binary’ physical direct detection methods and (ii) `indirect’ methods. As a result of these redefinitions, APID provides unified protein interactomes including the specific `experimental evidences’ that support each PPI, indicating whether the interactions can be considered `binary’ (i.e. supported by at least one binary detection method) or not.
      PubDate: Thu, 31 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/baz005
      Issue No: Vol. 2019 (2019)
  • Meta-omics data and collection objects (MOD-CO): a conceptual schema and
           data model for processing sample data in meta-omics research

    • Authors: Rambold G; Yilmaz P, Harjes J, et al.
      Abstract: With the advent of advanced molecular meta-omics techniques and methods, a new era commenced for analysing and characterizing historic collection specimens, as well as recently collected environmental samples. Nucleic acid and protein sequencing-based analyses are increasingly applied to determine the origin, identity and traits of environmental (biological) objects and organisms. In this context, the need for new data structures is evident and former approaches for data processing need to be expanded according to the new meta-omics techniques and operational standards. Existing schemas and community standards in the biodiversity and molecular domain concentrate on terms important for data exchange and publication. Detailed operational aspects of origin and laboratory as well as object and data management issues are frequently neglected. Meta-omics Data and Collection Objects (MOD-CO) has therefore been set up as a new schema for meta-omics research, with a hierarchical organization of the concepts describing collection samples, as well as products and data objects being generated during operational workflows. It is focussed on object trait descriptions as well as on operational aspects and thereby may serve as a backbone for R&D laboratory information management systems with functions of an electronic laboratory notebook. The schema in its current version 1.0 includes 653 concepts and 1810 predefined concept values, being equivalent to descriptors and descriptor states, respectively. It is published in several representations, like a Semantic Media Wiki publication with 2463 interlinked Wiki pages for concepts and concept values, being grouped in 37 concept collections and subcollections. The SQL database application DiversityDescriptions, a generic tool for maintaining descriptive data and schemas, has been applied for setting up and testing MOD-CO and for concept mapping on elements of corresponding schemas.
      PubDate: Thu, 31 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/baz002
      Issue No: Vol. 2019 (2019)
  • One tool to find them all: a case of data integration and querying in a
           distributed LIMS platform

    • Authors: Grand A; Geda E, Mignone A, et al.
      Abstract: In the last years, Laboratory Information Management Systems (LIMS) have been growing from mere inventory systems into increasingly comprehensive software platforms, spanning functionalities as diverse as data search, annotation and analysis. Our institution started in 2011 a LIMS project named the Laboratory Assistant Suite with the purpose of assisting researchers throughout all of their laboratory activities, providing graphical tools to support decision-making tasks and building complex analyses on integrated data. The modular architecture of the system exploits multiple databases with different technologies. To provide an efficient and easy tool for retrieving information of interest, we developed the Multi-Dimensional Data Manager (MDDM). By means of intuitive interfaces, scientists can execute complex queries without any knowledge of query languages or database structures, and easily integrate heterogeneous data stored in multiple databases. Together with the other software modules making up the platform, the MDDM has helped improve the overall quality of the data, substantially reduced the time spent with manual data entry and retrieval and ultimately broadened the spectrum of interconnections among the data, offering novel perspectives to the biomedical analysts.
      PubDate: Wed, 30 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/baz004
      Issue No: Vol. 2019 (2019)
  • Automatic identification of relevant chemical compounds from patents

    • Authors: Akhondi S; Rey H, Schwörer M, et al.
      Abstract: In commercial research and development projects, public disclosure of new chemical compounds often takes place in patents. Only a small proportion of these compounds are published in journals, usually a few years after the patent. Patent authorities make available the patents but do not provide systematic continuous chemical annotations. Content databases such as Elsevier’s Reaxys provide such services mostly based on manual excerptions, which are time-consuming and costly. Automatic text-mining approaches help overcome some of the limitations of the manual process. Different text-mining approaches exist to extract chemical entities from patents. The majority of them have been developed using sub-sections of patent documents and focus on mentions of compounds. Less attention has been given to relevancy of a compound in a patent. Relevancy of a compound to a patent is based on the patent’s context. A relevant compound plays a major role within a patent. Identification of relevant compounds reduces the size of the extracted data and improves the usefulness of patent resources (e.g. supports identifying the main compounds). Annotators of databases like Reaxys only annotate relevant compounds. In this study, we design an automated system that extracts chemical entities from patents and classifies their relevance. The gold-standard set contained 18 789 chemical entity annotations. Of these, 10% were relevant compounds, 88% were irrelevant and 2% were equivocal. Our compound recognition system was based on proprietary tools. The performance (F-score) of the system on compound recognition was 84% on the development set and 86% on the test set. The relevancy classification system had an F-score of 86% on the development set and 82% on the test set. Our system can extract chemical compounds from patents and classify their relevance with high performance. This enables the extension of the Reaxys database by means of automation.
      PubDate: Wed, 30 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/baz001
      Issue No: Vol. 2019 (2019)
  • Overview of the BioCreative VI Precision Medicine Track: mining protein
           interactions and mutations for precision medicine

    • Authors: Islamaj Doğan R; Kim S, Chatr-aryamontri A, et al.
      Abstract: The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating theseKBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein–protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
      PubDate: Mon, 28 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay147
      Issue No: Vol. 2019 (2019)
  • EnhancerDB: a resource of transcriptional regulation in the context of

    • Authors: Kang R; Zhang Y, Huang Q, et al.
      Abstract: Enhancers can act as cis-regulatory elements to control transcriptional regulation by recruiting DNA-binding transcription factors (TFs) in a tissue-specific manner. Recent studies show that enhancers regulate not only protein-coding genes but also microRNAs (miRNAs), and mutations within the TF binding sites (TFBSs) located on enhancers will cause a variety of diseases such as cancer. However, a comprehensive resource to integrate these regulation elements for revealing transcriptional regulations in the context of enhancers is not currently available. Here, we introduce EnhancerDB, a web-accessible database to provide a resource to browse and search regulatory relationships identified in this study, including 131 054 581 TF–enhancer, 17 059 enhancer–miRNAs, 318 993 enhancer–genes, 4 639 558 TF–miRNAs, 1 059 695 TF–genes, 11 439 394 enhancer–single-nucleotide polymorphisms (SNPs) and 23 334 genes associated with expression quantitative trait loci (eQTL) SNP and expression profile of TF/gene/miRNA across multiple human tissues/cell lines. We also developed a tool that further allows users to define tissue-specific enhancers by setting the threshold score of tissue specificity of enhancers. In addition, links to external resources are also available at EnhancerDB.
      PubDate: Thu, 24 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay141
      Issue No: Vol. 2019 (2019)
  • Towards comprehensive annotation of Drosophila melanogaster enzymes in

    • Authors: Garapati P; Zhang J, Rey A, et al.
      Abstract: The catalytic activities of enzymes can be described using Gene Ontology (GO) terms and Enzyme Commission (EC) numbers. These annotations are available from numerous biological databases and are routinely accessed by researchers and bioinformaticians to direct their work. However, enzyme data may not be congruent between different resources, while the origin, quality and genomic coverage of these data within any one resource are often unclear. GO/EC annotations are assigned either manually by expert curators or inferred computationally, and there is potential for errors in both types of annotation. If such errors remain unchecked, false positive annotations may be propagated across multiple resources, significantly degrading the quality and usefulness of these data. Similarly, the absence of annotations (false negatives) from any one resource can lead to incorrect inferences or conclusions. We are systematically reviewing and enhancing the functional annotation of the enzymes of Drosophila melanogaster, focusing on improvements within the FlyBase ( database. We have reviewed four major enzyme groups to date: oxidoreductases, lyases, isomerases and ligases. Herein, we describe our review workflow, the improvement in the quality and coverage of enzyme annotations within FlyBase and the wider impact of our work on other related databases.
      PubDate: Wed, 23 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay144
      Issue No: Vol. 2019 (2019)
  • ccPDB 2.0: an updated version of datasets created and compiled from
           Protein Data Bank

    • Authors: Agrawal P; Patiyal S, Kumar R, et al.
      Abstract: ccPDB 2.0 ( is an updated version of the manually curated database ccPDB that maintains datasets required for developing methods to predict the structure and function of proteins. The number of datasets compiled from literature increased from 45 to 141 in ccPDB 2.0. Similarly, the number of protein structures used for creating datasets also increased from ~74 000 to ~137 000 (PDB March 2018 release). ccPDB 2.0 provides the same web services and flexible tools which were present in the previous version of the database. In the updated version, links of the number of methods developed in the past few years have also been incorporated. This updated resource is built on responsive templates which is compatible with smartphones (mobile, iPhone, iPad, tablets etc.) and large screen gadgets. In summary, ccPDB 2.0 is a user-friendly web-based platform that provides comprehensive as well as updated information about datasets.
      PubDate: Wed, 23 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay142
      Issue No: Vol. 2019 (2019)
  • A manual corpus of annotated main findings of clinical case reports

    • Authors: Smalheiser N; Luo M, Addepalli S, et al.
      Abstract: Clinical case reports are the `eyewitness reports’ of medicine and provide a valuable, unique, albeit noisy and underutilized type of evidence. Generally a case report has a single main finding that represents the reason for writing up the report in the first place. In the present study, we present the results of manual annotation carried out by two individuals on 500 randomly sampled case reports. This corpus contains main finding sentences extracted from title, abstract and full-text of the same article that can be regarded as semantically related and are often paraphrases. The final reconciled corpus of 416 articles comprises an open resource for further study. This is the first step in establishing text mining models and tools that can identify main finding sentences in an automated fashion, and in measuring quantitatively how similar any two main findings are. We envision that case reports in PubMed may be automatically indexed by main finding, so that users can carry out information queries for specific main findings (rather than general topics)—and given one case report, a user can retrieve those having the most similar main findings. The metric of main finding similarity may also potentially be relevant to the modeling of paraphrasing, summarization and entailment within the biomedical literature.
      PubDate: Thu, 17 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay143
      Issue No: Vol. 2019 (2019)
  • RRMdb—an evolutionary-oriented database of RNA recognition motif

    • Authors: Nowacka M; Boccaletto P, Jankowska E, et al.
      Abstract: RNA-recognition motif (RRM) is an RNA-interacting protein domain that plays an important role in the processes of RNA metabolism such as the splicing, editing, export, degradation, and regulation of translation. Here, we present the RNA-recognition motif database (RRMdb), which affords rapid identification and annotation of RRM domains in a given protein sequence. The RRMdb database is compiled from ~57 000 collected representative RRM domain sequences, classified into 415 families. Whenever possible, the families are associated with the available literature and structural data. Moreover, the RRM families are organized into a network of sequence similarities that allows for the assessment of the evolutionary relationships between them.
      PubDate: Wed, 16 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay148
      Issue No: Vol. 2019 (2019)
  • Restructured GEO: restructuring Gene Expression Omnibus metadata for
           genome dynamics analysis

    • Authors: Chen G; Ramírez J, Deng N, et al.
      Abstract: MotivationGene Expression Omnibus (GEO) and other publicly available data store their metadata in the format of unstructured English text, which is very difficult for automated reuse.ResultsWe employed text mining techniques to analyze the metadata of GEO and developed Restructured GEO database (ReGEO). ReGEO reorganizes and categorizes GEO series and makes them searchable by two new attributes extracted automatically from each series’ metadata. These attributes are the number of time points tested in the experiment and the disease being investigated. ReGEO also makes series searchable by other attributes available in GEO, such as platform organism, experiment type, associated PubMed ID as well as general keywords in the study’s description. Our approach greatly expands the usability of GEO data, demonstrating a credible approach to improve the utility of vast amount of publicly available data in the era of Big Data research.
      PubDate: Wed, 16 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay145
      Issue No: Vol. 2019 (2019)
  • Involving community in genes and pathway curation

    • Authors: Naithani S; Gupta P, Preece J, et al.
      Abstract: Biocuration plays a crucial role in building databases and complex systems-level platforms required for processing, annotating and analyzing ‘Big Data’ in biology. However, biocuration efforts cannot keep pace with a dramatic increase in the production of omics data; this presents one of the bottlenecks in genomics. In two pathway curation jamborees, Plant Reactome curators tested strategies for introducing researchers to pathway curation tools, harnessing biologists’ expertise in curating plant pathways and developing a network of community biocurators. We summarize the strategy, workflow and outcomes of these exercises, and discuss the role of community biocuration in advancing databases and genomic resources.
      PubDate: Wed, 16 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay146
      Issue No: Vol. 2019 (2019)
  • Extracting chemical–protein interactions from literature using sentence
           structure analysis and feature engineering

    • Authors: Lung P; He Z, Zhao T, et al.
      Abstract: Information about the interactions between chemical compounds and proteins is indispensable for understanding the regulation of biological processes and the development of therapeutic drugs. Manually extracting such information from biomedical literature is very time and resource consuming. In this study, we propose a computational method to automatically extract chemical–protein interactions (CPIs) from a given text. Our method extracts CPI pairs and CPI triplets from sentences, where a CPI pair consists of a chemical compound and a protein name, and a CPI triplet consists of a CPI pair along with an interaction word describing their relationship. We extracted a diverse set of features from sentences that were used to build multiple machine learning models. Our models contain both simple features, which can be directly computed from sentences, and more sophisticated features derived using sentence structure analysis techniques. For example, one set of features was extracted based on the shortest paths between the CPI pairs or among the CPI triplets in the dependency graphs obtained from sentence parsing. We designed a three-stage approach to predict the multiple categories of CPIs. Our method performed the best among systems that use non-deep learning methods and outperformed several deep-learning-based systems in the track 5 of the BioCreative VI challenge. The features we designed in this study are informative and can be applied to other machine learning methods including deep learning.
      PubDate: Tue, 08 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay138
      Issue No: Vol. 2019 (2019)
  • TP53LNC-DB, the database of lncRNAs in the p53 signalling network

    • Authors: Khan M; Bukhari I, Khan R, et al.
      Abstract: The TP53 gene product, p53, is a pleiotropic transcription factor induced by stress, which functions to promote cell cycle arrest, apoptosis and senescence. Genome-wide profiling has revealed an extensive system of long noncoding RNAs (lncRNAs) that is integral to the p53 signalling network. As a research tool, we implemented a public access database called TP53LNC-DB that annotates currently available information relating lncRNAs to p53 signalling in humans.
      PubDate: Tue, 08 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay136
      Issue No: Vol. 2019 (2019)
  • RAEdb: a database of enhancers identified by high-throughput reporter

    • Authors: Cai Z; Cui Y, Tan Z, et al.
      Abstract: High-throughput reporter assays have been recently developed to directly and quantitatively assess enhancer activity for thousands of regulatory elements. However, there is still no database to collect these enhancers. We developed RAEdb, the first database to collect enhancers identified by high-throughput reporter assays. RAEdb includes 538 320 enhancers derived from eight studies, most of which were from six human cell lines. An activity score was assigned to each enhancer based on reporter assays. Based on these enhancers, 7658 epromoters (promoters with enhancer activity) were identified and stored in the database. RAEdb provides two ways of searches: the first is to search studies by species and cell line; the other is to search enhancers or epromoters by position, activity score, sequence and gene. RAEdb also provides a genome browser to query, visualize and compare enhancers. All data in RAEdb is freely available for download.
      PubDate: Tue, 08 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay140
      Issue No: Vol. 2019 (2019)
  • PubTerm: a web tool for organizing, annotating and curating genes,
           diseases, molecules and other concepts from PubMed records

    • Authors: Garcia-Pelaez J; Rodriguez D, Medina-Molina R, et al.
      Abstract: Background and objectiveAnalysis, annotation and curation of biomedical scientific literature is a recurrent task in biomedical research, database curation and clinics. Commonly, the reading is centered on concepts such as genes, diseases or molecules. Database curators may also need to annotate published abstracts related to a specific topic. However, few free and intuitive tools exist to assist users in this context. Therefore, we developed PubTerm, a web tool to organize, categorize, curate and annotate a large number of PubMed abstracts related to biological entities such as genes, diseases, chemicals, species, sequence variants and other related information.MethodsA variety of interfaces were implemented to facilitate curation and annotation, including the organization of abstracts by terms, by the co-occurrence of terms or by specific phrases. Information includes statistics on the occurrence of terms. The abstracts, terms and other related information can be annotated and categorized using user-defined categories. The session information can be saved and restored, and the data can be exported to other formats.ResultsThe pipeline in PubTerm starts by specifying a PubMed query or list of PubMed identifiers. Then, the user can specify three lists of categories and specify what information will be highlighted in which colors. The user then utilizes the `term view’ to organize the abstracts by gene, disease, species or other information to facilitate the annotation and categorization of terms or abstracts. Other views also facilitate the exploration of abstracts and connections between terms. We have used PubTerm to quickly and efficiently curate collections of more than 400 abstracts that mention more than 350 genes to generate revised lists of susceptibility genes for diseases. An example is provided for pulmonary arterial hypertension.ConclusionsPubTerm saves time for literature revision by assisting with annotation organization and knowledge acquisition.
      PubDate: Tue, 08 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay137
      Issue No: Vol. 2019 (2019)
  • TogoGenome/TogoStanza: modularized Semantic Web genome database

    • Authors: Katayama T; Kawashima S, Okamoto S, et al.
      Abstract: TogoGenome is a genome database that is purely based on the Semantic Web technology, which enables the integration of heterogeneous data and flexible semantic searches. All the information is stored as Resource Description Framework (RDF) data, and the reporting web pages are generated on the fly using SPARQL Protocol and RDF Query Language (SPARQL) queries. TogoGenome provides a semantic-faceted search system by gene functional annotation, taxonomy, phenotypes and environment based on the relevant ontologies. TogoGenome also serves as an interface to conduct semantic comparative genomics by which a user can observe pan-organism or organism-specific genes based on the functional aspect of gene annotations and the combinations of organisms from different taxa. The TogoGenome database exhibits a modularized structure, and each module in the report pages is separately served as TogoStanza, which is a generic framework for rendering an information block as IFRAME/Web Components, which can, unlike several other monolithic databases, also be reused to construct other databases. TogoGenome and TogoStanza have been under development since 2012 and are freely available along with their source codes on the GitHub repositories at and, respectively, under the MIT license.
      PubDate: Tue, 08 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay132
      Issue No: Vol. 2019 (2019)
  • The integrated National NeuroAIDS Tissue Consortium database: a rich
           platform for neuroHIV research

    • Authors: Heithoff A; Totusek S, Le D, et al.
      Abstract: Herein we present major updates to the National NeuroAIDS Tissue Consortium (NNTC) database. The NNTC's ongoing multisite clinical research study was established to facilitate access to ante-mortem and post-mortem data, tissues and biofluids for the neurohuman immunodeficiency virus (HIV) research community. Recently, the NNTC has expanded to include data from the central nervous system HIV Antiretroviral Therapy Effects Research (CHARTER) study. The data and biospecimens from CHARTER and NNTC cohorts are available to qualified researchers upon request. Data generated by requestors using NNTC biospecimens and tissues are returned to the NNTC upon the conclusion of requestors' work, and this external, experimental data are annotated and curated in the publically accessible NNTC database, thereby extending the utility of each case. A flexible and extensible database ontology allows the integration of disparate data sets, including external experimental data, clinical neuropsychological and neuromedical testing data, tissue pathology and neuroimaging data.
      PubDate: Tue, 08 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay134
      Issue No: Vol. 2019 (2019)
  • Combining relation extraction with function detection for BEL statement

    • Authors: Liu S; Cheng W, Qian L, et al.
      Abstract: The BioCreative-V community proposed a challenging task of automatic extraction of causal relation network in Biological Expression Language (BEL) from the biomedical literature. Previous studies on this task largely used models induced from other related tasks and then transformed intermediate structures to BEL statements, which left the given training corpus unexplored. To make full use of the BEL training corpus, in this work, we propose a deep learning-based approach to extract BEL statements. Specifically, we decompose the problem into two subtasks: entity relation extraction and entity function detection. First, two attention-based bidirectional long short-term memory networks models are used to extract entity relation and entity function, respectively. Then entity relation and their functions are combined into a BEL statement. In order to boost the overall performance, a strategy of threshold filtering is applied to improve the precision of identified entity functions. We evaluate our approach on the BioCreative-V Track 4 corpus with or without gold entities. The experimental results show that our method achieves the state-of-the-art performance with an overall F1-measure of 46.9% in stage 2 and 21.3% in stage 1, respectively.
      PubDate: Tue, 08 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay133
      Issue No: Vol. 2019 (2019)
  • AtFusionDB: a database of fusion transcripts in Arabidopsis thaliana

    • Authors: Singh A; Zahra S, Das D, et al.
      Abstract: Fusion transcripts are chimeric RNAs generated as a result of fusion either at DNA or RNA level. These novel transcripts have been extensively studied in the case of human cancers but still remain underexamined in plants. In this study, we introduce the first plant-specific database of fusion transcripts named AtFusionDB ( This is a comprehensive database that contains the detailed information about fusion transcripts identified in model plant Arabidopsis thaliana. A total of 82 969 fusion transcript entries generated from 17 181 different genes of A. thaliana are available in this database. Apart from the basic information consisting of the Ensembl gene names, official gene name, tissue type, EricScore, fusion type, AtFusionDB ID and sample ID (e.g. Sequence Read Archive ID), additional information like UniProt, gene coordinates (together with the function of parental genes), junction sequence, expression level of both parent genes and fusion transcript may be of high utility to the user. Two different types of search modules viz. ‘Simple Search’ and ‘Advanced Search’ in addition to the ‘Browse’ option with data download facility are provided in this database. Three different modules for mapping and alignment of the query sequences viz. BLASTN, SW Align and Mapping are incorporated in AtFusionDB. This database is a head start for exploring the complex and unexplored domain of gene/transcript fusion in plants.
      PubDate: Tue, 08 Jan 2019 00:00:00 GMT
      DOI: 10.1093/database/bay135
      Issue No: Vol. 2019 (2019)
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-