for Journals by Title or ISSN
for Articles by Keywords

Publisher: Oxford University Press   (Total: 396 journals)

 A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  

        1 2 | Last   [Sort by number of followers]   [Restore default list]

Showing 1 - 200 of 396 Journals sorted alphabetically
ACS Symposium Series     Full-text available via subscription   (SJR: 0.189, CiteScore: 0)
Acta Biochimica et Biophysica Sinica     Hybrid Journal   (Followers: 5, SJR: 0.79, CiteScore: 2)
Adaptation     Hybrid Journal   (Followers: 9, SJR: 0.143, CiteScore: 0)
Advances in Nutrition     Hybrid Journal   (Followers: 46, SJR: 2.196, CiteScore: 5)
Aesthetic Surgery J.     Hybrid Journal   (Followers: 6, SJR: 1.434, CiteScore: 1)
African Affairs     Hybrid Journal   (Followers: 64, SJR: 1.869, CiteScore: 2)
Age and Ageing     Hybrid Journal   (Followers: 91, SJR: 1.989, CiteScore: 4)
Alcohol and Alcoholism     Hybrid Journal   (Followers: 18, SJR: 1.376, CiteScore: 3)
American Entomologist     Full-text available via subscription   (Followers: 7)
American Historical Review     Hybrid Journal   (Followers: 154, SJR: 0.467, CiteScore: 1)
American J. of Agricultural Economics     Hybrid Journal   (Followers: 41, SJR: 2.113, CiteScore: 3)
American J. of Clinical Nutrition     Hybrid Journal   (Followers: 147, SJR: 3.438, CiteScore: 6)
American J. of Epidemiology     Hybrid Journal   (Followers: 175, SJR: 2.713, CiteScore: 3)
American J. of Hypertension     Hybrid Journal   (Followers: 25, SJR: 1.322, CiteScore: 3)
American J. of Jurisprudence     Hybrid Journal   (Followers: 18, SJR: 0.281, CiteScore: 1)
American J. of Legal History     Full-text available via subscription   (Followers: 8, SJR: 0.116, CiteScore: 0)
American Law and Economics Review     Hybrid Journal   (Followers: 27, SJR: 1.053, CiteScore: 1)
American Literary History     Hybrid Journal   (Followers: 15, SJR: 0.391, CiteScore: 0)
Analysis     Hybrid Journal   (Followers: 21, SJR: 1.038, CiteScore: 1)
Animal Frontiers     Hybrid Journal  
Annals of Behavioral Medicine     Hybrid Journal   (Followers: 15, SJR: 1.423, CiteScore: 3)
Annals of Botany     Hybrid Journal   (Followers: 36, SJR: 1.721, CiteScore: 4)
Annals of Oncology     Hybrid Journal   (Followers: 42, SJR: 5.599, CiteScore: 9)
Annals of the Entomological Society of America     Full-text available via subscription   (Followers: 10, SJR: 0.722, CiteScore: 1)
Annals of Work Exposures and Health     Hybrid Journal   (Followers: 32, SJR: 0.728, CiteScore: 2)
AoB Plants     Open Access   (Followers: 4, SJR: 1.28, CiteScore: 3)
Applied Economic Perspectives and Policy     Hybrid Journal   (Followers: 18, SJR: 0.858, CiteScore: 2)
Applied Linguistics     Hybrid Journal   (Followers: 56, SJR: 2.987, CiteScore: 3)
Applied Mathematics Research eXpress     Hybrid Journal   (Followers: 1, SJR: 1.241, CiteScore: 1)
Arbitration Intl.     Full-text available via subscription   (Followers: 20)
Arbitration Law Reports and Review     Hybrid Journal   (Followers: 14)
Archives of Clinical Neuropsychology     Hybrid Journal   (Followers: 30, SJR: 0.731, CiteScore: 2)
Aristotelian Society Supplementary Volume     Hybrid Journal   (Followers: 3)
Arthropod Management Tests     Hybrid Journal   (Followers: 2)
Astronomy & Geophysics     Hybrid Journal   (Followers: 43, SJR: 0.146, CiteScore: 0)
Behavioral Ecology     Hybrid Journal   (Followers: 52, SJR: 1.871, CiteScore: 3)
Bioinformatics     Hybrid Journal   (Followers: 303, SJR: 6.14, CiteScore: 8)
Biology Methods and Protocols     Hybrid Journal  
Biology of Reproduction     Full-text available via subscription   (Followers: 10, SJR: 1.446, CiteScore: 3)
Biometrika     Hybrid Journal   (Followers: 20, SJR: 3.485, CiteScore: 2)
BioScience     Hybrid Journal   (Followers: 29, SJR: 2.754, CiteScore: 4)
Bioscience Horizons : The National Undergraduate Research J.     Open Access   (Followers: 1, SJR: 0.146, CiteScore: 0)
Biostatistics     Hybrid Journal   (Followers: 17, SJR: 1.553, CiteScore: 2)
BJA : British J. of Anaesthesia     Hybrid Journal   (Followers: 168, SJR: 2.115, CiteScore: 3)
BJA Education     Hybrid Journal   (Followers: 64)
Brain     Hybrid Journal   (Followers: 68, SJR: 5.858, CiteScore: 7)
Briefings in Bioinformatics     Hybrid Journal   (Followers: 49, SJR: 2.505, CiteScore: 5)
Briefings in Functional Genomics     Hybrid Journal   (Followers: 3, SJR: 2.15, CiteScore: 3)
British J. for the Philosophy of Science     Hybrid Journal   (Followers: 35, SJR: 2.161, CiteScore: 2)
British J. of Aesthetics     Hybrid Journal   (Followers: 26, SJR: 0.508, CiteScore: 1)
British J. of Criminology     Hybrid Journal   (Followers: 585, SJR: 1.828, CiteScore: 3)
British J. of Social Work     Hybrid Journal   (Followers: 88, SJR: 1.019, CiteScore: 2)
British Medical Bulletin     Hybrid Journal   (Followers: 7, SJR: 1.355, CiteScore: 3)
British Yearbook of Intl. Law     Hybrid Journal   (Followers: 32)
Bulletin of the London Mathematical Society     Hybrid Journal   (Followers: 4, SJR: 1.376, CiteScore: 1)
Cambridge J. of Economics     Hybrid Journal   (Followers: 62, SJR: 0.764, CiteScore: 2)
Cambridge J. of Regions, Economy and Society     Hybrid Journal   (Followers: 11, SJR: 2.438, CiteScore: 4)
Cambridge Quarterly     Hybrid Journal   (Followers: 9, SJR: 0.104, CiteScore: 0)
Capital Markets Law J.     Hybrid Journal   (Followers: 2, SJR: 0.222, CiteScore: 0)
Carcinogenesis     Hybrid Journal   (Followers: 2, SJR: 2.135, CiteScore: 5)
Cardiovascular Research     Hybrid Journal   (Followers: 14, SJR: 3.002, CiteScore: 5)
Cerebral Cortex     Hybrid Journal   (Followers: 45, SJR: 3.892, CiteScore: 6)
CESifo Economic Studies     Hybrid Journal   (Followers: 17, SJR: 0.483, CiteScore: 1)
Chemical Senses     Hybrid Journal   (Followers: 1, SJR: 1.42, CiteScore: 3)
Children and Schools     Hybrid Journal   (Followers: 5, SJR: 0.246, CiteScore: 0)
Chinese J. of Comparative Law     Hybrid Journal   (Followers: 4, SJR: 0.412, CiteScore: 0)
Chinese J. of Intl. Law     Hybrid Journal   (Followers: 23, SJR: 0.329, CiteScore: 0)
Chinese J. of Intl. Politics     Hybrid Journal   (Followers: 9, SJR: 1.392, CiteScore: 2)
Christian Bioethics: Non-Ecumenical Studies in Medical Morality     Hybrid Journal   (Followers: 10, SJR: 0.183, CiteScore: 0)
Classical Receptions J.     Hybrid Journal   (Followers: 26, SJR: 0.123, CiteScore: 0)
Clean Energy     Open Access   (Followers: 1)
Clinical Infectious Diseases     Hybrid Journal   (Followers: 65, SJR: 5.051, CiteScore: 5)
Clinical Kidney J.     Open Access   (Followers: 3, SJR: 1.163, CiteScore: 2)
Communication Theory     Hybrid Journal   (Followers: 22, SJR: 2.424, CiteScore: 3)
Communication, Culture & Critique     Hybrid Journal   (Followers: 26, SJR: 0.222, CiteScore: 1)
Community Development J.     Hybrid Journal   (Followers: 27, SJR: 0.268, CiteScore: 1)
Computer J.     Hybrid Journal   (Followers: 9, SJR: 0.319, CiteScore: 1)
Conservation Physiology     Open Access   (Followers: 2, SJR: 1.818, CiteScore: 3)
Contemporary Women's Writing     Hybrid Journal   (Followers: 9, SJR: 0.121, CiteScore: 0)
Contributions to Political Economy     Hybrid Journal   (Followers: 5, SJR: 0.906, CiteScore: 1)
Critical Values     Full-text available via subscription  
Current Developments in Nutrition     Open Access   (Followers: 1)
Current Legal Problems     Hybrid Journal   (Followers: 27)
Current Zoology     Full-text available via subscription   (Followers: 2, SJR: 1.164, CiteScore: 2)
Database : The J. of Biological Databases and Curation     Open Access   (Followers: 8, SJR: 1.791, CiteScore: 3)
Digital Scholarship in the Humanities     Hybrid Journal   (Followers: 14, SJR: 0.259, CiteScore: 1)
Diplomatic History     Hybrid Journal   (Followers: 20, SJR: 0.45, CiteScore: 1)
DNA Research     Open Access   (Followers: 5, SJR: 2.866, CiteScore: 6)
Dynamics and Statistics of the Climate System     Open Access   (Followers: 4)
Early Music     Hybrid Journal   (Followers: 15, SJR: 0.139, CiteScore: 0)
Economic Policy     Hybrid Journal   (Followers: 39, SJR: 3.584, CiteScore: 3)
ELT J.     Hybrid Journal   (Followers: 24, SJR: 0.942, CiteScore: 1)
English Historical Review     Hybrid Journal   (Followers: 52, SJR: 0.612, CiteScore: 1)
English: J. of the English Association     Hybrid Journal   (Followers: 14, SJR: 0.1, CiteScore: 0)
Environmental Entomology     Full-text available via subscription   (Followers: 11, SJR: 0.818, CiteScore: 2)
Environmental Epigenetics     Open Access   (Followers: 3)
Environmental History     Hybrid Journal   (Followers: 27, SJR: 0.408, CiteScore: 1)
EP-Europace     Hybrid Journal   (Followers: 2, SJR: 2.748, CiteScore: 4)
Epidemiologic Reviews     Hybrid Journal   (Followers: 9, SJR: 4.505, CiteScore: 8)
ESHRE Monographs     Hybrid Journal  
Essays in Criticism     Hybrid Journal   (Followers: 17, SJR: 0.113, CiteScore: 0)
European Heart J.     Hybrid Journal   (Followers: 57, SJR: 9.315, CiteScore: 9)
European Heart J. - Cardiovascular Imaging     Hybrid Journal   (Followers: 9, SJR: 3.625, CiteScore: 3)
European Heart J. - Cardiovascular Pharmacotherapy     Full-text available via subscription   (Followers: 1)
European Heart J. - Quality of Care and Clinical Outcomes     Hybrid Journal  
European Heart J. : Case Reports     Open Access  
European Heart J. Supplements     Hybrid Journal   (Followers: 8, SJR: 0.223, CiteScore: 0)
European J. of Cardio-Thoracic Surgery     Hybrid Journal   (Followers: 9, SJR: 1.681, CiteScore: 2)
European J. of Intl. Law     Hybrid Journal   (Followers: 186, SJR: 0.694, CiteScore: 1)
European J. of Orthodontics     Hybrid Journal   (Followers: 4, SJR: 1.279, CiteScore: 2)
European J. of Public Health     Hybrid Journal   (Followers: 20, SJR: 1.36, CiteScore: 2)
European Review of Agricultural Economics     Hybrid Journal   (Followers: 10, SJR: 1.172, CiteScore: 2)
European Review of Economic History     Hybrid Journal   (Followers: 29, SJR: 0.702, CiteScore: 1)
European Sociological Review     Hybrid Journal   (Followers: 40, SJR: 2.728, CiteScore: 3)
Evolution, Medicine, and Public Health     Open Access   (Followers: 11)
Family Practice     Hybrid Journal   (Followers: 15, SJR: 1.018, CiteScore: 2)
Fems Microbiology Ecology     Hybrid Journal   (Followers: 12, SJR: 1.492, CiteScore: 4)
Fems Microbiology Letters     Hybrid Journal   (Followers: 24, SJR: 0.79, CiteScore: 2)
Fems Microbiology Reviews     Hybrid Journal   (Followers: 30, SJR: 7.063, CiteScore: 13)
Fems Yeast Research     Hybrid Journal   (Followers: 14, SJR: 1.308, CiteScore: 3)
Food Quality and Safety     Open Access   (Followers: 1)
Foreign Policy Analysis     Hybrid Journal   (Followers: 23, SJR: 1.425, CiteScore: 1)
Forest Science     Hybrid Journal   (Followers: 7, SJR: 0.89, CiteScore: 2)
Forestry: An Intl. J. of Forest Research     Hybrid Journal   (Followers: 16, SJR: 1.133, CiteScore: 3)
Forum for Modern Language Studies     Hybrid Journal   (Followers: 6, SJR: 0.104, CiteScore: 0)
French History     Hybrid Journal   (Followers: 33, SJR: 0.118, CiteScore: 0)
French Studies     Hybrid Journal   (Followers: 20, SJR: 0.148, CiteScore: 0)
French Studies Bulletin     Hybrid Journal   (Followers: 10, SJR: 0.152, CiteScore: 0)
Gastroenterology Report     Open Access   (Followers: 2)
Genome Biology and Evolution     Open Access   (Followers: 12, SJR: 2.578, CiteScore: 4)
Geophysical J. Intl.     Hybrid Journal   (Followers: 35, SJR: 1.506, CiteScore: 3)
German History     Hybrid Journal   (Followers: 22, SJR: 0.161, CiteScore: 0)
GigaScience     Open Access   (Followers: 4, SJR: 5.022, CiteScore: 7)
Global Summitry     Hybrid Journal   (Followers: 1)
Glycobiology     Hybrid Journal   (Followers: 14, SJR: 1.493, CiteScore: 3)
Health and Social Work     Hybrid Journal   (Followers: 56, SJR: 0.388, CiteScore: 1)
Health Education Research     Hybrid Journal   (Followers: 15, SJR: 0.854, CiteScore: 2)
Health Policy and Planning     Hybrid Journal   (Followers: 24, SJR: 1.512, CiteScore: 2)
Health Promotion Intl.     Hybrid Journal   (Followers: 22, SJR: 0.812, CiteScore: 2)
History Workshop J.     Hybrid Journal   (Followers: 31, SJR: 1.278, CiteScore: 1)
Holocaust and Genocide Studies     Hybrid Journal   (Followers: 28, SJR: 0.105, CiteScore: 0)
Human Communication Research     Hybrid Journal   (Followers: 13, SJR: 2.146, CiteScore: 3)
Human Molecular Genetics     Hybrid Journal   (Followers: 8, SJR: 3.555, CiteScore: 5)
Human Reproduction     Hybrid Journal   (Followers: 71, SJR: 2.643, CiteScore: 5)
Human Reproduction Open     Open Access  
Human Reproduction Update     Hybrid Journal   (Followers: 20, SJR: 5.317, CiteScore: 10)
Human Rights Law Review     Hybrid Journal   (Followers: 56, SJR: 0.756, CiteScore: 1)
ICES J. of Marine Science: J. du Conseil     Hybrid Journal   (Followers: 52, SJR: 1.591, CiteScore: 3)
ICSID Review     Hybrid Journal   (Followers: 10)
ILAR J.     Hybrid Journal   (Followers: 2, SJR: 1.732, CiteScore: 4)
IMA J. of Applied Mathematics     Hybrid Journal   (SJR: 0.679, CiteScore: 1)
IMA J. of Management Mathematics     Hybrid Journal   (SJR: 0.538, CiteScore: 1)
IMA J. of Mathematical Control and Information     Hybrid Journal   (Followers: 2, SJR: 0.496, CiteScore: 1)
IMA J. of Numerical Analysis - advance access     Hybrid Journal   (SJR: 1.987, CiteScore: 2)
Industrial and Corporate Change     Hybrid Journal   (Followers: 10, SJR: 1.792, CiteScore: 2)
Industrial Law J.     Hybrid Journal   (Followers: 35, SJR: 0.249, CiteScore: 1)
Inflammatory Bowel Diseases     Hybrid Journal   (Followers: 44, SJR: 2.511, CiteScore: 4)
Information and Inference     Free  
Integrative and Comparative Biology     Hybrid Journal   (Followers: 8, SJR: 1.319, CiteScore: 2)
Interacting with Computers     Hybrid Journal   (Followers: 11, SJR: 0.292, CiteScore: 1)
Interactive CardioVascular and Thoracic Surgery     Hybrid Journal   (Followers: 7, SJR: 0.762, CiteScore: 1)
Intl. Affairs     Hybrid Journal   (Followers: 60, SJR: 1.505, CiteScore: 3)
Intl. Data Privacy Law     Hybrid Journal   (Followers: 25)
Intl. Health     Hybrid Journal   (Followers: 6, SJR: 0.851, CiteScore: 2)
Intl. Immunology     Hybrid Journal   (Followers: 3, SJR: 2.167, CiteScore: 4)
Intl. J. for Quality in Health Care     Hybrid Journal   (Followers: 37, SJR: 1.348, CiteScore: 2)
Intl. J. of Constitutional Law     Hybrid Journal   (Followers: 63, SJR: 0.601, CiteScore: 1)
Intl. J. of Epidemiology     Hybrid Journal   (Followers: 227, SJR: 3.969, CiteScore: 5)
Intl. J. of Law and Information Technology     Hybrid Journal   (Followers: 5, SJR: 0.202, CiteScore: 1)
Intl. J. of Law, Policy and the Family     Hybrid Journal   (Followers: 26, SJR: 0.223, CiteScore: 1)
Intl. J. of Lexicography     Hybrid Journal   (Followers: 9, SJR: 0.285, CiteScore: 1)
Intl. J. of Low-Carbon Technologies     Open Access   (Followers: 1, SJR: 0.403, CiteScore: 1)
Intl. J. of Neuropsychopharmacology     Open Access   (Followers: 3, SJR: 1.808, CiteScore: 4)
Intl. J. of Public Opinion Research     Hybrid Journal   (Followers: 9, SJR: 1.545, CiteScore: 1)
Intl. J. of Refugee Law     Hybrid Journal   (Followers: 35, SJR: 0.389, CiteScore: 1)
Intl. J. of Transitional Justice     Hybrid Journal   (Followers: 11, SJR: 0.724, CiteScore: 2)
Intl. Mathematics Research Notices     Hybrid Journal   (Followers: 1, SJR: 2.168, CiteScore: 1)
Intl. Political Sociology     Hybrid Journal   (Followers: 37, SJR: 1.465, CiteScore: 3)
Intl. Relations of the Asia-Pacific     Hybrid Journal   (Followers: 23, SJR: 0.401, CiteScore: 1)
Intl. Studies Perspectives     Hybrid Journal   (Followers: 9, SJR: 0.983, CiteScore: 1)
Intl. Studies Quarterly     Hybrid Journal   (Followers: 45, SJR: 2.581, CiteScore: 2)
Intl. Studies Review     Hybrid Journal   (Followers: 23, SJR: 1.201, CiteScore: 1)
ISLE: Interdisciplinary Studies in Literature and Environment     Hybrid Journal   (Followers: 2, SJR: 0.15, CiteScore: 0)
ITNOW     Hybrid Journal   (Followers: 1, SJR: 0.103, CiteScore: 0)
J. of African Economies     Hybrid Journal   (Followers: 15, SJR: 0.533, CiteScore: 1)
J. of American History     Hybrid Journal   (Followers: 46, SJR: 0.297, CiteScore: 1)
J. of Analytical Toxicology     Hybrid Journal   (Followers: 14, SJR: 1.065, CiteScore: 2)
J. of Antimicrobial Chemotherapy     Hybrid Journal   (Followers: 15, SJR: 2.419, CiteScore: 4)
J. of Antitrust Enforcement     Hybrid Journal   (Followers: 1)
J. of Applied Poultry Research     Hybrid Journal   (Followers: 4, SJR: 0.585, CiteScore: 1)
J. of Biochemistry     Hybrid Journal   (Followers: 40, SJR: 1.226, CiteScore: 2)
J. of Burn Care & Research     Hybrid Journal   (Followers: 9, SJR: 0.768, CiteScore: 2)
J. of Chromatographic Science     Hybrid Journal   (Followers: 18, SJR: 0.36, CiteScore: 1)
J. of Church and State     Hybrid Journal   (Followers: 11, SJR: 0.139, CiteScore: 0)
J. of Communication     Hybrid Journal   (Followers: 53, SJR: 4.411, CiteScore: 5)
J. of Competition Law and Economics     Hybrid Journal   (Followers: 35, SJR: 0.33, CiteScore: 0)
J. of Complex Networks     Hybrid Journal   (Followers: 2, SJR: 1.05, CiteScore: 4)
J. of Computer-Mediated Communication     Open Access   (Followers: 26, SJR: 2.961, CiteScore: 6)
J. of Conflict and Security Law     Hybrid Journal   (Followers: 12, SJR: 0.402, CiteScore: 0)
J. of Consumer Research     Full-text available via subscription   (Followers: 43, SJR: 5.856, CiteScore: 5)

        1 2 | Last   [Sort by number of followers]   [Restore default list]

Journal Cover
Journal Prestige (SJR): 6.14
Citation Impact (citeScore): 8
Number of Followers: 303  
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1367-4803 - ISSN (Online) 1460-2059
Published by Oxford University Press Homepage  [396 journals]
  • Branch-recombinant Gaussian processes for analysis of perturbations in
           biological time series
    • Authors: Penfold C; Sybirna A, Reid J, et al.
      Abstract: MotivationA common class of behaviour encountered in the biological sciences involves branching and recombination. During branching, a statistical process bifurcates resulting in two or more potentially correlated processes that may undergo further branching; the contrary is true during recombination, where two or more statistical processes converge. A key objective is to identify the time of this bifurcation (branch or recombination time) from time series measurements, e.g. by comparing a control time series with perturbed time series. Gaussian processes (GPs) represent an ideal framework for such analysis, allowing for nonlinear regression that includes a rigorous treatment of uncertainty. Currently, however, GP models only exist for two-branch systems. Here, we highlight how arbitrarily complex branching processes can be built using the correct composition of covariance functions within a GP framework, thus outlining a general framework for the treatment of branching and recombination in the form of branch-recombinant Gaussian processes (B-RGPs).ResultsWe first benchmark the performance of B-RGPs compared to a variety of existing regression approaches, and demonstrate robustness to model misspecification. B-RGPs are then used to investigate the branching patterns of Arabidopsis thaliana gene expression following inoculation with the hemibotrophic bacteria, Pseudomonas syringae DC3000, and a disarmed mutant strain, hrpA. By grouping genes according to the number of branches, we could naturally separate out genes involved in basal immune response from those subverted by the virulent strain, and show enrichment for targets of pathogen protein effectors. Finally, we identify two early branching genes WRKY11 and WRKY17, and show that genes that branched at similar times to WRKY11/17 were enriched for W-box binding motifs, and overrepresented for genes differentially expressed in WRKY11/17 knockouts, suggesting that branch time could be used for identifying direct and indirect binding targets of key transcription factors.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty603
      Issue No: Vol. 34, No. 17 (2018)
  • ECCB 2018: The 17th European Conference on Computational Biology
    • Abstract: This volume of Bioinformatics includes the proceedings papers of the 17th European Conference in Computational Biology (ECCB), an annual international Conference for research in computational biology and bioinformatics.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty740
      Issue No: Vol. 34, No. 17 (2018)
  • ECCB 2018 Organization
    • PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty739
      Issue No: Vol. 34, No. 17 (2018)
  • Conditional generative adversarial network for gene expression inference
    • Authors: Wang X; Ghasedi Dizaji K, Huang H.
      Abstract: MotivationThe rapid progress of gene expression profiling has facilitated the prosperity of recent biological studies in various fields, where gene expression data characterizes various cell conditions and regulatory mechanisms under different experimental circumstances. Despite the widespread application of gene expression profiling and advances in high-throughput technologies, profiling in genome-wide level is still expensive and difficult. Previous studies found that high correlation exists in the expression pattern of different genes, such that a small subset of genes can be informative to approximately describe the entire transcriptome. In the Library of Integrated Network-based Cell-Signature program, a set of ∼1000 landmark genes have been identified that contain ∼80% information of the whole genome and can be used to predict the expression of remaining genes. For a cost-effective profiling strategy, traditional methods measure the profiles of landmark genes and then infer the expression of other target genes via linear models. However, linear models do not have the capacity to capture the non-linear associations in gene regulatory networks.ResultsAs a flexible model with high representative power, deep learning models provide an alternate to interpret the complex relation among genes. In this paper, we propose a deep learning architecture for the inference of target gene expression profiles. We construct a novel conditional generative adversarial network by incorporating both the adversarial and ℓ1-norm loss terms in our model. Unlike the smooth and blurry predictions resulted by mean squared error objective, the coupled adversarial and ℓ1-norm loss function leads to more accurate and sharp predictions. We validate our method under two different settings and find consistent and significant improvements over all the comparing methods.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty563
      Issue No: Vol. 34, No. 17 (2018)
  • Prioritising candidate genes causing QTL using hierarchical orthologous
    • Authors: Warwick Vesztrocy A; Dessimoz C, Redestig H.
      Abstract: MotivationA key goal in plant biotechnology applications is the identification of genes associated to particular phenotypic traits (for example: yield, fruit size, root length). Quantitative Trait Loci (QTL) studies identify genomic regions associated with a trait of interest. However, to infer potential causal genes in these regions, each of which can contain hundreds of genes, these data are usually intersected with prior functional knowledge of the genes. This process is however laborious, particularly if the experiment is performed in a non-model species, and the statistical significance of the inferred candidates is typically unknown.ResultsThis paper introduces QTLSearch, a method and software tool to search for candidate causal genes in QTL studies by combining Gene Ontology annotations across many species, leveraging hierarchical orthologous groups. The usefulness of this approach is demonstrated by re-analysing two metabolic QTL studies: one in Arabidopsis thaliana, the other in Oryza sativa subsp. indica. Even after controlling for statistical significance, QTLSearch inferred potential causal genes for more QTL than BLAST-based functional propagation against UniProtKB/Swiss-Prot, and for more QTL than in the original studies.Availability and implementationQTLSearch is distributed under the LGPLv3 license. It is available to install from the Python Package Index (as qtlsearch), with the source available from informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty615
      Issue No: Vol. 34, No. 17 (2018)
  • IRSOM, a reliable identifier of ncRNAs based on supervised self-organizing
           maps with rejection
    • Authors: Platon L; Zehraoui F, Bendahmane A, et al.
      Abstract: MotivationNon-coding RNAs (ncRNAs) play important roles in many biological processes and are involved in many diseases. Their identification is an important task, and many tools exist in the literature for this purpose. However, almost all of them are focused on the discrimination of coding and ncRNAs without giving more biological insight. In this paper, we propose a new reliable method called IRSOM, based on a supervised Self-Organizing Map (SOM) with a rejection option, that overcomes these limitations. The rejection option in IRSOM improves the accuracy of the method and also allows identifing the ambiguous transcripts. Furthermore, with the visualization of the SOM, we analyze the rejected predictions and highlight the ambiguity of the transcripts.ResultsIRSOM was tested on datasets of several species from different reigns, and shown better results compared to state-of-art. The accuracy of IRSOM is always greater than 0.95 for all the species with an average specificity of 0.98 and an average sensitivity of 0.99. Besides, IRSOM is fast (it takes around 254 s to analyze a dataset of 147 000 transcripts) and is able to handle very large datasets.Availability and implementationIRSOM is implemented in Python and C++. It is available on our software platform EvryRNA (
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty572
      Issue No: Vol. 34, No. 17 (2018)
  • Discovering epistatic feature interactions from neural network models of
           regulatory DNA sequences
    • Authors: Greenside P; Shimko T, Fordyce P, et al.
      Abstract: MotivationTranscription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models.ResultsWe present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics.Availability and implementationCode is available at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty575
      Issue No: Vol. 34, No. 17 (2018)
  • A deep neural network approach for learning intrinsic protein-RNA binding
    • Authors: Ben-Bassat I; Chor B, Orenstein Y.
      Abstract: MotivationThe complexes formed by binding of proteins to RNAs play key roles in many biological processes, such as splicing, gene expression regulation, translation and viral replication. Understanding protein-RNA binding may thus provide important insights to the functionality and dynamics of many cellular processes. This has sparked substantial interest in exploring protein-RNA binding experimentally, and predicting it computationally. The key computational challenge is to efficiently and accurately infer protein-RNA binding models that will enable prediction of novel protein-RNA interactions to additional transcripts of interest.ResultsWe developed DLPRB (Deep Learning for Protein-RNA Binding), a new deep neural network (DNN) approach for learning intrinsic protein-RNA binding preferences and predicting novel interactions. We present two different network architectures: a convolutional neural network (CNN), and a recurrent neural network (RNN). The novelty of our network hinges upon two key aspects: (i) the joint analysis of both RNA sequence and structure, which is represented as a probability vector of different RNA structural contexts; (ii) novel features in the architecture of the networks, such as the application of RNNs to RNA-binding prediction, and the combination of hundreds of variable-length filters in the CNN. Our results in inferring accurate RNA-binding models from high-throughput in vitro data exhibit substantial improvements, compared to all previous approaches for protein-RNA binding prediction (both DNN and non-DNN based). A more modest, yet statistically significant, improvement is achieved for in vivo binding prediction. When incorporating experimentally-measured RNA structure, compared to predicted one, the improvement on in vivo data increases. By visualizing the binding specificities, we can gain biological insights underlying the mechanism of protein RNA-binding.Availability and implementationThe source code is publicly available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty600
      Issue No: Vol. 34, No. 17 (2018)
  • Bayesian inference on stochastic gene transcription from flow cytometry
    • Authors: Tiberi S; Walsh M, Cavallaro M, et al.
      Abstract: MotivationTranscription in single cells is an inherently stochastic process as mRNA levels vary greatly between cells, even for genetically identical cells under the same experimental and environmental conditions. We present a stochastic two-state switch model for the population of mRNA molecules in single cells where genes stochastically alternate between a more active ON state and a less active OFF state. We prove that the stationary solution of such a model can be written as a mixture of a Poisson and a Poisson-beta probability distribution. This finding facilitates inference for single cell expression data, observed at a single time point, from flow cytometry experiments such as FACS or fluorescence in situ hybridization (FISH) as it allows one to sample directly from the equilibrium distribution of the mRNA population. We hence propose a Bayesian inferential methodology using a pseudo-marginal approach and a recent approximation to integrate over unobserved states associated with measurement error.ResultsWe provide a general inferential framework which can be widely used to study transcription in single cells from the kind of data arising in flow cytometry experiments. The approach allows us to separate between the intrinsic stochasticity of the molecular dynamics and the measurement noise. The methodology is tested in simulation studies and results are obtained for experimental multiple single cell expression data from FISH flow cytometry experiments.Availability and implementationAll analyses were implemented in R. Source code and the experimental data are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty568
      Issue No: Vol. 34, No. 17 (2018)
  • Off-target predictions in CRISPR-Cas9 gene editing using deep learning
    • Authors: Lin J; Wong K.
      Abstract: MotivationThe prediction of off-target mutations in CRISPR-Cas9 is a hot topic due to its relevance to gene editing research. Existing prediction methods have been developed; however, most of them just calculated scores based on mismatches to the guide sequence in CRISPR-Cas9. Therefore, the existing prediction methods are unable to scale and improve their performance with the rapid expansion of experimental data in CRISPR-Cas9. Moreover, the existing methods still cannot satisfy enough precision in off-target predictions for gene editing at the clinical level.ResultsTo address it, we design and implement two algorithms using deep neural networks to predict off-target mutations in CRISPR-Cas9 gene editing (i.e. deep convolutional neural network and deep feedforward neural network). The models were trained and tested on the recently released off-target dataset, CRISPOR dataset, for performance benchmark. Another off-target dataset identified by GUIDE-seq was adopted for additional evaluation. We demonstrate that convolutional neural network achieves the best performance on CRISPOR dataset, yielding an average classification area under the ROC curve (AUC) of 97.2% under stratified 5-fold cross-validation. Interestingly, the deep feedforward neural network can also be competitive at the average AUC of 97.0% under the same setting. We compare the two deep neural network models with the state-of-the-art off-target prediction methods (i.e. CFD, MIT, CROP-IT, and CCTop) and three traditional machine learning models (i.e. random forest, gradient boosting trees, and logistic regression) on both datasets in terms of AUC values, demonstrating the competitive edges of the proposed algorithms. Additional analyses are conducted to investigate the underlying reasons from different perspectives.Availability and implementationThe example code are available at The related datasets are available at
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty554
      Issue No: Vol. 34, No. 17 (2018)
  • CisPi: a transcriptomic score for disclosing cis-acting disease-associated
    • Authors: Wang Z; Cunningham J, Yang X.
      Abstract: MotivationLong intergenic noncoding RNAs (lincRNAs) have risen to prominence in cancer biology as new biomarkers of disease. Those lincRNAs transcribed from active cis-regulatory elements (enhancers) have provided mechanistic insight into cis-acting regulation; however, in the absence of an enhancer hallmark, computational prediction of cis-acting transcription of lincRNAs remains challenging. Here, we introduce a novel transcriptomic method: a cis-regulatory lincRNA–gene associating metric, termed ‘CisPi’. CisPi quantifies the mutual information between lincRNAs and local gene expression regarding their response to perturbation, such as disease risk-dependence. To predict risk-dependent lincRNAs in neuroblastoma, an aggressive pediatric cancer, we advance this scoring scheme to measure lincRNAs that represent the minority of reads in RNA-Seq libraries by a novel side-by-side analytical pipeline.ResultsAltered expression of lincRNAs that stratifies tumor risk is an informative readout of oncogenic enhancer activity. Our CisPi metric therefore provides a powerful computational model to identify enhancer-templated RNAs (eRNAs), eRNA-like lincRNAs, or active enhancers that regulate the expression of local genes. First, risk-dependent lincRNAs revealed active enhancers, over-represented neuroblastoma susceptibility loci, and uncovered novel clinical biomarkers. Second, the prioritized lincRNAs were significantly prognostic. Third, the predicted target genes further inherited the prognostic significance of these lincRNAs. In sum, RNA-Seq alone is sufficient to identify disease-associated lincRNAs using our methodologies, allowing broader applications to contexts in which enhancer hallmarks are not available or show limited sensitivity.Availability and implementationThe source code is available on request. The prioritized lincRNAs and their target genes are in the Supplementary MaterialSupplementary Material.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty574
      Issue No: Vol. 34, No. 17 (2018)
  • SPhyR: tumor phylogeny estimation from single-cell sequencing data under
           loss and error
    • Authors: El-Kebir M.
      Abstract: MotivationCancer is characterized by intra-tumor heterogeneity, the presence of distinct cell populations with distinct complements of somatic mutations, which include single-nucleotide variants (SNVs) and copy-number aberrations (CNAs). Single-cell sequencing technology enables one to study these cell populations at single-cell resolution. Phylogeny estimation algorithms that employ appropriate evolutionary models are key to understanding the evolutionary mechanisms behind intra-tumor heterogeneity.ResultsWe introduce Single-cell Phylogeny Reconstruction (SPhyR), a method for tumor phylogeny estimation from single-cell sequencing data. In light of frequent loss of SNVs due to CNAs in cancer, SPhyR employs the k-Dollo evolutionary model, where a mutation can only be gained once but lost k times. Underlying SPhyR is a novel combinatorial characterization of solutions as constrained integer matrix completions, based on a connection to the cladistic multi-state perfect phylogeny problem. SPhyR outperforms existing methods on simulated data and on a metastatic colorectal cancer.Availability and implementationSPhyR is available on informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty589
      Issue No: Vol. 34, No. 17 (2018)
  • S-Cluster++: a fast program for solving the cluster containment problem
           for phylogenetic networks
    • Authors: Yan H; Gunawan A, Zhang L.
      Abstract: MotivationComparative genomic studies indicate that extant genomes are more properly considered to be a fusion product of random mutations over generations (vertical evolution) and genomic material transfers between individuals of different lineages (reticulate transfer). This has motivated biologists to use phylogenetic networks and other general models to study genome evolution. Two fundamental algorithmic problems arising from verification of phylogenetic networks and from computing Robinson-Foulds distance in the space of phylogenetic networks are the tree and cluster containment problems. The former asks how to decide whether or not a phylogenetic tree is displayed in a phylogenetic network. The latter is to decide whether a subset of taxa appears as a cluster in some tree displayed in a phylogenetic network. The cluster containment problem (CCP) is also closely related to testing the infinite site model on a recombination network. Both the tree containment and CCP are NP-complete. Although the CCP was introduced a decade ago, there has been little progress in developing fast algorithms for it on arbitrary phylogenetic networks.ResultsIn this work, we present a fast computer program for the CCP. This program is developed on the basis of a linear-time transformation from the small version of the CCP to the SAT problem.Availability and implementationThe program package is available for download on∼matzlx/ccp.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty594
      Issue No: Vol. 34, No. 17 (2018)
  • Accurate and adaptive imputation of summary statistics in mixed-ethnicity
    • Authors: Togninalli M; Roqueiro D, C, et al.
      Abstract: MotivationMethods based on summary statistics obtained from genome-wide association studies have gained considerable interest in genetics due to the computational cost and privacy advantages they present. Imputing missing summary statistics has therefore become a key procedure in many bioinformatics pipelines, but available solutions may rely on additional knowledge about the populations used in the original study and, as a result, may not always ensure feasibility or high accuracy of the imputation procedure.ResultsWe present ARDISS, a method to impute missing summary statistics in mixed-ethnicity cohorts through Gaussian Process Regression and automatic relevance determination. ARDISS is trained on an external reference panel and does not require information about allele frequencies of genotypes from the original study. Our method approximates the original GWAS population by a combination of samples from a reference panel relying exclusively on the summary statistics and without any external information. ARDISS successfully reconstructs the original composition of mixed-ethnicity cohorts and outperforms alternative solutions in terms of speed and imputation accuracy both for heterogeneous and homogeneous datasets.Availability and implementationThe proposed method is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty596
      Issue No: Vol. 34, No. 17 (2018)
  • Towards an accurate and efficient heuristic for species/gene tree
    • Authors: Wang Y; Nakhleh L.
      Abstract: MotivationSpecies and gene trees represent how species and individual loci within their genomes evolve from their most recent common ancestors. These trees are central to addressing several questions in biology relating to, among other issues, species conservation, trait evolution and gene function. Consequently, their accurate inference from genomic data is a major endeavor. One approach to their inference is to co-estimate species and gene trees from genome-wide data. Indeed, Bayesian methods based on this approach already exist. However, these methods are very slow, limiting their applicability to datasets with small numbers of taxa. The more commonly used approach is to first infer gene trees individually, and then use gene tree estimates to infer the species tree. Methods in this category rely significantly on the accuracy of the gene trees which is often not high when the dataset includes closely related species.ResultsIn this work, we introduce a simple, yet effective, iterative method for co-estimating gene and species trees from sequence data of multiple, unlinked loci. In every iteration, the method estimates a species tree, uses it as a generative process to simulate a collection of gene trees, and then selects gene trees for the individual loci from among the simulated gene trees by making use of the sequence data. We demonstrate the accuracy and efficiency of our method on simulated as well as biological data, and compare them to those of existing competing methods.Availability and implementationThe method has been implemented in PhyloNet, which is publicly available at
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty599
      Issue No: Vol. 34, No. 17 (2018)
  • Fast characterization of segmental duplications in genome assemblies
    • Authors: Numanagić I; Gökkaya A, Zhang L, et al.
      Abstract: MotivationSegmental duplications (SDs) or low-copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner.ResultsHere we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% ‘pairwise error’ between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome.Availability and implementationSEDEF is available at
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty586
      Issue No: Vol. 34, No. 17 (2018)
  • PAIPline: pathogen identification in metagenomic and clinical next
           generation sequencing samples
    • Authors: Andrusch A; Dabrowski P, Klenner J, et al.
      Abstract: MotivationNext generation sequencing (NGS) has provided researchers with a powerful tool to characterize metagenomic and clinical samples in research and diagnostic settings. NGS allows an open view into samples useful for pathogen detection in an unbiased fashion and without prior hypothesis about possible causative agents. However, NGS datasets for pathogen detection come with different obstacles, such as a very unfavorable ratio of pathogen to host reads. Alongside often appearing false positives and irrelevant organisms, such as contaminants, tools are often challenged by samples with low pathogen loads and might not report organisms present below a certain threshold. Furthermore, some metagenomic profiling tools are only focused on one particular set of pathogens, for example bacteria.ResultsWe present PAIPline, a bioinformatics pipeline specifically designed to address problems associated with detecting pathogens in diagnostic samples. PAIPline particularly focuses on userfriendliness and encapsulates all necessary steps from preprocessing to resolution of ambiguous reads and filtering up to visualization in a single tool. In contrast to existing tools, PAIPline is more specific while maintaining sensitivity. This is shown in a comparative evaluation where PAIPline was benchmarked along other well-known metagenomic profiling tools on previously published well-characterized datasets. Additionally, as part of an international cooperation project, PAIPline was applied to an outbreak sample of hemorrhagic fevers of then unknown etiology. The presented results show that PAIPline can serve as a robust, reliable, user-friendly, adaptable and generalizable stand-alone software for diagnostics from NGS samples and as a stepping stone for further downstream analyses.Availability and implementationPAIPline is freely available under
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty595
      Issue No: Vol. 34, No. 17 (2018)
  • An accurate and rapid continuous wavelet dynamic time warping algorithm
           for end-to-end mapping in ultra-long nanopore sequencing
    • Authors: Han R; Li Y, Gao X, et al.
      Abstract: MotivationLong-reads, point-of-care and polymerase chain reaction-free are the promises brought by nanopore sequencing. Among various steps in nanopore data analysis, the end-to-end mapping between the raw electrical current signal sequence and the reference expected signal sequence serves as the key building block to signal labeling, and the following signal visualization, variant identification and methylation detection. One of the classic algorithms to solve the signal mapping problem is the dynamic time warping (DTW). However, the ultra-long nanopore sequencing and an order of magnitude difference in the sampling speed complexify the scenario and make the classical DTW infeasible to solve the problem.ResultsHere, we propose a novel multi-level DTW algorithm, continuous wavelet DTW (cwDTW), based on continuous wavelet transforms with different scales of the two signal sequences. Our algorithm starts from low-resolution wavelet transforms of the two sequences, such that the transformed sequences are short and have similar sampling rates. Then the peaks and nadirs of the transformed sequences are extracted to form feature sequences with similar lengths, which can be easily mapped by the original DTW. Our algorithm then recursively projects the warping path from a lower-resolution level to a higher-resolution one by building a context-dependent boundary and enabling a constrained search for the warping path in the latter. Comprehensive experiments on two real nanopore datasets on human and on Pandoraea pnomenusa demonstrate the efficiency and effectiveness of the proposed algorithm. In particular, cwDTW can gain remarkable acceleration with tiny loss of the alignment accuracy. On the real nanopore datasets, cwDTW can finish an alignment task in few seconds, which is about 3000 times faster than the original DTW. By successfully applying cwDTW on the tasks of signal labeling and ultra-long sequence comparison, we further demonstrate the power and applicability of cwDTW.Availability and implementationOur program is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty555
      Issue No: Vol. 34, No. 17 (2018)
  • Approximate, simultaneous comparison of microbial genome architectures via
           syntenic anchoring of quiver representations
    • Authors: Salazar A; Abeel T.
      Abstract: MotivationA long-standing limitation in comparative genomic studies is the dependency on a reference genome, which hinders the spectrum of genetic diversity that can be identified across a population of organisms. This is especially true in the microbial world where genome architectures can significantly vary. There is therefore a need for computational methods that can simultaneously analyze the architectures of multiple genomes without introducing bias from a reference.ResultsIn this article, we present Ptolemy: a novel method for studying the diversity of genome architectures—such as structural variation and pan-genomes—across a collection of microbial assemblies without the need of a reference. Ptolemy is a ‘top-down’ approach to compare whole genome assemblies. Genomes are represented as labeled multi-directed graphs—known as quivers—which are then merged into a single, canonical quiver by identifying ‘gene anchors’ via synteny analysis. The canonical quiver represents an approximate, structural alignment of all genomes in a given collection encoding structural variation across (sub-) populations within the collection. We highlight various applications of Ptolemy by analyzing structural variation and the pan-genomes of different datasets composing of Mycobacterium, Saccharomyces, Escherichia and Shigella species. Our results show that Ptolemy is flexible and can handle both conserved and highly dynamic genome architectures. Ptolemy is user-friendly—requires only FASTA-formatted assembly along with a corresponding GFF-formatted file—and resource-friendly—can align 24 genomes in ∼10 mins with four CPUs and <2 GB of RAM.Availability and implementationGithub: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty614
      Issue No: Vol. 34, No. 17 (2018)
  • CNEFinder: finding conserved non-coding elements in genomes
    • Authors: Ayad L; Pissis S, Polychronopoulos D.
      Abstract: MotivationConserved non-coding elements (CNEs) represent an enigmatic class of genomic elements which, despite being extremely conserved across evolution, do not encode for proteins. Their functions are still largely unknown. Thus, there exists a need to systematically investigate their roles in genomes. Towards this direction, identifying sets of CNEs in a wide range of organisms is an important first step. Currently, there are no tools published in the literature for systematically identifying CNEs in genomes.ResultsWe fill this gap by presenting CNEFinder; a tool for identifying CNEs between two given DNA sequences with user-defined criteria. The results presented here show the tool’s ability of identifying CNEs accurately and efficiently. CNEFinder is based on a k-mer technique for computing maximal exact matches. The tool thus does not require or compute whole-genome alignments or indexes, such as the suffix array or the Burrows Wheeler Transform (BWT), which makes it flexible to use on a wide scale.Availability and implementationFree software under the terms of the GNU GPL (
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty601
      Issue No: Vol. 34, No. 17 (2018)
  • A fast adaptive algorithm for computing whole-genome homology maps
    • Authors: Jain C; Koren S, Dilthey A, et al.
      Abstract: MotivationWhole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical methods lack any guarantee on the characteristics of output alignments, thus making them hard to tune for different application requirements.ResultsWe introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane-sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly-to-genome and genome-to-genome mapper. As a result, we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about 1 min total execution time and <4 GB memory using eight CPU threads, achieving significant improvement in memory-usage over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be >97% on multiple datasets. Finally, we performed a sensitive self-alignment of the human genome to compute all duplications of length ≥1 Kbp and ≥90% identity. The reported output achieves good recall and covers twice the number of bases than the current UCSC browser’s segmental duplication annotation.Availability and implementation
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty597
      Issue No: Vol. 34, No. 17 (2018)
  • Recognition of CRISPR/Cas9 off-target sites through ensemble learning of
           uneven mismatch distributions
    • Authors: Peng H; Zheng Y, Zhao Z, et al.
      Abstract: MotivationCRISPR/Cas9 is driving a broad range of innovative applications from basic biology to biotechnology and medicine. One of its current issues is the effect of off-target editing that should be critically resolved and should be completely avoided in the ideal use of this system. ResultsWe developed an ensemble learning method to detect the off-target sites of a single guide RNA (sgRNA) from its thousands of genome-wide candidates. Nucleotide mismatches between on-target and off-target sites have been studied recently. We confirm that there exists strong mismatch enrichment and preferences at the 5′-end close regions of the off-target sequences. Comparing with the on-target sites, sequences of no-editing sites can be also characterized by GC composition changes and position-specific mismatch binary features. Under this novel space of features, an ensemble strategy was applied to train a prediction model. The model achieved a mean score 0.99 of Aera Under Receiver Operating Characteristic curve and a mean score 0.45 of Aera Under Precision-Recall curve in cross-validations on big datasets, outperforming state-of-the-art methods in various test scenarios. Our predicted off-target sites also correspond very well to those detected by high-throughput sequencing techniques. Especially, two case studies for selecting sgRNAs to cure hearing loss and retinal degeneration partly prove the effectiveness of our method.Availability and implementationThe python and matlab version of source codes for detecting off-target sites of a given sgRNA and the supplementary files are freely available on the web at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty558
      Issue No: Vol. 34, No. 17 (2018)
  • DREAM-Yara: an exact read mapper for very large databases with short
           update time
    • Authors: Dadi T; Siragusa E, Piro V, et al.
      Abstract: MotivationMapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. >10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instance, in typical metagenomics analysis, the size of the reference sequences has become prohibitive to compute a single full-text index on standard machines. Even on large memory machines, computing such index takes about 1 day of computing time. As a result, updates of indices are rarely performed. Hence, it is desirable to create an alternative way of indexing while preserving fast search times.ResultsTo solve the index construction and update problem we propose the DREAM (Dynamic seaRchablE pArallel coMpressed index) framework and provide an implementation. The main contributions are the introduction of an approximate search distributor via a novel use of Bloom filters. We combine several Bloom filters to form an interleaved Bloom filter and use this new data structure to quickly exclude reads for parts of the databases where they cannot match. This allows us to keep the databases in several indices which can be easily rebuilt if parts are updated while maintaining a fast search time. The second main contribution is an implementation of DREAM-Yara a distributed version of a fully sensitive read mapper under the DREAM framework.Availability and implementation
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty567
      Issue No: Vol. 34, No. 17 (2018)
  • Learning structural motif representations for efficient protein structure
    • Authors: Liu Y; Ye Q, Wang L, et al.
      Abstract: MotivationGiven a protein of unknown function, fast identification of similar protein structures from the Protein Data Bank (PDB) is a critical step for inferring its biological function. Such structural neighbors can provide evolutionary insights into protein conformation, interfaces and binding sites that are not detectable from sequence similarity. However, the computational cost of performing pairwise structural alignment against all structures in PDB is prohibitively expensive. Alignment-free approaches have been introduced to enable fast but coarse comparisons by representing each protein as a vector of structure features or fingerprints and only computing similarity between vectors. As a notable example, FragBag represents each protein by a ‘bag of fragments’, which is a vector of frequencies of contiguous short backbone fragments from a predetermined library. Despite being efficient, the accuracy of FragBag is unsatisfactory because its backbone fragment library may not be optimally constructed and long-range interacting patterns are omitted.ResultsHere we present a new approach to learning effective structural motif presentations using deep learning. We develop DeepFold, a deep convolutional neural network model to extract structural motif features of a protein structure. We demonstrate that DeepFold substantially outperforms FragBag on protein structural search on a non-redundant protein structure database and a set of newly released structures. Remarkably, DeepFold not only extracts meaningful backbone segments but also finds important long-range interacting motifs for structural comparison. We expect that DeepFold will provide new insights into the evolution and hierarchical organization of protein structural motifs.Availability and implementation
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty585
      Issue No: Vol. 34, No. 17 (2018)
  • Insights on the alteration of functionality of a tyrosine kinase 2
           variant: a molecular dynamics study
    • Authors: Lesgidou N; Eliopoulos E, Goulielmos G, et al.
      Abstract: MotivationThe tyrosine kinase 2 protein (Tyk2), encoded by the TYK2 gene, has a crucial role in signal transduction and the pathogenesis of many diseases. A single nucleotide polymorphism of the TYK2 gene, SNP rs34536443, is of major importance, since it has been shown to confer protection against various, mainly, autoimmune diseases. This polymorphism results in a Pro to Ala change at amino acid position 1104 of the encoded Tyk2 protein that affects its enzymatic activity. However, the details of the underlined mechanism are unknown. To address this issue, in this study, we used molecular dynamics simulations on the kinase domains of both wild type and variant Tyk2 protein.ResultsOur MD results provided information, at atomic level, on the consequences of the Pro1104 to Ala substitution on the structure and dynamics of the kinase domain of Tyk2 and suggested reduced enzymatic activity of the resulting protein variant due to stabilization of inactive conformations, thus adding to knowledge towards the elucidation of the protection mechanism against autoimmune diseases associated with this point mutation.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty556
      Issue No: Vol. 34, No. 17 (2018)
  • Topology independent structural matching discovers novel templates for
           protein interfaces
    • Authors: Mirabello C; Wallner B.
      Abstract: MotivationProtein–protein interactions (PPI) are essential for the function of the cellular machinery. The rapid growth of protein–protein complexes with known 3D structures offers a unique opportunity to study PPI to gain crucial insights into protein function and the causes of many diseases. In particular, it would be extremely useful to compare interaction surfaces of monomers, as this would enable the pinpointing of potential interaction surfaces based solely on the monomer structure, without the need to predict the complete complex structure. While there are many structural alignment algorithms for individual proteins, very few have been developed for protein interfaces, and none that can align only the interface residues to other interfaces or surfaces of interacting monomer subunits in a topology independent (non-sequential) manner.ResultsWe present InterComp, a method for topology and sequence-order independent structural comparisons. The method is general and can be applied to various structural comparison applications. By representing residues as independent points in space rather than as a sequence of residues, InterComp can be applied to a wide range of problems including interface–surface comparisons and interface–interface comparisons. We demonstrate a use-case by applying InterComp to find similar protein interfaces on the surface of proteins. We show that InterComp pinpoints the correct interface for almost half of the targets (283 of 586) when considering the top 10 hits, and for 24% of the top 1, even when no templates can be found with regular sequence-order dependent structural alignment methods.Availability and implementationThe source code and the datasets are available at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty587
      Issue No: Vol. 34, No. 17 (2018)
  • Analysis of single amino acid variations in singlet hot spots of
           protein–protein interfaces
    • Authors: Ozdemir E; Gursoy A, Keskin O.
      Abstract: MotivationSingle amino acid variations (SAVs) in protein–protein interaction (PPI) sites play critical roles in diseases. PPI sites (interfaces) have a small subset of residues called hot spots that contribute significantly to the binding energy, and they may form clusters called hot regions. Singlet hot spots are the single amino acid hot spots outside of the hot regions. The distribution of SAVs on the interface residues may be related to their disease association.ResultsWe performed statistical and structural analyses of SAVs with literature curated experimental thermodynamics data, and demonstrated that SAVs which destabilize PPIs are more likely to be found in singlet hot spots rather than hot regions and energetically less important interface residues. In contrast, non-hot spot residues are significantly enriched in neutral SAVs, which do not affect PPI stability. Surprisingly, we observed that singlet hot spots tend to be enriched in disease-causing SAVs, while benign SAVs significantly occur in non-hot spot residues. Our work demonstrates that SAVs in singlet hot spot residues have significant effect on protein stability and function.Availability and implementationThe dataset used in this paper is available as Supplementary Material. The data can be found at as well.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty569
      Issue No: Vol. 34, No. 17 (2018)
  • Predicting protein–protein interactions through sequence-based deep
    • Authors: Hashemifar S; Neyshabur B, Khan A, et al.
      Abstract: MotivationHigh-throughput experimental techniques have produced a large amount of protein–protein interaction (PPI) data, but their coverage is still low and the PPI data is also very noisy. Computational prediction of PPIs can be used to discover new PPIs and identify errors in the experimental PPI data.ResultsWe present a novel deep learning framework, DPPI, to model and predict PPIs from sequence information alone. Our model efficiently applies a deep, Siamese-like convolutional neural network combined with random projection and data augmentation to predict PPIs, leveraging existing high-quality experimental PPI data and evolutionary information of a protein pair under prediction. Our experimental results show that DPPI outperforms the state-of-the-art methods on several benchmarks in terms of area under precision-recall curve (auPR), and computationally is more efficient. We also show that DPPI is able to predict homodimeric interactions where other methods fail to work accurately, and the effectiveness of DPPI in specific applications such as predicting cytokine-receptor binding affinities.Availability and implementationPredicting protein-protein interactions through sequence-based deep learning): informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty573
      Issue No: Vol. 34, No. 17 (2018)
  • iCFN: an efficient exact algorithm for multistate protein design
    • Authors: Karimi M; Shen Y.
      Abstract: MotivationMultistate protein design addresses real-world challenges, such as multi-specificity design and backbone flexibility, by considering both positive and negative protein states with an ensemble of substates for each. It also presents an enormous challenge to exact algorithms that guarantee the optimal solutions and enable a direct test of mechanistic hypotheses behind models. However, efficient exact algorithms are lacking for multistate protein design.ResultsWe have developed an efficient exact algorithm called interconnected cost function networks (iCFN) for multistate protein design. Its generic formulation allows for a wide array of applications such as stability, affinity and specificity designs while addressing concerns such as global flexibility of protein backbones. iCFN treats each substate design as a weighted constraint satisfaction problem (WCSP) modeled through a CFN; and it solves the coupled WCSPs using novel bounds and a depth-first branch-and-bound search over a tree structure of sequences, substates, and conformations. When iCFN is applied to specificity design of a T-cell receptor, a problem of unprecedented size to exact methods, it drastically reduces search space and running time to make the problem tractable. Moreover, iCFN generates experimentally-agreeing receptor designs with improved accuracy compared with state-of-the-art methods, highlights the importance of modeling backbone flexibility in protein design, and reveals molecular mechanisms underlying binding specificity.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty564
      Issue No: Vol. 34, No. 17 (2018)
  • DeepDTA: deep drug–target binding affinity prediction
    • Authors: Öztürk H; Özgür A, Ozkirimli E.
      Abstract: MotivationThe identification of novel drug–target (DT) interactions is a substantial part of the drug discovery process. Most of the computational methods that have been proposed to predict DT interactions have focused on binary classification, where the goal is to determine whether a DT pair interacts or not. However, protein–ligand interactions assume a continuum of binding strength values, also called binding affinity and predicting this value still remains a challenge. The increase in the affinity data available in DT knowledge-bases allows the use of advanced learning techniques such as deep learning architectures in the prediction of binding affinities. In this study, we propose a deep-learning based model that uses only sequence information of both targets and drugs to predict DT interaction binding affinities. The few studies that focus on DT binding affinity prediction use either 3D structures of protein–ligand complexes or 2D features of compounds. One novel approach used in this work is the modeling of protein sequences and compound 1D representations with convolutional neural networks (CNNs).ResultsThe results show that the proposed deep learning based model that uses the 1D representations of targets and drugs is an effective approach for drug target binding affinity prediction. The model in which high-level representations of a drug and a target are constructed via CNNs achieved the best Concordance Index (CI) performance in one of our larger benchmark datasets, outperforming the KronRLS algorithm and SimBoost, a state-of-the-art method for DT binding affinity prediction.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty593
      Issue No: Vol. 34, No. 17 (2018)
  • Protein pocket detection via convex hull surface evolution and associated
           Reeb graph
    • Authors: Zhao R; Cang Z, Tong Y, et al.
      Abstract: MotivationProtein pocket information is invaluable for drug target identification, agonist design, virtual screening and receptor-ligand binding analysis. A recent study indicates that about half holoproteins can simultaneously bind multiple interacting ligands in a large pocket containing structured sub-pockets. Although this hierarchical pocket and sub-pocket structure has a significant impact to multi-ligand synergistic interactions in the protein binding site, there is no method available for this analysis. This work introduces a computational tool based on differential geometry, algebraic topology and physics-based simulation to address this pressing issue.ResultsWe propose to detect protein pockets by evolving the convex hull surface inwards until it touches the protein surface everywhere. The governing partial differential equations (PDEs) include the mean curvature flow combined with the eikonal equation commonly used in the fast marching algorithm in the Eulerian representation. The surface evolution induced Morse function and Reeb graph are utilized to characterize the hierarchical pocket and sub-pocket structure in controllable detail. The proposed method is validated on PDBbind refined sets of 4414 protein-ligand complexes. Extensive numerical tests indicate that the proposed method not only provides a unique description of pocket-sub-pocket relations, but also offers efficient estimations of pocket surface area, pocket volume and pocket depth.Availability and implementationSource code available at Webserver available at
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty598
      Issue No: Vol. 34, No. 17 (2018)
  • MDPbiome: microbiome engineering through prescriptive perturbations
    • Authors: García-Jiménez B; de la Rosa T, Wilkinson M.
      Abstract: MotivationRecent microbiome dynamics studies highlight the current inability to predict the effects of external perturbations on complex microbial populations. To do so would be particularly advantageous in fields such as medicine, bioremediation or industrial scenarios.ResultsMDPbiome statistically models longitudinal metagenomics samples undergoing perturbations as a Markov Decision Process (MDP). Given a starting microbial composition, our MDPbiome system suggests the sequence of external perturbation(s) that will engineer that microbiome to a goal state, for example, a healthier or more performant composition. It also estimates intermediate microbiome states along the path, thus making it possible to avoid particularly undesirable/unhealthy states. We demonstrate MDPbiome performance over three real and distinct datasets, proving its flexibility, and the reliability and universality of its output ‘optimal perturbation policy’. For example, an MDP created using a vaginal microbiome time series, with a goal of recovering from bacterial vaginosis, suggested avoidance of perturbations such as lubricants or sex toys; while another MDP provided a quantitative explanation for why salmonella vaccine accelerates gut microbiome maturation in chicks. This novel analytical approach has clear applications in medicine, where it could suggest low-impact clinical interventions that will lead to achievement or maintenance of a healthy microbial population, or alternately, the sequence of interventions necessary to avoid strongly negative microbiome states.Availability and implementationCode ( and result files ( are available online.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty562
      Issue No: Vol. 34, No. 17 (2018)
  • piMGM: incorporating multi-source priors in mixed graphical models for
           learning disease networks
    • Authors: Manatakis D; Raghu V, Benos P.
      Abstract: MotivationLearning probabilistic graphs over mixed data is an important way to combine gene expression and clinical disease data. Leveraging the existing, yet imperfect, information in pathway databases for mixed graphical model (MGM) learning is an understudied problem with tremendous potential applications in systems medicine, the problems of which often involve high-dimensional data.ResultsWe present a new method, piMGM, which can learn with accuracy the structure of probabilistic graphs over mixed data by appropriately incorporating priors from multiple experts with different degrees of reliability. We show that piMGM accurately scores the reliability of prior information from a given expert even at low sample sizes. The reliability scores can be used to determine active pathways in healthy and disease samples. We tested piMGM on both simulated and real data from TCGA, and we found that its performance is not affected by unreliable priors. We demonstrate the applicability of piMGM by successfully using prior information to identify pathway components that are important in breast cancer and improve cancer subtype classification.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty591
      Issue No: Vol. 34, No. 17 (2018)
  • Ontology-based validation and identification of regulatory phenotypes
    • Authors: Kulmanov M; Schofield P, Gkoutos G, et al.
      Abstract: MotivationFunction annotations of gene products, and phenotype annotations of genotypes, provide valuable information about molecular mechanisms that can be utilized by computational methods to identify functional and phenotypic relatedness, improve our understanding of disease and pathobiology, and lead to discovery of drug targets. Identifying functions and phenotypes commonly requires experiments which are time-consuming and expensive to carry out; creating the annotations additionally requires a curator to make an assertion based on reported evidence. Support to validate the mutual consistency of functional and phenotype annotations as well as a computational method to predict phenotypes from function annotations, would greatly improve the utility of function annotations.ResultsWe developed a novel ontology-based method to validate the mutual consistency of function and phenotype annotations. We apply our method to mouse and human annotations, and identify several inconsistencies that can be resolved to improve overall annotation quality. We also apply our method to the rule-based prediction of regulatory phenotypes from functions and demonstrate that we can predict these phenotypes with Fmax of up to 0.647.Availability and implementation
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty605
      Issue No: Vol. 34, No. 17 (2018)
  • Quantitative trait loci identification for brain endophenotypes via new
           additive model with random networks
    • Authors: Wang X; Chen H, Yan J, et al.
      Abstract: MotivationThe identification of quantitative trait loci (QTL) is critical to the study of causal relationships between genetic variations and disease abnormalities. We focus on identifying the QTLs associated to the brain endophenotypes in imaging genomics study for Alzheimer’s Disease (AD). Existing research works mainly depict the association between single nucleotide polymorphisms (SNPs) and the brain endophenotypes via the linear methods, which may introduce high bias due to the simplicity of the models. Since the influence of QTLs on brain endophenotypes is quite complex, it is desired to design the appropriate non-linear models to investigate the associations of genotypes and endophenotypes.ResultsIn this paper, we propose a new additive model to learn the non-linear associations between SNPs and brain endophenotypes in Alzheimer’s disease. Our model can be flexibly employed to explain the non-linear influence of QTLs, thus is more adaptive for the complex distribution of the high-throughput biological data. Meanwhile, as an important computational learning theory contribution, we provide the generalization error analysis for the proposed approach. Unlike most previous theoretical analysis under independent and identically distributed samples assumption, our error bound is based on m-dependent observations, which is more appropriate for the high-throughput and noisy biological data. Experiments on the data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort demonstrate the promising performance of our approach for identifying biological meaningful SNPs.Availability and implementationAn executable is available at
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty557
      Issue No: Vol. 34, No. 17 (2018)
  • Liquid-chromatography retention order prediction for metabolite
    • Authors: Bach E; Szedmak S, Brouard C, et al.
      Abstract: MotivationLiquid Chromatography (LC) followed by tandem Mass Spectrometry (MS/MS) is one of the predominant methods for metabolite identification. In recent years, machine learning has started to transform the analysis of tandem mass spectra and the identification of small molecules. In contrast, LC data is rarely used to improve metabolite identification, despite numerous published methods for retention time prediction using machine learning.ResultsWe present a machine learning method for predicting the retention order of molecules; that is, the order in which molecules elute from the LC column. Our method has important advantages over previous approaches: We show that retention order is much better conserved between instruments than retention time. To this end, our method can be trained using retention time measurements from different LC systems and configurations without tedious pre-processing, significantly increasing the amount of available training data. Our experiments demonstrate that retention order prediction is an effective way to learn retention behaviour of molecules from heterogeneous retention time data. Finally, we demonstrate how retention order prediction and MS/MS-based scores can be combined for more accurate metabolite identifications when analyzing a complete LC-MS/MS run.Availability and implementationImplementation of the method is available at
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty590
      Issue No: Vol. 34, No. 17 (2018)
  • fastp: an ultra-fast all-in-one FASTQ preprocessor
    • Authors: Chen S; Zhou Y, Chen Y, et al.
      Abstract: MotivationQuality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g. Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient.ResultsWe developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality pruning and many other operations with a single scan of the FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2–5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools.Availability and implementationThe open-source code and corresponding instructions are available at
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty560
      Issue No: Vol. 34, No. 17 (2018)
  • DeepDiff: DEEP-learning for predicting DIFFerential gene expression from
           histone modifications
    • Authors: Sekhon A; Singh R, Qi Y.
      Abstract: MotivationComputational methods that predict differential gene expression from histone modification signals are highly desirable for understanding how histone modifications control the functional heterogeneity of cells through influencing differential gene regulation. Recent studies either failed to capture combinatorial effects on differential prediction or primarily only focused on cell type-specific analysis. In this paper we develop a novel attention-based deep learning architecture, DeepDiff, that provides a unified and end-to-end solution to model and to interpret how dependencies among histone modifications control the differential patterns of gene regulation. DeepDiff uses a hierarchy of multiple Long Short-Term Memory (LSTM) modules to encode the spatial structure of input signals and to model how various histone modifications cooperate automatically. We introduce and train two levels of attention jointly with the target prediction, enabling DeepDiff to attend differentially to relevant modifications and to locate important genome positions for each modification. Additionally, DeepDiff introduces a novel deep-learning based multi-task formulation to use the cell-type-specific gene expression predictions as auxiliary tasks, encouraging richer feature embeddings in our primary task of differential expression prediction.ResultsUsing data from Roadmap Epigenomics Project (REMC) for ten different pairs of cell types, we show that DeepDiff significantly outperforms the state-of-the-art baselines for differential gene expression prediction. The learned attention weights are validated by observations from previous studies about how epigenetic mechanisms connect to differential gene expression.Availability and implementationCodes and results are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty612
      Issue No: Vol. 34, No. 17 (2018)
  • Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene
           prioritization without phenotypes
    • Authors: Alshahrani M; Hoehndorf R.
      Abstract: MotivationIn the past years, several methods have been developed to incorporate information about phenotypes into computational disease gene prioritization methods. These methods commonly compute the similarity between a disease’s (or patient’s) phenotypes and a database of gene-to-phenotype associations to find the phenotypically most similar match. A key limitation of these methods is their reliance on knowledge about phenotypes associated with particular genes which is highly incomplete in humans as well as in many model organisms such as the mouse.ResultsWe developed SmuDGE, a method that uses feature learning to generate vector-based representations of phenotypes associated with an entity. SmuDGE can be used as a trainable semantic similarity measure to compare two sets of phenotypes (such as between a disease and gene, or a disease and patient). More importantly, SmuDGE can generate phenotype representations for entities that are only indirectly associated with phenotypes through an interaction network; for this purpose, SmuDGE exploits background knowledge in interaction networks comprised of multiple types of interactions. We demonstrate that SmuDGE can match or outperform semantic similarity in phenotype-based disease gene prioritization, and furthermore significantly extends the coverage of phenotype-based methods to all genes in a connected interaction network.Availability and implementation
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty559
      Issue No: Vol. 34, No. 17 (2018)
  • An ontology-based method for assessing batch effect adjustment approaches
           in heterogeneous datasets
    • Authors: Schmidt F; List M, Cukuroglu E, et al.
      Abstract: MotivationInternational consortia such as the Genotype-Tissue Expression (GTEx) project, The Cancer Genome Atlas (TCGA) or the International Human Epigenetics Consortium (IHEC) have produced a wealth of genomic datasets with the goal of advancing our understanding of cell differentiation and disease mechanisms. However, utilizing all of these data effectively through integrative analysis is hampered by batch effects, large cell type heterogeneity and low replicate numbers. To study if batch effects across datasets can be observed and adjusted for, we analyze RNA-seq data of 215 samples from ENCODE, Roadmap, BLUEPRINT and DEEP as well as 1336 samples from GTEx and TCGA. While batch effects are a considerable issue, it is non-trivial to determine if batch adjustment leads to an improvement in data quality, especially in cases of low replicate numbers.ResultsWe present a novel method for assessing the performance of batch effect adjustment methods on heterogeneous data. Our method borrows information from the Cell Ontology to establish if batch adjustment leads to a better agreement between observed pairwise similarity and similarity of cell types inferred from the ontology. A comparison of state-of-the art batch effect adjustment methods suggests that batch effects in heterogeneous datasets with low replicate numbers cannot be adequately adjusted. Better methods need to be developed, which can be assessed objectively in the framework presented here.Availability and implementationOur method is available online at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty553
      Issue No: Vol. 34, No. 17 (2018)
  • Computational enhancement of single-cell sequences for inferring tumor
    • Authors: Miura S; Huuki L, Buturla T, et al.
      Abstract: MotivationTumor sequencing has entered an exciting phase with the advent of single-cell techniques that are revolutionizing the assessment of single nucleotide variation (SNV) at the highest cellular resolution. However, state-of-the-art single-cell sequencing technologies produce data with many missing bases (MBs) and incorrect base designations that lead to false-positive (FP) and false-negative (FN) detection of somatic mutations. While computational methods are available to make biological inferences in the presence of these errors, the accuracy of the imputed MBs and corrected FPs and FNs remains unknown.ResultsUsing computer simulated datasets, we assessed the robustness performance of four existing methods (OncoNEM, SCG, SCITE and SiFit) and one new method (BEAM). BEAM is a Bayesian evolution-aware method that improves the quality of single-cell sequences by using the intrinsic evolutionary information in the single-cell data in a molecular phylogenetic framework. Overall, BEAM and SCITE performed the best. Most of the methods imputed MBs with high accuracy, but effective detection and correction of FPs and FNs is a challenge, especially for small datasets. Analysis of an empirical dataset shows that computational methods can improve both the quality of tumor single-cell sequences and their utility for biological inference. In conclusion, tumor cells descend from pre-existing cells, which creates evolutionary continuity in single-cell sequencing datasets. This information enables BEAM and other methods to correctly impute missing data and incorrect base assignments, but correction of FPs and FNs remains challenging when the number of SNVs sampled is small relative to the number of cells sequenced.Availability and implementationBEAM is available on the web at
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty571
      Issue No: Vol. 34, No. 17 (2018)
  • A Boolean network inference from time-series gene expression data using a
           genetic algorithm
    • Authors: Barman S; Kwon Y.
      Abstract: MotivationInferring a gene regulatory network from time-series gene expression data is a fundamental problem in systems biology, and many methods have been proposed. However, most of them were not efficient in inferring regulatory relations involved by a large number of genes because they limited the number of regulatory genes or computed an approximated reliability of multivariate relations. Therefore, an improved method is needed to efficiently search more generalized and scalable regulatory relations.ResultsIn this study, we propose a genetic algorithm-based Boolean network inference (GABNI) method which can search an optimal Boolean regulatory function of a large number of regulatory genes. For an efficient search, it solves the problem in two stages. GABNI first exploits an existing method, a mutual information-based Boolean network inference (MIBNI), because it can quickly find an optimal solution in a small-scale inference problem. When MIBNI fails to find an optimal solution, a genetic algorithm (GA) is applied to search an optimal set of regulatory genes in a wider solution space. In particular, we modified a typical GA framework to efficiently reduce a search space. We compared GABNI with four well-known inference methods through extensive simulations on both the artificial and the real gene expression datasets. Our results demonstrated that GABNI significantly outperformed them in both structural and dynamics accuracies.ConclusionThe proposed method is an efficient and scalable tool to infer a Boolean network from time-series gene expression data.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty584
      Issue No: Vol. 34, No. 17 (2018)
  • Scalable and exhaustive screening of metabolic functions carried out by
           microbial consortia
    • Authors: Frioux C; Fremy E, Trottier C, et al.
      Abstract: MotivationThe selection of species exhibiting metabolic behaviors of interest is a challenging step when switching from the investigation of a large microbiota to the study of functions effectiveness. Approaches based on a compartmentalized framework are not scalable. The output of scalable approaches based on a non-compartmentalized modeling may be so large that it has neither been explored nor handled so far.ResultsWe present the Miscoto tool to facilitate the selection of a community optimizing a desired function in a microbiome by reporting several possibilities which can be then sorted according to biological criteria. Communities are exhaustively identified using logical programming and by combining the non-compartmentalized and the compartmentalized frameworks. The benchmarking of 4.9 million metabolic functions associated with the Human Microbiome Project, shows that Miscoto is suited to screen and classify metabolic producibility in terms of feasibility, functional redundancy and cooperation processes involved. As an illustration of a host-microbial system, screening the Recon 2.2 human metabolism highlights the role of different consortia within a family of 773 intestinal bacteria.Availability and implementationMiscoto source code, instructions for use and examples are available at:
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty588
      Issue No: Vol. 34, No. 17 (2018)
  • Higher-order molecular organization as a source of biological function
    • Authors: Gaudelet T; Malod-Dognin N, Pržulj N.
      Abstract: MotivationMolecular interactions have widely been modelled as networks. The local wiring patterns around molecules in molecular networks are linked with their biological functions. However, networks model only pairwise interactions between molecules and cannot explicitly and directly capture the higher-order molecular organization, such as protein complexes and pathways. Hence, we ask if hypergraphs (hypernetworks), that directly capture entire complexes and pathways along with protein–protein interactions (PPIs), carry additional functional information beyond what can be uncovered from networks of pairwise molecular interactions. The mathematical formalism of a hypergraph has long been known, but not often used in studying molecular networks due to the lack of sophisticated algorithms for mining the underlying biological information hidden in the wiring patterns of molecular systems modelled as hypernetworks.ResultsWe propose a new, multi-scale, protein interaction hypernetwork model that utilizes hypergraphs to capture different scales of protein organization, including PPIs, protein complexes and pathways. In analogy to graphlets, we introduce hypergraphlets, small, connected, non-isomorphic, induced sub-hypergraphs of a hypergraph, to quantify the local wiring patterns of these multi-scale molecular hypergraphs and to mine them for new biological information. We apply them to model the multi-scale protein networks of bakers yeast and human and show that the higher-order molecular organization captured by these hypergraphs is strongly related to the underlying biology. Importantly, we demonstrate that our new models and data mining tools reveal different, but complementary biological information compared with classical PPI networks. We apply our hypergraphlets to successfully predict biological functions of uncharacterized proteins.Availability and implementationCode and data are available online at
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty570
      Issue No: Vol. 34, No. 17 (2018)
  • FLYCOP: metabolic modeling-based analysis and engineering microbial
    • Authors: García-Jiménez B; García J, Nogales J.
      Abstract: MotivationSynthetic microbial communities begin to be considered as promising multicellular biocatalysts having a large potential to replace engineered single strains in biotechnology applications, in pharmaceutical, chemical and living architecture sectors. In contrast to single strain engineering, the effective and high-throughput analysis and engineering of microbial consortia face the lack of knowledge, tools and well-defined workflows. This manuscript contributes to fill this important gap with a framework, called FLYCOP (FLexible sYnthetic Consortium OPtimization), which contributes to microbial consortia modeling and engineering, while improving the knowledge about how these communities work. FLYCOP selects the best consortium configuration to optimize a given goal, among multiple and diverse configurations, in a flexible way, taking temporal changes in metabolite concentrations into account.ResultsIn contrast to previous systems optimizing microbial consortia, FLYCOP has novel characteristics to face up to new problems, to represent additional features and to analyze events influencing the consortia behavior. In this manuscript, FLYCOP optimizes a Synechococcus elongatus-Pseudomonas putida consortium to produce the maximum amount of bio-plastic (PHA, polyhydroxyalkanoate), and highlights the influence of metabolites exchange dynamics in a four auxotrophic Escherichia coli consortium with parallel growth. FLYCOP can also provide an explanation about biological evolution driving evolutionary engineering endeavors by describing why and how heterogeneous populations emerge from monoclonal ones.Availability and implementationCode reproducing the study cases described in this manuscript are available on-line: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty561
      Issue No: Vol. 34, No. 17 (2018)
  • Single cell network analysis with a mixture of Nested Effects Models
    • Authors: Pirkl M; Beerenwinkel N.
      Abstract: MotivationNew technologies allow for the elaborate measurement of different traits of single cells under genetic perturbations. These interventional data promise to elucidate intra-cellular networks in unprecedented detail and further help to improve treatment of diseases like cancer. However, cell populations can be very heterogeneous.ResultsWe developed a mixture of Nested Effects Models (M&NEM) for single-cell data to simultaneously identify different cellular subpopulations and their corresponding causal networks to explain the heterogeneity in a cell population. For inference, we assign each cell to a network with a certain probability and iteratively update the optimal networks and cell probabilities in an Expectation Maximization scheme. We validate our method in the controlled setting of a simulation study and apply it to three data sets of pooled CRISPR screens generated previously by two novel experimental techniques, namely Crop-Seq and Perturb-Seq.Availability and implementationThe mixture Nested Effects Model (M&NEM) is available as the R-package mnem at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty602
      Issue No: Vol. 34, No. 17 (2018)
  • Hierarchical HotNet: identifying hierarchies of altered subnetworks
    • Authors: Reyna M; Leiserson M, Raphael B.
      Abstract: MotivationThe analysis of high-dimensional ‘omics data is often informed by the use of biological interaction networks. For example, protein–protein interaction networks have been used to analyze gene expression data, to prioritize germline variants, and to identify somatic driver mutations in cancer. In these and other applications, the underlying computational problem is to identify altered subnetworks containing genes that are both highly altered in an ‘omics dataset and are topologically close (e.g. connected) on an interaction network.ResultsWe introduce Hierarchical HotNet, an algorithm that finds a hierarchy of altered subnetworks. Hierarchical HotNet assesses the statistical significance of the resulting subnetworks over a range of biological scales and explicitly controls for ascertainment bias in the network. We evaluate the performance of Hierarchical HotNet and several other algorithms that identify altered subnetworks on the problem of predicting cancer genes and significantly mutated subnetworks. On somatic mutation data from The Cancer Genome Atlas, Hierarchical HotNet outperforms other methods and identifies significantly mutated subnetworks containing both well-known cancer genes and candidate cancer genes that are rarely mutated in the cohort. Hierarchical HotNet is a robust algorithm for identifying altered subnetworks across different ‘omics datasets.Availability and implementation informationSupplementary materialSupplementary material are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty613
      Issue No: Vol. 34, No. 17 (2018)
  • Understanding the evolution of functional redundancy in metabolic networks
    • Authors: Sambamoorthy G; Raman K.
      Abstract: MotivationMetabolic networks have evolved to reduce the disruption of key metabolic pathways by the establishment of redundant genes/reactions. Synthetic lethals in metabolic networks provide a window to study these functional redundancies. While synthetic lethals have been previously studied in different organisms, there has been no study on how the synthetic lethals are shaped during adaptation/evolution.ResultsTo understand the adaptive functional redundancies that exist in metabolic networks, we here explore a vast space of ‘random’ metabolic networks evolved on a glucose environment. We examine essential and synthetic lethal reactions in these random metabolic networks, evaluating over 39 billion phenotypes using an efficient algorithm previously developed in our lab, Fast-SL. We establish that nature tends to harbour higher levels of functional redundancies compared with random networks. We then examined the propensity for different reactions to compensate for one another and show that certain key metabolic reactions that are necessary for growth in a particular growth medium show much higher redundancies, and can partner with hundreds of different reactions across the metabolic networks that we studied. We also observe that certain redundancies are unique to environments while some others are observed in all environments. Interestingly, we observe that even very diverse reactions, such as those belonging to distant pathways, show synthetic lethality, illustrating the distributed nature of robustness in metabolism. Our study paves the way for understanding the evolution of redundancy in metabolic networks, and sheds light on the varied compensation mechanisms that serve to enhance robustness.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty604
      Issue No: Vol. 34, No. 17 (2018)
  • iTOP: inferring the topology of omics data
    • Authors: Aben N; Westerhuis J, Song Y, et al.
      Abstract: MotivationIn biology, we are often faced with multiple datasets recorded on the same set of objects, such as multi-omics and phenotypic data of the same tumors. These datasets are typically not independent from each other. For example, methylation may influence gene expression, which may, in turn, influence drug response. Such relationships can strongly affect analyses performed on the data, as we have previously shown for the identification of biomarkers of drug response. Therefore, it is important to be able to chart the relationships between datasets.ResultsWe present iTOP, a methodology to infer a topology of relationships between datasets. We base this methodology on the RV coefficient, a measure of matrix correlation, which can be used to determine how much information is shared between two datasets. We extended the RV coefficient for partial matrix correlations, which allows the use of graph reconstruction algorithms, such as the PC algorithm, to infer the topologies. In addition, since multi-omics data often contain binary data (e.g. mutations), we also extended the RV coefficient for binary data. Applying iTOP to pharmacogenomics data, we found that gene expression acts as a mediator between most other datasets and drug response: only proteomics clearly shares information with drug response that is not present in gene expression. Based on this result, we used TANDEM, a method for drug response prediction, to identify which variables predictive of drug response were distinct to either gene expression or proteomics.Availability and implementationAn implementation of our methodology is available in the R package iTOP on CRAN. Additionally, an R Markdown document with code to reproduce all figures is provided as Supplementary MaterialSupplementary Material.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty636
      Issue No: Vol. 34, No. 17 (2018)
  • Comparative Network Reconstruction using mixed integer programming
    • Authors: Bosdriesz E; Prahallad A, Klinger B, et al.
      Abstract: MotivationSignal-transduction networks are often aberrated in cancer cells, and new anti-cancer drugs that specifically target oncogenes involved in signaling show great clinical promise. However, the effectiveness of such targeted treatments is often hampered by innate or acquired resistance due to feedbacks, crosstalks or network adaptations in response to drug treatment. A quantitative understanding of these signaling networks and how they differ between cells with different oncogenic mutations or between sensitive and resistant cells can help in addressing this problem.ResultsHere, we present Comparative Network Reconstruction (CNR), a computational method to reconstruct signaling networks based on possibly incomplete perturbation data, and to identify which edges differ quantitatively between two or more signaling networks. Prior knowledge about network topology is not required but can straightforwardly be incorporated. We extensively tested our approach using simulated data and applied it to perturbation data from a BRAF mutant, PTPN11 KO cell line that developed resistance to BRAF inhibition. Comparing the reconstructed networks of sensitive and resistant cells suggests that the resistance mechanism involves re-establishing wild-type MAPK signaling, possibly through an alternative RAF-isoform.Availability and implementationCNR is available as a python module at Additionally, code to reproduce all figures is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 08 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty616
      Issue No: Vol. 34, No. 17 (2018)
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-