for Journals by Title or ISSN
for Articles by Keywords
Followed Journals
Journal you Follow: 0
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Journals are published.
Already have an account? Sign In to see the journals you follow.
Journal Cover
Journal Prestige (SJR): 6.14
Citation Impact (citeScore): 8
Number of Followers: 300  
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1367-4803 - ISSN (Online) 1460-2059
Published by Oxford University Press Homepage  [396 journals]
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty359
      Issue No: Vol. 34, No. 13 (2018)
  • A graph-based approach to diploid genome assembly
    • Authors: Garg S; Rautiainen M, Novak A, et al.
      Abstract: AbstractMotivationConstructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community.ResultsWe present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50× coverage Illumina data and 10× PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty279
      Issue No: Vol. 34, No. 13 (2018)
  • Strand-seq enables reliable separation of long reads by chromosome via
           expectation maximization
    • Authors: Ghareghani M; Porubskỳ D, Sanders A, et al.
      Abstract: AbstractMotivationCurrent sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately.ResultsTo address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly.Availability and implementation
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty290
      Issue No: Vol. 34, No. 13 (2018)
  • Scalable preprocessing for sparse scRNA-seq data exploiting prior
    • Authors: Mukherjee S; Zhang Y, Fan J, et al.
      Abstract: AbstractMotivationSingle cell RNA-seq (scRNA-seq) data contains a wealth of information which has to be inferred computationally from the observed sequencing reads. As the ability to sequence more cells improves rapidly, existing computational tools suffer from three problems. (i) The decreased reads-per-cell implies a highly sparse sample of the true cellular transcriptome. (ii) Many tools simply cannot handle the size of the resulting datasets. (iii) Prior biological knowledge such as bulk RNA-seq information of certain cell types or qualitative marker information is not taken into account. Here we present UNCURL, a preprocessing framework based on non-negative matrix factorization for scRNA-seq data, that is able to handle varying sampling distributions, scales to very large cell numbers and can incorporate prior knowledge.ResultsWe find that preprocessing using UNCURL consistently improves performance of commonly used scRNA-seq tools for clustering, visualization and lineage estimation, both in the absence and presence of prior knowledge. Finally we demonstrate that UNCURL is extremely scalable and parallelizable, and runs faster than other methods on a scRNA-seq dataset containing 1.3 million cells.Availability and implementationSource code is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty293
      Issue No: Vol. 34, No. 13 (2018)
  • Asymptotically optimal minimizers schemes
    • Authors: Marçais G; DeBlasio D, Kingsford C.
      Abstract: AbstractMotivationThe minimizers technique is a method to sample k-mers that is used in many bioinformatics software to reduce computation, memory usage and run time. The number of applications using minimizers keeps on growing steadily. Despite its many uses, the theoretical understanding of minimizers is still very limited. In many applications, selecting as few k-mers as possible (i.e. having a low density) is beneficial. The density is highly dependent on the choice of the order on the k-mers. Different applications use different orders, but none of these orders are optimal. A better understanding of minimizers schemes, and the related local and forward schemes, will allow designing schemes with lower density and thereby making existing and future bioinformatics tools even more efficient.ResultsFrom the analysis of the asymptotic behavior of minimizers, forward and local schemes, we show that the previously believed lower bound on minimizers schemes does not hold, and that schemes with density lower than thought possible actually exist. The proof is constructive and leads to an efficient algorithm to compare k-mers. These orders are the first known orders that are asymptotically optimal. Additionally, we give improved bounds on the density achievable by the three type of schemes.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty258
      Issue No: Vol. 34, No. 13 (2018)
  • Predicting CTCF-mediated chromatin loops using CTCF-MP
    • Authors: Zhang R; Wang Y, Yang Y, et al.
      Abstract: AbstractMotivationThe three dimensional organization of chromosomes within the cell nucleus is highly regulated. It is known that CCCTC-binding factor (CTCF) is an important architectural protein to mediate long-range chromatin loops. Recent studies have shown that the majority of CTCF binding motif pairs at chromatin loop anchor regions are in convergent orientation. However, it remains unknown whether the genomic context at the sequence level can determine if a convergent CTCF motif pair is able to form a chromatin loop.ResultsIn this article, we directly ask whether and what sequence-based features (other than the motif itself) may be important to establish CTCF-mediated chromatin loops. We found that motif conservation measured by ‘branch-of-origin’ that accounts for motif turn-over in evolution is an important feature. We developed a new machine learning algorithm called CTCF-MP based on word2vec to demonstrate that sequence-based features alone have the capability to predict if a pair of convergent CTCF motifs would form a loop. Together with functional genomic signals from CTCF ChIP-seq and DNase-seq, CTCF-MP is able to make highly accurate predictions on whether a convergent CTCF motif pair would form a loop in a single cell type and also across different cell types. Our work represents an important step further to understand the sequence determinants that may guide the formation of complex chromatin architectures.Availability and implementationThe source code of CTCF-MP can be accessed at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty248
      Issue No: Vol. 34, No. 13 (2018)
  • Versatile genome assembly evaluation with QUAST-LG
    • Authors: Mikheenko A; Prjibelski A, Saveliev V, et al.
      Abstract: AbstractMotivationThe emergence of high-throughput sequencing technologies revolutionized genomics in early 2000s. The next revolution came with the era of long-read sequencing. These technological advances along with novel computational approaches became the next step towards the automatic pipelines capable to assemble nearly complete mammalian-size genomes.ResultsIn this manuscript, we demonstrate performance of the state-of-the-art genome assembly software on six eukaryotic datasets sequenced using different technologies. To evaluate the results, we developed QUAST-LG—a tool that compares large genomic de novo assemblies against reference sequences and computes relevant quality metrics. Since genomes generally cannot be reconstructed completely due to complex repeat patterns and low coverage regions, we introduce a concept of upper bound assembly for a given genome and set of reads, and compute theoretical limits on assembly correctness and completeness. Using QUAST-LG, we show how close the assemblies are to the theoretical optimum, and how far this optimum is from the finished reference.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty266
      Issue No: Vol. 34, No. 13 (2018)
  • Optimization and profile calculation of ODE models using second order
           adjoint sensitivity analysis
    • Authors: Stapor P; Fröhlich F, Hasenauer J.
      Abstract: AbstractMotivationParameter estimation methods for ordinary differential equation (ODE) models of biological processes can exploit gradients and Hessians of objective functions to achieve convergence and computational efficiency. However, the computational complexity of established methods to evaluate the Hessian scales linearly with the number of state variables and quadratically with the number of parameters. This limits their application to low-dimensional problems.ResultsWe introduce second order adjoint sensitivity analysis for the computation of Hessians and a hybrid optimization-integration-based approach for profile likelihood computation. Second order adjoint sensitivity analysis scales linearly with the number of parameters and state variables. The Hessians are effectively exploited by the proposed profile likelihood computation approach. We evaluate our approaches on published biological models with real measurement data. Our study reveals an improved computational efficiency and robustness of optimization compared to established approaches, when using Hessians computed with adjoint sensitivity analysis. The hybrid computation method was more than 2-fold faster than the best competitor. Thus, the proposed methods and implemented algorithms allow for the improvement of parameter estimation for medium and large scale ODE models.Availability and implementationThe algorithms for second order adjoint sensitivity analysis are implemented in the Advanced MATLAB Interface to CVODES and IDAS (AMICI, The algorithm for hybrid profile likelihood computation is implemented in the parameter estimation toolbox (PESTO, Both toolboxes are freely available under the BSD license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty230
      Issue No: Vol. 34, No. 13 (2018)
  • AnoniMME: bringing anonymity to the Matchmaker Exchange platform for rare
           disease gene discovery
    • Authors: Oprisanu B; De Cristofaro E.
      Abstract: AbstractSummaryAdvances in genome sequencing and genomics research are bringing us closer to a new era of personalized medicine, where healthcare can be tailored to the individual’s genetic makeup and to more effective diagnosis and treatment of rare genetic diseases. Much of this progress depends on collaborations and access to data, thus, a number of initiatives have been introduced to support seamless data sharing. Among these, the Global Alliance for Genomics and Health has developed and operates a platform, called Matchmaker Exchange (MME), which allows researchers to perform queries for rare genetic disease discovery over multiple federated databases. Queries include gene variations which are linked to rare diseases, and the ability to find other researchers that have seen or have interest in those variations is extremely valuable. Nonetheless, in some cases, researchers may be reluctant to use the platform since the queries they make (thus, what they are working on) are revealed to other researchers, and this creates concerns with respect to privacy and competitive advantage.In this paper, we present AnoniMME, a framework geared to enable anonymous queries within the MME platform. The framework, building on a cryptographic primitive called Reverse Private Information Retrieval, let researchers anonymously query the federated platform, in a multi-server setting—specifically, they write their query, along with a public encryption key, anonymously in a public database. Responses are also supported, so that other researchers can respond to queries by providing their encrypted contact details.Availability and implementation
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty269
      Issue No: Vol. 34, No. 13 (2018)
  • A space and time-efficient index for the compacted colored de Bruijn graph
    • Authors: Almodaresi F; Sarkar H, Srivastava A, et al.
      Abstract: AbstractMotivationIndexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.ResultsWe present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences.Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.Availability and implementationpufferfish is written in C++11, is open source, and is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty292
      Issue No: Vol. 34, No. 13 (2018)
  • Personalized regression enables sample-specific pan-cancer analysis
    • Authors: Lengerich B; Aragam B, Xing E.
      Abstract: AbstractMotivationIn many applications, inter-sample heterogeneity is crucial to understanding the complex biological processes under study. For example, in genomic analysis of cancers, each patient in a cohort may have a different driver mutation, making it difficult or impossible to identify causal mutations from an averaged view of the entire cohort. Unfortunately, many traditional methods for genomic analysis seek to estimate a single model which is shared by all samples in a population, ignoring this inter-sample heterogeneity entirely. In order to better understand patient heterogeneity, it is necessary to develop practical, personalized statistical models.ResultsTo uncover this inter-sample heterogeneity, we propose a novel regularizer for achieving patient-specific personalized estimation. This regularizer operates by learning two latent distance metrics—one between personalized parameters and one between clinical covariates—and attempting to match the induced distances as closely as possible. Crucially, we do not assume these distance metrics are already known. Instead, we allow the data to dictate the structure of these latent distance metrics. Finally, we apply our method to learn patient-specific, interpretable models for a pan-cancer gene expression dataset containing samples from more than 30 distinct cancer types and find strong evidence of personalization effects between cancer types as well as between individuals. Our analysis uncovers sample-specific aberrations that are overlooked by population-level methods, suggesting a promising new path for precision analysis of complex diseases such as cancer.Availability and implementationSoftware for personalized linear and personalized logistic regression, along with code to reproduce experimental results, is freely available at
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty250
      Issue No: Vol. 34, No. 13 (2018)
  • A scalable estimator of SNP heritability for biobank-scale data
    • Authors: Wu Y; Sankararaman S.
      Abstract: AbstractMotivationHeritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide single nucleotide polymorphism (SNP) variation data has motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets. Linear mixed models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs, i.e. the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs, however, poses serious computational burdens.ResultsWe propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a method-of-moment estimator that has a runtime complexity O(NMB) for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector multiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to O(NMBmax( log⁡3N, log⁡3M)).We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data. On standard hardware, our method computes heritability on a dataset of 500 000 individuals and 100 000 SNPs in 38 min.Availability and implementationThe RHE-reg software is made freely available to the research community at:
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty253
      Issue No: Vol. 34, No. 13 (2018)
  • A unifying framework for joint trait analysis under a non-infinitesimal
    • Authors: Johnson R; Shi H, Pasaniuc B, et al.
      Abstract: AbstractMotivationA large proportion of risk regions identified by genome-wide association studies (GWAS) are shared across multiple diseases and traits. Understanding whether this clustering is due to sharing of causal variants or chance colocalization can provide insights into shared etiology of complex traits and diseases.ResultsIn this work, we propose a flexible, unifying framework to quantify the overlap between a pair of traits called UNITY (Unifying Non-Infinitesimal Trait analYsis). We formulate a Bayesian generative model that relates the overlap between pairs of traits to GWAS summary statistic data under a non-infinitesimal genetic architecture underlying each trait. We propose a Metropolis–Hastings sampler to compute the posterior density of the genetic overlap parameters in this model. We validate our method through comprehensive simulations and analyze summary statistics from height and body mass index GWAS to show that it produces estimates consistent with the known genetic makeup of both traits.Availability and implementationThe UNITY software is made freely available to the research community at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty254
      Issue No: Vol. 34, No. 13 (2018)
  • ISMB 2018 proceedings
    • Authors: Bromberg Y; Radivojac P.
      Abstract: The 26th Annual Conference on Intelligent Systems for Molecular Biology (ISMB) was held in Chicago, Illinois, USA, July 6–10, 2018. ISMB is the flagship conference of the International Society for Computational Biology (ISCB) and the world’s premier forum for dissemination of scientific research in computational biology and its intersection with other fields. This special issue serves as the Proceedings of ISMB 2018.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty413
      Issue No: Vol. 34, No. 13 (2018)
  • AmpUMI: design and analysis of unique molecular identifiers for deep
           amplicon sequencing
    • Authors: Clement K; Farouni R, Bauer D, et al.
      Abstract: AbstractMotivationUnique molecular identifiers (UMIs) are added to DNA fragments before PCR amplification to discriminate between alleles arising from the same genomic locus and sequencing reads produced by PCR amplification. While computational methods have been developed to take into account UMI information in genome-wide and single-cell sequencing studies, they are not designed for modern amplicon-based sequencing experiments, especially in cases of high allelic diversity. Importantly, no guidelines are provided for the design of optimal UMI length for amplicon-based sequencing experiments.ResultsBased on the total number of DNA fragments and the distribution of allele frequencies, we present a model for the determination of the minimum UMI length required to prevent UMI collisions and reduce allelic distortion. We also introduce a user-friendly software tool called AmpUMI to assist in the design and the analysis of UMI-based amplicon sequencing studies. AmpUMI provides quality control metrics on frequency and quality of UMIs, and trims and deduplicates amplicon sequences with user specified parameters for use in downstream analysis.Availability and implementationAmpUMI is open-source and freely available at
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty264
      Issue No: Vol. 34, No. 13 (2018)
  • Haplotype phasing in single-cell DNA-sequencing data
    • Authors: Satas G; Raphael B.
      Abstract: AbstractMotivationCurrent technologies for single-cell DNA sequencing require whole-genome amplification (WGA), as a single cell contains too little DNA for direct sequencing. Unfortunately, WGA introduces biases in the resulting sequencing data, including non-uniformity in genome coverage and high rates of allele dropout. These biases complicate many downstream analyses, including the detection of genomic variants.ResultsWe show that amplification biases have a potential upside: long-range correlations in rates of allele dropout provide a signal for phasing haplotypes at the lengths of amplicons from WGA, lengths which are generally longer than than individual sequence reads. We describe a statistical test to measure concurrent allele dropout between single-nucleotide polymorphisms (SNPs) across multiple sequenced single cells. We use results of this test to perform haplotype assembly across a collection of single cells. We demonstrate that the algorithm predicts phasing between pairs of SNPs with higher accuracy than phasing from reads alone. Using whole-genome sequencing data from only seven neural cells, we obtain haplotype blocks that are orders of magnitude longer than with sequence reads alone (median length 10.2 kb versus 312 bp), with error rates <2%. We demonstrate similar advantages on whole-exome data from 16 cells, where we obtain haplotype blocks with median length 9.2 kb—comparable to typical gene lengths—compared with median lengths of 41 bp with sequence reads alone, with error rates <4%. Our algorithm will be useful for haplotyping of rare alleles and studies of allele-specific somatic aberrations.Availability and implementationSource code is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty286
      Issue No: Vol. 34, No. 13 (2018)
  • Dissecting newly transcribed and old RNA using GRAND-SLAM
    • Authors: Jürges C; Dölken L, Erhard F.
      Abstract: AbstractSummaryGlobal quantification of total RNA is used to investigate steady state levels of gene expression. However, being able to differentiate pre-existing RNA (that has been synthesized prior to a defined point in time) and newly transcribed RNA can provide invaluable information e.g. to estimate RNA half-lives or identify fast and complex regulatory processes. Recently, new techniques based on metabolic labeling and RNA-seq have emerged that allow to quantify new and old RNA: Nucleoside analogs are incorporated into newly transcribed RNA and are made detectable as point mutations in mapped reads. However, relatively infrequent incorporation events and significant sequencing error rates make the differentiation between old and new RNA a highly challenging task. We developed a statistical approach termed GRAND-SLAM that, for the first time, allows to estimate the proportion of old and new RNA in such an experiment. Uncertainty in the estimates is quantified in a Bayesian framework. Simulation experiments show our approach to be unbiased and highly accurate. Furthermore, we analyze how uncertainty in the proportion translates into uncertainty in estimating RNA half-lives and give guidelines for planning experiments. Finally, we demonstrate that our estimates of RNA half-lives compare favorably to other experimental approaches and that biological processes affecting RNA half-lives can be investigated with greater power than offered by any other method. GRAND-SLAM is freely available for non-commercial use at; R scripts to generate all figures are available at zenodo (doi: 10.5281/zenodo.1162340).
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.5281/zenodo.1162340).
      Issue No: Vol. 34, No. 13 (2018)
  • An integration of fast alignment and maximum-likelihood methods for
           electron subtomogram averaging and classification
    • Authors: Zhao Y; Zeng X, Guo Q, et al.
      Abstract: AbstractMotivationCellular Electron CryoTomography (CECT) is an emerging 3D imaging technique that visualizes subcellular organization of single cells at sub-molecular resolution and in near-native state. CECT captures large numbers of macromolecular complexes of highly diverse structures and abundances. However, the structural complexity and imaging limits complicate the systematic de novo structural recovery and recognition of these macromolecular complexes. Efficient and accurate reference-free subtomogram averaging and classification represent the most critical tasks for such analysis. Existing subtomogram alignment based methods are prone to the missing wedge effects and low signal-to-noise ratio (SNR). Moreover, existing maximum-likelihood based methods rely on integration operations, which are in principle computationally infeasible for accurate calculation.ResultsBuilt on existing works, we propose an integrated method, Fast Alignment Maximum Likelihood method (FAML), which uses fast subtomogram alignment to sample sub-optimal rigid transformations. The transformations are then used to approximate integrals for maximum-likelihood update of subtomogram averages through expectation–maximization algorithm. Our tests on simulated and experimental subtomograms showed that, compared to our previously developed fast alignment method (FA), FAML is significantly more robust to noise and missing wedge effects with moderate increases of computation cost. Besides, FAML performs well with significantly fewer input subtomograms when the FA method fails. Therefore, FAML can serve as a key component for improved construction of initial structural models from macromolecules captured by CECT.Availability and implementation
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty267
      Issue No: Vol. 34, No. 13 (2018)
  • Viral quasispecies reconstruction via tensor factorization with successive
           read removal
    • Authors: Ahn S; Ke Z, Vikalo H.
      Abstract: AbstractMotivationAs RNA viruses mutate and adapt to environmental changes, often developing resistance to anti-viral vaccines and drugs, they form an ensemble of viral strains––a viral quasispecies. While high-throughput sequencing (HTS) has enabled in-depth studies of viral quasispecies, sequencing errors and limited read lengths render the problem of reconstructing the strains and estimating their spectrum challenging. Inference of viral quasispecies is difficult due to generally non-uniform frequencies of the strains, and is further exacerbated when the genetic distances between the strains are small.ResultsThis paper presents TenSQR, an algorithm that utilizes tensor factorization framework to analyze HTS data and reconstruct viral quasispecies characterized by highly uneven frequencies of its components. Fundamentally, TenSQR performs clustering with successive data removal to infer strains in a quasispecies in order from the most to the least abundant one; every time a strain is inferred, sequencing reads generated from that strain are removed from the dataset. The proposed successive strain reconstruction and data removal enables discovery of rare strains in a population and facilitates detection of deletions in such strains. Results on simulated datasets demonstrate that TenSQR can reconstruct full-length strains having widely different abundances, generally outperforming state-of-the-art methods at diversities 1–10% and detecting long deletions even in rare strains. A study on a real HIV-1 dataset demonstrates that TenSQR outperforms competing methods in experimental settings as well. Finally, we apply TenSQR to analyze a Zika virus sample and reconstruct the full-length strains it contains.Availability and implementationTenSQR is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty291
      Issue No: Vol. 34, No. 13 (2018)
  • Convolutional neural networks for classification of alignments of
           non-coding RNA sequences
    • Authors: Aoki G; Sakakibara Y.
      Abstract: AbstractMotivationThe convolutional neural network (CNN) has been applied to the classification problem of DNA sequences, with the additional purpose of motif discovery. The training of CNNs with distributed representations of four nucleotides has successfully derived position weight matrices on the learned kernels that corresponded to sequence motifs such as protein-binding sites.ResultsWe propose a novel application of CNNs to classification of pairwise alignments of sequences for accurate clustering of sequences and show the benefits of the CNN method of inputting pairwise alignments for clustering of non-coding RNA (ncRNA) sequences and for motif discovery. Classification of a pairwise alignment of two sequences into positive and negative classes corresponds to the clustering of the input sequences. After we combined the distributed representation of RNA nucleotides with the secondary-structure information specific to ncRNAs and furthermore with mapping profiles of next-generation sequence reads, the training of CNNs for classification of alignments of RNA sequences yielded accurate clustering in terms of ncRNA families and outperformed the existing clustering methods for ncRNA sequences. Several interesting sequence motifs and secondary-structure motifs known for the snoRNA family and specific to microRNA and tRNA families were identified.Availability and implementationThe source code of our CNN software in the deep-learning framework Chainer is available at, and the dataset used for performance evaluation in this work is available at the same URL.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty228
      Issue No: Vol. 34, No. 13 (2018)
  • DisruPPI: structure-based computational redesign algorithm for protein
           binding disruption
    • Authors: Choi Y; Furlon J, Amos R, et al.
      Abstract: AbstractMotivationDisruption of protein–protein interactions can mitigate antibody recognition of therapeutic proteins, yield monomeric forms of oligomeric proteins, and elucidate signaling mechanisms, among other applications. While designing affinity-enhancing mutations remains generally quite challenging, both statistically and physically based computational methods can precisely identify affinity-reducing mutations. In order to leverage this ability to design variants of a target protein with disrupted interactions, we developed the DisruPPI protein design method (DISRUpting Protein–Protein Interactions) to optimize combinations of mutations simultaneously for both disruption and stability, so that incorporated disruptive mutations do not inadvertently affect the target protein adversely.ResultsTwo existing methods for predicting mutational effects on binding, FoldX and INT5, were demonstrated to be quite precise in selecting disruptive mutations from the SKEMPI and AB-Bind databases of experimentally determined changes in binding free energy. DisruPPI was implemented to use an INT5-based disruption score integrated with an AMBER-based stability assessment and was applied to disrupt protein interactions in a set of different targets representing diverse applications. In retrospective evaluation with three different case studies, comparison of DisruPPI-designed variants to published experimental data showed that DisruPPI was able to identify more diverse interaction-disrupting and stability-preserving variants more efficiently and effectively than previous approaches. In prospective application to an interaction between enhanced green fluorescent protein (EGFP) and a nanobody, DisruPPI was used to design five EGFP variants, all of which were shown to have significantly reduced nanobody binding while maintaining function and thermostability. This demonstrates that DisruPPI may be readily utilized for effective removal of known epitopes of therapeutically relevant proteins.Availability and implementationDisruPPI is implemented in the EpiSweep package, freely available under an academic use license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty274
      Issue No: Vol. 34, No. 13 (2018)
  • DeepFam: deep learning based alignment-free method for protein family
           modeling and prediction
    • Authors: Seo S; Oh M, Park Y, et al.
      Abstract: AbstractMotivationA large number of newly sequenced proteins are generated by the next-generation sequencing technologies and the biochemical function assignment of the proteins is an important task. However, biological experiments are too expensive to characterize such a large number of protein sequences, thus protein function prediction is primarily done by computational modeling methods, such as profile Hidden Markov Model (pHMM) and k-mer based methods. Nevertheless, existing methods have some limitations; k-mer based methods are not accurate enough to assign protein functions and pHMM is not fast enough to handle large number of protein sequences from numerous genome projects. Therefore, a more accurate and faster protein function prediction method is needed.ResultsIn this paper, we introduce DeepFam, an alignment-free method that can extract functional information directly from sequences without the need of multiple sequence alignments. In extensive experiments using the Clusters of Orthologous Groups (COGs) and G protein-coupled receptor (GPCR) dataset, DeepFam achieved better performance in terms of accuracy and runtime for predicting functions of proteins compared to the state-of-the-art methods, both alignment-free and alignment-based methods. Additionally, we showed that DeepFam has a power of capturing conserved regions to model protein families. In fact, DeepFam was able to detect conserved regions documented in the Prosite database while predicting functions of proteins. Our deep learning method will be useful in characterizing functions of the ever increasing protein sequences.Availability and implementationCodes are available at
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty275
      Issue No: Vol. 34, No. 13 (2018)
  • Protein threading using residue co-variation and deep learning
    • Authors: Zhu J; Wang S, Bu D, et al.
      Abstract: AbstractMotivationTemplate-based modeling, including homology modeling and protein threading, is a popular method for protein 3D structure prediction. However, alignment generation and template selection for protein sequences without close templates remain very challenging.ResultsWe present a new method called DeepThreader to improve protein threading, including both alignment generation and template selection, by making use of deep learning (DL) and residue co-variation information. Our method first employs DL to predict inter-residue distance distribution from residue co-variation and sequential information (e.g. sequence profile and predicted secondary structure), and then builds sequence-template alignment by integrating predicted distance information and sequential features through an ADMM algorithm. Experimental results suggest that predicted inter-residue distance is helpful to both protein alignment and template selection especially for protein sequences without very close templates, and that our method outperforms currently popular homology modeling method HHpred and threading method CNFpred by a large margin and greatly outperforms the latest contact-assisted protein threading method EigenTHREADER.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty278
      Issue No: Vol. 34, No. 13 (2018)
  • mGPfusion: predicting protein stability changes with Gaussian process
           kernel learning and data fusion
    • Authors: Jokinen E; Heinonen M, Lähdesmäki H.
      Abstract: AbstractMotivationProteins are commonly used by biochemical industry for numerous processes. Refining these proteins’ properties via mutations causes stability effects as well. Accurate computational method to predict how mutations affect protein stability is necessary to facilitate efficient protein design. However, accuracy of predictive models is ultimately constrained by the limited availability of experimental data.ResultsWe have developed mGPfusion, a novel Gaussian process (GP) method for predicting protein’s stability changes upon single and multiple mutations. This method complements the limited experimental data with large amounts of molecular simulation data. We introduce a Bayesian data fusion model that re-calibrates the experimental and in silico data sources and then learns a predictive GP model from the combined data. Our protein-specific model requires experimental data only regarding the protein of interest and performs well even with few experimental measurements. The mGPfusion models proteins by contact maps and infers the stability effects caused by mutations with a mixture of graph kernels. Our results show that mGPfusion outperforms state-of-the-art methods in predicting protein stability on a dataset of 15 different proteins and that incorporating molecular simulation data improves the model learning and prediction accuracy.Availability and implementationSoftware implementation and datasets are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty238
      Issue No: Vol. 34, No. 13 (2018)
  • DLBI: deep learning guided Bayesian inference for structure reconstruction
           of super-resolution fluorescence microscopy
    • Authors: Li Y; Xu F, Zhang F, et al.
      Abstract: AbstractMotivationSuper-resolution fluorescence microscopy with a resolution beyond the diffraction limit of light, has become an indispensable tool to directly visualize biological structures in living cells at a nanometer-scale resolution. Despite advances in high-density super-resolution fluorescent techniques, existing methods still have bottlenecks, including extremely long execution time, artificial thinning and thickening of structures, and lack of ability to capture latent structures.ResultsHere, we propose a novel deep learning guided Bayesian inference (DLBI) approach, for the time-series analysis of high-density fluorescent images. Our method combines the strength of deep learning and statistical inference, where deep learning captures the underlying distribution of the fluorophores that are consistent with the observed time-series fluorescent images by exploring local features and correlation along time-axis, and statistical inference further refines the ultrastructure extracted by deep learning and endues physical meaning to the final image. In particular, our method contains three main components. The first one is a simulator that takes a high-resolution image as the input, and simulates time-series low-resolution fluorescent images based on experimentally calibrated parameters, which provides supervised training data to the deep learning model. The second one is a multi-scale deep learning module to capture both spatial information in each input low-resolution image as well as temporal information among the time-series images. And the third one is a Bayesian inference module that takes the image from the deep learning module as the initial localization of fluorophores and removes artifacts by statistical inference. Comprehensive experimental results on both real and simulated datasets demonstrate that our method provides more accurate and realistic local patch and large-field reconstruction than the state-of-the-art method, the 3B analysis, while our method is more than two orders of magnitude faster.Availability and implementationThe main program is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty241
      Issue No: Vol. 34, No. 13 (2018)
  • A novel methodology on distributed representations of proteins using their
           interacting ligands
    • Authors: Öztürk H; Ozkirimli E, Özgür A.
      Abstract: AbstractMotivationThe effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand-based approach can be utilized in protein representation. In this study, we propose SMILESVec, a Simplified molecular input line entry system (SMILES)-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, Basic local alignment tool and ProtVec, and two compound fingerprint-based protein representation methods are compared.ResultsWe showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein sequence-based representation methods in protein clustering. The results suggest that ligand-based protein description can be an alternative to the traditional sequence or structure-based representation of proteins and this novel approach can be applied to different bioinformatics problems such as prediction of new protein–ligand interactions and protein function annotation.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty287
      Issue No: Vol. 34, No. 13 (2018)
  • HFSP: high speed homology-driven function annotation of proteins
    • Authors: Mahlich Y; Steinegger M, Rost B, et al.
      Abstract: AbstractMotivationThe rapid drop in sequencing costs has produced many more (predicted) protein sequences than can feasibly be functionally annotated with wet-lab experiments. Thus, many computational methods have been developed for this purpose. Most of these methods employ homology-based inference, approximated via sequence alignments, to transfer functional annotations between proteins. The increase in the number of available sequences, however, has drastically increased the search space, thus significantly slowing down alignment methods.ResultsHere we describe homology-derived functional similarity of proteins (HFSP), a novel computational method that uses results of a high-speed alignment algorithm, MMseqs2, to infer functional similarity of proteins on the basis of their alignment length and sequence identity. We show that our method is accurate (85% precision) and fast (more than 40-fold speed increase over state-of-the-art). HFSP can help correct at least a 16% error in legacy curations, even for a resource of as high quality as Swiss-Prot. These findings suggest HFSP as an ideal resource for large-scale functional annotation efforts.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty262
      Issue No: Vol. 34, No. 13 (2018)
  • Enumerating consistent sub-graphs of directed acyclic graphs: an insight
           into biomedical ontologies
    • Authors: Peng Y; Jiang Y, Radivojac P.
      Abstract: AbstractMotivationModern problems of concept annotation associate an object of interest (gene, individual, text document) with a set of interrelated textual descriptors (functions, diseases, topics), often organized in concept hierarchies or ontologies. Most ontology can be seen as directed acyclic graphs (DAGs), where nodes represent concepts and edges represent relational ties between these concepts. Given an ontology graph, each object can only be annotated by a consistent sub-graph; that is, a sub-graph such that if an object is annotated by a particular concept, it must also be annotated by all other concepts that generalize it. Ontologies therefore provide a compact representation of a large space of possible consistent sub-graphs; however, until now we have not been aware of a practical algorithm that can enumerate such annotation spaces for a given ontology.ResultsWe propose an algorithm for enumerating consistent sub-graphs of DAGs. The algorithm recursively partitions the graph into strictly smaller graphs until the resulting graph becomes a rooted tree (forest), for which a linear-time solution is computed. It then combines the tallies from graphs created in the recursion to obtain the final count. We prove the correctness of this algorithm, propose several practical accelerations, evaluate it on random graphs and then apply it to characterize four major biomedical ontologies. We believe this work provides valuable insights into the complexity of concept annotation spaces and its potential influence on the predictability of ontological annotation.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty268
      Issue No: Vol. 34, No. 13 (2018)
  • MicroPheno: predicting environments and host phenotypes from 16S rRNA gene
           sequencing using a k-mer based representation of shallow sub-samples
    • Authors: Asgari E; Garakani K, McHardy A, et al.
      Abstract: AbstractMotivationMicrobial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes.ResultsA k-mer distribution of shallow sub-samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body-site identification and Crohn’s disease prediction. Aside from being more accurate, using k-mer features in shallow sub-samples allows (i) skipping computationally costly sequence alignments required in OTU-picking and (ii) provided a proof of concept for the sufficiency of shallow and short-length 16S rRNA sequencing for phenotype prediction. In addition, k-mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro-F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and Support Vector Machine.Availability and implementationThe software and datasets are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty296
      Issue No: Vol. 34, No. 13 (2018)
  • SIMPLE: Sparse Interaction Model over Peaks of moLEcules for fast,
           interpretable metabolite identification from tandem mass spectra
    • Authors: Nguyen D; Nguyen C, Mamitsuka H.
      Abstract: AbstractMotivationRecent success in metabolite identification from tandem mass spectra has been led by machine learning, which has two stages: mapping mass spectra to molecular fingerprint vectors and then retrieving candidate molecules from the database. In the first stage, i.e. fingerprint prediction, spectrum peaks are features and considering their interactions would be reasonable for more accurate identification of unknown metabolites. Existing approaches of fingerprint prediction are based on only individual peaks in the spectra, without explicitly considering the peak interactions. Also the current cutting-edge method is based on kernels, which are computationally heavy and difficult to interpret.ResultsWe propose two learning models that allow to incorporate peak interactions for fingerprint prediction. First, we extend the state-of-the-art kernel learning method by developing kernels for peak interactions to combine with kernels for peaks through multiple kernel learning (MKL). Second, we formulate a sparse interaction model for metabolite peaks, which we call SIMPLE, which is computationally light and interpretable for fingerprint prediction. The formulation of SIMPLE is convex and guarantees global optimization, for which we develop an alternating direction method of multipliers (ADMM) algorithm. Experiments using the MassBank dataset show that both models achieved comparative prediction accuracy with the current top-performance kernel method. Furthermore SIMPLE clearly revealed individual peaks and peak interactions which contribute to enhancing the performance of fingerprint prediction.Availability and implementationThe code will be accessed through
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty252
      Issue No: Vol. 34, No. 13 (2018)
  • Bayesian networks for mass spectrometric metabolite identification via
           molecular fingerprints
    • Authors: Ludwig M; Dührkop K, Böcker S.
      Abstract: AbstractMotivationMetabolites, small molecules that are involved in cellular reactions, provide a direct functional signature of cellular state. Untargeted metabolomics experiments usually rely on tandem mass spectrometry to identify the thousands of compounds in a biological sample. Recently, we presented CSI:FingerID for searching in molecular structure databases using tandem mass spectrometry data. CSI:FingerID predicts a molecular fingerprint that encodes the structure of the query compound, then uses this to search a molecular structure database such as PubChem. Scoring of the predicted query fingerprint and deterministic target fingerprints is carried out assuming independence between the molecular properties constituting the fingerprint.ResultsWe present a scoring that takes into account dependencies between molecular properties. As before, we predict posterior probabilities of molecular properties using machine learning. Dependencies between molecular properties are modeled as a Bayesian tree network; the tree structure is estimated on the fly from the instance data. For each edge, we also estimate the expected covariance between the two random variables. For fixed marginal probabilities, we then estimate conditional probabilities using the known covariance. Now, the corrected posterior probability of each candidate can be computed, and candidates are ranked by this score. Modeling dependencies improves identification rates of CSI:FingerID by 2.85 percentage points.Availability and implementationThe new scoring Bayesian (fixed tree) is integrated into SIRIUS 4.0 (
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty245
      Issue No: Vol. 34, No. 13 (2018)
  • A spectral clustering-based method for identifying clones from
           high-throughput B cell repertoire sequencing data
    • Authors: Nouri N; Kleinstein S.
      Abstract: AbstractMotivationB cells derive their antigen-specificity through the expression of Immunoglobulin (Ig) receptors on their surface. These receptors are initially generated stochastically by somatic re-arrangement of the DNA and further diversified following antigen-activation by a process of somatic hypermutation, which introduces mainly point substitutions into the receptor DNA at a high rate. Recent advances in next-generation sequencing have enabled large-scale profiling of the B cell Ig repertoire from blood and tissue samples. A key computational challenge in the analysis of these data is partitioning the sequences to identify descendants of a common B cell (i.e. a clone). Current methods group sequences using a fixed distance threshold, or a likelihood calculation that is computationally-intensive. Here, we propose a new method based on spectral clustering with an adaptive threshold to determine the local sequence neighborhood. Validation using simulated and experimental datasets demonstrates that this method has high sensitivity and specificity compared to a fixed threshold that is optimized for these measures. In addition, this method works on datasets where choosing an optimal fixed threshold is difficult and is more computationally efficient in all cases. The ability to quickly and accurately identify members of a clone from repertoire sequencing data will greatly improve downstream analyses. Clonally-related sequences cannot be treated independently in statistical models, and clonal partitions are used as the basis for the calculation of diversity metrics, lineage reconstruction and selection analysis. Thus, the spectral clustering-based method here represents an important contribution to repertoire analysis.Availability and implementationSource code for this method is freely available in the SCOPe (Spectral Clustering for clOne Partitioning) R package in the Immcantation framework: under the CC BY-SA 4.0 license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty235
      Issue No: Vol. 34, No. 13 (2018)
  • An evolutionary model motivated by physicochemical properties of amino
           acids reveals variation among proteins
    • Authors: Braun E.
      Abstract: AbstractMotivationThe relative rates of amino acid interchanges over evolutionary time are likely to vary among proteins. Variation in those rates has the potential to reveal information about constraints on proteins. However, the most straightforward model that could be used to estimate relative rates of amino acid substitution is parameter-rich and it is therefore impractical to use for this purpose.ResultsA six-parameter model of amino acid substitution that incorporates information about the physicochemical properties of amino acids was developed. It showed that amino acid side chain volume, polarity and aromaticity have major impacts on protein evolution. It also revealed variation among proteins in the relative importance of those properties. The same general approach can be used to improve the fit of empirical models such as the commonly used PAM and LG models.Availability and implementationPerl code and test data are available from informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty261
      Issue No: Vol. 34, No. 13 (2018)
  • Deconvolution and phylogeny inference of structural variations in tumor
           genomic samples
    • Authors: Eaton J; Wang J, Schwartz R.
      Abstract: AbstractMotivationPhylogenetic reconstruction of tumor evolution has emerged as a crucial tool for making sense of the complexity of emerging cancer genomic datasets. Despite the growing use of phylogenetics in cancer studies, though, the field has only slowly adapted to many ways that tumor evolution differs from classic species evolution. One crucial question in that regard is how to handle inference of structural variations (SVs), which are a major mechanism of evolution in cancers but have been largely neglected in tumor phylogenetics to date, in part due to the challenges of reliably detecting and typing SVs and interpreting them phylogenetically.ResultsWe present a novel method for reconstructing evolutionary trajectories of SVs from bulk whole-genome sequence data via joint deconvolution and phylogenetics, to infer clonal sub-populations and reconstruct their ancestry. We establish a novel likelihood model for joint deconvolution and phylogenetic inference on bulk SV data and formulate an associated optimization algorithm. We demonstrate the approach to be efficient and accurate for realistic scenarios of SV mutation on simulated data. Application to breast cancer genomic data from The Cancer Genome Atlas shows it to be practical and effective at reconstructing features of SV-driven evolution in single tumors.Availability and implementationPython source code and associated documentation are available at
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty270
      Issue No: Vol. 34, No. 13 (2018)
  • Accurate prediction of orthologs in the presence of divergence after
    • Authors: Lafond M; Meghdari Miardan M, Sankoff D.
      Abstract: AbstractMotivationWhen gene duplication occurs, one of the copies may become free of selective pressure and evolve at an accelerated pace. This has important consequences on the prediction of orthology relationships, since two orthologous genes separated by divergence after duplication may differ in both sequence and function. In this work, we make the distinction between the primary orthologs, which have not been affected by accelerated mutation rates on their evolutionary path, and the secondary orthologs, which have. Similarity-based prediction methods will tend to miss secondary orthologs, whereas phylogeny-based methods cannot separate primary and secondary orthologs. However, both types of orthology have applications in important areas such as gene function prediction and phylogenetic reconstruction, motivating the need for methods that can distinguish the two types.ResultsWe formalize the notion of divergence after duplication and provide a theoretical basis for the inference of primary and secondary orthologs. We then put these ideas to practice with the Hybrid Prediction of Paralogs and Orthologs (HyPPO) framework, which combines ideas from both similarity and phylogeny approaches. We apply our method to simulated and empirical datasets and show that we achieve superior accuracy in predicting primary orthologs, secondary orthologs and paralogs.Availability and implementationHyPPO is a modular framework with a core developed in Python and is provided with a variety of C++ modules. The source code is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty242
      Issue No: Vol. 34, No. 13 (2018)
  • Inference of species phylogenies from bi-allelic markers using
    • Authors: Zhu J; Nakhleh L.
      Abstract: AbstractMotivationPhylogenetic networks represent reticulate evolutionary histories. Statistical methods for their inference under the multispecies coalescent have recently been developed. A particularly powerful approach uses data that consist of bi-allelic markers (e.g. single nucleotide polymorphism data) and allows for exact likelihood computations of phylogenetic networks while numerically integrating over all possible gene trees per marker. While the approach has good accuracy in terms of estimating the network and its parameters, likelihood computations remain a major computational bottleneck and limit the method’s applicability.ResultsIn this article, we first demonstrate why likelihood computations of networks take orders of magnitude more time when compared to trees. We then propose an approach for inference of phylogenetic networks based on pseudo-likelihood using bi-allelic markers. We demonstrate the scalability and accuracy of phylogenetic network inference via pseudo-likelihood computations on simulated data. Furthermore, we demonstrate aspects of robustness of the method to violations in the underlying assumptions of the employed statistical model. Finally, we demonstrate the application of the method to biological data. The proposed method allows for analyzing larger datasets in terms of the numbers of taxa and reticulation events. While pseudo-likelihood had been proposed before for data consisting of gene trees, the work here uses sequence data directly, offering several advantages as we discuss.Availability and implementationThe methods have been implemented in PhyloNet (
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty295
      Issue No: Vol. 34, No. 13 (2018)
  • A gene–phenotype relationship extraction pipeline from the biomedical
           literature using a representation learning approach
    • Authors: Xing W; Qi J, Yuan X, et al.
      Abstract: AbstractMotivationThe fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study- or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants.ResultsWe have proposed a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improved the unsupervised word-embedding-to-sentence-embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary- and rule-based method was applied for gene recognition. Finally, we integrated one of famous information extraction system OLLIE to identify gene-phenotype relations. To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state-of-the-art baselines, our approach obtained the best performance (F1-Measure of 66.83%). We also applied the pipeline to 481 full-articles from TAIR gene-phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it.Availability and implementationThe source code is available at 82/ informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty263
      Issue No: Vol. 34, No. 13 (2018)
  • Improving genomics-based predictions for precision medicine through active
           elicitation of expert knowledge
    • Authors: Sundin I; Peltola T, Micallef L, et al.
      Abstract: AbstractMotivationPrecision medicine requires the ability to predict the efficacies of different treatments for a given individual using high-dimensional genomic measurements. However, identifying predictive features remains a challenge when the sample size is small. Incorporating expert knowledge offers a promising approach to improve predictions, but collecting such knowledge is laborious if the number of candidate features is very large.ResultsWe introduce a probabilistic framework to incorporate expert feedback about the impact of genomic measurements on the outcome of interest and present a novel approach to collect the feedback efficiently, based on Bayesian experimental design. The new approach outperformed other recent alternatives in two medical applications: prediction of metabolic traits and prediction of sensitivity of cancer cells to different drugs, both using genomic features as predictors. Furthermore, the intelligent approach to collect feedback reduced the workload of the expert to approximately 11%, compared to a baseline approach.Availability and implementationSource code implementing the introduced computational methods is freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty257
      Issue No: Vol. 34, No. 13 (2018)
  • Training for translation between disciplines: a philosophy for life and
           data sciences curricula
    • Authors: Anton Feenstra K; Abeln S, Westerhuis J, et al.
      Abstract: AbstractMotivationOur society has become data-rich to the extent that research in many areas has become impossible without computational approaches. Educational programmes seem to be lagging behind this development. At the same time, there is a growing need not only for strong data science skills, but foremost for the ability to both translate between tools and methods on the one hand, and application and problems on the other.ResultsHere we present our experiences with shaping and running a masters’ programme in bioinformatics and systems biology in Amsterdam. From this, we have developed a comprehensive philosophy on how translation in training may be achieved in a dynamic and multidisciplinary research area, which is described here. We furthermore describe two requirements that enable translation, which we have found to be crucial: sufficient depth and focus on multidisciplinary topic areas, coupled with a balanced breadth from adjacent disciplines. Finally, we present concrete suggestions on how this may be implemented in practice, which may be relevant for the effectiveness of life science and data science curricula in general, and of particular interest to those who are in the process of setting up such curricula.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty233
      Issue No: Vol. 34, No. 13 (2018)
  • Driver gene mutations based clustering of tumors: methods and applications
    • Authors: Zhang W; Flemington E, Zhang K.
      Abstract: AbstractMotivationSomatic mutations in proto-oncogenes and tumor suppressor genes constitute a major category of causal genetic abnormalities in tumor cells. The mutation spectra of thousands of tumors have been generated by The Cancer Genome Atlas (TCGA) and other whole genome (exome) sequencing projects. A promising approach to utilizing these resources for precision medicine is to identify genetic similarity-based sub-types within a cancer type and relate the pinpointed sub-types to the clinical outcomes and pathologic characteristics of patients.ResultsWe propose two novel methods, ccpwModel and xGeneModel, for mutation-based clustering of tumors. In the former, binary variables indicating the status of cancer driver genes in tumors and the genes’ involvement in the core cancer pathways are treated as the features in the clustering process. In the latter, the functional similarities of putative cancer driver genes and their confidence scores as the ‘true’ driver genes are integrated with the mutation spectra to calculate the genetic distances between tumors. We apply both methods to the TCGA data of 16 cancer types. Promising results are obtained when these methods are compared to state-of-the-art approaches as to the associations between the determined tumor clusters and patient race (or survival time). We further extend the analysis to detect mutation-characterized transcriptomic prognostic signatures, which are directly relevant to the etiology of carcinogenesis.Availability and implementationR codes and example data for ccpwModel and xGeneModel can be obtained from informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty232
      Issue No: Vol. 34, No. 13 (2018)
  • Discriminating early- and late-stage cancers using multiple kernel
           learning on gene sets
    • Authors: Rahimi A; Gönen M.
      Abstract: AbstractMotivationIdentifying molecular mechanisms that drive cancers from early to late stages is highly important to develop new preventive and therapeutic strategies. Standard machine learning algorithms could be used to discriminate early- and late-stage cancers from each other using their genomic characterizations. Even though these algorithms would get satisfactory predictive performance, their knowledge extraction capability would be quite restricted due to highly correlated nature of genomic data. That is why we need algorithms that can also extract relevant information about these biological mechanisms using our prior knowledge about pathways/gene sets.ResultsIn this study, we addressed the problem of separating early- and late-stage cancers from each other using their gene expression profiles. We proposed to use a multiple kernel learning (MKL) formulation that makes use of pathways/gene sets (i) to obtain satisfactory/improved predictive performance and (ii) to identify biological mechanisms that might have an effect in cancer progression. We extensively compared our proposed MKL on gene sets algorithm against two standard machine learning algorithms, namely, random forests and support vector machines, on 20 diseases from the Cancer Genome Atlas cohorts for two different sets of experiments. Our method obtained statistically significantly better or comparable predictive performance on most of the datasets using significantly fewer gene expression features. We also showed that our algorithm was able to extract meaningful and disease-specific information that gives clues about the progression mechanism.Availability and implementationOur implementations of support vector machine and multiple kernel learning algorithms in R are available at together with the scripts that replicate the reported experiments.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty239
      Issue No: Vol. 34, No. 13 (2018)
  • LONGO: an R package for interactive gene length dependent analysis for
           neuronal identity
    • Authors: McCoy M; Paul A, Victor M, et al.
      Abstract: AbstractMotivationReprogramming somatic cells into neurons holds great promise to model neuronal development and disease. The efficiency and success rate of neuronal reprogramming, however, may vary between different conversion platforms and cell types, thereby necessitating an unbiased, systematic approach to estimate neuronal identity of converted cells. Recent studies have demonstrated that long genes (>100 kb from transcription start to end) are highly enriched in neurons, which provides an opportunity to identify neurons based on the expression of these long genes.ResultsWe have developed a versatile R package, LONGO, to analyze gene expression based on gene length. We propose a systematic analysis of long gene expression (LGE) with a metric termed the long gene quotient (LQ) that quantifies LGE in RNA-seq or microarray data to validate neuronal identity at the single-cell and population levels. This unique feature of neurons provides an opportunity to utilize measurements of LGE in transcriptome data to quickly and easily distinguish neurons from non-neuronal cells. By combining this conceptual advancement and statistical tool in a user-friendly and interactive software package, we intend to encourage and simplify further investigation into LGE, particularly as it applies to validating and improving neuronal differentiation and reprogramming methodologies.Availability and implementationLONGO is freely available for download at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty243
      Issue No: Vol. 34, No. 13 (2018)
  • COSSMO: predicting competitive alternative splice site selection using
           deep learning
    • Authors: Bretschneider H; Gandhi S, Deshwar A, et al.
      Abstract: AbstractMotivationAlternative splice site selection is inherently competitive and the probability of a given splice site to be used also depends on the strength of neighboring sites. Here, we present a new model named the competitive splice site model (COSSMO), which explicitly accounts for these competitive effects and predicts the percent selected index (PSI) distribution over any number of putative splice sites. We model an alternative splicing event as the choice of a 3′ acceptor site conditional on a fixed upstream 5′ donor site or the choice of a 5′ donor site conditional on a fixed 3′ acceptor site. We build four different architectures that use convolutional layers, communication layers, long short-term memory and residual networks, respectively, to learn relevant motifs from sequence alone. We also construct a new dataset from genome annotations and RNA-Seq read data that we use to train our model.ResultsCOSSMO is able to predict the most frequently used splice site with an accuracy of 70% on unseen test data, and achieve an R2 of 0.6 in modeling the PSI distribution. We visualize the motifs that COSSMO learns from sequence and show that COSSMO recognizes the consensus splice site sequences and many known splicing factors with high specificity.Availability and implementationModel predictions, our training dataset, and code are available from informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty244
      Issue No: Vol. 34, No. 13 (2018)
  • Novo&amp;Stitch: accurate reconciliation of genome assemblies via
           optical maps
    • Authors: Pan W; Wanamaker S, Ah-Fong A, et al.
      Abstract: AbstractMotivationDe novo genome assembly is a challenging computational problem due to the high repetitive content of eukaryotic genomes and the imperfections of sequencing technologies (i.e. sequencing errors, uneven sequencing coverage and chimeric reads). Several assembly tools are currently available, each of which has strengths and weaknesses in dealing with the trade-off between maximizing contiguity and minimizing assembly errors (e.g. mis-joins). To obtain the best possible assembly, it is common practice to generate multiple assemblies from several assemblers and/or parameter settings and try to identify the highest quality assembly. Unfortunately, often there is no assembly that both maximizes contiguity and minimizes assembly errors, so one has to compromise one for the other.ResultsThe concept of assembly reconciliation has been proposed as a way to obtain a higher quality assembly by merging or reconciling all the available assemblies. While several reconciliation methods have been introduced in the literature, we have shown in one of our recent papers that none of them can consistently produce assemblies that are better than the assemblies provided in input. Here we introduce Novo&Stitch, a novel method that takes advantage of optical maps to accurately carry out assembly reconciliation (assuming that the assembled contigs are sufficiently long to be reliably aligned to the optical maps, e.g. 50 Kbp or longer). Experimental results demonstrate that Novo&Stitch can double the contiguity (N50) of the input assemblies without introducing mis-joins or reducing genome completeness.Availability and implementationNovo&Stitch can be obtained from
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty255
      Issue No: Vol. 34, No. 13 (2018)
  • Association mapping in biomedical time series via statistically
           significant shapelet mining
    • Authors: Bock C; Gumbsch T, Moor M, et al.
      Abstract: AbstractMotivationMost modern intensive care units record the physiological and vital signs of patients. These data can be used to extract signatures, commonly known as biomarkers, that help physicians understand the biological complexity of many syndromes. However, most biological biomarkers suffer from either poor predictive performance or weak explanatory power. Recent developments in time series classification focus on discovering shapelets, i.e. subsequences that are most predictive in terms of class membership. Shapelets have the advantage of combining a high predictive performance with an interpretable component—their shape. Currently, most shapelet discovery methods do not rely on statistical tests to verify the significance of individual shapelets. Therefore, identifying associations between the shapelets of physiological biomarkers and patients that exhibit certain phenotypes of interest enables the discovery and subsequent ranking of physiological signatures that are interpretable, statistically validated and accurate predictors of clinical endpoints.ResultsWe present a novel and scalable method for scanning time series and identifying discriminative patterns that are statistically significant. The significance of a shapelet is evaluated while considering the problem of multiple hypothesis testing and mitigating it by efficiently pruning untestable shapelet candidates with Tarone’s method. We demonstrate the utility of our method by discovering patterns in three of a patient’s vital signs: heart rate, respiratory rate and systolic blood pressure that are indicators of the severity of a future sepsis event, i.e. an inflammatory response to an infective agent that can lead to organ failure and death, if not treated in time.Availability and implementationWe make our method and the scripts that are required to reproduce the experiments publicly available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty246
      Issue No: Vol. 34, No. 13 (2018)
  • Gene prioritization using Bayesian matrix factorization with genomic and
           phenotypic side information
    • Authors: Zakeri P; Simm J, Arany A, et al.
      Abstract: AbstractMotivationMost gene prioritization methods model each disease or phenotype individually, but this fails to capture patterns common to several diseases or phenotypes. To overcome this limitation, we formulate the gene prioritization task as the factorization of a sparsely filled gene-phenotype matrix, where the objective is to predict the unknown matrix entries. To deliver more accurate gene-phenotype matrix completion, we extend classical Bayesian matrix factorization to work with multiple side information sources. The availability of side information allows us to make non-trivial predictions for genes for which no previous disease association is known.ResultsOur gene prioritization method can innovatively not only integrate data sources describing genes, but also data sources describing Human Phenotype Ontology terms. Experimental results on our benchmarks show that our proposed model can effectively improve accuracy over the well-established gene prioritization method, Endeavour. In particular, our proposed method offers promising results on diseases of the nervous system; diseases of the eye and adnexa; endocrine, nutritional and metabolic diseases; and congenital malformations, deformations and chromosomal abnormalities, when compared to Endeavour.Availability and implementationThe Bayesian data fusion method is implemented as a Python/C++ package: It is also available as a Julia package: All data and benchmarks generated or analyzed during this study can be downloaded at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty289
      Issue No: Vol. 34, No. 13 (2018)
  • Modeling polypharmacy side effects with graph convolutional networks
    • Authors: Zitnik M; Agrawal M, Leskovec J.
      Abstract: AbstractMotivationThe use of drug combinations, termed polypharmacy, is common to treat patients with complex diseases or co-existing conditions. However, a major consequence of polypharmacy is a much higher risk of adverse side effects for the patient. Polypharmacy side effects emerge because of drug–drug interactions, in which activity of one drug may change, favorably or unfavorably, if taken with another drug. The knowledge of drug interactions is often limited because these complex relationships are rare, and are usually not observed in relatively small clinical testing. Discovering polypharmacy side effects thus remains an important challenge with significant implications for patient mortality and morbidity.ResultsHere, we present Decagon, an approach for modeling polypharmacy side effects. The approach constructs a multimodal graph of protein–protein interactions, drug–protein target interactions and the polypharmacy side effects, which are represented as drug–drug interactions, where each side effect is an edge of a different type. Decagon is developed specifically to handle such multimodal graphs with a large number of edge types. Our approach develops a new graph convolutional neural network for multirelational link prediction in multimodal networks. Unlike approaches limited to predicting simple drug–drug interaction values, Decagon can predict the exact side effect, if any, through which a given drug combination manifests clinically. Decagon accurately predicts polypharmacy side effects, outperforming baselines by up to 69%. We find that it automatically learns representations of side effects indicative of co-occurrence of polypharmacy in patients. Furthermore, Decagon models particularly well polypharmacy side effects that have a strong molecular basis, while on predominantly non-molecular side effects, it achieves good performance because of effective sharing of model parameters across edge types. Decagon opens up opportunities to use large pharmacogenomic and patient population data to flag and prioritize polypharmacy side effects for follow-up analysis via formal pharmacological studies.Availability and implementationSource code and preprocessed datasets are at:
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty294
      Issue No: Vol. 34, No. 13 (2018)
  • Finding associated variants in genome-wide association studies on multiple
    • Authors: Gai L; Eskin E.
      Abstract: AbstractMotivationMany variants identified by genome-wide association studies (GWAS) have been found to affect multiple traits, either directly or through shared pathways. There is currently a wealth of GWAS data collected in numerous phenotypes, and analyzing multiple traits at once can increase power to detect shared variant effects. However, traditional meta-analysis methods are not suitable for combining studies on different traits. When applied to dissimilar studies, these meta-analysis methods can be underpowered compared to univariate analysis. The degree to which traits share variant effects is often not known, and the vast majority of GWAS meta-analysis only consider one trait at a time.ResultsHere, we present a flexible method for finding associated variants from GWAS summary statistics for multiple traits. Our method estimates the degree of shared effects between traits from the data. Using simulations, we show that our method properly controls the false positive rate and increases power when an effect is present in a subset of traits. We then apply our method to the North Finland Birth Cohort and UK Biobank datasets using a variety of metabolic traits and discover novel loci.Availability and implementationOur source code is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty249
      Issue No: Vol. 34, No. 13 (2018)
  • Quantifying the similarity of topological domains across normal and cancer
           human cell types
    • Authors: Sauerwald N; Kingsford C.
      Abstract: AbstractMotivationThree-dimensional chromosome structure has been increasingly shown to influence various levels of cellular and genomic functions. Through Hi-C data, which maps contact frequency on chromosomes, it has been found that structural elements termed topologically associating domains (TADs) are involved in many regulatory mechanisms. However, we have little understanding of the level of similarity or variability of chromosome structure across cell types and disease states. In this study, we present a method to quantify resemblance and identify structurally similar regions between any two sets of TADs.ResultsWe present an analysis of 23 human Hi-C samples representing various tissue types in normal and cancer cell lines. We quantify global and chromosome-level structural similarity, and compare the relative similarity between cancer and non-cancer cells. We find that cancer cells show higher structural variability around commonly mutated pan-cancer genes than normal cells at these same locations.Availability and implementationSoftware for the methods and analysis can be found at
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty265
      Issue No: Vol. 34, No. 13 (2018)
  • Classifying tumors by supervised network propagation
    • Authors: Zhang W; Ma J, Ideker T.
      Abstract: AbstractMotivationNetwork propagation has been widely used to aggregate and amplify the effects of tumor mutations using knowledge of molecular interaction networks. However, propagating mutations through interactions irrelevant to cancer leads to erosion of pathway signals and complicates the identification of cancer subtypes.ResultsTo address this problem we introduce a propagation algorithm, Network-Based Supervised Stratification (NBS2), which learns the mutated subnetworks underlying tumor subtypes using a supervised approach. Given an annotated molecular network and reference tumor mutation profiles for which subtypes have been predefined, NBS2 is trained by adjusting the weights on interaction features such that network propagation best recovers the provided subtypes. After training, weights are fixed such that mutation profiles of new tumors can be accurately classified. We evaluate NBS2 on breast and glioblastoma tumors, demonstrating that it outperforms the best network-based approaches in classifying tumors to known subtypes for these diseases. By interpreting the interaction weights, we highlight characteristic molecular pathways driving selected subtypes.Availability and implementationThe NBS2 package is freely available at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty247
      Issue No: Vol. 34, No. 13 (2018)
  • Bayesian parameter estimation for biochemical reaction networks using
           region-based adaptive parallel tempering
    • Authors: Ballnus B; Schaper S, Theis F, et al.
      Abstract: AbstractMotivationMathematical models have become standard tools for the investigation of cellular processes and the unraveling of signal processing mechanisms. The parameters of these models are usually derived from the available data using optimization and sampling methods. However, the efficiency of these methods is limited by the properties of the mathematical model, e.g. non-identifiabilities, and the resulting posterior distribution. In particular, multi-modal distributions with long valleys or pronounced tails are difficult to optimize and sample. Thus, the developement or improvement of optimization and sampling methods is subject to ongoing research.ResultsWe suggest a region-based adaptive parallel tempering algorithm which adapts to the problem-specific posterior distributions, i.e. modes and valleys. The algorithm combines several established algorithms to overcome their individual shortcomings and to improve sampling efficiency. We assessed its properties for established benchmark problems and two ordinary differential equation models of biochemical reaction networks. The proposed algorithm outperformed state-of-the-art methods in terms of calculation efficiency and mixing. Since the algorithm does not rely on a specific problem structure, but adapts to the posterior distribution, it is suitable for a variety of model classes.Availability and implementationThe code is available both as Supplementary MaterialSupplementary Material and in a Git repository written in MATLAB.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty229
      Issue No: Vol. 34, No. 13 (2018)
  • An optimization framework for network annotation
    • Authors: Patkar S; Sharan R.
      Abstract: AbstractMotivationA chief goal of systems biology is the reconstruction of large-scale executable models of cellular processes of interest. While accurate continuous models are still beyond reach, a powerful alternative is to learn a logical model of the processes under study, which predicts the logical state of any node of the model as a Boolean function of its incoming nodes. Key to learning such models is the functional annotation of the underlying physical interactions with activation/repression (sign) effects. Such annotations are pretty common for a few well-studied biological pathways.ResultsHere we present a novel optimization framework for large-scale sign annotation that employs different plausible models of signaling and combines them in a rigorous manner. We apply our framework to two large-scale knockout datasets in yeast and evaluate its different components as well as the combined model to predict signs of different subsets of physical interactions. Overall, we obtain an accurate predictor that outperforms previous work by a considerable margin.Availability and implementationThe code is publicly available at
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty236
      Issue No: Vol. 34, No. 13 (2018)
  • Learning with multiple pairwise kernels for drug bioactivity prediction
    • Authors: Cichonska A; Pahikkala T, Szedmak S, et al.
      Abstract: AbstractMotivationMany inference problems in bioinformatics, including drug bioactivity prediction, can be formulated as pairwise learning problems, in which one is interested in making predictions for pairs of objects, e.g. drugs and their targets. Kernel-based approaches have emerged as powerful tools for solving problems of that kind, and especially multiple kernel learning (MKL) offers promising benefits as it enables integrating various types of complex biomedical information sources in the form of kernels, along with learning their importance for the prediction task. However, the immense size of pairwise kernel spaces remains a major bottleneck, making the existing MKL algorithms computationally infeasible even for small number of input pairs.ResultsWe introduce pairwiseMKL, the first method for time- and memory-efficient learning with multiple pairwise kernels. pairwiseMKL first determines the mixture weights of the input pairwise kernels, and then learns the pairwise prediction function. Both steps are performed efficiently without explicit computation of the massive pairwise matrices, therefore making the method applicable to solving large pairwise learning problems. We demonstrate the performance of pairwiseMKL in two related tasks of quantitative drug bioactivity prediction using up to 167 995 bioactivity measurements and 3120 pairwise kernels: (i) prediction of anticancer efficacy of drug compounds across a large panel of cancer cell lines; and (ii) prediction of target profiles of anticancer compounds across their kinome-wide target spaces. We show that pairwiseMKL provides accurate predictions using sparse solutions in terms of selected kernels, and therefore it automatically identifies also data sources relevant for the prediction problem.Availability and implementationCode is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty277
      Issue No: Vol. 34, No. 13 (2018)
  • Improved pathway reconstruction from RNA interference screens by
           exploiting off-target effects
    • Authors: Srivatsa S; Kuipers J, Schmich F, et al.
      Abstract: AbstractMotivationPathway reconstruction has proven to be an indispensable tool for analyzing the molecular mechanisms of signal transduction underlying cell function. Nested effects models (NEMs) are a class of probabilistic graphical models designed to reconstruct signalling pathways from high-dimensional observations resulting from perturbation experiments, such as RNA interference (RNAi). NEMs assume that the short interfering RNAs (siRNAs) designed to knockdown specific genes are always on-target. However, it has been shown that most siRNAs exhibit strong off-target effects, which further confound the data, resulting in unreliable reconstruction of networks by NEMs.ResultsHere, we present an extension of NEMs called probabilistic combinatorial nested effects models (pc-NEMs), which capitalize on the ancillary siRNA off-target effects for network reconstruction from combinatorial gene knockdown data. Our model employs an adaptive simulated annealing search algorithm for simultaneous inference of network structure and error rates inherent to the data. Evaluation of pc-NEMs on simulated data with varying number of phenotypic effects and noise levels as well as real data demonstrates improved reconstruction compared to classical NEMs. Application to Bartonella henselae infection RNAi screening data yielded an eight node network largely in agreement with previous works, and revealed novel binary interactions of direct impact between established components.Availability and implementationThe software used for the analysis is freely available as an R package at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty240
      Issue No: Vol. 34, No. 13 (2018)
  • Onto2Vec: joint vector-based representation of biological entities and
           their ontology-based annotations
    • Authors: Smaili F; Gao X, Hoehndorf R.
      Abstract: AbstractMotivationBiological knowledge is widely represented in the form of ontology-based annotations: ontologies describe the phenomena assumed to exist within a domain, and the annotations associate a (kind of) biological entity with a set of phenomena within the domain. The structure and information contained in ontologies and their annotations make them valuable for developing machine learning, data analysis and knowledge extraction algorithms; notably, semantic similarity is widely used to identify relations between biological entities, and ontology-based annotations are frequently used as features in machine learning applications.ResultsWe propose the Onto2Vec method, an approach to learn feature vectors for biological entities based on their annotations to biomedical ontologies. Our method can be applied to a wide range of bioinformatics research problems such as similarity-based prediction of interactions between proteins, classification of interaction types using supervised learning, or clustering. To evaluate Onto2Vec, we use the gene ontology (GO) and jointly produce dense vector representations of proteins, the GO classes to which they are annotated, and the axioms in GO that constrain these classes. First, we demonstrate that Onto2Vec-generated feature vectors can significantly improve prediction of protein–protein interactions in human and yeast. We then illustrate how Onto2Vec representations provide the means for constructing data-driven, trainable semantic similarity measures that can be used to identify particular relations between proteins. Finally, we use an unsupervised clustering approach to identify protein families based on their Enzyme Commission numbers. Our results demonstrate that Onto2Vec can generate high quality feature vectors from biological entities and ontologies. Onto2Vec has the potential to significantly outperform the state-of-the-art in several predictive applications in which ontologies are involved.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty259
      Issue No: Vol. 34, No. 13 (2018)
  • A new method for constructing tumor specific gene co-expression networks
           based on samples with tumor purity heterogeneity
    • Authors: Petralia F; Wang L, Peng J, et al.
      Abstract: AbstractMotivationTumor tissue samples often contain an unknown fraction of stromal cells. This problem is widely known as tumor purity heterogeneity (TPH) was recently recognized as a severe issue in omics studies. Specifically, if TPH is ignored when inferring co-expression networks, edges are likely to be estimated among genes with mean shift between non-tumor- and tumor cells rather than among gene pairs interacting with each other in tumor cells. To address this issue, we propose Tumor Specific Net (TSNet), a new method which constructs tumor-cell specific gene/protein co-expression networks based on gene/protein expression profiles of tumor tissues. TSNet treats the observed expression profile as a mixture of expressions from different cell types and explicitly models tumor purity percentage in each tumor sample.ResultsUsing extensive synthetic data experiments, we demonstrate that TSNet outperforms a standard graphical model which does not account for TPH. We then apply TSNet to estimate tumor specific gene co-expression networks based on TCGA ovarian cancer RNAseq data. We identify novel co-expression modules and hub structure specific to tumor cells.Availability and implementationR codes can be found at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty280
      Issue No: Vol. 34, No. 13 (2018)
  • PrimAlign: PageRank-inspired Markovian alignment for large biological
    • Authors: Kalecky K; Cho Y.
      Abstract: AbstractMotivationCross-species analysis of large-scale protein–protein interaction (PPI) networks has played a significant role in understanding the principles deriving evolution of cellular organizations and functions. Recently, network alignment algorithms have been proposed to predict conserved interactions and functions of proteins. These approaches are based on the notion that orthologous proteins across species are sequentially similar and that topology of PPIs between orthologs is often conserved. However, high accuracy and scalability of network alignment are still a challenge.ResultsWe propose a novel pairwise global network alignment algorithm, called PrimAlign, which is modeled as a Markov chain and iteratively transited until convergence. The proposed algorithm also incorporates the principles of PageRank. This approach is evaluated on tasks with human, yeast and fruit fly PPI networks. The experimental results demonstrate that PrimAlign outperforms several prevalent methods with statistically significant differences in multiple evaluation measures. PrimAlign, which is multi-platform, achieves superior performance in runtime with its linear asymptotic time complexity. Further evaluation is done with synthetic networks and results suggest that popular topological measures do not reflect real precision of alignments.Availability and implementationThe source code is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty288
      Issue No: Vol. 34, No. 13 (2018)
  • SigMat: a classification scheme for gene signature matching
    • Authors: Xiao J; Blatti C, Sinha S.
      Abstract: AbstractMotivationSeveral large-scale efforts have been made to collect gene expression signatures from a variety of biological conditions, such as response of cell lines to treatment with drugs, or tumor samples with different characteristics. These gene signature collections are utilized through bioinformatics tools for ‘signature matching’, whereby a researcher studying an expression profile can identify previously cataloged biological conditions most related to their profile. Signature matching tools typically retrieve from the collection the signature that has highest similarity to the user-provided profile. Alternatively, classification models may be applied where each biological condition in the signature collection is a class label; however, such models are trained on the collection of available signatures and may not generalize to the novel cellular context or cell line of the researcher’s expression profile.ResultsWe present an advanced multi-way classification algorithm for signature matching, called SigMat, that is trained on a large signature collection from a well-studied cellular context, but can also classify signatures from other cell types by relying on an additional, small collection of signatures representing the target cell type. It uses these ‘tuning data’ to learn two additional parameters that help adapt its predictions for other cellular contexts. SigMat outperforms other similarity scores and classification methods in identifying the correct label of a query expression profile from as many as 244 or 500 candidate classes (drug treatments) cataloged by the LINCS L1000 project. SigMat retains its high accuracy in cross-cell line applications even when the amount of tuning data is severely limited.Availability and implementationSigMat is available on GitHub at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty251
      Issue No: Vol. 34, No. 13 (2018)
  • GSEA-InContext: identifying novel and common patterns in expression
    • Authors: Powers R; Goodspeed A, Pielke-Lombardo H, et al.
      Abstract: AbstractMotivationGene Set Enrichment Analysis (GSEA) is routinely used to analyze and interpret coordinate pathway-level changes in transcriptomics experiments. For an experiment where less than seven samples per condition are compared, GSEA employs a competitive null hypothesis to test significance. A gene set enrichment score is tested against a null distribution of enrichment scores generated from permuted gene sets, where genes are randomly selected from the input experiment. Looking across a variety of biological conditions, however, genes are not randomly distributed with many showing consistent patterns of up- or down-regulation. As a result, common patterns of positively and negatively enriched gene sets are observed across experiments. Placing a single experiment into the context of a relevant set of background experiments allows us to identify both the common and experiment-specific patterns of gene set enrichment.ResultsWe compiled a compendium of 442 small molecule transcriptomic experiments and used GSEA to characterize common patterns of positively and negatively enriched gene sets. To identify experiment-specific gene set enrichment, we developed the GSEA-InContext method that accounts for gene expression patterns within a background set of experiments to identify statistically significantly enriched gene sets. We evaluated GSEA-InContext on experiments using small molecules with known targets to show that it successfully prioritizes gene sets that are specific to each experiment, thus providing valuable insights that complement standard GSEA analysis.Availability and implementationGSEA-InContext implemented in Python, Supplementary results and the background expression compendium are available at:
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty271
      Issue No: Vol. 34, No. 13 (2018)
  • Deep neural networks and distant supervision for geographic location
           mention extraction
    • Authors: Magge A; Weissenbacher D, Sarker A, et al.
      Abstract: AbstractMotivationVirus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER.ResultsOur NER achieves an F1-score of 0.910 and significantly outperforms the previous state-of-the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER’s capability to embed external features to further boost the system’s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty273
      Issue No: Vol. 34, No. 13 (2018)
  • NeuroMorphoVis: a collaborative framework for analysis and visualization
           of neuronal morphology skeletons reconstructed from microscopy stacks
    • Authors: Abdellah M; Hernando J, Eilemann S, et al.
      Abstract: AbstractMotivationFrom image stacks to computational models, processing digital representations of neuronal morphologies is essential to neuroscientific research. Workflows involve various techniques and tools, leading in certain cases to convoluted and fragmented pipelines. The existence of an integrated, extensible and free framework for processing, analysis and visualization of those morphologies is a challenge that is still largely unfulfilled.ResultsWe present NeuroMorphoVis, an interactive, extensible and cross-platform framework for building, visualizing and analyzing digital reconstructions of neuronal morphology skeletons extracted from microscopy stacks. Our framework is capable of detecting and repairing tracing artifacts, allowing the generation of high fidelity surface meshes and high resolution volumetric models for simulation and in silico imaging studies. The applicability of NeuroMorphoVis is demonstrated with two case studies. The first simulates the construction of three-dimensional profiles of neuronal somata and the other highlights how the framework is leveraged to create volumetric models of neuronal circuits for simulating different types of in vitro imaging experiments.Availability and implementationThe source code and documentation are freely available on under the GNU public license. The morphological analysis, visualization and surface meshing are implemented as an extensible Python API (Application Programming Interface) based on Blender, and the volume reconstruction and analysis code is written in C++ and parallelized using OpenMP. The framework features are accessible from a user-friendly GUI (Graphical User Interface) and a rich CLI (Command Line Interface).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty231
      Issue No: Vol. 34, No. 13 (2018)
  • The Kappa platform for rule-based modeling
    • Authors: Boutillier P; Maasha M, Li X, et al.
      Abstract: AbstractMotivationWe present an overview of the Kappa platform, an integrated suite of analysis and visualization techniques for building and interactively exploring rule-based models. The main components of the platform are the Kappa Simulator, the Kappa Static Analyzer and the Kappa Story Extractor. In addition to these components, we describe the Kappa User Interface, which includes a range of interactive visualization tools for rule-based models needed to make sense of the complexity of biological systems. We argue that, in this approach, modeling is akin to programming and can likewise benefit from an integrated development environment. Our platform is a step in this direction.ResultsWe discuss details about the computation and rendering of static, dynamic, and causal views of a model, which include the contact map (CM), snaphots at different resolutions, the dynamic influence network (DIN) and causal compression. We provide use cases illustrating how these concepts generate insight. Specifically, we show how the CM and snapshots provide information about systems capable of polymerization, such as Wnt signaling. A well-understood model of the KaiABC oscillator, translated into Kappa from the literature, is deployed to demonstrate the DIN and its use in understanding systems dynamics. Finally, we discuss how pathways might be discovered or recovered from a rule-based model by means of causal compression, as exemplified for early events in EGF signaling.Availability and implementationThe Kappa platform is available via the project website at All components of the platform are open source and freely available through the authors’ code repositories.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty272
      Issue No: Vol. 34, No. 13 (2018)
  • Covariate-dependent negative binomial factor analysis of RNA sequencing
    • Authors: Zamani Dadaneh S; Zhou M, Qian X.
      Abstract: AbstractMotivationHigh-throughput sequencing technologies, in particular RNA sequencing (RNA-seq), have become the basic practice for genomic studies in biomedical research. In addition to studying genes individually, for example, through differential expression analysis, investigating co-ordinated expression variations of genes may help reveal the underlying cellular mechanisms to derive better understanding and more effective prognosis and intervention strategies. Although there exists a variety of co-expression network based methods to analyze microarray data for this purpose, instead of blindly extending these methods for microarray data that may introduce unnecessary bias, it is crucial to develop methods well adapted to RNA-seq data to identify the functional modules of genes with similar expression patterns.ResultsWe have developed a fully Bayesian covariate-dependent negative binomial factor analysis (dNBFA) method—dNBFA—for RNA-seq count data, to capture coordinated gene expression changes, while considering effects from covariates reflecting different influencing factors. Unlike existing co-expression network based methods, our proposed model does not require multiple ad-hoc choices on data processing, transformation, as well as co-expression measures and can be directly applied to RNA-seq data. Furthermore, being capable of incorporating covariate information, the proposed method can tackle setups with complex confounding factors in different experiment designs. Finally, the natural model parameterization removes the need for a normalization preprocessing step, as commonly adopted to compensate for the effect of sequencing-depth variations. Efficient Bayesian inference of model parameters is derived by exploiting conditional conjugacy via novel data augmentation techniques. Experimental results on several real-world RNA-seq datasets on complex diseases suggest dNBFA as a powerful tool for discovering the gene modules with significant differential expression and meaningful biological insight.Availability and implementationdNBFA is implemented in R language and is available at
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty237
      Issue No: Vol. 34, No. 13 (2018)
  • aliFreeFold: an alignment-free approach to predict secondary structure
           from homologous RNA sequences
    • Authors: Glouzon J; Ouangraoua A.
      Abstract: AbstractMotivationPredicting the conserved secondary structure of homologous ribonucleic acid (RNA) sequences is crucial for understanding RNA functions. However, fast and accurate RNA structure prediction is challenging, especially when the number and the divergence of homologous RNA increases. To address this challenge, we propose aliFreeFold, based on a novel alignment-free approach which computes a representative structure from a set of homologous RNA sequences using sub-optimal secondary structures generated for each sequence. It is based on a vector representation of sub-optimal structures capturing structure conservation signals by weighting structural motifs according to their conservation across the sub-optimal structures.ResultsWe demonstrate that aliFreeFold provides a good balance between speed and accuracy regarding predictions of representative structures for sets of homologous RNA compared to traditional methods based on sequence and structure alignment. We show that aliFreeFold is capable of uncovering conserved structural features fastly and effectively thanks to its weighting scheme that gives more (resp. less) importance to common (resp. uncommon) structural motifs. The weighting scheme is also shown to be capable of capturing conservation signal as the number of homologous RNA increases. These results demonstrate the ability of aliFreefold to efficiently and accurately provide interesting structural representatives of RNA families.Availability and implementationaliFreeFold was implemented in C++. Source code and Linux binary are freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty234
      Issue No: Vol. 34, No. 13 (2018)
  • Random forest based similarity learning for single cell RNA sequencing
    • Authors: Pouyan M; Kostka D.
      Abstract: AbstractMotivationGenome-wide transcriptome sequencing applied to single cells (scRNA-seq) is rapidly becoming an assay of choice across many fields of biological and biomedical research. Scientific objectives often revolve around discovery or characterization of types or sub-types of cells, and therefore, obtaining accurate cell–cell similarities from scRNA-seq data is a critical step in many studies. While rapid advances are being made in the development of tools for scRNA-seq data analysis, few approaches exist that explicitly address this task. Furthermore, abundance and type of noise present in scRNA-seq datasets suggest that application of generic methods, or of methods developed for bulk RNA-seq data, is likely suboptimal.ResultsHere, we present RAFSIL, a random forest based approach to learn cell–cell similarities from scRNA-seq data. RAFSIL implements a two-step procedure, where feature construction geared towards scRNA-seq data is followed by similarity learning. It is designed to be adaptable and expandable, and RAFSIL similarities can be used for typical exploratory data analysis tasks like dimension reduction, visualization and clustering. We show that our approach compares favorably with current methods across a diverse collection of datasets, and that it can be used to detect and highlight unwanted technical variation in scRNA-seq datasets in situations where other methods fail. Overall, RAFSIL implements a flexible approach yielding a useful tool that improves the analysis of scRNA-seq data.Availability and implementationThe RAFSIL R package is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty260
      Issue No: Vol. 34, No. 13 (2018)
  • A pan-genome-based machine learning approach for predicting antimicrobial
           resistance activities of the Escherichia coli strains
    • Authors: Her H; Wu Y.
      Abstract: AbstractMotivationAntimicrobial resistance (AMR) is becoming a huge problem in both developed and developing countries, and identifying strains resistant or susceptible to certain antibiotics is essential in fighting against antibiotic-resistant pathogens. Whole-genome sequences have been collected for different microbial strains in order to identify crucial characteristics that allow certain strains to become resistant to antibiotics; however, a global inspection of the gene content responsible for AMR activities remains to be done.ResultsWe propose a pan-genome-based approach to characterize antibiotic-resistant microbial strains and test this approach on the bacterial model organism Escherichia coli. By identifying core and accessory gene clusters and predicting AMR genes for the E. coli pan-genome, we not only showed that certain classes of genes are unevenly distributed between the core and accessory parts of the pan-genome but also demonstrated that only a portion of the identified AMR genes belong to the accessory genome. Application of machine learning algorithms to predict whether specific strains were resistant to antibiotic drugs yielded the best prediction accuracy for the set of AMR genes within the accessory part of the pan-genome, suggesting that these gene clusters were most crucial to AMR activities in E. coli. Selecting subsets of AMR genes for different antibiotic drugs based on a genetic algorithm (GA) achieved better prediction performances than the gene sets established in the literature, hinting that the gene sets selected by the GA may warrant further analysis in investigating more details about how E. coli fight against antibiotics.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty276
      Issue No: Vol. 34, No. 13 (2018)
  • Unsupervised embedding of single-cell Hi-C data
    • Authors: Liu J; Lin D, Yardımcı G, et al.
      Abstract: AbstractMotivationSingle-cell Hi-C (scHi-C) data promises to enable scientists to interrogate the 3D architecture of DNA in the nucleus of the cell, studying how this structure varies stochastically or along developmental or cell-cycle axes. However, Hi-C data analysis requires methods that take into account the unique characteristics of this type of data. In this work, we explore whether methods that have been developed previously for the analysis of bulk Hi-C data can be applied to scHi-C data. We apply methods designed for analysis of bulk Hi-C data to scHi-C data in conjunction with unsupervised embedding.ResultsWe find that one of these methods, HiCRep, when used in conjunction with multidimensional scaling (MDS), strongly outperforms three other methods, including a technique that has been used previously for scHi-C analysis. We also provide evidence that the HiCRep/MDS method is robust to extremely low per-cell sequencing depth, that this robustness is improved even further when high-coverage and low-coverage cells are projected together, and that the method can be used to jointly embed cells from multiple published datasets.
      PubDate: Wed, 27 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty285
      Issue No: Vol. 34, No. 13 (2018)
  • Improved genomic island predictions with IslandPath-DIMOB
    • Authors: Bertelli C; Brinkman F, Valencia A.
      Pages: 2161 - 2167
      Abstract: AbstractMotivationGenomic islands (GIs) are clusters of genes of probable horizontal origin that play a major role in bacterial and archaeal genome evolution and microbial adaptability. They are of high medical and industrial interest, due to their enrichment in virulence factors, some antimicrobial resistance genes and adaptive metabolic pathways. The development of more sensitive but precise prediction tools, using either sequence composition-based methods or comparative genomics, is needed as large-scale analyses of microbial genomes increase.ResultsIslandPath-DIMOB, a leading GI prediction tool in the IslandViewer webserver, has now been significantly improved by modifying both the decision algorithm to determine sequence composition biases, and the underlying database of HMM profiles for associated mobility genes. The accuracy of IslandPath-DIMOB and other major software has been assessed using a reference GI dataset predicted by comparative genomics, plus a manually curated dataset from literature review. Compared to the previous version (v0.2.0), this IslandPath-DIMOB v1.0.0 achieves 11.7% and 5.3% increase in recall and precision, respectively. IslandPath-DIMOB has the highest Matthews correlation coefficient among individual prediction methods tested, combining one of the highest recall measures (46.9%) at high precision (87.4%). The only method with higher recall had notably lower precision (55.1%). This new IslandPath-DIMOB v1.0.0 will facilitate more accurate studies of GIs, including their key roles in microbial adaptability of medical, environmental and industrial interest.Availability and implementationIslandPath-DIMOB v1.0.0 is freely available through the IslandViewer webserver {{}} and as standalone software {{}} under the GNU-GPLv3.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 23 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty095
      Issue No: Vol. 34, No. 13 (2018)
  • IDP-denovo: de novo transcriptome assembly and isoform annotation by
           hybrid sequencing
    • Authors: Fu S; Ma Y, Yao H, et al.
      Pages: 2168 - 2176
      Abstract: AbstractMotivationIn the past years, the long read (LR) sequencing technologies, such as Pacific Biosciences and Oxford Nanopore Technologies, have been demonstrated to substantially improve the quality of genome assembly and transcriptome characterization. Compared to the high cost of genome assembly by LR sequencing, it is more affordable to generate LRs for transcriptome characterization. That is, when informative transcriptome LR data are available without a high-quality genome, a method for de novo transcriptome assembly and annotation is of high demand.ResultsWithout a reference genome, IDP-denovo performs de novo transcriptome assembly, isoform annotation and quantification by integrating the strengths of LRs and short reads. Using the GM12878 human data as a gold standard, we demonstrated that IDP-denovo had superior sensitivity of transcript assembly and high accuracy of isoform annotation. In addition, IDP-denovo outputs two abundance indices to provide a comprehensive expression profile of genes/isoforms. IDP-denovo represents a robust approach for transcriptome assembly, isoform annotation and quantification for non-model organism studies. Applying IDP-denovo to a non-model organism, Dendrobium officinale, we discovered a number of novel genes and novel isoforms that were not reported by the existing annotation library. These results reveal the high diversity of gene isoforms in D.officinale, which was not reported in the existing annotation library.Availability and implementationThe dataset of Dendrobium officinale used/analyzed during the current study has been deposited in SRA, with accession code SRP094520. IDP-denovo is available for download at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 23 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty098
      Issue No: Vol. 34, No. 13 (2018)
  • Hierarchical analysis of RNA-seq reads improves the accuracy of
           allele-specific expression
    • Authors: Raghupathy N; Choi K, Vincent M, et al.
      Pages: 2177 - 2184
      Abstract: AbstractMotivationAllele-specific expression (ASE) refers to the differential abundance of the allelic copies of a transcript. RNA sequencing (RNA-seq) can provide quantitative estimates of ASE for genes with transcribed polymorphisms. When short-read sequences are aligned to a diploid transcriptome, read-mapping ambiguities confound our ability to directly count reads. Multi-mapping reads aligning equally well to multiple genomic locations, isoforms or alleles can comprise the majority (>85%) of reads. Discarding them can result in biases and substantial loss of information. Methods have been developed that use weighted allocation of read counts but these methods treat the different types of multi-reads equivalently. We propose a hierarchical approach to allocation of read counts that first resolves ambiguities among genes, then among isoforms, and lastly between alleles. We have implemented our model in EMASE software (Expectation-Maximization for Allele Specific Expression) to estimate total gene expression, isoform usage and ASE based on this hierarchical allocation.ResultsMethods that align RNA-seq reads to a diploid transcriptome incorporating known genetic variants improve estimates of ASE and total gene expression compared to methods that use reference genome alignments. Weighted allocation methods outperform methods that discard multi-reads. Hierarchical allocation of reads improves estimation of ASE even when data are simulated from a non-hierarchical model. Analysis of RNA-seq data from F1 hybrid mice using EMASE reveals widespread ASE associated with cis-acting polymorphisms and a small number of parent-of-origin effects.Availability and implementationEMASE software is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 12 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty078
      Issue No: Vol. 34, No. 13 (2018)
  • The lncLocator: a subcellular localization predictor for long non-coding
           RNAs based on a stacked ensemble classifier
    • Authors: Cao Z; Pan X, Yang Y, et al.
      Pages: 2185 - 2194
      Abstract: MotivationThe long non-coding RNA (lncRNA) studies have been hot topics in the field of RNA biology. Recent studies have shown that their subcellular localizations carry important information for understanding their complex biological functions. Considering the costly and time-consuming experiments for identifying subcellular localization of lncRNAs, computational methods are urgently desired. However, to the best of our knowledge, there are no computational tools for predicting the lncRNA subcellular locations to date.ResultsIn this study, we report an ensemble classifier-based predictor, lncLocator, for predicting the lncRNA subcellular localizations. To fully exploit lncRNA sequence information, we adopt both k-mer features and high-level abstraction features generated by unsupervised deep models, and construct four classifiers by feeding these two types of features to support vector machine (SVM) and random forest (RF), respectively. Then we use a stacked ensemble strategy to combine the four classifiers and get the final prediction results. The current lncLocator can predict five subcellular localizations of lncRNAs, including cytoplasm, nucleus, cytosol, ribosome and exosome, and yield an overall accuracy of 0.59 on the constructed benchmark dataset.Availability and implementationThe lncLocator is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 15 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty085
      Issue No: Vol. 34, No. 13 (2018)
  • ViCTree: an automated framework for taxonomic classification from protein
    • Authors: Modha S; Thanki A, Cotmore S, et al.
      Pages: 2195 - 2200
      Abstract: AbstractMotivationThe increasing rate of submission of genetic sequences into public databases is providing a growing resource for classifying the organisms that these sequences represent. To aid viral classification, we have developed ViCTree, which automatically integrates the relevant sets of sequences in NCBI GenBank and transforms them into an interactive maximum likelihood phylogenetic tree that can be updated automatically. ViCTree incorporates ViCTreeView, which is a JavaScript-based visualization tool that enables the tree to be explored interactively in the context of pairwise distance data.ResultsTo demonstrate utility, ViCTree was applied to subfamily Densovirinae of family Parvoviridae. This led to the identification of six new species of insect virus.Availability and implementationViCTree is open-source and can be run on any Linux- or Unix-based computer or cluster. A tutorial, the documentation and the source code are available under a GPL3 license, and can be accessed at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 20 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty099
      Issue No: Vol. 34, No. 13 (2018)
  • Enhancing protein fold determination by exploring the complementary
           information of chemical cross-linking and coevolutionary signals
    • Authors: dos Santos R; Ferrari A, de Jesus H, et al.
      Pages: 2201 - 2208
      Abstract: AbstractMotivationElucidation of protein native states from amino acid sequences is a primary computational challenge. Modern computational and experimental methodologies, such as molecular coevolution and chemical cross-linking mass-spectrometry allowed protein structural characterization to previously intangible systems. Despite several independent successful examples, data from these distinct methodologies have not been systematically studied in conjunction. One challenge of structural inference using coevolution is that it is limited to sequence fragments within a conserved and unique domain for which sufficient sequence datasets are available. Therefore, coupling coevolutionary data with complimentary distance constraints from orthogonal sources can provide additional precision to structure prediction methodologies.ResultsIn this work, we present a methodology to combine residue interaction data obtained from coevolutionary information and cross-linking/mass spectrometry distance constraints in order to identify functional states of proteins. Using a combination of structure-based models (SBMs) with optimized Gaussian-like potentials, secondary structure estimation and simulated annealing molecular dynamics, we provide an automated methodology to integrate constraint data from diverse sources in order to elucidate the native conformation of full protein systems with distinct complexity and structural topologies. We show that cross-linking mass spectrometry constraints improve the structure predictions obtained from SBMs and coevolution signals, and that the constraints obtained by each method have a useful degree of complementarity that promotes enhanced fold estimates.Availability and implementationScripts and procedures to implement the methodology presented herein are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 12 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty074
      Issue No: Vol. 34, No. 13 (2018)
  • LS-align: an atom-level, flexible ligand structural alignment algorithm
           for high-throughput virtual screening
    • Authors: Hu J; Liu Z, Yu D, et al.
      Pages: 2209 - 2218
      Abstract: AbstractMotivationSequence-order independent structural comparison, also called structural alignment, of small ligand molecules is often needed for computer-aided virtual drug screening. Although many ligand structure alignment programs are proposed, most of them build the alignments based on rigid-body shape comparison which cannot provide atom-specific alignment information nor allow structural variation; both abilities are critical to efficient high-throughput virtual screening.ResultsWe propose a novel ligand comparison algorithm, LS-align, to generate fast and accurate atom-level structural alignments of ligand molecules, through an iterative heuristic search of the target function that combines inter-atom distance with mass and chemical bond comparisons. LS-align contains two modules of Rigid-LS-align and Flexi-LS-align, designed for rigid-body and flexible alignments, respectively, where a ligand-size independent, statistics-based scoring function is developed to evaluate the similarity of ligand molecules relative to random ligand pairs. Large-scale benchmark tests are performed on prioritizing chemical ligands of 102 protein targets involving 1 415 871 candidate compounds from the DUD-E (Database of Useful Decoys: Enhanced) database, where LS-align achieves an average enrichment factor (EF) of 22.0 at the 1% cutoff and the AUC score of 0.75, which are significantly higher than other state-of-the-art methods. Detailed data analyses show that the advanced performance is mainly attributed to the design of the target function that combines structural and chemical information to enhance the sensitivity of recognizing subtle difference of ligand molecules and the introduces of structural flexibility that help capture the conformational changes induced by the ligand–receptor binding interactions. These data demonstrate a new avenue to improve the virtual screening efficiency through the development of sensitive ligand structural alignments.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 15 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty081
      Issue No: Vol. 34, No. 13 (2018)
  • Combining co-evolution and secondary structure prediction to improve
           fragment library generation
    • Authors: de Oliveira S; Deane C, Valencia A.
      Pages: 2219 - 2227
      Abstract: AbstractMotivationRecent advances in co-evolution techniques have made possible the accurate prediction of protein structures in the absence of a template. Here, we provide a general approach that further utilizes co-evolution constraints to generate better fragment libraries for fragment-based protein structure prediction.ResultsWe have compared five different fragment library generation programmes on three different datasets encompassing over 400 unique protein folds. We show that considering the secondary structure of the fragments when assembling these libraries provides a critical way to assess their usefulness to structure prediction. We then use co-evolution constraints to improve the fragment libraries by enriching them with fragments that satisfy constraints and discarding those that do not. These improved libraries have better precision and lead to consistently better modelling results.Availability and implementationData is available for download from: Flib-Coevo is available for download from: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 15 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty084
      Issue No: Vol. 34, No. 13 (2018)
  • Multiobjective multifactor dimensionality reduction to detect
           SNP–SNP interactions
    • Authors: Yang C; Chuang L, Lin Y, et al.
      Pages: 2228 - 2236
      Abstract: AbstractMotivationSingle-nucleotide polymorphism (SNP)–SNP interactions (SSIs) are popular markers for understanding disease susceptibility. Multifactor dimensionality reduction (MDR) can successfully detect considerable SSIs. Currently, MDR-based methods mainly adopt a single-objective function (a single measure based on contingency tables) to detect SSIs. However, generally, a single-measure function might not yield favorable results due to potential model preferences and disease complexities.ApproachThis study proposes a multiobjective MDR (MOMDR) method that is based on a contingency table of MDR as an objective function. MOMDR considers the incorporated measures, including correct classification and likelihood rates, to detect SSIs and adopts set theory to predict the most favorable SSIs with cross-validation consistency. MOMDR enables simultaneously using multiple measures to determine potential SSIs.ResultsThree simulation studies were conducted to compare the detection success rates of MOMDR and single-objective MDR (SOMDR), revealing that MOMDR had higher detection success rates than SOMDR. Furthermore, the Wellcome Trust Case Control Consortium dataset was analyzed by MOMDR to detect SSIs associated with coronary artery disease.Availability and implementation: MOMDR is freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 19 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty076
      Issue No: Vol. 34, No. 13 (2018)
  • CrossPlan: systematic planning of genetic crosses to validate mathematical
    • Authors: Pratapa A; Adames N, Kraikivski P, et al.
      Pages: 2237 - 2244
      Abstract: AbstractMotivationMathematical models of cellular processes can systematically predict the phenotypes of novel combinations of multi-gene mutations. Searching for informative predictions and prioritizing them for experimental validation is challenging since the number of possible combinations grows exponentially in the number of mutations. Moreover, keeping track of the crosses needed to make new mutants and planning sequences of experiments is unmanageable when the experimenter is deluged by hundreds of potentially informative predictions to test.ResultsWe present CrossPlan, a novel methodology for systematically planning genetic crosses to make a set of target mutants from a set of source mutants. We base our approach on a generic experimental workflow used in performing genetic crosses in budding yeast. We prove that the CrossPlan problem is NP-complete. We develop an integer-linear-program (ILP) to maximize the number of target mutants that we can make under certain experimental constraints. We apply our method to a comprehensive mathematical model of the protein regulatory network controlling cell division in budding yeast. We also extend our solution to incorporate other experimental conditions such as a delay factor that decides the availability of a mutant and genetic markers to confirm gene deletions. The experimental flow that underlies our work is quite generic and our ILP-based algorithm is easy to modify. Hence, our framework should be relevant in plant and animal systems as well.Availability and implementationCrossPlan code is freely available under GNU General Public Licence v3.0 at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 08 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty072
      Issue No: Vol. 34, No. 13 (2018)
  • flowLearn: fast and precise identification and quality checking of cell
           populations in flow cytometry
    • Authors: Lux M; Brinkman R, Chauve C, et al.
      Pages: 2245 - 2253
      Abstract: AbstractMotivationIdentification of cell populations in flow cytometry is a critical part of the analysis and lays the groundwork for many applications and research discovery. The current paradigm of manual analysis is time consuming and subjective. A common goal of users is to replace manual analysis with automated methods that replicate their results. Supervised tools provide the best performance in such a use case, however they require fine parameterization to obtain the best results. Hence, there is a strong need for methods that are fast to setup, accurate and interpretable.ResultsflowLearn is a semi-supervised approach for the quality-checked identification of cell populations. Using a very small number of manually gated samples, through density alignments it is able to predict gates on other samples with high accuracy and speed. On two state-of-the-art datasets, our tool achieves median(F1)-measures exceeding 0.99 for 31%, and 0.90 for 80% of all analyzed populations. Furthermore, users can directly interpret and adjust automated gates on new sample files to iteratively improve the initial training.Availability and implementationFlowLearn is available as an R package on Evaluation data is publicly available online. Details can be found in the Supplementary MaterialSupplementary Material.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 15 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty082
      Issue No: Vol. 34, No. 13 (2018)
  • pBRIT: gene prioritization by correlating functional and phenotypic
           annotations through integrative data fusion
    • Authors: Kumar A; Van Laer L, Alaerts M, et al.
      Pages: 2254 - 2262
      Abstract: MotivationComputational gene prioritization can aid in disease gene identification. Here, we propose pBRIT (prioritization using Bayesian Ridge regression and Information Theoretic model), a novel adaptive and scalable prioritization tool, integrating Pubmed abstracts, Gene Ontology, Sequence similarities, Mammalian and Human Phenotype Ontology, Pathway, Interactions, Disease Ontology, Gene Association database and Human Genome Epidemiology database, into the prediction model. We explore and address effects of sparsity and inter-feature dependencies within annotation sources, and the impact of bias towards specific annotations.ResultspBRIT models feature dependencies and sparsity by an Information-Theoretic (data driven) approach and applies intermediate integration based data fusion. Following the hypothesis that genes underlying similar diseases will share functional and phenotype characteristics, it incorporates Bayesian Ridge regression to learn a linear mapping between functional and phenotype annotations. Genes are prioritized on phenotypic concordance to the training genes. We evaluated pBRIT against nine existing methods, and on over 2000 HPO-gene associations retrieved after construction of pBRIT data sources. We achieve maximum AUC scores ranging from 0.92 to 0.96 against benchmark datasets and of 0.80 against the time-stamped HPO entries, indicating good performance with high sensitivity and specificity. Our model shows stable performance with regard to changes in the underlying annotation data, is fast and scalable for implementation in routine pipelines.Availability and implementation; informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 14 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty079
      Issue No: Vol. 34, No. 13 (2018)
  • ARGs-OAP v2.0 with an expanded SARG database and Hidden Markov Models for
           enhancement characterization and quantification of antibiotic resistance
           genes in environmental metagenomes
    • Authors: Yin X; Jiang X, Chai B, et al.
      Pages: 2263 - 2270
      Abstract: AbstractMotivationMuch global attention has been paid to antibiotic resistance in monitoring its emergence, accumulation and dissemination. For rapid characterization and quantification of antibiotic resistance genes (ARGs) in metagenomic datasets, an online analysis pipeline, ARGs-OAP has been developed consisting of a database termed Structured Antibiotic Resistance Genes (the SARG) with a hierarchical structure (ARGs type-subtype-reference sequence).ResultsThe new release of the database, termed SARG version 2.0, contains sequences not only from CARD and ARDB databases, but also carefully selected and curated sequences from the latest protein collection of the NCBI-NR database, to keep up to date with the increasing number of ARG deposited sequences. SARG v2.0 has tripled the sequences of the first version and demonstrated improved coverage of ARGs detection in metagenomes from various environmental samples. In addition to annotation of high-throughput raw reads using a similarity search strategy, ARGs-OAP v2.0 now provides model-based identification of assembled sequences using SARGfam, a high-quality profile Hidden Markov Model (HMM), containing profiles of ARG subtypes. Additionally, ARGs-OAP v2.0 improves cell number quantification by using the average coverage of essential single copy marker genes, as an option in addition to the previous method based on the 16S rRNA gene.Availability and implementationARGs-OAP can be accessed through The database could be downloaded from the same site. Source codes for this study can be downloaded from informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 02 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty053
      Issue No: Vol. 34, No. 13 (2018)
  • WDL-RF: predicting bioactivities of ligand molecules acting with G
           protein-coupled receptors by combining weighted deep learning and random
    • Authors: Wu J; Zhang Q, Wu W, et al.
      Pages: 2271 - 2282
      Abstract: AbstractMotivationPrecise assessment of ligand bioactivities (including IC50, EC50, Ki, Kd, etc.) is essential for virtual screening and lead compound identification. However, not all ligands have experimentally determined activities. In particular, many G protein-coupled receptors (GPCRs), which are the largest integral membrane protein family and represent targets of nearly 40% drugs on the market, lack published experimental data about ligand interactions. Computational methods with the ability to accurately predict the bioactivity of ligands can help efficiently address this problem.ResultsWe proposed a new method, WDL-RF, using weighted deep learning and random forest, to model the bioactivity of GPCR-associated ligand molecules. The pipeline of our algorithm consists of two consecutive stages: (i) molecular fingerprint generation through a new weighted deep learning method, and (ii) bioactivity calculations with a random forest model; where one uniqueness of the approach is that the model allows end-to-end learning of prediction pipelines with input ligands being of arbitrary size. The method was tested on a set of twenty-six non-redundant GPCRs that have a high number of active ligands, each with 200–4000 ligand associations. The results from our benchmark show that WDL-RF can generate bioactivity predictions with an average root-mean square error 1.33 and correlation coefficient (r2) 0.80 compared to the experimental measurements, which are significantly more accurate than the control predictors with different molecular fingerprints and descriptors. In particular, data-driven molecular fingerprint features, as extracted from the weighted deep learning models, can help solve deficiencies stemming from the use of traditional hand-crafted features and significantly increase the efficiency of short molecular fingerprints in virtual screening.Availability and implementationThe WDL-RF web server, as well as source codes and datasets of WDL-RF, is freely available at for academic purposes.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 08 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty070
      Issue No: Vol. 34, No. 13 (2018)
  • RaMWAS: fast methylome-wide association study pipeline for enrichment
    • Authors: Shabalin A; Hattab M, Clark S, et al.
      Pages: 2283 - 2285
      Abstract: AbstractMotivationEnrichment-based technologies can provide measurements of DNA methylation at tens of millions of CpGs for thousands of samples. Existing tools for methylome-wide association studies cannot analyze datasets of this size and lack important features like principal component analysis, combined analysis with SNP data and outcome predictions that are based on all informative methylation sites.ResultsWe present a Bioconductor R package called RaMWAS with a full set of tools for large-scale methylome-wide association studies. It is free, cross-platform, open source, memory efficient and fast.Availability and implementationRelease version and vignettes with small case study at Development version at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 12 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty069
      Issue No: Vol. 34, No. 13 (2018)
  • chromswitch: a flexible method to detect chromatin state switches
    • Authors: Jessa S; Kleinman C, Hancock J.
      Pages: 2286 - 2288
      Abstract: AbstractSummaryChromatin state plays a major role in controlling gene expression, and comparative analysis of ChIP-seq data is key to understanding epigenetic regulation. We present chromswitch, an R/Bioconductor package to integrate epigenomic data in a defined window of interest to detect an overall switch in chromatin state. Chromswitch accurately classifies a benchmarking dataset, and when applied genome-wide, the tool successfully detects chromatin changes that result in brain-specific expression.Availability and implementationChromswitch is implemented as an R package available from Bioconductor at All data and code for reproducing the analysis presented in this paper are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 09 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty075
      Issue No: Vol. 34, No. 13 (2018)
  • IWTomics: testing high-resolution sequence-based ‘Omics’ data at
           multiple locations and scales
    • Authors: Cremona M; Pini A, Cumbo F, et al.
      Pages: 2289 - 2291
      Abstract: AbstractSummaryWith increased generation of high-resolution sequence-based ‘Omics’ data, detecting statistically significant effects at different genomic locations and scales has become key to addressing several scientific questions. IWTomics is an R/Bioconductor package (integrated in Galaxy) that, exploiting sophisticated Functional Data Analysis techniques (i.e. statistical techniques that deal with the analysis of curves), allows users to pre-process, visualize and test these data at multiple locations and scales. The package provides a friendly, flexible and complete workflow that can be employed in many genomic and epigenomic applications.Availability and implementationIWTomics is freely available at the Bioconductor website ( and on the main Galaxy instance ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 20 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty090
      Issue No: Vol. 34, No. 13 (2018)
  • SEED 2: a user-friendly platform for amplicon high-throughput sequencing
           data analyses
    • Authors: Větrovský T; Baldrian P, Morais D, et al.
      Pages: 2292 - 2294
      Abstract: AbstractMotivationModern molecular methods have increased our ability to describe microbial communities. Along with the advances brought by new sequencing technologies, we now require intensive computational resources to make sense of the large numbers of sequences continuously produced. The software developed by the scientific community to address this demand, although very useful, require experience of the command-line environment, extensive training and have steep learning curves, limiting their use. We created SEED 2, a graphical user interface for handling high-throughput amplicon-sequencing data under Windows operating systems.ResultsSEED 2 is the only sequence visualizer that empowers users with tools to handle amplicon-sequencing data of microbial community markers. It is suitable for any marker genes sequences obtained through Illumina, IonTorrent or Sanger sequencing. SEED 2 allows the user to process raw sequencing data, identify specific taxa, produce of OTU-tables, create sequence alignments and construct phylogenetic trees. Standard dual core laptops with 8 GB of RAM can handle ca. 8 million of Illumina PE 300 bp sequences, ca. 4 GB of data.Availability and implementationSEED 2 was implemented in Object Pascal and uses internal functions and external software for amplicon data processing. SEED 2 is a freeware software, available at as a self-contained file, including all the dependencies, and does not require installation. Supplementary dataSupplementary data contain a comprehensive list of supported functions.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 14 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty071
      Issue No: Vol. 34, No. 13 (2018)
  • SecretSanta: flexible pipelines for functional secretome prediction
    • Authors: Gogleva A; Drost H, Schornack S, et al.
      Pages: 2295 - 2296
      Abstract: AbstractMotivationThe secretome denotes the collection of secreted proteins exported outside of the cell. The functional roles of secreted proteins include the maintenance and remodelling of the extracellular matrix as well as signalling between host and non-host cells. These features make secretomes rich reservoirs of biomarkers for disease classification and host–pathogen interaction studies. Common biomarkers are extracellular proteins secreted via classical pathways that can be predicted from sequence by annotating the presence or absence of N-terminal signal peptides. Several heterogeneous command line tools and web-interfaces exist to identify individual motifs, signal sequences and domains that are either characteristic or strictly excluded from secreted proteins. However, a single flexible secretome-prediction workflow that combines all analytic steps is still missing.ResultsTo bridge this gap the SecretSanta package implements wrapper and parser functions around established command line tools for the integrative prediction of extracellular proteins that are secreted via classical pathways. The modularity of SecretSanta enables users to create tailored pipelines and apply them across the whole tree of life to facilitate comparison of secretomes across multiple species or under various conditions.Availability and implementationSecretSanta is implemented in the R programming language and is released under GPL-3 license. All functions have been optimized and parallelized to allow large-scale processing of sequences. The open-source code, installation instructions and vignette with use case scenarios can be downloaded from informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 16 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty088
      Issue No: Vol. 34, No. 13 (2018)
  • SubRecon: ancestral reconstruction of amino acid substitutions along a
           branch in a phylogeny
    • Authors: Monit C; Goldstein R, Kelso J.
      Pages: 2297 - 2299
      Abstract: AbstractSummaryExisting ancestral sequence reconstruction techniques are ill-suited to investigating substitutions on a single branch of interest. We present SubRecon, an implementation of a hybrid technique integrating joint and marginal reconstruction for protein sequence data. SubRecon calculates the joint probability of states at adjacent internal nodes in a phylogeny, i.e. how the state has changed along a branch. This does not condition on states at other internal nodes and includes site rate variation. Simulation experiments show the technique to be accurate and powerful. SubRecon has a user-friendly command line interface and produces concise output that is intuitive yet suitable for subsequent parsing in an automated pipeline.Availability and implementationSubRecon is platform independent, requiring Java v1.8 or above. Source code, installation instructions and an example dataset are freely available under the Apache 2.0 license at
      PubDate: Wed, 28 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty101
      Issue No: Vol. 34, No. 13 (2018)
  • PhyloMAd: efficient assessment of phylogenomic model adequacy
    • Authors: Duchêne D; Duchêne S, Ho S, et al.
      Pages: 2300 - 2301
      Abstract: AbstractSummaryStatistical phylogenetic inference plays an important role in evolutionary biology. The accuracy of phylogenetic methods relies on having suitable models of the evolutionary process. Various tools allow comparisons of candidate phylogenetic models, but assessing the absolute performance of models remains a considerable challenge. We introduce PhyloMAd, a user-friendly application for assessing the adequacy of commonly used models of nucleotide substitution and among-lineage rate variation. Our software implements a fast, likelihood-based method of model assessment that is tractable for analyses of large multi-locus datasets. PhyloMAd provides a means of informing model improvement, or selecting data to enhance the evolutionary signal in phylogenomic analyses.Availability and implementationPhyloMAd, together with a manual, a tutorial and the source code, are freely available from the GitHub repository
      PubDate: Wed, 21 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty103
      Issue No: Vol. 34, No. 13 (2018)
  • StructureMapper: a high-throughput algorithm for analyzing protein
           sequence locations in structural data
    • Authors: Nurminen A; Hytönen V, Valencia A.
      Pages: 2302 - 2304
      Abstract: AbstractMotivationStructureMapper is a high-throughput algorithm for automated mapping of protein primary amino sequence locations to existing three-dimensional protein structures. The algorithm is intended for facilitating easy and efficient utilization of structural information in protein characterization and proteomics. StructureMapper provides an analysis of the identified structural locations that includes surface accessibility, flexibility, protein–protein interfacing, intrinsic disorder prediction, secondary structure assignment, biological assembly information and sequence identity percentages, among other metrics.ResultsWe have showcased the use of the algorithm by estimating the coverage of structural information of the human proteome, identifying critical interface residues in DNA polymerase γ, profiling structurally protease cleavage sites and post-translational modification sites, and by identifying putative, novel phosphoswitches.Availability and implementationThe StructureMapper algorithm is available as an online service and standalone implementation at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 14 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty086
      Issue No: Vol. 34, No. 13 (2018)
  • GAIT: Gene expression Analysis for Interval Time
    • Authors: Kim Y; Kang Y, Seok J, et al.
      Pages: 2305 - 2307
      Abstract: AbstractMotivationDespite the potential usefulness, the association analysis of gene expression with interval times of two events has been hampered because the occurrence of events can be censored and the conventional survival analysis is not suitable to handle two censored events. However, the recent advances of multivariate survival analysis considering multiple censored events together provide an unprecedented chance for this problem. Based on such advances, we have developed a software tool, GAIT, for the association analysis of gene expression with interval time of two events.ResultsThe performance of GAIT was demonstrated by simulation studies and the real data analysis. The result indicates the usefulness of GAIT in a wide range of biomedical applications.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 02 Mar 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty111
      Issue No: Vol. 34, No. 13 (2018)
  • Bacmeta: simulator for genomic evolution in bacterial metapopulations
    • Authors: Sipola A; Marttinen P, Corander J, et al.
      Pages: 2308 - 2310
      Abstract: AbstractSummaryThe advent of genomic data from densely sampled bacterial populations has created a need for flexible simulators by which models and hypotheses can be efficiently investigated in the light of empirical observations. Bacmeta provides fast stochastic simulation of neutral evolution within a large collection of interconnected bacterial populations with completely adjustable connectivity network. Stochastic events of mutations, recombinations, insertions/deletions, migrations and micro-epidemics can be simulated in discrete non-overlapping generations with a Wright–Fisher model that operates on explicit sequence data of any desired genome length. Each model component, including locus, bacterial strain, population and ultimately the whole metapopulation, is efficiently simulated using C++ objects and detailed metadata from each level can be acquired. The software can be executed in a cluster environment using simple textual input files, enabling, e.g. large-scale simulations and likelihood-free inference.Availability and implementationBacmeta is implemented with C++ for Linux, Mac and Windows. It is available at under the BSD 3-clause license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 20 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty093
      Issue No: Vol. 34, No. 13 (2018)
  • ShinyKGode: an interactive application for ODE parameter inference using
           gradient matching
    • Authors: Wandy J; Niu M, Giurghita D, et al.
      Pages: 2314 - 2315
      Abstract: AbstractMotivationMathematical modelling based on ordinary differential equations (ODEs) is widely used to describe the dynamics of biological systems, particularly in systems and pathway biology. Often the kinetic parameters of these ODE systems are unknown and have to be inferred from the data. Approximate parameter inference methods based on gradient matching (which do not require performing computationally expensive numerical integration of the ODEs) have been getting popular in recent years, but many implementations are difficult to run without expert knowledge. Here, we introduce ShinyKGode, an interactive web application to perform fast parameter inference on ODEs using gradient matching.ResultsShinyKGode can be used to infer ODE parameters on simulated and observed data using gradient matching. Users can easily load their own models in Systems Biology Markup Language format, and a set of pre-defined ODE benchmark models are provided in the application. Inferred parameters are visualized alongside diagnostic plots to assess convergence.Availability and implementationThe R package for ShinyKGode can be installed through the Comprehensive R Archive Network (CRAN). Installation instructions, as well as tutorial videos and source code are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 27 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty089
      Issue No: Vol. 34, No. 13 (2018)
  • BEL2ABM: agent-based simulation of static models in Biological Expression
    • Authors: Gündel M; Hoyt C, Hofmann-Apitius M, et al.
      Pages: 2316 - 2318
      Abstract: AbstractSummaryWhile cause-and-effect knowledge assembly models encoded in Biological Expression Language are able to support generation of mechanistic hypotheses, they are static and limited in their ability to encode temporality. Here, we present BEL2ABM, a software for producing continuous, dynamic, executable agent-based models from BEL templates.Availability and implementationThe tool has been developed in Java and NetLogo. Code, data and documentation are available under the Apache 2.0 License at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 23 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty107
      Issue No: Vol. 34, No. 13 (2018)
  • MARSI: metabolite analogues for rational strain improvement
    • Authors: Cardoso J; Zeidan A, Jensen K, et al.
      Pages: 2319 - 2321
      Abstract: AbstractSummaryMetabolite analogues (MAs) mimic the structure of native metabolites, can competitively inhibit their utilization in enzymatic reactions, and are commonly used as selection tools for isolating desirable mutants of industrial microorganisms. Genome-scale metabolic models representing all biochemical reactions in an organism can be used to predict effects of MAs on cellular phenotypes. Here, we present the metabolite analogues for rational strain improvement (MARSI) framework. MARSI provides a rational approach to strain improvement by searching for metabolites as targets instead of genes or reactions. The designs found by MARSI can be implemented by supplying MAs in the culture media, enabling metabolic rewiring without the use of recombinant DNA technologies that cannot always be used due to regulations. To facilitate experimental implementation, MARSI provides tools to identify candidate MAs to a target metabolite from a database of known drugs and analogues.Availability and implementationThe code is freely available at under the Apache License V2. MARSI is implemented in Python.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 23 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty108
      Issue No: Vol. 34, No. 13 (2018)
  • nVenn: generalized, quasi-proportional Venn and Euler diagrams
    • Authors: Pérez-Silva J; Araujo-Voces M, Quesada V, et al.
      Pages: 2322 - 2324
      Abstract: AbstractMotivationVenn and Euler diagrams are extensively used for the visualization of relationships between experiments and datasets. However, representing more than three datasets while keeping the proportions of each region is still not feasible with existing tools.ResultsWe present an algorithm to render all the regions of a generalized n-dimensional Venn diagram, while keeping the area of each region approximately proportional to the number of elements included. In addition, missing regions in Euler diagrams lead to simplified representations. The algorithm generates an n-dimensional Venn diagram and inserts circles of given areas in each region. Then, the diagram is rearranged with a dynamic, self-correcting simulation in which each set border is contracted until it contacts the circles inside. This algorithm is implemented in a C++ tool (nVenn) with or without a web interface. The web interface also provides the ability to analyze the regions of the diagram.Availability and implementationThe source code and pre-compiled binaries of nVenn are available at A web interface for up to six sets can be accessed at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 23 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty109
      Issue No: Vol. 34, No. 13 (2018)
  • MutHTP: mutations in human transmembrane proteins
    • Authors: Kulandaisamy A; Binny Priya S, Sakthivel R, et al.
      Pages: 2325 - 2326
      Abstract: AbstractMotivationExisting sources of experimental mutation data do not consider the structural environment of amino acid substitutions and distinguish between soluble and membrane proteins. They also suffer from a number of further limitations, including data redundancy, lack of disease classification, incompatible information content, and ambiguous annotations (e.g. the same mutation being annotated as disease and benign).ResultsWe have developed a novel database, MutHTP, which contains information on 183 395 disease-associated and 17 827 neutral mutations in human transmembrane proteins. For each mutation site MutHTP provides a description of its location with respect to the membrane protein topology, structural environment (if available) and functional features. Comprehensive visualization, search, display and download options are available.Availability and implementationThe database is publicly available at The website is implemented using HTML, PHP and javascript and supports recent versions of all major browsers, such as Firefox, Chrome and Opera.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 01 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty054
      Issue No: Vol. 34, No. 13 (2018)
  • PartsGenie: an integrated tool for optimizing and sharing synthetic
           biology parts
    • Authors: Swainston N; Dunstan M, Jervis A, et al.
      Pages: 2327 - 2329
      Abstract: AbstractMotivationSynthetic biology is typified by developing novel genetic constructs from the assembly of reusable synthetic DNA parts, which contain one or more features such as promoters, ribosome binding sites, coding sequences and terminators. PartsGenie is introduced to facilitate the computational design of such synthetic biology parts, bridging the gap between optimization tools for the design of novel parts, the representation of such parts in community-developed data standards such as Synthetic Biology Open Language, and their sharing in journal-recommended data repositories. Consisting of a drag-and-drop web interface, a number of DNA optimization algorithms, and an interface to the well-used data repository JBEI ICE, PartsGenie facilitates the design, optimization and dissemination of reusable synthetic biology parts through an integrated application.Availability and implementationPartsGenie is freely available at
      PubDate: Fri, 23 Feb 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty105
      Issue No: Vol. 34, No. 13 (2018)
  • 2018 ISCB Overton Prize awarded to Cole Trapnell
    • Authors: Fogg C; Kovats D, Shamir R.
      Pages: 2330 - 2331
      Abstract: Each year the International Society for Computational Biology (ISCB) recognizes the achievements of an early to mid-career scientist with the Overton Prize. This prize honors the untimely death of Dr. G. Christian Overton, a respected computational biologist and founding ISCB Board member. The Overton Prize recognizes independent investigators who are in the early to middle phases of their careers and are selected because of their significant contributions to computational biology through research, teaching, and service.
      PubDate: Sat, 02 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty360
      Issue No: Vol. 34, No. 13 (2018)
  • Message from the ISCB: 2018 ISCB Accomplishments by a Senior Scientist
    • Authors: Fogg C; Kovats D, Shamir R.
      Pages: 2332 - 2333
      Abstract: Every year ISCB recognizes a leader in the computational biology and bioinformatics fields with the Accomplishments by a Senior Scientist Award. This is the highest award bestowed by ISCB in recognition of a scientist’s significant research, education and service contributions. Ruth Nussinov, Senior Principal Scientist and Principal Investigator at the National Cancer Institute, National Institutes of Health and Professor Emeritus in the Department of Human Molecular Genetics & Biochemistry, School of Medicine at Tel Aviv University, Israel is being honored as the 2018 winner of the Accomplishment by a Senior Scientist Award. She will receive her award and present a keynote address at ISCB’s premiere annual meeting, the 2018 Intelligent Systems for Molecular Biology (ISMB) conference in Chicago, IL being held on July 6–10, 2018.
      PubDate: Sat, 02 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty284
      Issue No: Vol. 34, No. 13 (2018)
  • Message from the ISCB: 2018 Outstanding Contributions to ISCB Award: Russ
    • Authors: Fogg C; Kovats D, Shamir R.
      Pages: 2334 - 2335
      Abstract: The Outstanding Contributions to International Society for Computational Biology (ISCB) Award was introduced in 2015 to recognize Society members who have made lasting and beneficial contributions through their leadership, service and educational work or a combination of these areas. Russ Altman, Kenneth Fong Professor and Professor of Bioengineering, of Genetics, of Medicine (General Medicine Discipline), of Biomedical Data Science and, by courtesy, of Computer Science, is the 2018 winner of the Outstanding Contributions to ISCB Award and will be recognized at the 2018 Intelligent Systems for Molecular Biology (ISMB) meeting in Chicago, IL being held on July 6–10, 2018.
      PubDate: Sat, 02 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty308
      Issue No: Vol. 34, No. 13 (2018)
  • 2018 ISCB Innovator Award recognizes M. Madan Babu
    • Authors: Fogg C; Kovats D, Shamir R.
      Pages: 2336 - 2337
      Abstract: The ISCB Innovator Award recognizes an ISCB scientist who is within two decades of having completed his or her graduate degree and has consistently made outstanding contributions to the field of computational biology. The 2018 winner is Dr. M. Madan Babu, Programme Leader at the MRC Laboratory of Molecular Biology, Cambridge, UK. Madan will receive his award and deliver a keynote presentation at the 2018 International Conference on Intelligent Systems for Molecular Biology in Chicago, Illinois being held on July 6–10, 2018.
      PubDate: Sat, 02 Jun 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty328
      Issue No: Vol. 34, No. 13 (2018)
  • Using combined evidence from replicates to evaluate ChIP-seq peaks
    • Authors: Jalili V; Matteucci M, Masseroli M, et al.
      Pages: 2338 - 2338
      Abstract: Bioinformatics (2015)
      PubDate: Tue, 13 Mar 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty119
      Issue No: Vol. 34, No. 13 (2018)
  • Accurate mapping of tRNA reads
    • Authors: Hoffmann A; Fallmann J, Vilardo E, et al.
      Pages: 2339 - 2339
      Abstract: Bioinformatics (2017)
      PubDate: Tue, 13 Mar 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty118
      Issue No: Vol. 34, No. 13 (2018)
  • CGManalyzer: an R package for analyzing continuous glucose monitoring
    • Authors: Zhang X; Zhang Z, Wang D.
      Pages: 2340 - 2340
      Abstract: Bioinformatics, (2018)
      PubDate: Wed, 14 Mar 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty144
      Issue No: Vol. 34, No. 13 (2018)
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-