Journal Cover
Journal Prestige (SJR): 6.14
Citation Impact (citeScore): 8
Number of Followers: 365  
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1367-4803 - ISSN (Online) 1460-2059
Published by Oxford University Press Homepage  [409 journals]
    • PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz494
      Issue No: Vol. 35, No. 14 (2019)
  • Learning a mixture of microbial networks using
    • Authors: Tavakoli S; Yooseph S.
      Abstract: MotivationThe interactions among the constituent members of a microbial community play a major role in determining the overall behavior of the community and the abundance levels of its members. These interactions can be modeled using a network whose nodes represent microbial taxa and edges represent pairwise interactions. A microbial network is typically constructed from a sample-taxa count matrix that is obtained by sequencing multiple biological samples and identifying taxa counts. From large-scale microbiome studies, it is evident that microbial community compositions and interactions are impacted by environmental and/or host factors. Thus, it is not unreasonable to expect that a sample-taxa matrix generated as part of a large study involving multiple environmental or clinical parameters can be associated with more than one microbial network. However, to our knowledge, microbial network inference methods proposed thus far assume that the sample-taxa matrix is associated with a single network.ResultsWe present a mixture model framework to address the scenario when the sample-taxa matrix is associated with K microbial networks. This count matrix is modeled using a mixture of K Multivariate Poisson Log-Normal distributions and parameters are estimated using a maximum likelihood framework. Our parameter estimation algorithm is based on the minorization–maximization principle combined with gradient ascent and block updates. Synthetic datasets were generated to assess the performance of our approach on absolute count data, compositional data and normalized data. We also addressed the recovery of sparse networks based on an l1-penalty model.Availability and implementationMixMPLN is implemented in R and is freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz370
      Issue No: Vol. 35, No. 14 (2019)
  • ISMB/ECCB 2019 Proceedings
    • Authors: Bromberg Y; El-Mabrouk N, Radivojac P.
      Abstract: The biennial joint meeting of ISMB (27th Annual Conference on Intelligent Systems for Molecular Biology) and ECCB (18th European Conference on Computational Biology) was held in Basel, Switzerland, July 21–25, 2019. ISMB is the flagship conference of the International Society for Computational Biology and the world’s premier forum for dissemination of scientific research in computational biology and its intersection with other areas. ECCB is similarly a top venue in the field, with a long tradition of publishing and presenting world-class research. This special issue serves as the Proceedings of ISMB/ECCB 2019.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz439
      Issue No: Vol. 35, No. 14 (2019)
  • ShaKer: RNA SHAPE prediction using graph kernel
    • Authors: Mautner S; Montaseri S, Miladi M, et al.
      Abstract: SummarySHAPE experiments are used to probe the structure of RNA molecules. We present ShaKer to predict SHAPE data for RNA using a graph-kernel-based machine learning approach that is trained on experimental SHAPE information. While other available methods require a manually curated reference structure, ShaKer predicts reactivity data based on sequence input only and by sampling the ensemble of possible structures. Thus, ShaKer is well placed to enable experiment-driven, transcriptome-wide SHAPE data prediction to enable the study of RNA structuredness and to improve RNA structure and RNA–RNA interaction prediction. For performance evaluation, we use accuracy and accessibility comparing to experimental SHAPE data and competing methods. We can show that Shaker outperforms its competitors and is able to predict high quality SHAPE annotations even when no reference structure is provided.Availability and implementationShaKer is freely available at
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz395
      Issue No: Vol. 35, No. 14 (2019)
  • Using the structure of genome data in the design of deep neural networks
           for predicting amyotrophic lateral sclerosis from genotype
    • Authors: Yin B; Balvert M, van der Spek R, et al.
      Abstract: MotivationAmyotrophic lateral sclerosis (ALS) is a neurodegenerative disease caused by aberrations in the genome. While several disease-causing variants have been identified, a major part of heritability remains unexplained. ALS is believed to have a complex genetic basis where non-additive combinations of variants constitute disease, which cannot be picked up using the linear models employed in classical genotype–phenotype association studies. Deep learning on the other hand is highly promising for identifying such complex relations. We therefore developed a deep-learning based approach for the classification of ALS patients versus healthy individuals from the Dutch cohort of the Project MinE dataset. Based on recent insight that regulatory regions harbor the majority of disease-associated variants, we employ a two-step approach: first promoter regions that are likely associated to ALS are identified, and second individuals are classified based on their genotype in the selected genomic regions. Both steps employ a deep convolutional neural network. The network architecture accounts for the structure of genome data by applying convolution only to parts of the data where this makes sense from a genomics perspective.ResultsOur approach identifies potentially ALS-associated promoter regions, and generally outperforms other classification methods. Test results support the hypothesis that non-additive combinations of variants contribute to ALS. Architectures and protocols developed are tailored toward processing population-scale, whole-genome data. We consider this a relevant first step toward deep learning assisted genotype–phenotype association in whole genome-sized data.Availability and implementationOur code will be available on Github, together with a synthetic dataset ( The data used in this study is available to bona-fide researchers upon request.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz369
      Issue No: Vol. 35, No. 14 (2019)
  • Multifaceted protein–protein interaction prediction based on Siamese
           residual RCNN
    • Authors: Chen M; Ju C, Zhou G, et al.
      Abstract: MotivationSequence-based protein–protein interaction (PPI) prediction represents a fundamental computational biology problem. To address this problem, extensive research efforts have been made to extract predefined features from the sequences. Based on these features, statistical algorithms are learned to classify the PPIs. However, such explicit features are usually costly to extract, and typically have limited coverage on the PPI information.ResultsWe present an end-to-end framework, PIPR (Protein–Protein Interaction Prediction Based on Siamese Residual RCNN), for PPI predictions using only the protein sequences. PIPR incorporates a deep residual recurrent convolutional neural network in the Siamese architecture, which leverages both robust local features and contextualized information, which are significant for capturing the mutual influence of proteins sequences. PIPR relieves the data pre-processing efforts that are required by other systems, and generalizes well to different application scenarios. Experimental evaluations show that PIPR outperforms various state-of-the-art systems on the binary PPI prediction problem. Moreover, it shows a promising performance on more challenging problems of interaction type prediction and binding affinity estimation, where existing approaches fall short.Availability and implementationThe implementation is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz328
      Issue No: Vol. 35, No. 14 (2019)
  • Weighted elastic net for unsupervised domain adaptation with application
           to age prediction from DNA methylation data
    • Authors: Handl L; Jalali A, Scherer M, et al.
      Abstract: MotivationPredictive models are a powerful tool for solving complex problems in computational biology. They are typically designed to predict or classify data coming from the same unknown distribution as the training data. In many real-world settings, however, uncontrolled biological or technical factors can lead to a distribution mismatch between datasets acquired at different times, causing model performance to deteriorate on new data. A common additional obstacle in computational biology is scarce data with many more features than samples. To address these problems, we propose a method for unsupervised domain adaptation that is based on a weighted elastic net. The key idea of our approach is to compare dependencies between inputs in training and test data and to increase the cost of differently behaving features in the elastic net regularization term. In doing so, we encourage the model to assign a higher importance to features that are robust and behave similarly across domains.ResultsWe evaluate our method both on simulated data with varying degrees of distribution mismatch and on real data, considering the problem of age prediction based on DNA methylation data across multiple tissues. Compared with a non-adaptive standard model, our approach substantially reduces errors on samples with a mismatched distribution. On real data, we achieve far lower errors on cerebellum samples, a tissue which is not part of the training data and poorly predicted by standard models. Our results demonstrate that unsupervised domain adaptation is possible for applications in computational biology, even with many more features than samples.Availability and implementationSource code is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz338
      Issue No: Vol. 35, No. 14 (2019)
  • Statistical compression of protein sequences and inference of marginal
           probability landscapes over competing alignments using finite state models
           and Dirichlet priors
    • Authors: Sumanaweera D; Allison L, Konagurthu A.
      Abstract: The information criterion of minimum message length (MML) provides a powerful statistical framework for inductive reasoning from observed data. We apply MML to the problem of protein sequence comparison using finite state models with Dirichlet distributions. The resulting framework allows us to supersede the ad hoc cost functions commonly used in the field, by systematically addressing the problem of arbitrariness in alignment parameters, and the disconnect between substitution scores and gap costs. Furthermore, our framework enables the generation of marginal probability landscapes over all possible alignment hypotheses, with potential to facilitate the users to simultaneously rationalize and assess competing alignment relationships between protein sequences, beyond simply reporting a single (best) alignment. We demonstrate the performance of our program on benchmarks containing distantly related protein sequences.Availability and implementationThe open-source program supporting this work is available from: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz368
      Issue No: Vol. 35, No. 14 (2019)
  • MCS2: minimal coordinated supports for fast enumeration of minimal cut
           sets in metabolic networks
    • Authors: Miraskarshahi R; Zabeti H, Stephen T, et al.
      Abstract: MotivationConstraint-based modeling of metabolic networks helps researchers gain insight into the metabolic processes of many organisms, both prokaryotic and eukaryotic. Minimal cut sets (MCSs) are minimal sets of reactions whose inhibition blocks a target reaction in a metabolic network. Most approaches for finding the MCSs in constrained-based models require, either as an intermediate step or as a byproduct of the calculation, the computation of the set of elementary flux modes (EFMs), a convex basis for the valid flux vectors in the network. Recently, Ballerstein et al. proposed a method for computing the MCSs of a network without first computing its EFMs, by creating a dual network whose EFMs are a superset of the MCSs of the original network. However, their dual network is always larger than the original network and depends on the target reaction. Here we propose the construction of a different dual network, which is typically smaller than the original network and is independent of the target reaction, for the same purpose. We prove the correctness of our approach, minimal coordinated support (MCS2), and describe how it can be modified to compute the few smallest MCSs for a given target reaction.ResultsWe compare MCS2 to the method of Ballerstein et al. and two other existing methods. We show that MCS2 succeeds in calculating the full set of MCSs in many models where other approaches cannot finish within a reasonable amount of time. Thus, in addition to its theoretical novelty, our approach provides a practical advantage over existing methods.Availability and implementationMCS2 is freely available at under the GNU 3.0 license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz393
      Issue No: Vol. 35, No. 14 (2019)
  • TADA: phylogenetic augmentation of microbiome samples enhances phenotype
    • Authors: Sayyari E; Kawas B, Mirarab S.
      Abstract: MotivationLearning associations of traits with the microbial composition of a set of samples is a fundamental goal in microbiome studies. Recently, machine learning methods have been explored for this goal, with some promise. However, in comparison to other fields, microbiome data are high-dimensional and not abundant; leading to a high-dimensional low-sample-size under-determined system. Moreover, microbiome data are often unbalanced and biased. Given such training data, machine learning methods often fail to perform a classification task with sufficient accuracy. Lack of signal is especially problematic when classes are represented in an unbalanced way in the training data; with some classes under-represented. The presence of inter-correlations among subsets of observations further compounds these issues. As a result, machine learning methods have had only limited success in predicting many traits from microbiome. Data augmentation consists of building synthetic samples and adding them to the training data and is a technique that has proved helpful for many machine learning tasks.ResultsIn this paper, we propose a new data augmentation technique for classifying phenotypes based on the microbiome. Our algorithm, called TADA, uses available data and a statistical generative model to create new samples augmenting existing ones, addressing issues of low-sample-size. In generating new samples, TADA takes into account phylogenetic relationships between microbial species. On two real datasets, we show that adding these synthetic samples to the training set improves the accuracy of downstream classification, especially when the training data have an unbalanced representation of classes.Availability and implementationTADA is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz394
      Issue No: Vol. 35, No. 14 (2019)
  • MOLI: multi-omics late integration with deep neural networks for drug
           response prediction
    • Authors: Sharifi-Noghabi H; Zolotareva O, Collins C, et al.
      Abstract: MotivationHistorically, gene expression has been shown to be the most informative data for drug response prediction. Recent evidence suggests that integrating additional omics can improve the prediction accuracy which raises the question of how to integrate the additional omics. Regardless of the integration strategy, clinical utility and translatability are crucial. Thus, we reasoned a multi-omics approach combined with clinical datasets would improve drug response prediction and clinical relevance.ResultsWe propose MOLI, a multi-omics late integration method based on deep neural networks. MOLI takes somatic mutation, copy number aberration and gene expression data as input, and integrates them for drug response prediction. MOLI uses type-specific encoding sub-networks to learn features for each omics type, concatenates them into one representation and optimizes this representation via a combined cost function consisting of a triplet loss and a binary cross-entropy loss. The former makes the representations of responder samples more similar to each other and different from the non-responders, and the latter makes this representation predictive of the response values. We validate MOLI on in vitro and in vivo datasets for five chemotherapy agents and two targeted therapeutics. Compared to state-of-the-art single-omics and early integration multi-omics methods, MOLI achieves higher prediction accuracy in external validations. Moreover, a significant improvement in MOLI’s performance is observed for targeted drugs when training on a pan-drug input, i.e. using all the drugs with the same target compared to training only on drug-specific inputs. MOLI’s high predictive power suggests it may have utility in precision oncology.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz318
      Issue No: Vol. 35, No. 14 (2019)
  • DIFFUSE: predicting isoform functions from sequences and expression
           profiles via deep learning
    • Authors: Chen H; Shaw D, Zeng J, et al.
      Abstract: MotivationAlternative splicing generates multiple isoforms from a single gene, greatly increasing the functional diversity of a genome. Although gene functions have been well studied, little is known about the specific functions of isoforms, making accurate prediction of isoform functions highly desirable. However, the existing approaches to predicting isoform functions are far from satisfactory due to at least two reasons: (i) unlike genes, isoform-level functional annotations are scarce. (ii) The information of isoform functions is concealed in various types of data including isoform sequences, co-expression relationship among isoforms, etc.ResultsIn this study, we present a novel approach, DIFFUSE (Deep learning-based prediction of IsoForm FUnctions from Sequences and Expression), to predict isoform functions. To integrate various types of data, our approach adopts a hybrid framework by first using a deep neural network (DNN) to predict the functions of isoforms from their genomic sequences and then refining the prediction using a conditional random field (CRF) based on co-expression relationship. To overcome the lack of isoform-level ground truth labels, we further propose an iterative semi-supervised learning algorithm to train both the DNN and CRF together. Our extensive computational experiments demonstrate that DIFFUSE could effectively predict the functions of isoforms and genes. It achieves an average area under the receiver operating characteristics curve of 0.840 and area under the precision–recall curve of 0.581 over 4184 GO functional categories, which are significantly higher than the state-of-the-art methods. We further validate the prediction results by analyzing the correlation between functional similarity, sequence similarity, expression similarity and structural similarity, as well as the consistency between the predicted functions and some well-studied functional features of isoform sequences.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz367
      Issue No: Vol. 35, No. 14 (2019)
  • Prediction of mRNA subcellular localization using deep recurrent neural
    • Authors: Yan Z; Lécuyer E, Blanchette M.
      Abstract: MotivationMessenger RNA subcellular localization mechanisms play a crucial role in post-transcriptional gene regulation. This trafficking is mediated by trans-acting RNA-binding proteins interacting with cis-regulatory elements called zipcodes. While new sequencing-based technologies allow the high-throughput identification of RNAs localized to specific subcellular compartments, the precise mechanisms at play, and their dependency on specific sequence elements, remain poorly understood.ResultsWe introduce RNATracker, a novel deep neural network built to predict, from their sequence alone, the distributions of mRNA transcripts over a predefined set of subcellular compartments. RNATracker integrates several state-of-the-art deep learning techniques (e.g. CNN, LSTM and attention layers) and can make use of both sequence and secondary structure information. We report on a variety of evaluations showing RNATracker’s strong predictive power, which is significantly superior to a variety of baseline predictors. Despite its complexity, several aspects of the model can be isolated to yield valuable, testable mechanistic hypotheses, and to locate candidate zipcode sequences within transcripts.Availability and implementationCode and data can be accessed at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz337
      Issue No: Vol. 35, No. 14 (2019)
  • PRISM: methylation pattern-based, reference-free inference of subclonal
    • Authors: Lee D; Lee S, Kim S.
      Abstract: MotivationCharacterizing cancer subclones is crucial for the ultimate conquest of cancer. Thus, a number of bioinformatic tools have been developed to infer heterogeneous tumor populations based on genomic signatures such as mutations and copy number variations. Despite accumulating evidence for the significance of global DNA methylation reprogramming in certain cancer types including myeloid malignancies, none of the bioinformatic tools are designed to exploit subclonally reprogrammed methylation patterns to reveal constituent populations of a tumor. In accordance with the notion of global methylation reprogramming, our preliminary observations on acute myeloid leukemia (AML) samples implied the existence of subclonally occurring focal methylation aberrance throughout the genome.ResultsWe present PRISM, a tool for inferring the composition of epigenetically distinct subclones of a tumor solely from methylation patterns obtained by reduced representation bisulfite sequencing. PRISM adopts DNA methyltransferase 1-like hidden Markov model-based in silico proofreading for the correction of erroneous methylation patterns. With error-corrected methylation patterns, PRISM focuses on a short individual genomic region harboring dichotomous patterns that can be split into fully methylated and unmethylated patterns. Frequencies of such two patterns form a sufficient statistic for subclonal abundance. A set of statistics collected from each genomic region is modeled with a beta-binomial mixture. Fitting the mixture with expectation-maximization algorithm finally provides inferred composition of subclones. Applying PRISM for two AML samples, we demonstrate that PRISM could infer the evolutionary history of malignant samples from an epigenetic point of view.Availability and implementationPRISM is freely available on GitHub ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz327
      Issue No: Vol. 35, No. 14 (2019)
  • Inference of clonal selection in cancer populations using single-cell
           sequencing data
    • Authors: Skums P; Tsyvina V, Zelikovsky A.
      Abstract: SummaryIntra-tumor heterogeneity is one of the major factors influencing cancer progression and treatment outcome. However, evolutionary dynamics of cancer clone populations remain poorly understood. Quantification of clonal selection and inference of fitness landscapes of tumors is a key step to understanding evolutionary mechanisms driving cancer. These problems could be addressed using single-cell sequencing (scSeq), which provides an unprecedented insight into intra-tumor heterogeneity allowing to study and quantify selective advantages of individual clones. Here, we present Single Cell Inference of FItness Landscape (SCIFIL), a computational tool for inference of fitness landscapes of heterogeneous cancer clone populations from scSeq data. SCIFIL allows to estimate maximum likelihood fitnesses of clone variants, measure their selective advantages and order of appearance by fitting an evolutionary model into the tumor phylogeny. We demonstrate the accuracy our approach, and show how it could be applied to experimental tumor data to study clonal selection and infer evolutionary history. SCIFIL can be used to provide new insight into the evolutionary dynamics of cancer.Availability and implementationIts source code is available at
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz392
      Issue No: Vol. 35, No. 14 (2019)
  • scOrange—a tool for hands-on training of concepts from single-cell
           data analytics
    • Authors: Stražar M; Žagar L, Kokošar J, et al.
      Abstract: MotivationSingle-cell RNA sequencing allows us to simultaneously profile the transcriptomes of thousands of cells and to indulge in exploring cell diversity, development and discovery of new molecular mechanisms. Analysis of scRNA data involves a combination of non-trivial steps from statistics, data visualization, bioinformatics and machine learning. Training molecular biologists in single-cell data analysis and empowering them to review and analyze their data can be challenging, both because of the complexity of the methods and the steep learning curve.ResultsWe propose a workshop-style training in single-cell data analytics that relies on an explorative data analysis toolbox and a hands-on teaching style. The training relies on scOrange, a newly developed extension of a data mining framework that features workflow design through visual programming and interactive visualizations. Workshops with scOrange can proceed much faster than similar training methods that rely on computer programming and analysis through scripting in R or Python, allowing the trainer to cover more ground in the same time-frame. We here review the design principles of the scOrange toolbox that support such workshops and propose a syllabus for the course. We also provide examples of data analysis workflows that instructors can use during the training.Availability and implementationscOrange is an open-source software. The software, documentation and an emerging set of educational videos are available at
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz348
      Issue No: Vol. 35, No. 14 (2019)
  • Reconstructing signaling pathways using regular language constrained paths
    • Authors: Wagner M; Pratapa A, Murali T.
      Abstract: MotivationHigh-quality curation of the proteins and interactions in signaling pathways is slow and painstaking. As a result, many experimentally detected interactions are not annotated to any pathways. A natural question that arises is whether or not it is possible to automatically leverage existing pathway annotations to identify new interactions for inclusion in a given pathway.ResultsWe present RegLinker, an algorithm that achieves this purpose by computing multiple short paths from pathway receptors to transcription factors within a background interaction network. The key idea underlying RegLinker is the use of regular language constraints to control the number of non-pathway interactions that are present in the computed paths. We systematically evaluate RegLinker and five alternative approaches against a comprehensive set of 15 signaling pathways and demonstrate that RegLinker recovers withheld pathway proteins and interactions with the best precision and recall. We used RegLinker to propose new extensions to the pathways. We discuss the literature that supports the inclusion of these proteins in the pathways. These results show the broad potential of automated analysis to attenuate difficulties of traditional manual inquiry.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz360
      Issue No: Vol. 35, No. 14 (2019)
  • hicGAN infers super resolution Hi-C data with generative adversarial
    • Authors: Liu Q; Lv H, Jiang R.
      Abstract: MotivationHi-C is a genome-wide technology for investigating 3D chromatin conformation by measuring physical contacts between pairs of genomic regions. The resolution of Hi-C data directly impacts the effectiveness and accuracy of downstream analysis such as identifying topologically associating domains (TADs) and meaningful chromatin loops. High resolution Hi-C data are valuable resources which implicate the relationship between 3D genome conformation and function, especially linking distal regulatory elements to their target genes. However, high resolution Hi-C data across various tissues and cell types are not always available due to the high sequencing cost. It is therefore indispensable to develop computational approaches for enhancing the resolution of Hi-C data.ResultsWe proposed hicGAN, an open-sourced framework, for inferring high resolution Hi-C data from low resolution Hi-C data with generative adversarial networks (GANs). To the best of our knowledge, this is the first study to apply GANs to 3D genome analysis. We demonstrate that hicGAN effectively enhances the resolution of low resolution Hi-C data by generating matrices that are highly consistent with the original high resolution Hi-C matrices. A typical scenario of usage for our approach is to enhance low resolution Hi-C data in new cell types, especially where the high resolution Hi-C data are not available. Our study not only presents a novel approach for enhancing Hi-C data resolution, but also provides fascinating insights into disclosing complex mechanism underlying the formation of chromatin contacts.Availability and implementationWe release hicGAN as an open-sourced software at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz317
      Issue No: Vol. 35, No. 14 (2019)
  • Robust network inference using response logic
    • Authors: Gross T; Wongchenko M, Yan Y, et al.
      Abstract: MotivationA major challenge in molecular and cellular biology is to map out the regulatory networks of cells. As regulatory interactions can typically not be directly observed experimentally, various computational methods have been proposed to disentangling direct and indirect effects. Most of these rely on assumptions that are rarely met or cannot be adapted to a given context.ResultsWe present a network inference method that is based on a simple response logic with minimal presumptions. It requires that we can experimentally observe whether or not some of the system’s components respond to perturbations of some other components, and then identifies the directed networks that most accurately account for the observed propagation of the signal. To cope with the intractable number of possible networks, we developed a logic programming approach that can infer networks of hundreds of nodes, while being robust to noisy, heterogeneous or missing data. This allows to directly integrate prior network knowledge and additional constraints such as sparsity. We systematically benchmark our method on KEGG pathways, and show that it outperforms existing approaches in DREAM3 and DREAM4 challenges. Applied to a novel perturbation dataset on PI3K and MAPK pathways in isogenic models of a colon cancer cell line, it generates plausible network hypotheses that explain distinct sensitivities toward various targeted inhibitors due to different PI3K mutants.Availability and implementationA Python/Answer Set Programming implementation can be accessed at Data and analysis scripts are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz326
      Issue No: Vol. 35, No. 14 (2019)
  • Precise modelling and interpretation of bioactivities of ligands targeting
           G protein-coupled receptors
    • Authors: Wu J; Liu B, Chan W, et al.
      Abstract: MotivationAccurate prediction and interpretation of ligand bioactivities are essential for virtual screening and drug discovery. Unfortunately, many important drug targets lack experimental data about the ligand bioactivities; this is particularly true for G protein-coupled receptors (GPCRs), which account for the targets of about a third of drugs currently on the market. Computational approaches with the potential of precise assessment of ligand bioactivities and determination of key substructural features which determine ligand bioactivities are needed to address this issue.ResultsA new method, SED, was proposed to predict ligand bioactivities and to recognize key substructures associated with GPCRs through the coupling of screening for Lasso of long extended-connectivity fingerprints (ECFPs) with deep neural network training. The SED pipeline contains three successive steps: (i) representation of long ECFPs for ligand molecules, (ii) feature selection by screening for Lasso of ECFPs and (iii) bioactivity prediction through a deep neural network regression model. The method was examined on a set of 16 representative GPCRs that cover most subfamilies of human GPCRs, where each has 300–5000 ligand associations. The results show that SED achieves excellent performance in modelling ligand bioactivities, especially for those in the GPCR datasets without sufficient ligand associations, where SED improved the baseline predictors by 12% in correlation coefficient (r2) and 19% in root mean square error. Detail data analyses suggest that the major advantage of SED lies on its ability to detect substructures from long ECFPs which significantly improves the predictive performance.Availability and implementationThe source code and datasets of SED are freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz336
      Issue No: Vol. 35, No. 14 (2019)
  • Inheritance and variability of kinetic gene expression parameters in
           microbial cells: modeling and inference from lineage tree data
    • Authors: Marguet A; Lavielle M, Cinquemani E.
      Abstract: MotivationModern experimental technologies enable monitoring of gene expression dynamics in individual cells and quantification of its variability in isogenic microbial populations. Among the sources of this variability is the randomness that affects inheritance of gene expression factors at cell division. Known parental relationships among individually observed cells provide invaluable information for the characterization of this extrinsic source of gene expression noise. Despite this fact, most existing methods to infer stochastic gene expression models from single-cell data dedicate little attention to the reconstruction of mother–daughter inheritance dynamics.ResultsStarting from a transcription and translation model of gene expression, we propose a stochastic model for the evolution of gene expression dynamics in a population of dividing cells. Based on this model, we develop a method for the direct quantification of inheritance and variability of kinetic gene expression parameters from single-cell gene expression and lineage data. We demonstrate that our approach provides unbiased estimates of mother–daughter inheritance parameters, whereas indirect approaches using lineage information only in the post-processing of individual-cell parameters underestimate inheritance. Finally, we show on yeast osmotic shock response data that daughter cell parameters are largely determined by the mother, thus confirming the relevance of our method for the correct assessment of the onset of gene expression variability and the study of the transmission of regulatory factors.Availability and implementationSoftware code is available at Lineage tree data is available upon request.Supplementary informationSupplementary materialSupplementary material is available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz378
      Issue No: Vol. 35, No. 14 (2019)
  • Efficient haplotype matching between a query and a panel for genealogical
    • Authors: Naseri A; Holzhauser E, Zhi D, et al.
      Abstract: MotivationWith the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin’s Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple ‘good enough’ matches are desired.ResultsIn this work, we developed two algorithmic extensions of Durbin’s Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce ‘virtual insertion’ of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search.Availability and informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz347
      Issue No: Vol. 35, No. 14 (2019)
  • A divide-and-conquer method for scalable phylogenetic network inference
           from multilocus data
    • Authors: Zhu J; Liu X, Ogilvie H, et al.
      Abstract: MotivationReticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes.ResultsIn this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference.Availability and implementationWe implemented the algorithms in the publicly available software package PhyloNet ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz359
      Issue No: Vol. 35, No. 14 (2019)
  • pNovo 3: precise de novo peptide sequencing using a learning-to-rank
    • Authors: Yang H; Chi H, Zeng W, et al.
      Abstract: MotivationDe novo peptide sequencing based on tandem mass spectrometry data is the key technology of shotgun proteomics for identifying peptides without any database and assembling unknown proteins. However, owing to the low ion coverage in tandem mass spectra, the order of certain consecutive amino acids cannot be determined if all of their supporting fragment ions are missing, which results in the low precision of de novo sequencing.ResultsIn order to solve this problem, we developed pNovo 3, which used a learning-to-rank framework to distinguish similar peptide candidates for each spectrum. Three metrics for measuring the similarity between each experimental spectrum and its corresponding theoretical spectrum were used as important features, in which the theoretical spectra can be precisely predicted by the pDeep algorithm using deep learning. On seven benchmark datasets from six diverse species, pNovo 3 recalled 29–102% more correct spectra, and the precision was 11–89% higher than three other state-of-the-art de novo sequencing algorithms. Furthermore, compared with the newly developed DeepNovo, which also used the deep learning approach, pNovo 3 still identified 21–50% more spectra on the nine datasets used in the study of DeepNovo. In summary, the deep learning and learning-to-rank techniques implemented in pNovo 3 significantly improve the precision of de novo sequencing, and such machine learning framework is worth extending to other related research fields to distinguish the similar sequences.Availability and implementationpNovo 3 can be freely downloaded from informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz366
      Issue No: Vol. 35, No. 14 (2019)
  • Inferring signalling dynamics by integrating interventional with
           observational data
    • Authors: Cardner M; Meyer-Schaller N, Christofori G, et al.
      Abstract: MotivationIn order to infer a cell signalling network, we generally need interventional data from perturbation experiments. If the perturbation experiments are time-resolved, then signal progression through the network can be inferred. However, such designs are infeasible for large signalling networks, where it is more common to have steady-state perturbation data on the one hand, and a non-interventional time series on the other. Such was the design in a recent experiment investigating the coordination of epithelial–mesenchymal transition (EMT) in murine mammary gland cells. We aimed to infer the underlying signalling network of transcription factors and microRNAs coordinating EMT, as well as the signal progression during EMT.ResultsIn the context of nested effects models, we developed a method for integrating perturbation data with a non-interventional time series. We applied the model to RNA sequencing data obtained from an EMT experiment. Part of the network inferred from RNA interference was validated experimentally using luciferase reporter assays. Our model extension is formulated as an integer linear programme, which can be solved efficiently using heuristic algorithms. This extension allowed us to infer the signal progression through the network during an EMT time course, and thereby assess when each regulator is necessary for EMT to advance.Availability and implementationR package at The RNA sequencing data and microscopy images can be explored through a Shiny app at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz325
      Issue No: Vol. 35, No. 14 (2019)
  • FunDMDeep-m6A: identification and prioritization of functional
           differential m6A methylation genes
    • Authors: Zhang S; Zhang S, Fan X, et al.
      Abstract: MotivationAs the most abundant mammalian mRNA methylation, N6-methyladenosine (m6A) exists in >25% of human mRNAs and is involved in regulating many different aspects of mRNA metabolism, stem cell differentiation and diseases like cancer. However, our current knowledge about dynamic changes of m6A levels and how the change of m6A levels for a specific gene can play a role in certain biological processes like stem cell differentiation and diseases like cancer is largely elusive.ResultsTo address this, we propose in this paper FunDMDeep-m6A a novel pipeline for identifying context-specific (e.g. disease versus normal, differentiated cells versus stem cells or gene knockdown cells versus wild-type cells) m6A-mediated functional genes. FunDMDeep-m6A includes, at the first step, DMDeep-m6A a novel method based on a deep learning model and a statistical test for identifying differential m6A methylation (DmM) sites from MeRIP-Seq data at a single-base resolution. FunDMDeep-m6A then identifies and prioritizes functional DmM genes (FDmMGenes) by combing the DmM genes (DmMGenes) with differential expression analysis using a network-based method. This proposed network method includes a novel m6A-signaling bridge (MSB) score to quantify the functional significance of DmMGenes by assessing functional interaction of DmMGenes with their signaling pathways using a heat diffusion process in protein-protein interaction (PPI) networks. The test results on 4 context-specific MeRIP-Seq datasets showed that FunDMDeep-m6A can identify more context-specific and functionally significant FDmMGenes than m6A-Driver. The functional enrichment analysis of these genes revealed that m6A targets key genes of many important context-related biological processes including embryonic development, stem cell differentiation, transcription, translation, cell death, cell proliferation and cancer-related pathways. These results demonstrate the power of FunDMDeep-m6A for elucidating m6A regulatory functions and its roles in biological processes and diseases.Availability and implementationThe R-package for DMDeep-m6A is freely available from informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz316
      Issue No: Vol. 35, No. 14 (2019)
  • Enhancing the drug discovery process: Bayesian inference for the analysis
           and comparison of dose–response experiments
    • Authors: Labelle C; Marinier A, Lemieux S.
      Abstract: MotivationThe efficacy of a chemical compound is often tested through dose–response experiments from which efficacy metrics, such as the IC50, can be derived. The Marquardt–Levenberg algorithm (non-linear regression) is commonly used to compute estimations for these metrics. The analysis are however limited and can lead to biased conclusions. The approach does not evaluate the certainty (or uncertainty) of the estimates nor does it allow for the statistical comparison of two datasets. To compensate for these shortcomings, intuition plays an important role in the interpretation of results and the formulations of conclusions. We here propose a Bayesian inference methodology for the analysis and comparison of dose–response experiments.ResultsOur results well demonstrate the informativeness gain of our Bayesian approach in comparison to the commonly used Marquardt–Levenberg algorithm. It is capable to characterize the noise of dataset while inferring probable values distributions for the efficacy metrics. It can also evaluate the difference between the metrics of two datasets and compute the probability that one value is greater than the other. The conclusions that can be drawn from such analyzes are more precise.Availability and implementationWe implemented a simple web interface that allows the users to analyze a single dose–response dataset, as well as to statistically compare the metrics of two datasets.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz335
      Issue No: Vol. 35, No. 14 (2019)
  • Efficient merging of genome profile alignments
    • Authors: Hennig A; Nieselt K.
      Abstract: MotivationWhole-genome alignment (WGA) methods show insufficient scalability toward the generation of large-scale WGAs. Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which make the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles.ResultsHere, we present genome profile alignment, an approach that aligns the profiles of WGAs and that is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial datasets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool.Availability and implementationGPA is freely available at GPA is implemented in Java, uses progressiveMauve and offers a parallel computation of WGAs.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz377
      Issue No: Vol. 35, No. 14 (2019)
  • Large scale microbiome profiling in the cloud
    • Authors: Valdes C; Stebliankin V, Narasimhan G.
      Abstract: MotivationBacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources.ResultsWe developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark’s built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43 552 bacterial genomes from Ensembl. Flint runs on Amazon’s Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40 K genomes on 64 machines in 67 s—an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments.Availability and implementationFlint is open source software, available under the MIT License (MIT). Source code is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz356
      Issue No: Vol. 35, No. 14 (2019)
  • Alignment-free filtering for cfNA fusion fragments
    • Authors: Yang X; Saito Y, Rao A, et al.
      Abstract: MotivationCell-free nucleic acid (cfNA) sequencing data require improvements to existing fusion detection methods along multiple axes: high depth of sequencing, low allele fractions, short fragment lengths and specialized barcodes, such as unique molecular identifiers.ResultsAF4 was developed to address these challenges. It uses a novel alignment-free kmer-based method to detect candidate fusion fragments with high sensitivity and orders of magnitude faster than existing tools. Candidate fragments are then filtered using a max-cover criterion that significantly reduces spurious matches while retaining authentic fusion fragments. This efficient first stage reduces the data sufficiently that commonly used criteria can process the remaining information, or sophisticated filtering policies that may not scale to the raw reads can be used. AF4 provides both targeted and de novo fusion detection modes. We demonstrate both modes in benchmark simulated and real RNA-seq data as well as clinical and cell-line cfNA data.Availability and implementationAF4 is open sourced, licensed under Apache License 2.0, and is available at:
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz346
      Issue No: Vol. 35, No. 14 (2019)
  • Identifying and ranking potential driver genes of Alzheimer’s disease
           using multiview evidence aggregation
    • Authors: Mukherjee S; Perumal T, Daily K, et al.
      Abstract: MotivationLate onset Alzheimer’s disease is currently a disease with no known effective treatment options. To better understand disease, new multi-omic data-sets have recently been generated with the goal of identifying molecular causes of disease. However, most analytic studies using these datasets focus on uni-modal analysis of the data. Here, we propose a data driven approach to integrate multiple data types and analytic outcomes to aggregate evidences to support the hypothesis that a gene is a genetic driver of the disease. The main algorithmic contributions of our article are: (i) a general machine learning framework to learn the key characteristics of a few known driver genes from multiple feature sets and identifying other potential driver genes which have similar feature representations, and (ii) A flexible ranking scheme with the ability to integrate external validation in the form of Genome Wide Association Study summary statistics. While we currently focus on demonstrating the effectiveness of the approach using different analytic outcomes from RNA-Seq studies, this method is easily generalizable to other data modalities and analysis types.ResultsWe demonstrate the utility of our machine learning algorithm on two benchmark multiview datasets by significantly outperforming the baseline approaches in predicting missing labels. We then use the algorithm to predict and rank potential drivers of Alzheimer’s. We show that our ranked genes show a significant enrichment for single nucleotide polymorphisms associated with Alzheimer’s and are enriched in pathways that have been previously associated with the disease.Availability and implementationSource code and link to all feature sets is available at
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz365
      Issue No: Vol. 35, No. 14 (2019)
  • SCRIBER: accurate and partner type-specific prediction of protein-binding
           residues from proteins sequences
    • Authors: Zhang J; Kurgan L.
      Abstract: MotivationAccurate predictions of protein-binding residues (PBRs) enhances understanding of molecular-level rules governing protein–protein interactions, helps protein–protein docking and facilitates annotation of protein functions. Recent studies show that current sequence-based predictors of PBRs severely cross-predict residues that interact with other types of protein partners (e.g. RNA and DNA) as PBRs. Moreover, these methods are relatively slow, prohibiting genome-scale use.ResultsWe propose a novel, accurate and fast sequence-based predictor of PBRs that minimizes the cross-predictions. Our SCRIBER (SeleCtive pRoteIn-Binding rEsidue pRedictor) method takes advantage of three innovations: comprehensive dataset that covers multiple types of binding residues, novel types of inputs that are relevant to the prediction of PBRs, and an architecture that is tailored to reduce the cross-predictions. The dataset includes complete protein chains and offers improved coverage of binding annotations that are transferred from multiple protein–protein complexes. We utilize innovative two-layer architecture where the first layer generates a prediction of protein-binding, RNA-binding, DNA-binding and small ligand-binding residues. The second layer re-predicts PBRs by reducing overlap between PBRs and the other types of binding residues produced in the first layer. Empirical tests on an independent test dataset reveal that SCRIBER significantly outperforms current predictors and that all three innovations contribute to its high predictive performance. SCRIBER reduces cross-predictions by between 41% and 69% and our conservative estimates show that it is at least 3 times faster. We provide putative PBRs produced by SCRIBER for the entire human proteome and use these results to hypothesize that about 14% of currently known human protein domains bind proteins.Availability and implementationSCRIBER webserver is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz324
      Issue No: Vol. 35, No. 14 (2019)
  • Learning signaling networks from combinatorial perturbations by exploiting
           siRNA off-target effects
    • Authors: Tiuryn J; Szczurek E.
      Abstract: MotivationPerturbation experiments constitute the central means to study cellular networks. Several confounding factors complicate computational modeling of signaling networks from this data. First, the technique of RNA interference (RNAi), designed and commonly used to knock-down specific genes, suffers from off-target effects. As a result, each experiment is a combinatorial perturbation of multiple genes. Second, the perturbations propagate along unknown connections in the signaling network. Once the signal is blocked by perturbation, proteins downstream of the targeted proteins also become inactivated. Finally, all perturbed network members, either directly targeted by the experiment, or by propagation in the network, contribute to the observed effect, either in a positive or negative manner. One of the key questions of computational inference of signaling networks from such data are, how many and what combinations of perturbations are required to uniquely and accurately infer the model'ResultsHere, we introduce an enhanced version of linear effects models (LEMs), which extends the original by accounting for both negative and positive contributions of the perturbed network proteins to the observed phenotype. We prove that the enhanced LEMs are identified from data measured under perturbations of all single, pairs and triplets of network proteins. For small networks of up to five nodes, only perturbations of single and pairs of proteins are required for identifiability. Extensive simulations demonstrate that enhanced LEMs achieve excellent accuracy of parameter estimation and network structure learning, outperforming the previous version on realistic data. LEMs applied to Bartonella henselae infection RNAi screening data identified known interactions between eight nodes of the infection network, confirming high specificity of our model and suggested one new interaction.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz334
      Issue No: Vol. 35, No. 14 (2019)
  • TideHunter: efficient and sensitive tandem repeat detection from noisy
           long-reads using seed-and-chain
    • Authors: Gao Y; Liu B, Wang Y, et al.
      Abstract: MotivationPacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing technologies can produce long-reads up to tens of kilobases, but with high error rates. In order to reduce sequencing error, Rolling Circle Amplification (RCA) has been used to improve library preparation by amplifying circularized template molecules. Linear products of the RCA contain multiple tandem copies of the template molecule. By integrating additional in silico processing steps, these tandem sequences can be collapsed into a consensus sequence with a higher accuracy than the original raw reads. Existing pipelines using alignment-based methods to discover the tandem repeat patterns from the long-reads are either inefficient or lack sensitivity.ResultsWe present a novel tandem repeat detection and consensus calling tool, TideHunter, to efficiently discover tandem repeat patterns and generate high-quality consensus sequences from amplified tandemly repeated long-read sequencing data. TideHunter works with noisy long-reads (PacBio and ONT) at error rates of up to 20% and does not have any limitation of the maximal repeat pattern size. We benchmarked TideHunter using simulated and real datasets with varying error rates and repeat pattern sizes. TideHunter is tens of times faster than state-of-the-art methods and has a higher sensitivity and accuracy.Availability and implementationTideHunter is written in C, it is open source and is available at
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz376
      Issue No: Vol. 35, No. 14 (2019)
  • Bayesian metabolic flux analysis reveals intracellular flux couplings
    • Authors: Heinonen M; Osmala M, Mannerström H, et al.
      Abstract: MotivationMetabolic flux balance analysis (FBA) is a standard tool in analyzing metabolic reaction rates compatible with measurements, steady-state and the metabolic reaction network stoichiometry. Flux analysis methods commonly place model assumptions on fluxes due to the convenience of formulating the problem as a linear programing model, while many methods do not consider the inherent uncertainty in flux estimates.ResultsWe introduce a novel paradigm of Bayesian metabolic flux analysis that models the reactions of the whole genome-scale cellular system in probabilistic terms, and can infer the full flux vector distribution of genome-scale metabolic systems based on exchange and intracellular (e.g. 13C) flux measurements, steady-state assumptions, and objective function assumptions. The Bayesian model couples all fluxes jointly together in a simple truncated multivariate posterior distribution, which reveals informative flux couplings. Our model is a plug-in replacement to conventional metabolic balance methods, such as FBA. Our experiments indicate that we can characterize the genome-scale flux covariances, reveal flux couplings, and determine more intracellular unobserved fluxes in Clostridium acetobutylicum from 13C data than flux variability analysis.Availability and implementationThe COBRA compatible software is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz315
      Issue No: Vol. 35, No. 14 (2019)
  • Unsupervised segmentation of mass spectrometric ion images characterizes
           morphology of tissues
    • Authors: Guo D; Bemis K, Rawlins C, et al.
      Abstract: MotivationMass spectrometry imaging (MSI) characterizes the spatial distribution of ions in complex biological samples such as tissues. Since many tissues have complex morphology, treatments and conditions often affect the spatial distribution of the ions in morphology-specific ways. Evaluating the selectivity and the specificity of ion localization and regulation across morphology types is biologically important. However, MSI lacks algorithms for segmenting images at both single-ion and spatial resolution.ResultsThis article contributes spatial-Dirichlet Gaussian mixture model (DGMM), an algorithm and a workflow for the analyses of MSI experiments, that detects components of single-ion images with homogeneous spatial composition. The approach extends DGMMs to account for the spatial structure of MSI. Evaluations on simulated and experimental datasets with diverse MSI workflows demonstrated that spatial-DGMM accurately segments ion images, and can distinguish ions with homogeneous and heterogeneous spatial distribution. We also demonstrated that the extracted spatial information is useful for downstream analyses, such as detecting morphology-specific ions, finding groups of ions with similar spatial patterns, and detecting changes in chemical composition of tissues between conditions.Availability and implementationThe data and code are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz345
      Issue No: Vol. 35, No. 14 (2019)
  • Collaborative intra-tumor heterogeneity detection
    • Authors: Khakabimamaghani S; Malikic S, Tang J, et al.
      Abstract: MotivationDespite the remarkable advances in sequencing and computational techniques, noise in the data and complexity of the underlying biological mechanisms render deconvolution of the phylogenetic relationships between cancer mutations difficult. Besides that, the majority of the existing datasets consist of bulk sequencing data of single tumor sample of an individual. Accurate inference of the phylogenetic order of mutations is particularly challenging in these cases and the existing methods are faced with several theoretical limitations. To overcome these limitations, new methods are required for integrating and harnessing the full potential of the existing data.ResultsWe introduce a method called Hintra for intra-tumor heterogeneity detection. Hintra integrates sequencing data for a cohort of tumors and infers tumor phylogeny for each individual based on the evolutionary information shared between different tumors. Through an iterative process, Hintra learns the repeating evolutionary patterns and uses this information for resolving the phylogenetic ambiguities of individual tumors. The results of synthetic experiments show an improved performance compared to two state-of-the-art methods. The experimental results with a recent Breast Cancer dataset are consistent with the existing knowledge and provide potentially interesting findings.Availability and implementationThe source code for Hintra is available at
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz355
      Issue No: Vol. 35, No. 14 (2019)
  • Adversarial domain adaptation for cross data source macromolecule in situ
           structural classification in cellular electron cryo-tomograms
    • Authors: Lin R; Zeng X, Kitani K, et al.
      Abstract: MotivationSince 2017, an increasing amount of attention has been paid to the supervised deep learning-based macromolecule in situ structural classification (i.e. subtomogram classification) in cellular electron cryo-tomography (CECT) due to the substantially higher scalability of deep learning. However, the success of such supervised approach relies heavily on the availability of large amounts of labeled training data. For CECT, creating valid training data from the same data source as prediction data is usually laborious and computationally intensive. It would be beneficial to have training data from a separate data source where the annotation is readily available or can be performed in a high-throughput fashion. However, the cross data source prediction is often biased due to the different image intensity distributions (a.k.a. domain shift).ResultsWe adapt a deep learning-based adversarial domain adaptation (3D-ADA) method to timely address the domain shift problem in CECT data analysis. 3D-ADA first uses a source domain feature extractor to extract discriminative features from the training data as the input to a classifier. Then it adversarially trains a target domain feature extractor to reduce the distribution differences of the extracted features between training and prediction data. As a result, the same classifier can be directly applied to the prediction data. We tested 3D-ADA on both experimental and realistically simulated subtomogram datasets under different imaging conditions. 3D-ADA stably improved the cross data source prediction, as well as outperformed two popular domain adaptation methods. Furthermore, we demonstrate that 3D-ADA can improve cross data source recovery of novel macromolecular structures.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz364
      Issue No: Vol. 35, No. 14 (2019)
  • LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic
           programming and beam search
    • Authors: Huang L; Zhang H, Deng D, et al.
      Abstract: MotivationPredicting the secondary structure of an ribonucleic acid (RNA) sequence is useful in many applications. Existing algorithms [based on dynamic programming] suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genome-wide applications.ResultsWe present a novel alternative O(n3)-time dynamic programming algorithm for RNA folding that is amenable to heuristics that make it run in O(n) time and O(n) space, while producing a high-quality approximation to the optimal solution. Inspired by incremental parsing for context-free grammars in computational linguistics, our alternative dynamic programming algorithm scans the sequence in a left-to-right (5′-to-3′) direction rather than in a bottom-up fashion, which allows us to employ the effective beam pruning heuristic. Our work, though inexact, is the first RNA folding algorithm to achieve linear runtime (and linear space) without imposing constraints on the output structure. Surprisingly, our approximate search results in even higher overall accuracy on a diverse database of sequences with known structures. More interestingly, it leads to significantly more accurate predictions on the longest sequence families in that database (16S and 23S Ribosomal RNAs), as well as improved accuracies for long-range base pairs (500+ nucleotides apart), both of which are well known to be challenging for the current models.Availability and implementationOur source code is available at, and our webserver is at (sequence limit: 100 000nt).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz375
      Issue No: Vol. 35, No. 14 (2019)
  • Block HSIC Lasso: model-free biomarker detection for ultra-high
           dimensional data
    • Authors: Climente-González H; Azencott C, Kaski S, et al.
      Abstract: MotivationFinding non-linear relationships between biomolecules and a biological outcome is computationally expensive and statistically challenging. Existing methods have important drawbacks, including among others lack of parsimony, non-convexity and computational overhead. Here we propose block HSIC Lasso, a non-linear feature selector that does not present the previous drawbacks.ResultsWe compare block HSIC Lasso to other state-of-the-art feature selection techniques in both synthetic and real data, including experiments over three common types of genomic data: gene-expression microarrays, single-cell RNA sequencing and genome-wide association studies. In all cases, we observe that features selected by block HSIC Lasso retain more information about the underlying biology than those selected by other techniques. As a proof of concept, we applied block HSIC Lasso to a single-cell RNA sequencing experiment on mouse hippocampus. We discovered that many genes linked in the past to brain development and function are involved in the biological differences between the types of neurons.Availability and implementationBlock HSIC Lasso is implemented in the Python 2/3 package pyHSICLasso, available on PyPI. Source code is available on GitHub ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz333
      Issue No: Vol. 35, No. 14 (2019)
  • Large-scale inference of competing endogenous RNA networks with sparse
           partial correlation
    • Authors: List M; Dehghani Amirabad A, Kostka D, et al.
      Abstract: MotivationMicroRNAs (miRNAs) are important non-coding post-transcriptional regulators that are involved in many biological processes and human diseases. Individual miRNAs may regulate hundreds of genes, giving rise to a complex gene regulatory network in which transcripts carrying miRNA binding sites act as competing endogenous RNAs (ceRNAs). Several methods for the analysis of ceRNA interactions exist, but these do often not adjust for statistical confounders or address the problem that more than one miRNA interacts with a target transcript.ResultsWe present SPONGE, a method for the fast construction of ceRNA networks. SPONGE uses ’multiple sensitivity correlation’, a newly defined measure for which we can estimate a distribution under a null hypothesis. SPONGE can accurately quantify the contribution of multiple miRNAs to a ceRNA interaction with a probabilistic model that addresses previously neglected confounding factors and allows fast P-value calculation, thus outperforming existing approaches. We applied SPONGE to paired miRNA and gene expression data from The Cancer Genome Atlas for studying global effects of miRNA-mediated cross-talk. Our results highlight already established and novel protein-coding and non-coding ceRNAs which could serve as biomarkers in cancer.Availability and implementationSPONGE is available as an R/Bioconductor package (doi: 10.18129/B9.bioc.SPONGE).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.18129/b9.bioc.sponge).
      Issue No: Vol. 35, No. 14 (2019)
  • A joint method for marker-free alignment of tilt series in electron
    • Authors: Han R; Bao Z, Zeng X, et al.
      Abstract: MotivationElectron tomography (ET) is a widely used technology for 3D macro-molecular structure reconstruction. To obtain a satisfiable tomogram reconstruction, several key processes are involved, one of which is the calibration of projection parameters of the tilt series. Although fiducial marker-based alignment for tilt series has been well studied, marker-free alignment remains a challenge, which requires identifying and tracking the identical objects (landmarks) through different projections. However, the tracking of these landmarks is usually affected by the pixel density (intensity) change caused by the geometry difference in different views. The tracked landmarks will be used to determine the projection parameters. Meanwhile, different projection parameters will also affect the localization of landmarks. Currently, there is no alignment method that takes interrelationship between the projection parameters and the landmarks.ResultsHere, we propose a novel, joint method for marker-free alignment of tilt series in ET, by utilizing the information underlying the interrelationship between the projection model and the landmarks. The proposed method is the first joint solution that combines the extrinsic (track-based) alignment and the intrinsic (intensity-based) alignment, in which the localization of landmarks and projection parameters keep refining each other until convergence. This iterative approach makes our solution robust to different initial parameters and extreme geometric changes, which ensures a better reconstruction for marker-free ET. Comprehensive experimental results on three real datasets show that our new method achieved a significant improvement in alignment accuracy and reconstruction quality, compared to the state-of-the-art methods.Availability and implementationThe main program is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz323
      Issue No: Vol. 35, No. 14 (2019)
  • TreeMerge: a new method for improving the scalability of species tree
           estimation methods
    • Authors: Molloy E; Warnow T.
      Abstract: MotivationAt RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species.ResultsHere we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All.Availability and implementationTreeMerge is publicly available on Github ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz344
      Issue No: Vol. 35, No. 14 (2019)
  • Locality-sensitive hashing for the edit distance
    • Authors: Marçais G; DeBlasio D, Pandey P, et al.
      Abstract: MotivationSequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy.ResultsWe present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH.Availability and implementationThe code to generate the results is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz354
      Issue No: Vol. 35, No. 14 (2019)
  • NPS: scoring and evaluating the statistical significance of peptidic
           natural product–spectrum matches
    • Authors: Tagirdzhanov A; Shlemov A, Gurevich A.
      Abstract: MotivationPeptidic natural products (PNPs) are considered a promising compound class that has many applications in medicine. Recently developed mass spectrometry-based pipelines are transforming PNP discovery into a high-throughput technology. However, the current computational methods for PNP identification via database search of mass spectra are still in their infancy and could be substantially improved.ResultsHere we present NPS, a statistical learning-based approach for scoring PNP–spectrum matches. We incorporated NPS into two leading PNP discovery tools and benchmarked them on millions of natural product mass spectra. The results demonstrate more than 45% increase in the number of identified spectra and 20% more found PNPs at a false discovery rate of 1%.Availability and implementationNPS is available as a command line tool and as a web application at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz374
      Issue No: Vol. 35, No. 14 (2019)
  • Estimating the predictability of cancer evolution
    • Authors: Hosseini S; Diaz-Uriarte R, Markowetz F, et al.
      Abstract: MotivationHow predictable is the evolution of cancer' This fundamental question is of immense relevance for the diagnosis, prognosis and treatment of cancer. Evolutionary biologists have approached the question of predictability based on the underlying fitness landscape. However, empirical fitness landscapes of tumor cells are impossible to determine in vivo. Thus, in order to quantify the predictability of cancer evolution, alternative approaches are required that circumvent the need for fitness landscapes.ResultsWe developed a computational method based on conjunctive Bayesian networks (CBNs) to quantify the predictability of cancer evolution directly from mutational data, without the need for measuring or estimating fitness. Using simulated data derived from >200 different fitness landscapes, we show that our CBN-based notion of evolutionary predictability strongly correlates with the classical notion of predictability based on fitness landscapes under the strong selection weak mutation assumption. The statistical framework enables robust and scalable quantification of evolutionary predictability. We applied our approach to driver mutation data from the TCGA and the MSK-IMPACT clinical cohorts to systematically compare the predictability of 15 different cancer types. We found that cancer evolution is remarkably predictable as only a small fraction of evolutionary trajectories are feasible during cancer progression.Availability and implementation\_of\_cancer\_evolutionSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz332
      Issue No: Vol. 35, No. 14 (2019)
  • Large-scale mammalian genome rearrangements coincide with chromatin
    • Authors: Swenson K; Blanchette M.
      Abstract: MotivationGenome rearrangements drastically change gene order along great stretches of a chromosome. There has been initial evidence that these apparently non-local events in the 1D sense may have breakpoints that are close in the 3D sense. We harness the power of the Double Cut and Join model of genome rearrangement, along with Hi-C chromosome conformation capture data to test this hypothesis between human and mouse.ResultsWe devise novel statistical tests that show that indeed, rearrangement scenarios that transform the human into the mouse gene order are enriched for pairs of breakpoints that have frequent chromosome interactions. This is observed for both intra-chromosomal breakpoint pairs, as well as for inter-chromosomal pairs. For intra-chromosomal rearrangements, the enrichment exists from close (<20 Mb) to very distant (100 Mb) pairs. Further, the pattern exists across multiple cell lines in Hi-C data produced by different laboratories and at different stages of the cell cycle. We show that similarities in the contact frequencies between these many experiments contribute to the enrichment. We conclude that either (i) rearrangements usually involve breakpoints that are spatially close or (ii) there is selection against rearrangements that act on spatially distant breakpoints.Availability and implementationOur pipeline is freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz343
      Issue No: Vol. 35, No. 14 (2019)
  • Predicting drug-induced transcriptome responses of a wide range of human
           cell lines by a novel tensor-train decomposition algorithm
    • Authors: Iwata M; Yuan L, Zhao Q, et al.
      Abstract: MotivationGenome-wide identification of the transcriptomic responses of human cell lines to drug treatments is a challenging issue in medical and pharmaceutical research. However, drug-induced gene expression profiles are largely unknown and unobserved for all combinations of drugs and human cell lines, which is a serious obstacle in practical applications.ResultsHere, we developed a novel computational method to predict unknown parts of drug-induced gene expression profiles for various human cell lines and predict new drug therapeutic indications for a wide range of diseases. We proposed a tensor-train weighted optimization (TT-WOPT) algorithm to predict the potential values for unknown parts in tensor-structured gene expression data. Our results revealed that the proposed TT-WOPT algorithm can accurately reconstruct drug-induced gene expression data for a range of human cell lines in the Library of Integrated Network-based Cellular Signatures. The results also revealed that in comparison with the use of original gene expression profiles, the use of imputed gene expression profiles improved the accuracy of drug repositioning. We also performed a comprehensive prediction of drug indications for diseases with gene expression profiles, which suggested many potential drug indications that were not predicted by previous approaches.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz313
      Issue No: Vol. 35, No. 14 (2019)
  • Rotation equivariant and invariant neural networks for microscopy image
    • Authors: Chidester B; Zhou T, Do M, et al.
      Abstract: MotivationNeural networks have been widely used to analyze high-throughput microscopy images. However, the performance of neural networks can be significantly improved by encoding known invariance for particular tasks. Highly relevant to the goal of automated cell phenotyping from microscopy image data is rotation invariance. Here we consider the application of two schemes for encoding rotation equivariance and invariance in a convolutional neural network, namely, the group-equivariant CNN (G-CNN), and a new architecture with simple, efficient conic convolution, for classifying microscopy images. We additionally integrate the 2D-discrete-Fourier transform (2D-DFT) as an effective means for encoding global rotational invariance. We call our new method the Conic Convolution and DFT Network (CFNet).ResultsWe evaluated the efficacy of CFNet and G-CNN as compared to a standard CNN for several different image classification tasks, including simulated and real microscopy images of subcellular protein localization, and demonstrated improved performance. We believe CFNet has the potential to improve many high-throughput microscopy image analysis applications.Availability and implementationSource code of CFNet is available at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz353
      Issue No: Vol. 35, No. 14 (2019)
  • GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer
    • Authors: Shrikumar A; Prakash E, Kundaje A.
      Abstract: Summary Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM) or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose GkmExplain: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients. Using simulated regulatory DNA sequences, we show that GkmExplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo transcription factor (TF) binding data, we recover consolidated, non-redundant TF motifs. Mutation impact scores derived using GkmExplain consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines.Availability and implementation Code and example notebooks to reproduce results are at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz322
      Issue No: Vol. 35, No. 14 (2019)
  • Comprehensive evaluation of transcriptome-based cell-type quantification
           methods for immuno-oncology
    • Authors: Sturm G; Finotello F, Petitprez F, et al.
      Abstract: MotivationThe composition and density of immune cells in the tumor microenvironment (TME) profoundly influence tumor progression and success of anti-cancer therapies. Flow cytometry, immunohistochemistry staining or single-cell sequencing are often unavailable such that we rely on computational methods to estimate the immune-cell composition from bulk RNA-sequencing (RNA-seq) data. Various methods have been proposed recently, yet their capabilities and limitations have not been evaluated systematically. A general guideline leading the research community through cell type deconvolution is missing.ResultsWe developed a systematic approach for benchmarking such computational methods and assessed the accuracy of tools at estimating nine different immune- and stromal cells from bulk RNA-seq samples. We used a single-cell RNA-seq dataset of ∼11 000 cells from the TME to simulate bulk samples of known cell type proportions, and validated the results using independent, publicly available gold-standard estimates. This allowed us to analyze and condense the results of more than a hundred thousand predictions to provide an exhaustive evaluation across seven computational methods over nine cell types and ∼1800 samples from five simulated and real-world datasets. We demonstrate that computational deconvolution performs at high accuracy for well-defined cell-type signatures and propose how fuzzy cell-type signatures can be improved. We suggest that future efforts should be dedicated to refining cell population definitions and finding reliable signatures.Availability and implementationA snakemake pipeline to reproduce the benchmark is available at An R package allows the community to perform integrated deconvolution using different methods ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz363
      Issue No: Vol. 35, No. 14 (2019)
  • Representation transfer for differentially private drug sensitivity
    • Authors: Niinimäki T; Heikkilä M, Honkela A, et al.
      Abstract: MotivationHuman genomic datasets often contain sensitive information that limits use and sharing of the data. In particular, simple anonymization strategies fail to provide sufficient level of protection for genomic data, because the data are inherently identifiable. Differentially private machine learning can help by guaranteeing that the published results do not leak too much information about any individual data point. Recent research has reached promising results on differentially private drug sensitivity prediction using gene expression data. Differentially private learning with genomic data is challenging because it is more difficult to guarantee privacy in high dimensions. Dimensionality reduction can help, but if the dimension reduction mapping is learned from the data, then it needs to be differentially private too, which can carry a significant privacy cost. Furthermore, the selection of any hyperparameters (such as the target dimensionality) needs to also avoid leaking private information.ResultsWe study an approach that uses a large public dataset of similar type to learn a compact representation for differentially private learning. We compare three representation learning methods: variational autoencoders, principal component analysis and random projection. We solve two machine learning tasks on gene expression of cancer cell lines: cancer type classification, and drug sensitivity prediction. The experiments demonstrate significant benefit from all representation learning methods with variational autoencoders providing the most accurate predictions most often. Our results significantly improve over previous state-of-the-art in accuracy of differentially private drug sensitivity prediction.Availability and implementationCode used in the experiments is available at
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz373
      Issue No: Vol. 35, No. 14 (2019)
  • Drug repositioning based on bounded nuclear norm regularization
    • Authors: Yang M; Luo H, Li Y, et al.
      Abstract: MotivationComputational drug repositioning is a cost-effective strategy to identify novel indications for existing drugs. Drug repositioning is often modeled as a recommendation system problem. Taking advantage of the known drug–disease associations, the objective of the recommendation system is to identify new treatments by filling out the unknown entries in the drug–disease association matrix, which is known as matrix completion. Underpinned by the fact that common molecular pathways contribute to many different diseases, the recommendation system assumes that the underlying latent factors determining drug–disease associations are highly correlated. In other words, the drug–disease matrix to be completed is low-rank. Accordingly, matrix completion algorithms efficiently constructing low-rank drug–disease matrix approximations consistent with known associations can be of immense help in discovering the novel drug–disease associations.ResultsIn this article, we propose to use a bounded nuclear norm regularization (BNNR) method to complete the drug–disease matrix under the low-rank assumption. Instead of strictly fitting the known elements, BNNR is designed to tolerate the noisy drug–drug and disease–disease similarities by incorporating a regularization term to balance the approximation error and the rank properties. Moreover, additional constraints are incorporated into BNNR to ensure that all predicted matrix entry values are within the specific interval. BNNR is carried out on an adjacency matrix of a heterogeneous drug–disease network, which integrates the drug–drug, drug–disease and disease–disease networks. It not only makes full use of available drugs, diseases and their association information, but also is capable of dealing with cold start naturally. Our computational results show that BNNR yields higher drug–disease association prediction accuracy than the current state-of-the-art methods. The most significant gain is in prediction precision measured as the fraction of the positive predictions that are truly positive, which is particularly useful in drug design practice. Cases studies also confirm the accuracy and reliability of BNNR.Availability and implementationThe code of BNNR is freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz331
      Issue No: Vol. 35, No. 14 (2019)
  • Summarizing the solution space in tumor phylogeny inference by multiple
           consensus trees
    • Authors: Aguse N; Qi Y, El-Kebir M.
      Abstract: MotivationCancer phylogenies are key to studying tumorigenesis and have clinical implications. Due to the heterogeneous nature of cancer and limitations in current sequencing technology, current cancer phylogeny inference methods identify a large solution space of plausible phylogenies. To facilitate further downstream analyses, methods that accurately summarize such a set T of cancer phylogenies are imperative. However, current summary methods are limited to a single consensus tree or graph and may miss important topological features that are present in different subsets of candidate trees.ResultsWe introduce the Multiple Consensus Tree (MCT) problem to simultaneously cluster T and infer a consensus tree for each cluster. We show that MCT is NP-hard, and present an exact algorithm based on mixed integer linear programming (MILP). In addition, we introduce a heuristic algorithm that efficiently identifies high-quality consensus trees, recovering all optimal solutions identified by the MILP in simulated data at a fraction of the time. We demonstrate the applicability of our methods on both simulated and real data, showing that our approach selects the number of clusters depending on the complexity of the solution space T.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz312
      Issue No: Vol. 35, No. 14 (2019)
  • Deep learning with multimodal representation for pancancer prognosis
    • Authors: Cheerla A; Gevaert O.
      Abstract: MotivationEstimating the future course of patients with cancer lesions is invaluable to physicians; however, current clinical methods fail to effectively use the vast amount of multimodal data that is available for cancer patients. To tackle this problem, we constructed a multimodal neural network-based model to predict the survival of patients for 20 different cancer types using clinical data, mRNA expression data, microRNA expression data and histopathology whole slide images (WSIs). We developed an unsupervised encoder to compress these four data modalities into a single feature vector for each patient, handling missing data through a resilient, multimodal dropout method. Encoding methods were tailored to each data type—using deep highway networks to extract features from clinical and genomic data, and convolutional neural networks to extract features from WSIs.ResultsWe used pancancer data to train these feature encodings and predict single cancer and pancancer overall survival, achieving a C-index of 0.78 overall. This work shows that it is possible to build a pancancer model for prognosis that also predicts prognosis in single cancer sites. Furthermore, our model handles multiple data modalities, efficiently analyzes WSIs and represents patient multimodal data flexibly into an unsupervised, informative representation. We thus present a powerful automated tool to accurately determine prognosis, a key step towards personalized treatment for cancer patients.Availability and implementation
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz342
      Issue No: Vol. 35, No. 14 (2019)
  • Integrating regulatory DNA sequence and gene expression to predict
           genome-wide chromatin accessibility across cellular contexts
    • Authors: Nair S; Kim D, Perricone J, et al.
      Abstract: MotivationGenome-wide profiles of chromatin accessibility and gene expression in diverse cellular contexts are critical to decipher the dynamics of transcriptional regulation. Recently, convolutional neural networks have been used to learn predictive cis-regulatory DNA sequence models of context-specific chromatin accessibility landscapes. However, these context-specific regulatory sequence models cannot generalize predictions across cell types.ResultsWe introduce multi-modal, residual neural network architectures that integrate cis-regulatory sequence and context-specific expression of trans-regulators to predict genome-wide chromatin accessibility profiles across cellular contexts. We show that the average accessibility of a genomic region across training contexts can be a surprisingly powerful predictor. We leverage this feature and employ novel strategies for training models to enhance genome-wide prediction of shared and context-specific chromatin accessible sites across cell types. We interpret the models to reveal insights into cis- and trans-regulation of chromatin dynamics across 123 diverse cellular contexts.Availability and implementationThe code is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz352
      Issue No: Vol. 35, No. 14 (2019)
  • PRECISE: a domain adaptation approach to transfer predictors of drug
           response from pre-clinical models to tumors
    • Authors: Mourragui S; Loog M, van de Wiel M, et al.
      Abstract: MotivationCell lines and patient-derived xenografts (PDXs) have been used extensively to understand the molecular underpinnings of cancer. While core biological processes are typically conserved, these models also show important differences compared to human tumors, hampering the translation of findings from pre-clinical models to the human setting. In particular, employing drug response predictors generated on data derived from pre-clinical models to predict patient response remains a challenging task. As very large drug response datasets have been collected for pre-clinical models, and patient drug response data are often lacking, there is an urgent need for methods that efficiently transfer drug response predictors from pre-clinical models to the human setting.ResultsWe show that cell lines and PDXs share common characteristics and processes with human tumors. We quantify this similarity and show that a regression model cannot simply be trained on cell lines or PDXs and then applied on tumors. We developed PRECISE, a novel methodology based on domain adaptation that captures the common information shared amongst pre-clinical models and human tumors in a consensus representation. Employing this representation, we train predictors of drug response on pre-clinical data and apply these predictors to stratify human tumors. We show that the resulting domain-invariant predictors show a small reduction in predictive performance in the pre-clinical domain but, importantly, reliably recover known associations between independent biomarkers and their companion drugs on human tumors.Availability and implementationPRECISE and the scripts for running our experiments are available on our GitHub page ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz372
      Issue No: Vol. 35, No. 14 (2019)
  • Selfish: discovery of differential chromatin interactions via a
           self-similarity measure
    • Authors: Ardakany A; Ay F, Lonardi S.
      Abstract: MotivationHigh-throughput conformation capture experiments, such as Hi-C provide genome-wide maps of chromatin interactions, enabling life scientists to investigate the role of the three-dimensional structure of genomes in gene regulation and other essential cellular functions. A fundamental problem in the analysis of Hi-C data is how to compare two contact maps derived from Hi-C experiments. Detecting similarities and differences between contact maps are critical in evaluating the reproducibility of replicate experiments and for identifying differential genomic regions with biological significance. Due to the complexity of chromatin conformations and the presence of technology-driven and sequence-specific biases, the comparative analysis of Hi-C data is analytically and computationally challenging.ResultsWe present a novel method called Selfish for the comparative analysis of Hi-C data that takes advantage of the structural self-similarity in contact maps. We define a novel self-similarity measure to design algorithms for (i) measuring reproducibility for Hi-C replicate experiments and (ii) finding differential chromatin interactions between two contact maps. Extensive experimental results on simulated and real data show that Selfish is more accurate and robust than state-of-the-art methods.Availability and implementation
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz362
      Issue No: Vol. 35, No. 14 (2019)
  • A statistical simulator scDesign for rational scRNA-seq experimental
    • Authors: Li W Li J.
      Abstract: MotivationSingle-cell RNA sequencing (scRNA-seq) has revolutionized biological sciences by revealing genome-wide gene expression levels within individual cells. However, a critical challenge faced by researchers is how to optimize the choices of sequencing platforms, sequencing depths and cell numbers in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information.ResultsHere we present a flexible and robust simulator, scDesign, the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings. In an evaluation based on 17 cell types and 6 different protocols, scDesign outperformed four state-of-the-art scRNA-seq simulation methods and led to rational experimental design. In addition, scDesign demonstrates reproducibility across biological replicates and independent studies. We also discuss the performance of multiple differential expression and dimension reduction methods based on the protocol-dependent scRNA-seq data generated by scDesign. scDesign is expected to be an effective bioinformatic tool that assists rational scRNA-seq experimental design and comparison of scRNA–seq computational methods based on specific research goals.Availability and implementationWe have implemented our method in the R package scDesign, which is freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz321
      Issue No: Vol. 35, No. 14 (2019)
  • DeepLigand: accurate prediction of MHC class I ligands using peptide
    • Authors: Zeng H; Gifford D.
      Abstract: MotivationThe computational modeling of peptide display by class I major histocompatibility complexes (MHCs) is essential for peptide-based therapeutics design. Existing computational methods for peptide-display focus on modeling the peptide-MHC-binding affinity. However, such models are not able to characterize the sequence features for the other cellular processes in the peptide display pathway that determines MHC ligand selection.ResultsWe introduce a semi-supervised model, DeepLigand that outperforms the state-of-the-art models in MHC Class I ligand prediction. DeepLigand combines a peptide language model and peptide binding affinity prediction to score MHC class I peptide presentation. The peptide language model characterizes sequence features that correspond to secondary factors in MHC ligand selection other than binding affinity. The peptide embedding is learned by pre-training on natural ligands, and can discriminate between ligands and non-ligands in the absence of binding affinity prediction. Although conventional affinity-based models fail to classify peptides with moderate affinities, DeepLigand discriminates ligands from non-ligands with consistently high accuracy.Availability and implementationWe make DeepLigand available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz330
      Issue No: Vol. 35, No. 14 (2019)
  • Fully-sensitive seed finding in sequence graphs using a hybrid index
    • Authors: Ghaffaari A; Marschall T.
      Abstract: MotivationSequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus—a property that is not exploited by extant methods.ResultsWe present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project dataset. On this graph, PSI outperforms GCSA2 in terms of index size, query time and sensitivity.Availability and implementationThe C++ implementation is publicly available at:
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz341
      Issue No: Vol. 35, No. 14 (2019)
  • Minnow: a principled framework for rapid simulation of dscRNA-seq data at
           the read level
    • Authors: Sarkar H; Srivastava A, Patro R.
      Abstract: SummaryWith the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz351
      Issue No: Vol. 35, No. 14 (2019)
  • Controlling large Boolean networks with single-step perturbations
    • Authors: Baudin A; Paul S, Su C, et al.
      Abstract: MotivationThe control of Boolean networks has traditionally focussed on strategies where the perturbations are applied to the nodes of the network for an extended period of time. In this work, we study if and how a Boolean network can be controlled by perturbing a minimal set of nodes for a single-step and letting the system evolve afterwards according to its original dynamics. More precisely, given a Boolean network (BN), we compute a minimal subset Cmin of the nodes such that BN can be driven from any initial state in an attractor to another ‘desired’ attractor by perturbing some or all of the nodes of Cmin for a single-step. Such kind of control is attractive for biological systems because they are less time consuming than the traditional strategies for control while also being financially more viable. However, due to the phenomenon of state-space explosion, computing such a minimal subset is computationally inefficient and an approach that deals with the entire network in one-go, does not scale well for large networks.ResultsWe develop a ‘divide-and-conquer’ approach by decomposing the network into smaller partitions, computing the minimal control on the projection of the attractors to these partitions and then composing the results to obtain Cmin for the whole network. We implement our method and test it on various real-life biological networks to demonstrate its applicability and efficiency.Supplementary informationSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz371
      Issue No: Vol. 35, No. 14 (2019)
  • Building large updatable colored de Bruijn graphs via merging
    • Authors: Muggli M; Alipanahi B, Boucher C.
      Abstract: MotivationThere exist several large genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph were developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed.ResultsWe create a method for constructing the colored de Bruijn graph for large datasets that is based on partitioning the data into smaller datasets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this article. We refer to the resulting method as VariMerge. This construction method also allows the graph to be updated with new data. We validate our approach and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8000 strains. Lastly, we compare VariMerge to other competing methods—including Vari, Rainbowfish, Mantis, Bloom Filter Trie, the method of Almodaresi et al. and Multi-BRWT—and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16 000 strains in a manner that allows it to be updated. Competing methods either did not scale to this large of a dataset or do not allow for additions without reconstruction.Availability and implementationVariMerge is available at under GPLv3 license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz350
      Issue No: Vol. 35, No. 14 (2019)
  • Integrating read-based and population-based phasing for dense and accurate
           haplotyping of individual genomes
    • Authors: Bansal V.
      Abstract: MotivationReconstruction of haplotypes for human genomes is an important problem in medical and population genetics. Hi-C sequencing generates read pairs with long-range haplotype information that can be computationally assembled to generate chromosome-spanning haplotypes. However, the haplotypes have limited completeness and low accuracy. Haplotype information from population reference panels can potentially be used to improve the completeness and accuracy of Hi-C haplotyping.ResultsIn this paper, we describe a likelihood based method to integrate short-range haplotype information from a population reference panel of haplotypes with the long-range haplotype information present in sequence reads from methods such as Hi-C to assemble dense and highly accurate haplotypes for individual genomes. Our method leverages a statistical phasing method and a maximum spanning tree algorithm to determine the optimal second-order approximation of the population-based haplotype likelihood for an individual genome. The population-based likelihood is encoded using pseudo-reads which are then used as input along with sequence reads for haplotype assembly using an existing tool, HapCUT2. Using whole-genome Hi-C data for two human genomes (NA19240 and NA12878), we demonstrate that this integrated phasing method enables the phasing of 97–98% of variants, reduces the switch error rates by 3–6-fold, and outperforms an existing method for combining phase information from sequence reads with population-based phasing. On Strand-seq data for NA12878, our method improves the haplotype completeness from 71.4 to 94.6% and reduces the switch error rate 2-fold, demonstrating its utility for phasing using multiple sequencing technologies.Availability and implementationCode and datasets are available at
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz329
      Issue No: Vol. 35, No. 14 (2019)
  • Identifying progressive imaging genetic patterns via multi-task sparse
           canonical correlation analysis: a longitudinal study of the ADNI cohort
    • Authors: Du L; Liu K, Zhu L, et al.
      Abstract: MotivationIdentifying the genetic basis of the brain structure, function and disorder by using the imaging quantitative traits (QTs) as endophenotypes is an important task in brain science. Brain QTs often change over time while the disorder progresses and thus understanding how the genetic factors play roles on the progressive brain QT changes is of great importance and meaning. Most existing imaging genetics methods only analyze the baseline neuroimaging data, and thus those longitudinal imaging data across multiple time points containing important disease progression information are omitted.ResultsWe propose a novel temporal imaging genetic model which performs the multi-task sparse canonical correlation analysis (T-MTSCCA). Our model uses longitudinal neuroimaging data to uncover that how single nucleotide polymorphisms (SNPs) play roles on affecting brain QTs over the time. Incorporating the relationship of the longitudinal imaging data and that within SNPs, T-MTSCCA could identify a trajectory of progressive imaging genetic patterns over the time. We propose an efficient algorithm to solve the problem and show its convergence. We evaluate T-MTSCCA on 408 subjects from the Alzheimer’s Disease Neuroimaging Initiative database with longitudinal magnetic resonance imaging data and genetic data available. The experimental results show that T-MTSCCA performs either better than or equally to the state-of-the-art methods. In particular, T-MTSCCA could identify higher canonical correlation coefficients and capture clearer canonical weight patterns. This suggests that T-MTSCCA identifies time-consistent and time-dependent SNPs and imaging QTs, which further help understand the genetic basis of the brain QT changes over the time during the disease progression.Availability and implementationThe software and simulation data are publicly available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz320
      Issue No: Vol. 35, No. 14 (2019)
  • Model-based optimization of subgroup weights for survival analysis
    • Authors: Richter J; Madjar K, Rahnenführer J.
      Abstract: MotivationTo obtain a reliable prediction model for a specific cancer subgroup or cohort is often difficult due to limited sample size and, in survival analysis, due to potentially high censoring rates. Sometimes similar data from other patient subgroups are available, e.g. from other clinical centers. Simple pooling of all subgroups can decrease the variance of the predicted parameters of the prediction models, but also increase the bias due to heterogeneity between the cohorts. A promising compromise is to identify those subgroups with a similar relationship between covariates and target variable and then include only these for model building.ResultsWe propose a subgroup-based weighted likelihood approach for survival prediction with high-dimensional genetic covariates. When predicting survival for a specific subgroup, for every other subgroup an individual weight determines the strength with which its observations enter into model building. MBO (model-based optimization) can be used to quickly find a good prediction model in the presence of a large number of hyperparameters. We use MBO to identify the best model for survival prediction of a specific subgroup by optimizing the weights for additional subgroups for a Cox model. The approach is evaluated on a set of lung cancer cohorts with gene expression measurements. The resulting models have competitive prediction quality, and they reflect the similarity of the corresponding cancer subgroups, with both weights close to 0 and close to 1 and medium weights.Availability and implementationmlrMBO is implemented as an R-package and is freely available at
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz361
      Issue No: Vol. 35, No. 14 (2019)
  • Modeling clinical and molecular covariates of mutational process activity
           in cancer
    • Authors: Robinson W; Sharan R, Leiserson M.
      Abstract: MotivationSomatic mutations result from processes related to DNA replication or environmental/lifestyle exposures. Knowing the activity of mutational processes in a tumor can inform personalized therapies, early detection, and understanding of tumorigenesis. Computational methods have revealed 30 validated signatures of mutational processes active in human cancers, where each signature is a pattern of single base substitutions. However, half of these signatures have no known etiology, and some similar signatures have distinct etiologies, making patterns of mutation signature activity hard to interpret. Existing mutation signature detection methods do not consider tumor-level clinical/demographic (e.g. smoking history) or molecular features (e.g. inactivations to DNA damage repair genes).ResultsTo begin to address these challenges, we present the Tumor Covariate Signature Model (TCSM), the first method to directly model the effect of observed tumor-level covariates on mutation signatures. To this end, our model uses methods from Bayesian topic modeling to change the prior distribution on signature exposure conditioned on a tumor’s observed covariates. We also introduce methods for imputing covariates in held-out data and for evaluating the statistical significance of signature-covariate associations. On simulated and real data, we find that TCSM outperforms both non-negative matrix factorization and topic modeling-based approaches, particularly in recovering the ground truth exposure to similar signatures. We then use TCSM to discover five mutation signatures in breast cancer and predict homologous recombination repair deficiency in held-out tumors. We also discover four signatures in a combined melanoma and lung cancer cohort—using cancer type as a covariate—and provide statistical evidence to support earlier claims that three lung cancers from The Cancer Genome Atlas are misdiagnosed metastatic melanomas.Availability and implementationTCSM is implemented in Python 3 and available at, along with a data workflow for reproducing the experiments in the paper.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz340
      Issue No: Vol. 35, No. 14 (2019)
  • cloudSPAdes: assembly of synthetic long reads using de Bruijn graphs
    • Authors: Tolstoganov I; Bankevich A, Chen Z, et al.
      Abstract: MotivationThe recently developed barcoding-based synthetic long read (SLR) technologies have already found many applications in genome assembly and analysis. However, although some new barcoding protocols are emerging and the range of SLR applications is being expanded, the existing SLR assemblers are optimized for a narrow range of parameters and are not easily extendable to new barcoding technologies and new applications such as metagenomics or hybrid assembly.ResultsWe describe the algorithmic challenge of the SLR assembly and present a cloudSPAdes algorithm for SLR assembly that is based on analyzing the de Bruijn graph of SLRs. We benchmarked cloudSPAdes across various barcoding technologies/applications and demonstrated that it improves on the state-of-the-art SLR assemblers in accuracy and speed.Availability and implementationSource code and installation manual for cloudSPAdes are available at InformationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz349
      Issue No: Vol. 35, No. 14 (2019)
  • ADAPTIVE: leArning DAta-dePendenT, concIse molecular VEctors for fast,
           accurate metabolite identification from tandem mass spectra
    • Authors: Nguyen D; Nguyen C, Mamitsuka H.
      Abstract: MotivationMetabolite identification is an important task in metabolomics to enhance the knowledge of biological systems. There have been a number of machine learning-based methods proposed for this task, which predict a chemical structure of a given spectrum through an intermediate (chemical structure) representation called molecular fingerprints. They usually have two steps: (i) predicting fingerprints from spectra; (ii) searching chemical compounds (in database) corresponding to the predicted fingerprints. Fingerprints are feature vectors, which are usually very large to cover all possible substructures and chemical properties, and therefore heavily redundant, in the sense of having many molecular (sub)structures irrelevant to the task, causing limited predictive performance and slow prediction.ResultsWe propose ADAPTIVE, which has two parts: learning two mappings (i) from structures to molecular vectors and (ii) from spectra to molecular vectors. The first part learns molecular vectors for metabolites from given data, to be consistent with both spectra and chemical structures of metabolites. In more detail, molecular vectors are generated by a model, being parameterized by a message passing neural network, and parameters are estimated by maximizing the correlation between molecular vectors and the corresponding spectra in terms of Hilbert-Schmidt Independence Criterion. Molecular vectors generated by this model are compact and importantly adaptive (specific) to both given data and task of metabolite identification. The second part uses input output kernel regression (IOKR), the current cutting-edge method of metabolite identification. We empirically confirmed the effectiveness of ADAPTIVE by using a benchmark data, where ADAPTIVE outperformed the original IOKR in both predictive performance and computational efficiency.Availability and implementationThe code will be accessed through after the acceptance of this article.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz319
      Issue No: Vol. 35, No. 14 (2019)
  • Comprehensive evaluation of deep learning architectures for prediction of
           DNA/RNA sequence binding specificities
    • Authors: Trabelsi A; Chaabane M, Ben-Hur A.
      Abstract: MotivationDeep learning architectures have recently demonstrated their power in predicting DNA- and RNA-binding specificity. Existing methods fall into three classes: Some are based on convolutional neural networks (CNNs), others use recurrent neural networks (RNNs) and others rely on hybrid architectures combining CNNs and RNNs. However, based on existing studies the relative merit of the various architectures remains unclear.ResultsIn this study we present a systematic exploration of deep learning architectures for predicting DNA- and RNA-binding specificity. For this purpose, we present deepRAM, an end-to-end deep learning tool that provides an implementation of a wide selection of architectures; its fully automatic model selection procedure allows us to perform a fair and unbiased comparison of deep learning architectures. We find that deeper more complex architectures provide a clear advantage with sufficient training data, and that hybrid CNN/RNN architectures outperform other methods in terms of accuracy. Our work provides guidelines that can assist the practitioner in choosing an appropriate network architecture, and provides insight on the difference between the models learned by convolutional and recurrent networks. In particular, we find that although recurrent networks improve model accuracy, this comes at the expense of a loss in the interpretability of the features learned by the model.Availability and implementationThe source code for deepRAM is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 05 Jul 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz339
      Issue No: Vol. 35, No. 14 (2019)
  • Classifying tumors by supervised network propagation
    • Authors: Zhang W; Ma J, Ideker T.
      Pages: 2528 - 2528
      Abstract: Bioinformatics (2018) doi: 10.1093/bioinformatics/bty247
      PubDate: Mon, 04 Feb 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty1072
      Issue No: Vol. 35, No. 14 (2019)
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-