Journal Cover
Journal Prestige (SJR): 6.14
Citation Impact (citeScore): 8
Number of Followers: 327  
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1367-4803 - ISSN (Online) 1460-2059
Published by Oxford University Press Homepage  [397 journals]
  • polymapR—linkage analysis and genetic map construction from F1
           populations of outcrossing polyploids
    • Authors: Bourke P; van Geest G, Voorrips R, et al.
      Pages: 540 - 540
      Abstract: Bioinformatics (2018), 34(20): 3496-3502.
      PubDate: Wed, 09 Jan 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty1002
      Issue No: Vol. 35, No. 3 (2019)
  • Systematic discovery of novel and valuable plant gene modules by
           large-scale RNA-seq samples
    • Authors: Yu H; Lu L, Jiao B, et al.
      Pages: 361 - 364
      Abstract: MotivationThe complex cellular networks underlying phenotypes are formed by the interacting gene modules. Building and analyzing genome-wide and high-quality Gene Co-expression Networks (GCNs) is useful for uncovering these modules and understanding the phenotypes of an organism.ResultsUsing large-scale RNA-seq samples, we constructed high coverage and confident GCNs in two monocot species rice and maize, and two eudicot species Arabidopsis and soybean, and subdivided them into co-expressed gene modules. Taking rice as an example, we discovered many interesting and valuable modules, for instance, pollen-specific modules and starch biosynthesis module. We explored the regulatory mechanism of modules and revealed synergistic effects of gene expression regulation. In addition, we discovered that the modules conserved among plants participated in basic biological processes, whereas the species-specific modules were involved in spatiotemporal-specific processes linking genotypes to phenotypes. Our study suggests gene regulatory relationships and modules relating to cellular activities and agronomic traits in several model and crop plants, and thus providing a valuable data source for plant genetics research and breeding.Availability and implementationThe analyzed gene expression data, reconstructed GCNs, modules and detailed annotations can be freely downloaded from informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 19 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty642
      Issue No: Vol. 35, No. 3 (2018)
  • Re-identification of individuals in genomic data-sharing beacons via
           allele inference
    • Authors: von Thenen N; Ayday E, Cicek A, et al.
      Pages: 365 - 371
      Abstract: MotivationGenomic data-sharing beacons aim to provide a secure, easy to implement and standardized interface for data-sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. Previously deemed secure against re-identification attacks, beacons were shown to be vulnerable despite their stringent policy. Recent studies have demonstrated that it is possible to determine whether the victim is in the dataset, by repeatedly querying the beacon for his/her single-nucleotide polymorphisms (SNPs). Here, we propose a novel re-identification attack and show that the privacy risk is more serious than previously thought.ResultsUsing the proposed attack, even if the victim systematically hides informative SNPs, it is possible to infer the alleles at positions of interest as well as the beacon query results with very high confidence. Our method is based on the fact that alleles at different loci are not necessarily independent. We use linkage disequilibrium and a high-order Markov chain-based algorithm for inference. We show that in a simulated beacon with 65 individuals from the European population, we can infer membership of individuals with 95% confidence with only 5 queries, even when SNPs with MAF <0.05 are hidden. We need less than 0.5% of the number of queries that existing works require, to determine beacon membership under the same conditions. We show that countermeasures such as hiding certain parts of the genome or setting a query budget for the user would fail to protect the privacy of the participants.Availability and implementationSoftware is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 20 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty643
      Issue No: Vol. 35, No. 3 (2018)
  • Kinome-wide identification of phosphorylation networks in eukaryotic
    • Authors: Parca L; Ariano B, Cabibbo A, et al.
      Pages: 372 - 379
      Abstract: MotivationSignaling and metabolic pathways are finely regulated by a network of protein phosphorylation events. Unraveling the nature of this intricate network, composed of kinases, target proteins and their interactions, is therefore of crucial importance. Although thousands of kinase-specific phosphorylations (KsP) have been annotated in model organisms their kinase-target network is far from being complete, with less studied organisms lagging behind.ResultsIn this work, we achieved an automated and accurate identification of kinase domains, inferring the residues that most likely contribute to peptide specificity. We integrated this information with the target peptides of known human KsP to predict kinase-specific interactions in other eukaryotes through a deep neural network, outperforming similar methods. We analyzed the differential conservation of kinase specificity among eukaryotes revealing the high conservation of the specificity of tyrosine kinases. With this approach we discovered 1590 novel KsP of potential clinical relevance in the human proteome.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 17 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty545
      Issue No: Vol. 35, No. 3 (2018)
  • A parallel computational framework for ultra-large-scale sequence
           clustering analysis
    • Authors: Zheng W; Mao Q, Genco R, et al.
      Pages: 380 - 388
      Abstract: MotivationThe rapid development of sequencing technology has led to an explosive accumulation of genomic data. Clustering is often the first step to be performed in sequence analysis. However, existing methods scale poorly with respect to the unprecedented growth of input data size. As high-performance computing systems are becoming widely accessible, it is highly desired that a clustering method can easily scale to handle large-scale sequence datasets by leveraging the power of parallel computing.ResultsIn this paper, we introduce SLAD (Separation via Landmark-based Active Divisive clustering), a generic computational framework that can be used to parallelize various de novo operational taxonomic unit (OTU) picking methods and comes with theoretical guarantees on both accuracy and efficiency. The proposed framework was implemented on Apache Spark, which allows for easy and efficient utilization of parallel computing resources. Experiments performed on various datasets demonstrated that SLAD can significantly speed up a number of popular de novo OTU picking methods and meanwhile maintains the same level of accuracy. In particular, the experiment on the Earth Microbiome Project dataset (∼2.2B reads, 437 GB) demonstrated the excellent scalability of the proposed method.Availability and implementationOpen-source software for the proposed method is freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 13 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty617
      Issue No: Vol. 35, No. 3 (2018)
  • DMCM: a Data-adaptive Mutation Clustering Method to identify
           cancer-related mutation clusters
    • Authors: Lu X; Qian X, Li X, et al.
      Pages: 389 - 397
      Abstract: MotivationFunctional somatic mutations within coding amino acid sequences confer growth advantage in pathogenic process. Most existing methods for identifying cancer-related mutations focus on the single amino acid or the entire gene level. However, gain-of-function mutations often cluster in specific protein regions instead of existing independently in the amino acid sequences. Some approaches for identifying mutation clusters with mutation density on amino acid chain have been proposed recently. But their performance in identification of mutation clusters remains to be improved.ResultsHere we present a Data-adaptive Mutation Clustering Method (DMCM), in which kernel density estimate (KDE) with a data-adaptive bandwidth is applied to estimate the mutation density, to find variable clusters with different lengths on amino acid sequences. We apply this approach in the mutation data of 571 genes in over twenty cancer types from The Cancer Genome Atlas (TCGA). We compare the DMCM with M2C, OncodriveCLUST and Pfam Domain and find that DMCM tends to identify more significant clusters. The cross-validation analysis shows DMCM is robust and cluster cancer type enrichment analysis shows that specific cancer types are enriched for specific mutation clusters.Availability and implementationDMCM is written in Python and analysis methods of DMCM are written in R. They are all released online, available through informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 13 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty624
      Issue No: Vol. 35, No. 3 (2018)
  • pLoc_bal-mAnimal: predict subcellular localization of animal proteins by
           balancing training dataset and PseAAC
    • Authors: Cheng X; Lin W, Xiao X, et al.
      Pages: 398 - 406
      Abstract: MotivationA cell contains numerous protein molecules. One of the fundamental goals in cell biology is to determine their subcellular locations, which can provide useful clues about their functions. Knowledge of protein subcellular localization is also indispensable for prioritizing and selecting the right targets for drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called ‘pLoc-mAnimal’ was developed for identifying the subcellular localization of animal proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with the multi-label systems in which some proteins, called ‘multiplex proteins’, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mAnimal was trained by an extremely skewed dataset in which some subset (subcellular location) was about 128 times the size of the other subsets. Accordingly, such an uneven training dataset will inevitably cause a biased consequence.ResultsTo alleviate such biased consequence, we have developed a new and bias-reducing predictor called pLoc_bal-mAnimal by quasi-balancing the training dataset. Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mAnimal, the existing state-of-the-art predictor, in identifying the subcellular localization of animal proteins.Availability and implementationTo maximize the convenience for the vast majority of experimental scientists, a user-friendly web-server for the new predictor has been established at, by which users can easily get their desired results without the need to go through the complicated mathematics.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 13 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty628
      Issue No: Vol. 35, No. 3 (2018)
  • Dynamic compression schemes for graph coloring
    • Authors: Mustafa H; Schilken I, Karasikov M, et al.
      Pages: 407 - 414
      Abstract: MotivationTechnological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata.ResultsWe present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain.Availability and implementationWe provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 18 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty632
      Issue No: Vol. 35, No. 3 (2018)
  • Toward fast and accurate SNP genotyping from whole genome sequencing data
           for bedside diagnostics
    • Authors: Sun C; Medvedev P, Berger B.
      Pages: 415 - 420
      Abstract: MotivationGenotyping a set of variants from a database is an important step for identifying known genetic traits and disease-related variants within an individual. The growing size of variant databases as well as the high depth of sequencing data poses an efficiency challenge. In clinical applications, where time is crucial, alignment-based methods are often not fast enough. To fill the gap, Shajii et al. propose LAVA, an alignment-free genotyping method which is able to more quickly genotype single nucleotide polymorphisms (SNPs); however, there remains large room for improvements in running time and accuracy.ResultsWe present the VarGeno method for SNP genotyping from Illumina whole genome sequencing data. VarGeno builds upon LAVA by improving the speed of k-mer querying as well as the accuracy of the genotyping strategy. We evaluate VarGeno on several read datasets using different genotyping SNP lists. VarGeno performs 7–13 times faster than LAVA with similar memory usage, while improving accuracy.Availability and implementationVarGeno is freely available at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 19 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty641
      Issue No: Vol. 35, No. 3 (2018)
  • Scaling read aligners to hundreds of threads on general-purpose processors
    • Authors: Langmead B; Wilks C, Antonescu V, et al.
      Pages: 421 - 432
      Abstract: MotivationGeneral-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners.ResultsWe implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling.Availability and implementationExperiments for this study: 2 informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 18 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty648
      Issue No: Vol. 35, No. 3 (2018)
  • StackDPPred: a stacking based prediction of DNA-binding protein from
    • Authors: Mishra A; Pokhrel P, Hoque M, et al.
      Pages: 433 - 441
      Abstract: MotivationIdentification of DNA-binding proteins from only sequence information is one of the most challenging problems in the field of genome annotation. DNA-binding proteins play an important role in various biological processes such as DNA replication, repair, transcription and splicing. Existing experimental techniques for identifying DNA-binding proteins are time-consuming and expensive. Thus, prediction of DNA-binding proteins from sequences alone using computational methods can be useful to quickly annotate and guide the experimental process. Most of the methods developed for predicting DNA-binding proteins use the information from the evolutionary profile, called the position-specific scoring matrix (PSSM) profile, alone and the accuracies of such methods have been limited. Here, we propose a method, called StackDPPred, which utilizes features extracted from PSSM and residue specific contact-energy to help train a stacking based machine learning method for the effective prediction of DNA-binding proteins.ResultsBased on benchmark sequences of 1063 (518 DNA-binding and 545 non DNA-binding) proteins and using jackknife validation, StackDPPred achieved an ACC of 89.96%, MCC of 0.799 and AUC of 94.50%. This outcome outperforms several state-of-the-art approaches. Furthermore, when tested on recently designed two independent test datasets, StackDPPred outperforms existing approaches consistently. The proposed StackDPPred can be used for effective prediction of DNA-binding proteins from sequence alone.Availability and implementationOnline server is at and code-data is at∼tamjid/Software/StackDPPred/ informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 19 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty653
      Issue No: Vol. 35, No. 3 (2018)
  • Effusion: prediction of protein function from sequence similarity networks
    • Authors: Yunes J; Babbitt P, Hancock J.
      Pages: 442 - 451
      Abstract: MotivationCritical evaluation of methods for protein function prediction shows that data integration improves the performance of methods that predict protein function, but a basic BLAST-based method is still a top contender. We sought to engineer a method that modernizes the classical approach while avoiding pitfalls common to state-of-the-art methods.ResultsWe present a method for predicting protein function, Effusion, which uses a sequence similarity network to add context for homology transfer, a probabilistic model to account for the uncertainty in labels and function propagation, and the structure of the Gene Ontology (GO) to best utilize sparse input labels and make consistent output predictions. Effusion’s model makes it practical to integrate rare experimental data and abundant primary sequence and sequence similarity. We demonstrate Effusion’s performance using a critical evaluation method and provide an in-depth analysis. We also dissect the design decisions we used to address challenges for predicting protein function. Finally, we propose directions in which the framework of the method can be modified for additional predictive power.Availability and implementationThe source code for an implementation of Effusion is freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 01 Aug 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty672
      Issue No: Vol. 35, No. 3 (2018)
  • Automatic recognition of ligands in electron density by machine learning
    • Authors: Kowiel M; Brzezinski D, Porebski P, et al.
      Pages: 452 - 461
      Abstract: MotivationThe correct identification of ligands in crystal structures of protein complexes is the cornerstone of structure-guided drug design. However, cognitive bias can sometimes mislead investigators into modeling fictitious compounds without solid support from the electron density maps. Ligand identification can be aided by automatic methods, but existing approaches are based on time-consuming iterative fitting.ResultsHere we report a new machine learning algorithm called CheckMyBlob that identifies ligands from experimental electron density maps. In benchmark tests on portfolios of up to 219 931 ligand binding sites containing the 200 most popular ligands found in the Protein Data Bank, CheckMyBlob markedly outperforms the existing automatic methods for ligand identification, in some cases doubling the recognition rates, while requiring significantly less time. Our work shows that machine learning can improve the automation of structure modeling and significantly accelerate the drug screening process of macromolecule-ligand complexes.Availability and implementationCode and data are available on GitHub at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 17 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty626
      Issue No: Vol. 35, No. 3 (2018)
  • SKEMPI 2.0: an updated benchmark of changes in protein–protein binding
           energy, kinetics and thermodynamics upon mutation
    • Authors: Jankauskaitė J; Jiménez-García B, Dapkūnas J, et al.
      Pages: 462 - 469
      Abstract: MotivationUnderstanding the relationship between the sequence, structure, binding energy, binding kinetics and binding thermodynamics of protein–protein interactions is crucial to understanding cellular signaling, the assembly and regulation of molecular complexes, the mechanisms through which mutations lead to disease, and protein engineering.ResultsWe present SKEMPI 2.0, a major update to our database of binding free energy changes upon mutation for structurally resolved protein–protein interactions. This version now contains manually curated binding data for 7085 mutations, an increase of 133%, including changes in kinetics for 1844 mutations, enthalpy and entropy changes for 443 mutations, and 440 mutations, which abolish detectable binding.Availability and implementationThe database is available as supplementary datasupplementary data and at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 18 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty635
      Issue No: Vol. 35, No. 3 (2018)
  • BIPSPI: a method for the prediction of partner-specific
           protein–protein interfaces
    • Authors: Sanchez-Garcia R; Sorzano C, Carazo J, et al.
      Pages: 470 - 477
      Abstract: MotivationProtein–Protein Interactions (PPI) are essentials for most cellular processes and thus, unveiling how proteins interact is a crucial question that can be better understood by identifying which residues are responsible for the interaction. Computational approaches are orders of magnitude cheaper and faster than experimental ones, leading to proliferation of multiple methods aimed to predict which residues belong to the interface of an interaction.ResultsWe present BIPSPI, a new machine learning-based method for the prediction of partner-specific PPI sites. Contrary to most binding site prediction methods, the proposed approach takes into account a pair of interacting proteins rather than a single one in order to predict partner-specific binding sites. BIPSPI has been trained employing sequence-based and structural features from both protein partners of each complex compiled in the Protein–Protein Docking Benchmark version 5.0 and in an additional set independently compiled. Also, a version trained only on sequences has been developed. The performance of our approach has been assessed by a leave-one-out cross-validation over different benchmarks, outperforming state-of-the-art methods.Availability and implementationBIPSPI web server is freely available at BIPSPI code is available at Docker image is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 18 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty647
      Issue No: Vol. 35, No. 3 (2018)
  • SMSSVD: SubMatrix Selection Singular Value Decomposition
    • Authors: Henningsson R; Fontes M, Birol I.
      Pages: 478 - 486
      Abstract: MotivationHigh throughput biomedical measurements normally capture multiple overlaid biologically relevant signals and often also signals representing different types of technical artefacts like e.g. batch effects. Signal identification and decomposition are accordingly main objectives in statistical biomedical modeling and data analysis. Existing methods, aimed at signal reconstruction and deconvolution, in general, are either supervised, contain parameters that need to be estimated or present other types of ad hoc features. We here introduce SubMatrix Selection Singular Value Decomposition (SMSSVD), a parameter-free unsupervised signal decomposition and dimension reduction method, designed to reduce noise, adaptively for each low-rank-signal in a given data matrix, and represent the signals in the data in a way that enable unbiased exploratory analysis and reconstruction of multiple overlaid signals, including identifying groups of variables that drive different signals.ResultsThe SMSSVD method produces a denoised signal decomposition from a given data matrix. It also guarantees orthogonality between signal components in a straightforward manner and it is designed to make automation possible. We illustrate SMSSVD by applying it to several real and synthetic datasets and compare its performance to golden standard methods like PCA (Principal Component Analysis) and SPC (Sparse Principal Components, using Lasso constraints). The SMSSVD is computationally efficient and despite being a parameter-free method, in general, outperforms existing statistical learning methods.Availability and implementationA Julia implementation of SMSSVD is openly available on GitHub ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 13 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty566
      Issue No: Vol. 35, No. 3 (2018)
  • Heritability estimation and differential analysis of count data with
           generalized linear mixed models in genomic sequencing studies
    • Authors: Sun S; Zhu J, Mozaffari S, et al.
      Pages: 487 - 496
      Abstract: MotivationGenomic sequencing studies, including RNA sequencing and bisulfite sequencing studies, are becoming increasingly common and increasingly large. Large genomic sequencing studies open doors for accurate molecular trait heritability estimation and powerful differential analysis. Heritability estimation and differential analysis in sequencing studies requires the development of statistical methods that can properly account for the count nature of the sequencing data and that are computationally efficient for large datasets.ResultsHere, we develop such a method, PQLseq (Penalized Quasi-Likelihood for sequencing count data), to enable effective and efficient heritability estimation and differential analysis using the generalized linear mixed model framework. With extensive simulations and comparisons to previous methods, we show that PQLseq is the only method currently available that can produce unbiased heritability estimates for sequencing count data. In addition, we show that PQLseq is well suited for differential analysis in large sequencing studies, providing calibrated type I error control and more power compared to the standard linear mixed model methods. Finally, we apply PQLseq to perform gene expression heritability estimation and differential expression analysis in a large RNA sequencing study in the Hutterites.Availability and implementationPQLseq is implemented as an R package with source code freely available at and informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 18 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty644
      Issue No: Vol. 35, No. 3 (2018)
  • Random walk with restart on multiplex and heterogeneous biological
    • Authors: Valdeolivas A; Tichit L, Navarro C, et al.
      Pages: 497 - 505
      Abstract: MotivationRecent years have witnessed an exponential growth in the number of identified interactions between biological molecules. These interactions are usually represented as large and complex networks, calling for the development of appropriated tools to exploit the functional information they contain. Random walk with restart (RWR) is the state-of-the-art guilt-by-association approach. It explores the network vicinity of gene/protein seeds to study their functions, based on the premise that nodes related to similar functions tend to lie close to each other in the networks.ResultsIn this study, we extended the RWR algorithm to multiplex and heterogeneous networks. The walk can now explore different layers of physical and functional interactions between genes and proteins, such as protein–protein interactions and co-expression associations. In addition, the walk can also jump to a network containing different sets of edges and nodes, such as phenotype similarities between diseases. We devised a leave-one-out cross-validation strategy to evaluate the algorithms abilities to predict disease-associated genes. We demonstrate the increased performances of the multiplex-heterogeneous RWR as compared to several random walks on monoplex or heterogeneous networks. Overall, our framework is able to leverage the different interaction sources to outperform current approaches. Finally, we applied the algorithm to predict candidate genes for the Wiedemann–Rautenstrauch syndrome, and to explore the network vicinity of the SHORT syndrome.Availability and implementationThe source code is available on GitHub at: In addition, an R package is freely available through Bioconductor at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 18 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty637
      Issue No: Vol. 35, No. 3 (2018)
  • CIRCOAST: a statistical hypothesis test for cellular colocalization with
           network structures
    • Authors: Corliss B; Ray H, Patrie J, et al.
      Pages: 506 - 514
      Abstract: MotivationColocalization of structures in biomedical images can lead to insights into biological behaviors. One class of colocalization problems is examining an annular structure (disk-shaped such as a cell, vesicle or molecule) interacting with a network structure (vascular, neuronal, cytoskeletal, organellar). Examining colocalization events across conditions is often complicated by changes in density of both structure types, confounding traditional statistical approaches since colocalization cannot be normalized to the density of both structure types simultaneously. We have developed a technique to measure colocalization independent of structure density and applied it to characterizing intercellular colocation with blood vessel networks. This technique could be used to analyze colocalization of any annular structure with an arbitrarily shaped network structure.ResultsWe present the circular colocalization affinity with network structures test (CIRCOAST), a novel statistical hypothesis test to probe for enriched network colocalization in 2D z-projected multichannel images by using agent-based Monte Carlo modeling and image processing to generate the pseudo-null distribution of random cell placement unique to each image. This hypothesis test was validated by confirming that adipose-derived stem cells (ASCs) exhibit enriched colocalization with endothelial cells forming arborized networks in culture and then applied to show that locally delivered ASCs have enriched colocalization with murine retinal microvasculature in a model of diabetic retinopathy. We demonstrate that the CIRCOAST test provides superior power and type I error rates in characterizing intercellular colocalization compared to generic approaches that are confounded by changes in cell or vessel density.Availability and implementationCIRCOAST source code available at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 19 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty638
      Issue No: Vol. 35, No. 3 (2018)
  • MAVIS: merging, annotation, validation, and illustration of structural
    • Authors: Reisle C; Mungall K, Choo C, et al.
      Pages: 515 - 517
      Abstract: SummaryReliably identifying genomic rearrangements and interpreting their impact is a key step in understanding their role in human cancers and inherited genetic diseases. Many short read algorithmic approaches exist but all have appreciable false negative rates. A common approach is to evaluate the union of multiple tools increasing sensitivity, followed by filtering to retain specificity. Here we describe an application framework for the rapid generation of structural variant consensus, unique in its ability to visualize the genetic impact and context as well as process both genome and transcriptome data.Availability and implementationhttp://mavis.bcgsc.caSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 17 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty621
      Issue No: Vol. 35, No. 3 (2018)
  • TreeGrafter: phylogenetic tree-based annotation of proteins with Gene
           Ontology terms and other annotations
    • Authors: Tang H; Finn R, Thomas P, et al.
      Pages: 518 - 520
      Abstract: SummaryTreeGrafter is a new software tool for annotating protein sequences using pre-annotated phylogenetic trees. Currently, the tool provides annotations to Gene Ontology (GO) terms, and PANTHER family and subfamily. The approach is generalizable to any annotations that have been made to internal nodes of a reference phylogenetic tree. TreeGrafter takes each input query protein sequence, finds the best matching homologous family in a library of pre-calculated, pre-annotated gene trees, and then grafts it to the best location in the tree. It then annotates the sequence by propagating annotations from ancestral nodes in the reference tree. We show that TreeGrafter outperforms subfamily HMM scoring for correctly assigning subfamily membership, and that it produces highly specific annotations of GO terms based on annotated reference phylogenetic trees. This method will be further integrated into InterProScan, enabling an even broader user community.Availability and implementationTreeGrafter is freely available on the web at, including as a Docker image.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 19 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty625
      Issue No: Vol. 35, No. 3 (2018)
  • Simulating Illumina metagenomic data with InSilicoSeq
    • Authors: Gourlé H; Karlsson-Lindsjö O, Hayer J, et al.
      Pages: 521 - 522
      Abstract: MotivationThe accurate in silico simulation of metagenomic datasets is of great importance for benchmarking bioinformatics tools as well as for experimental design. Users are dependant on large-scale simulation to not only design experiments and new projects but also for accurate estimation of computational needs within a project. Unfortunately, most current read simulators are either not suited for metagenomics, out of date or relatively poorly documented. In this article, we describe InSilicoSeq, a software package to simulate metagenomic Illumina sequencing data. InsilicoSeq has a simple command-line interface and extensive documentation.ResultsInSilicoSeq is implemented in Python and capable of simulating realistic Illumina (meta) genomic data in a parallel fashion with sensible default parameters.Availability and implementationSource code and documentation are available under the MIT license at and informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 19 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty630
      Issue No: Vol. 35, No. 3 (2018)
  • MinIONQC: fast and simple quality control for MinION sequencing data
    • Authors: Lanfear R; Schalamun M, Kainer D, et al.
      Pages: 523 - 525
      Abstract: SummaryMinIONQC provides rapid diagnostic plots and quality control data from one or more flowcells of sequencing data from Oxford Nanopore Technologies’ MinION instrument. It can be used to assist with the optimisation of extraction, library preparation, and sequencing protocols, to quickly and directly compare the data from many flowcells, and to provide publication-ready figures summarising sequencing data.Availability and implementationMinIONQC is implemented in R and released under an MIT license. It is available for all platforms from
      PubDate: Mon, 23 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty654
      Issue No: Vol. 35, No. 3 (2018)
  • ape 5.0: an environment for modern phylogenetics and evolutionary analyses
           in R
    • Authors: Paradis E; Schliep K, Schwartz R.
      Pages: 526 - 528
      Abstract: SummaryAfter more than fifteen years of existence, the R package ape has continuously grown its contents, and has been used by a growing community of users. The release of version 5.0 has marked a leap towards a modern software for evolutionary analyses. Efforts have been put to improve efficiency, flexibility, support for ‘big data’ (R’s long vectors), ease of use and quality check before a new release. These changes will hopefully make ape a useful software for the study of biodiversity and evolution in a context of increasing data quantity.Availability and implementationape is distributed through the Comprehensive R Archive Network: Further information may be found at
      PubDate: Tue, 17 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty633
      Issue No: Vol. 35, No. 3 (2018)
  • MendelProb: probability and sample size calculations for Mendelian studies
           of exome and whole genome sequence data
    • Authors: He Z; Wang L, DeWan A, et al.
      Pages: 529 - 531
      Abstract: MotivationFor the design of genetic studies, it is necessary to perform power calculations. Although for Mendelian traits the power of detecting linkage for pedigree(s) can be determined, it is also of great interest to determine the probability of identifying multiple pedigrees or unrelated cases with variants in the same gene. For many diseases, due to extreme locus heterogeneity this probability can be small. If only one family is observed segregating a variant classified as likely pathogenic or of unknown significance, the gene cannot be implicated in disease etiology. The probability of identifying several disease families or cases is dependent on the gene-specific disease prevalence and the sample size. The observation of multiple disease families or cases with variants in the same gene as well as evidence of pathogenicity from other sources, e.g. in silico prediction, expression and functional studies, can aid in implicating a gene in disease etiology. MendelProb can determine the probability of detecting a minimum number of families or cases with variants in the same gene. It can also calculate the probability of detecting genes with variants in different data types, e.g. identifying a variant in at least one family that can establish linkage and more the two additional families regardless of their size. Additionally, for a specified probability MendelProb can determine the number of probands which need to be screened to detect a minimum number of individuals with variants within the same gene.ResultsA single Mendelian disease family is not sufficient to implicate a gene in disease etiology. It is necessary to observe multiple families or cases with potentially pathogenic variants in the same gene. MendelProb, an R library, was developed to determine the probability of observing multiple families and cases with variants within a gene and to also establish the numbers of probands to screen to detect multiple observations of variants within a gene.Availability and implementation
      PubDate: Thu, 19 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty542
      Issue No: Vol. 35, No. 3 (2018)
  • MoDentify: phenotype-driven module identification in metabolomics networks
           at different resolutions
    • Authors: Do K; Rasp D, Kastenmüller G, et al.
      Pages: 532 - 534
      Abstract: SummaryAssociations of metabolomics data with phenotypic outcomes are expected to span functional modules, which are defined as sets of correlating metabolites that are coordinately regulated. Moreover, these associations occur at different scales, from entire pathways to only a few metabolites; an aspect that has not been addressed by previous methods. Here, we present MoDentify, a free R package to identify regulated modules in metabolomics networks at different layers of resolution. Importantly, MoDentify shows higher statistical power than classical association analysis. Moreover, the package offers direct interactive visualization of the results in Cytoscape. We present an application example using complex, multifluid metabolomics data. Due to its generic character, the method is widely applicable to other types of data.Availability and implementation (vignette includes detailed workflow).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 19 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty650
      Issue No: Vol. 35, No. 3 (2018)
  • gMCS: fast computation of genetic minimal cut sets in large networks
    • Authors: Apaolaza I; Valcarcel L, Planes F, et al.
      Pages: 535 - 537
      Abstract: MotivationThe identification of minimal gene knockout strategies to engineer metabolic systems constitutes one of the most relevant applications of the COnstraint-Based Reconstruction and Analysis (COBRA) framework. In the last years, the minimal cut sets (MCSs) approach has emerged as a promising tool to carry out this task. However, MCSs define reaction knockout strategies, which are not necessarily transformed into feasible strategies at the gene level.ResultsWe present a more general, easy-to-use and efficient computational implementation of a previously published algorithm to calculate MCSs to the gene level (gMCSs). Our tool was compared with existing methods in order to calculate essential genes and synthetic lethals in metabolic networks of different complexity, showing a significant reduction in model size and computation time.Availability and implementationgMCS is publicly and freely available under GNU license in the COBRA toolbox ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 20 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty656
      Issue No: Vol. 35, No. 3 (2018)
  • iSwathX: an interactive web-based application for extension of DIA peptide
           reference libraries
    • Authors: Noor Z; Wu J, Pascovici D, et al.
      Pages: 538 - 539
      Abstract: SummaryLarge-scale peptide mass spectrometry (MS)/MS reference libraries are essential for the comprehensive analysis of data-independent acquisition (DIA) MS datasets, providing a comprehensive set of spectra for identification and quantification of proteins. We have developed a novel web-based R-package (iSwathX) for combining reference libraries that is compatible with different DIA analysis software. This open-source web GUI automates the process of normalization and combination of spectral libraries and provides a user-friendly method for performing library format conversions, analysis and visualizations, with no need for programing familiarity.Availability and implementationiSwathX is freely accessible at with the R-package and Shiny source code available from GitHub ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 23 Jul 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty660
      Issue No: Vol. 35, No. 3 (2018)
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-