for Journals by Title or ISSN
for Articles by Keywords
Journal Cover Bioinformatics
  [SJR: 4.643]   [H-I: 271]   [311 followers]  Follow
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1367-4803 - ISSN (Online) 1460-2059
   Published by Oxford University Press Homepage  [370 journals]
  • An introduction to deep learning on biological sequence data: examples and
    • Authors: Jurtz V; Johansen A, Nielsen M, et al.
      Abstract: MotivationDeep neural network architectures such as convolutional and long short-term memory networks have become increasingly popular as machine learning tools during the recent years. The availability of greater computational resources, more data, new algorithms for training deep models and easy to use libraries for implementation and training of neural networks are the drivers of this development. The use of deep learning has been especially successful in image recognition; and the development of tools, applications and code examples are in most cases centered within this field rather than within biology.ResultsHere, we aim to further the development of deep learning methods within biology by providing application examples and ready to apply and adapt code templates. Given such examples, we illustrate how architectures consisting of convolutional and long short-term memory neural networks can relatively easily be designed and trained to state-of-the-art performance on three biological sequence problems: prediction of subcellular localization, protein secondary structure and the binding of peptides to MHC Class II molecules.Availability and implementationAll implementations and datasets are available online to the scientific community at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-08-23
  • Motif independent identification of potential RNA G-quadruplexes by G4RNA
    • Authors: Garant J; Perreault J, Scott M.
      Abstract: MotivationG-quadruplex structures in RNA molecules are known to have regulatory impacts in cells but are difficult to locate in the genome. The minimal requirements for G-quadruplex folding in RNA (G≥3N1-7 G≥3N1-7 G≥3N1-7 G≥3) is being challenged by observations made on specific examples in recent years. The definition of potential G-quadruplex sequences has major repercussions on the observation of the structure since it introduces a bias. The canonical motif only describes a sub-population of the reported G-quadruplexes. To address these issues, we propose an RNA G-quadruplex prediction strategy that does not rely on a motif definition.ResultsWe trained an artificial neural network with sequences of experimentally validated G-quadruplexes from the G4RNA database encoded using an abstract definition of their sequence. This artificial neural network, G4NN, evaluates the similarity of a given sequence to known G-quadruplexes and reports it as a score. G4NN has a predictive power comparable to the reported G richness and G/C skewness evaluations that are the current state-of-the-art for the identification of potential RNA G-quadruplexes. We combined these approaches in the G4RNA screener, a program designed to manage and evaluate the sequences to identify potential G-quadruplexes.Availability and implementationG4RNA screener is available for download at or or michelle.scott@usherbrooke.caSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-08-03
  • DelPhiForce web server: electrostatic forces and energy calculations and
    • Authors: Li L; Jia Z, Peng Y, et al.
      Abstract: SummaryElectrostatic force is an essential component of the total force acting between atoms and macromolecules. Therefore, accurate calculations of electrostatic forces are crucial for revealing the mechanisms of many biological processes. We developed a DelPhiForce web server to calculate and visualize the electrostatic forces at molecular level. DelPhiForce web server enables modeling of electrostatic forces on individual atoms, residues, domains and molecules, and generates an output that can be visualized by VMD software. Here we demonstrate the usage of the server for various biological problems including protein–cofactor, domain–domain, protein–protein, protein–DNA and protein–RNA interactions.Availability and implementationThe DelPhiForce web server is available at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-08-03
  • ComplexViewer: visualization of curated macromolecular complexes
    • Authors: Combe C; Sivade M, Hermjakob H, et al.
      Abstract: SummaryProteins frequently function as parts of complexes, assemblages of multiple proteins and other biomolecules, yet network visualizations usually only show proteins as parts of binary interactions. ComplexViewer visualizes interactions with more than two participants and thereby avoids the need to first expand these into multiple binary interactions. Furthermore, if binding regions between molecules are known then these can be displayed in the context of the larger complex.Availability and implementationfreely available under Apache version 2 license; EMBL-EBI Complex Portal:; Source code:; Package:; Language: JavaScript; Web technology: Scalable Vector Graphics; Libraries: or
      PubDate: 2017-08-03
  • MFIB: a repository of protein complexes with mutual folding induced by
    • Authors: Fichó E; Reményi I, Simon I, et al.
      Abstract: MotivationIt is commonplace that intrinsically disordered proteins (IDPs) are involved in crucial interactions in the living cell. However, the study of protein complexes formed exclusively by IDPs is hindered by the lack of data and such analyses remain sporadic. Systematic studies benefited other types of protein–protein interactions paving a way from basic science to therapeutics; yet these efforts require reliable datasets that are currently lacking for synergistically folding complexes of IDPs.ResultsHere we present the Mutual Folding Induced by Binding (MFIB) database, the first systematic collection of complexes formed exclusively by IDPs. MFIB contains an order of magnitude more data than any dataset used in corresponding studies and offers a wide coverage of known IDP complexes in terms of flexibility, oligomeric composition and protein function from all domains of life. The included complexes are grouped using a hierarchical classification and are complemented with structural and functional annotations. MFIB is backed by a firm development team and infrastructure, and together with possible future community collaboration it will provide the cornerstone for structural and functional studies of IDP complexes.Availability and implementationMFIB is freely accessible at The MFIB application is hosted by Apache web server and was implemented in PHP. To enrich querying features and to enhance backend performance a MySQL database was also, meszaros.balint@ttk.mta.huSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-08-03
  • BiobankUniverse: automatic matchmaking between datasets for biobank data
           discovery and integration
    • Authors: Pang C; Kelpin F, van Enckevort D, et al.
      Abstract: MotivationBiobanks are indispensable for large-scale genetic/epidemiological studies, yet it remains difficult for researchers to determine which biobanks contain data matching their research questions.ResultsTo overcome this, we developed a new matching algorithm that identifies pairs of related data elements between biobanks and research variables with high precision and recall. It integrates lexical comparison, Unified Medical Language System ontology tagging and semantic query expansion. The result is BiobankUniverse, a fast matchmaking service for biobanks and researchers. Biobankers upload their data elements and researchers their desired study variables, BiobankUniverse automatically shortlists matching attributes between them. Users can quickly explore matching potential and search for biobanks/data elements matching their research. They can also curate matches and define personalized data-universes.Availability and implementationBiobankUniverse is available at or can be downloaded as part of the open source MOLGENIS suite at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-08-02
  • LRCstats, a tool for evaluating long reads correction methods
    • Authors: La S; Haghshenas E, Chauve C.
      Abstract: MotivationThird-generation sequencing (TGS) platforms that generate long reads, such as PacBio and Oxford Nanopore technologies, have had a dramatic impact on genomics research. However, despite recent improvements, TGS reads suffer from high-error rates and the development of read correction methods is an active field of research. This motivates the need to develop tools that can evaluate the accuracy of noisy long reads correction tools.ResultsWe introduce LRCstats, a tool that measures the accuracy of long reads correction tools. LRCstats takes advantage of long reads simulators that provide each simulated read with an alignment to the reference genome segment they originate from, and does not rely on a step of mapping corrected reads onto the reference genome. This allows for the measurement of the accuracy of the correction while being consistent with the actual errors introduced in the simulation process used to generate noisy reads. We illustrate the usefulness of LRCstats by analyzing the accuracy of four hybrid correction methods for PacBio long reads over three datasets.Availability and implementation or cedric.chauve@sfu.caSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-08-02
  • Reference genome assessment from a population scale perspective: an
           accurate profile of variability and noise
    • Authors: Carbonell-Caballero J; Amadoz A, Alonso R, et al.
      Abstract: MotivationCurrent plant and animal genomic studies are often based on newly assembled genomes that have not been properly consolidated. In this scenario, misassembled regions can easily lead to false-positive findings. Despite quality control scores are included within genotyping protocols, they are usually employed to evaluate individual sample quality rather than reference sequence reliability. We propose a statistical model that combines quality control scores across samples in order to detect incongruent patterns at every genomic region. Our model is inherently robust since common artifact signals are expected to be shared between independent samples over misassembled regions of the genome.ResultsThe reliability of our protocol has been extensively tested through different experiments and organisms with accurate results, improving state-of-the-art methods. Our analysis demonstrates synergistic relations between quality control scores and allelic variability estimators, that improve the detection of misassembled regions, and is able to find strong artifact signals even within the human reference assembly. Furthermore, we demonstrated how our model can be trained to properly rank the confidence of a set of candidate variants obtained from new independent samples.Availability and implementationThis tool is freely available at or joaquin.dopazo@juntadeandalucia.esSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-29
  • Improved prediction of breast cancer outcome by identifying heterogeneous
    • Authors: Choi J; Park S, Yoon Y, et al.
      Abstract: MotivationIdentification of genes that can be used to predict prognosis in patients with cancer is important in that it can lead to improved therapy, and can also promote our understanding of tumor progression on the molecular level. One of the common but fundamental problems that render identification of prognostic genes and prediction of cancer outcomes difficult is the heterogeneity of patient samples.ResultsTo reduce the effect of sample heterogeneity, we clustered data samples using K-means algorithm and applied modified PageRank to functional interaction (FI) networks weighted using gene expression values of samples in each cluster. Hub genes among resulting prioritized genes were selected as biomarkers to predict the prognosis of samples. This process outperformed traditional feature selection methods as well as several network-based prognostic gene selection methods when applied to Random Forest. We were able to find many cluster-specific prognostic genes for each dataset. Functional study showed that distinct biological processes were enriched in each cluster, which seems to reflect different aspect of tumor progression or oncogenesis among distinct patient groups. Taken together, these results provide support for the hypothesis that our approach can effectively identify heterogeneous prognostic genes, and these are complementary to each other, improving prediction accuracy.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-29
  • FAF-Drugs4: free ADME-tox filtering computations for chemical biology and
           early stages drug discovery
    • Authors: Lagorce D; Bouslama L, Becot J, et al.
      Abstract: MotivationIdentification of small molecules that could be interesting starting points for drug discovery or to investigate a biological system as in chemical biology endeavours is both time consuming and costly. In silico approaches that assist the design of quality compound collections or help to prioritize molecules before synthesis or purchase are therefore valuable. Here quality refers to the selection of molecules that pass one or several selected filters that can be tuned by the users according to the project and the stage of the project. These filters can involve prediction of physicochemical properties, search for toxicophores or other unwanted chemical groups.ResultsFAF-Drugs4 is a novel version of our online server dedicated to the preparation and annotation of compound collections. The tool is now faster and several parameters have been optimized. In addition, a new service referred to as FAF-QED, an implementation of the quantitative estimate of drug-likeness method, is now available.Availability and implementationThe server is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-29
  • 3DBIONOTES v2.0: a web server for the automatic annotation of
           macromolecular structures
    • Authors: Segura J; Sanchez-Garcia R, Martinez M, et al.
      Abstract: MotivationComplementing structural information with biochemical and biomedical annotations is a powerful approach to explore the biological function of macromolecular complexes. However, currently the compilation of annotations and structural data is a feature only available for those structures that have been released as entries to the Protein Data Bank.ResultsTo help researchers in assessing the consistency between structures and biological annotations for structural models not deposited in databases, we present 3DBIONOTES v2.0, a web application designed for the automatic annotation of biochemical and biomedical information onto macromolecular structural models determined by any experimental or computational technique.Availability and implementationThe web server is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-28
  • Sequence2Vec: a novel embedding approach for modeling transcription factor
           binding affinity landscape
    • Authors: Dai H; Umarov R, Kuwahara H, et al.
      Abstract: MotivationAn accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem.ResultsHere we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these hidden Markov models into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA datasets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods.Availability and implementationOur program is freely available at or informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-27
  • pgRNAFinder: a web-based tool to design distance independent paired-gRNA
    • Authors: Xiong Y; Xie X, Wang Y, et al.
      Abstract: SummaryThe CRISPR/Cas System has been shown to be an efficient and accurate genome-editing technique. There exist a number of tools to design the guide RNA sequences and predict potential off-target sites. However, most of the existing computational tools on gRNA design are restricted to small deletions. To address this issue, we present pgRNAFinder, with an easy-to-use web interface, which enables researchers to design single or distance-free paired-gRNA sequences. The web interface of pgRNAFinder contains both gRNA search and scoring system. After users input query sequences, it searches gRNA by 3' protospacer-adjacent motif (PAM), and possible off-targets, and scores the conservation of the deleted sequences rapidly. Filters can be applied to identify high-quality CRISPR sites. PgRNAFinder offers gRNA design functionality for 8 vertebrate genomes. Furthermore, to keep pgRNAFinder open, extensible to any organism, we provide the source package for local use.Availability and implementationThe pgRNAFinder is freely available at, and the source code and user manual can be obtained from or informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-27
  • iDNA4mC: identifying DNA N 4 -methylcytosine sites based on nucleotide
           chemical properties
    • Authors: Chen W; Yang H, Feng P, et al.
      Abstract: MotivationDNA N4-methylcytosine (4mC) is an epigenetic modification. The knowledge about the distribution of 4mC is helpful for understanding its biological functions. Although experimental methods have been proposed to detect 4mC sites, they are expensive for performing genome-wide detections. Thus, it is necessary to develop computational methods for predicting 4mC sites.ResultsIn this work, we developed iDNA4mC, the first webserver to identify 4mC sites, in which DNA sequences are encoded with both nucleotide chemical properties and nucleotide frequency. The predictive results of the rigorous jackknife test and cross species test demonstrated that the performance of iDNA4mC is quite promising and holds high potential to become a useful tool for identifying 4mC sites.Availability and implementationThe user-friendly web-server, iDNA4mC, is freely accessible at or
      PubDate: 2017-07-26
  • Towards clinically more relevant dissection of patient heterogeneity via
           survival-based Bayesian clustering
    • Authors: Ahmad A; Fröhlich H.
      Abstract: MotivationDiscovery of clinically relevant disease sub-types is of prime importance in personalized medicine. Disease sub-type identification has in the past often been explored in an unsupervised machine learning paradigm which involves clustering of patients based on available-omics data, such as gene expression. A follow-up analysis involves determining the clinical relevance of the molecular sub-types such as that reflected by comparing their disease progressions. The above methodology, however, fails to guarantee the separability of the sub-types based on their subtype-specific survival curves.ResultsWe propose a new algorithm, Survival-based Bayesian Clustering (SBC) which simultaneously clusters heterogeneous-omics and clinical end point data (time to event) in order to discover clinically relevant disease subtypes. For this purpose we formulate a novel Hierarchical Bayesian Graphical Model which combines a Dirichlet Process Gaussian Mixture Model with an Accelerated Failure Time model. In this way we make sure that patients are grouped in the same cluster only when they show similar characteristics with respect to molecular features across data types (e.g. gene expression, mi-RNA) as well as survival times. We extensively test our model in simulation studies and apply it to cancer patient data from the Breast Cancer dataset and The Cancer Genome Atlas repository. Notably, our method is not only able to find clinically relevant sub-groups, but is also able to predict cluster membership and survival on test data in a better way than other competing methods.Availability and implementationOur R-code can be accessed as informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-26
  • pLoc-mAnimal: predict subcellular localization of animal proteins with
           both single and multiple sites
    • Authors: Cheng X; Zhao S, Lin W, et al.
      Abstract: MotivationCells are deemed the basic unit of life. However, many important functions of cells as well as their growth and reproduction are performed via the protein molecules located at their different organelles or locations. Facing explosive growth of protein sequences, we are challenged to develop fast and effective method to annotate their subcellular localization. However, this is by no means an easy task. Particularly, mounting evidences have indicated proteins have multi-label feature meaning that they may simultaneously exist at, or move between, two or more different subcellular location sites. Unfortunately, most of the existing computational methods can only be used to deal with the single-label proteins. Although the ‘iLoc-Animal’ predictor developed recently is quite powerful that can be used to deal with the animal proteins with multiple locations as well, its prediction quality needs to be improved, particularly in enhancing the absolute true rate and reducing the absolute false rate.ResultsHere we propose a new predictor called ‘pLoc-mAnimal’, which is superior to iLoc-Animal as shown by the compelling facts. When tested by the most rigorous cross-validation on the same high-quality benchmark dataset, the absolute true success rate achieved by the new predictor is 37% higher and the absolute false rate is four times lower in comparison with the state-of-the-art predictor.Availability and implementationTo maximize the convenience of most experimental scientists, a user-friendly web-server for the new predictor has been established at, by which users can easily get their desired results without the need to go through the complicated mathematics or kcchou@gordonlifescience.orgSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-24
  • SPRINT: an SNP-free toolkit for identifying RNA editing sites
    • Authors: Zhang F; Lu Y, Yan S, et al.
      Abstract: MotivationRNA editing generates post-transcriptional sequence alterations. Detection of RNA editing sites (RESs) typically requires the filtering of SNVs called from RNA-seq data using an SNP database, an obstacle that is difficult to overcome for most organisms.ResultsHere, we present a novel method named SPRINT that identifies RESs without the need to filter out SNPs. SPRINT also integrates the detection of hyper RESs from remapped reads, and has been fully automated to any RNA-seq data with reference genome sequence available. We have rigorously validated SPRINT’s effectiveness in detecting RESs using RNA-seq data of samples in which genes encoding RNA editing enzymes are knock down or over-expressed, and have also demonstrated its superiority over current methods. We have applied SPRINT to investigate RNA editing across tissues and species, and also in the development of mouse embryonic central nervous system. A web resource ( of RESs identified by SPRINT has been constructed.Availability and implementationThe software and related data are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-24
  • CrocoBLAST: Running BLAST efficiently in the age of next-generation
    • Authors: Tristão Ramos R; de Azevedo Martins A, da Silva Delgado G, et al.
      Abstract: SummaryCrocoBLAST is a tool for dramatically speeding up BLAST+ execution on any computer. Alignments that would take days or weeks with NCBI BLAST+ can be run overnight with CrocoBLAST. Additionally, CrocoBLAST provides features critical for NGS data analysis, including: results identical to those of BLAST+; compatibility with any BLAST+ version; real-time information regarding calculation progress and remaining run time; access to partial alignment results; queueing, pausing, and resuming BLAST+ calculations without information loss.Availability and implementationCrocoBLAST is freely available online, with ample documentation ( No installation or user registration is required. CrocoBLAST is implemented in C, while the graphical user interface is implemented in Java. CrocoBLAST is supported under Linux and Windows, and can be run under Mac OS X in a Linux virtual machine.Contactjkoca@ceitec.czSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-24
  • myVCF: a desktop application for high-throughput mutations data management
    • Authors: Pietrelli A; Valenti L.
      Abstract: SummaryNext-generation sequencing technologies have become the most powerful tool to discover genetic variants associated with human diseases. Although the dramatic reductions in the costs facilitate the use in the wet-lab and clinics, the huge amount of data generated renders their management by non-expert researchers and physicians extremely difficult. Therefore, there is an urgent need of novel approaches and tools aimed at getting the ‘end-users’ closer to the sequencing data, facilitating the access by non-bioinformaticians, and to speed-up the functional interpretation of genetic variants. We developed myVCF, a standalone, easy-to-use desktop application, which is based on a browser interface and is suitable for Windows, Mac and UNIX systems. myVCF is an efficient platform that is able to manage multiple sequencing projects created from VCF files within the system; stores genetic variants and samples genotypes from an annotated VCF files into a SQLite database; implements a flexible search engine for data exploration, allowing to query for chromosomal region, gene, single variant or dbSNP ID. Besides, myVCF generates a summary statistics report about mutations distribution across samples and across the genome/exome by aggregating the information within the VCF file. In summary, the myVCF platform allows end-users without strong programming and bioinformatics skills to explore, query, visualize and export mutations data in a simple and straightforward way.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-24
  • ggseqlogo: a versatile R package for drawing sequence logos
    • Authors: Wagih O.
      Abstract: SummarySequence logos have become a crucial visualization method for studying underlying sequence patterns in the genome. Despite this, there remains a scarcity of software packages that provide the versatility often required for such visualizations. ggseqlogo is an R package built on the ggplot2 package that aims to address this issue. ggseqlogo offers native illustration of publication-ready DNA, RNA and protein sequence logos in a highly customizable fashion with features including multi-logo plots, qualitative and quantitative colour schemes, annotation of logos and integration with other plots. The package is intuitive to use and seamlessly integrates into R analysis pipelines.Availability and implementationggseqlogo is released under the GNU licence and is freely available via CRAN-The Comprehensive R Archive Network A detailed tutorial can be found at
      PubDate: 2017-07-20
  • Phylotyper: in silico predictor of gene subtypes
    • Authors: Whiteside M; Gannon V, Laing C.
      Abstract: SummaryWhole genome sequencing (WGS) is being adopted in public health for improved surveillance and outbreak analysis. In public health, subtyping has been used to infer phenotypes and distinguish bacterial strain groups. In silico tools that predict subtypes from sequences data are needed to transition historical data to WGS-based protocols. Phylotyper is a novel solution for in silico subtype prediction from gene sequences. Designed for incorporation into WGS pipelines, it is a general prediction tool that can be applied to different subtype schemes. Phylotyper uses phylogeny to model the evolution of the subtype and infer subtypes for unannotated sequences. The phylogenic framework in Phylotyper improves accuracy over approaches based solely on sequence similarity and provides useful contextual feedback.Availability and implementationPhylotyper is a python and R package. It is available from: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-18
  • SPATKIN: a simulator for rule-based modeling of biomolecular site dynamics
           on surfaces
    • Authors: Kochańczyk M; Hlavacek W, Lipniacki T.
      Abstract: SummaryRule-based modeling is a powerful approach for studying biomolecular site dynamics. Here, we present SPATKIN, a general-purpose simulator for rule-based modeling in two spatial dimensions. The simulation algorithm is a lattice-based method that tracks Brownian motion of individual molecules and the stochastic firing of rule-defined reaction events. Because rules are used as event generators, the algorithm is network-free, meaning that it does not require to generate the complete reaction network implied by rules prior to simulation. In a simulation, each molecule (or complex of molecules) is taken to occupy a single lattice site that cannot be shared with another molecule (or complex). SPATKIN is capable of simulating a wide array of membrane-associated processes, including adsorption, desorption and crowding. Models are specified using an extension of the BioNetGen language, which allows to account for spatial features of the simulated process.Availability and implementationThe C ++ source code for SPATKIN is distributed freely under the terms of the GNU GPLv3 license. The source code can be compiled for execution on popular platforms (Windows, Mac and Linux). An installer for 64-bit Windows and a macOS app are available. The source code and precompiled binaries are available at the SPATKIN Web site ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-17
  • Analysis and prediction of protein folding energy changes upon mutation by
           element specific persistent homology
    • Authors: Cang Z; Wei G.
      Abstract: MotivationSite directed mutagenesis is widely used to understand the structure and function of biomolecules. Computational prediction of mutation impacts on protein stability offers a fast, economical and potentially accurate alternative to laboratory mutagenesis. Most existing methods rely on geometric descriptions, this work introduces a topology based approach to provide an entirely new representation of mutation induced protein stability changes that could not be obtained from conventional techniques.ResultsTopology based mutation predictor (T-MP) is introduced to dramatically reduce the geometric complexity and number of degrees of freedom of proteins, while element specific persistent homology is proposed to retain essential biological information. The present approach is found to outperform other existing methods in the predictions of globular protein stability changes upon mutation. A Pearson correlation coefficient of 0.82 with an RMSE of 0.92 kcal/mol is obtained on a test set of 350 mutation samples. For the prediction of membrane protein stability changes upon mutation, the proposed topological approach has a 84% higher Pearson correlation coefficient than the current state-of-the-art empirical methods, achieving a Pearson correlation of 0.57 and an RMSE of 1.09 kcal/mol in a 5-fold cross validation on a set of 223 membrane protein mutation samples.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-14
  • Prediction and modeling of pre-analytical sampling errors as a strategy to
           improve plasma NMR metabolomics data
    • Authors: Brunius C; Pedersen A, Malmodin D, et al.
      Abstract: MotivationBiobanks are important infrastructures for life science research. Optimal sample handling regarding e.g. collection and processing of biological samples is highly complex, with many variables that could alter sample integrity and even more complex when considering multiple study centers or using legacy samples with limited documentation on sample management. Novel means to understand and take into account such variability would enable high-quality research on archived samples.ResultsThis study investigated whether pre-analytical sample variability could be predicted and reduced by modeling alterations in the plasma metabolome, measured by NMR, as a function of pre-centrifugation conditions (1–36 h pre-centrifugation delay time at 4 °C and 22 °C) in 16 individuals. Pre-centrifugation temperature and delay times were predicted using random forest modeling and performance was validated on independent samples. Alterations in the metabolome were modeled at each temperature using a cluster-based approach, revealing reproducible effects of delay time on energy metabolism intermediates at both temperatures, but more pronounced at 22 °C. Moreover, pre-centrifugation delay at 4 °C resulted in large, specific variability at 3 h, predominantly of lipids. Pre-analytical sample handling error correction resulted in significant improvement of data quality, particularly at 22 °C. This approach offers the possibility to predict pre-centrifugation delay temperature and time in biobanked samples before use in costly downstream applications. Moreover, the results suggest potential to decrease the impact of undesired, delay-induced variability. However, these findings need to be validated in multiple, large sample sets and with analytical techniques covering a wider range of the metabolome, such as LC-MS.Availability and implementationThe sampleDrift R package is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-14
  • A robust DF-REML framework for variance components estimation in genetic
    • Authors: Lourenço V; Rodrigues P, Pires A, et al.
      Abstract: MotivationIn genetic association studies, linear mixed models (LMMs) are used to test for associations between phenotypes and candidate single nucleotide polymorphisms (SNPs). These same models are also used to estimate heritability, which is central not only to evolutionary biology but also to the prediction of the response to selection in plant and animal breeding, as well as the prediction of disease risk in humans. However, when one or more of the underlying assumptions are violated, the estimation of variance components may be compromised and therefore so may the estimates of heritability and any other functions of these. Considering that datasets obtained from real life experiments are prone to several sources of contamination, which usually induce the violation of the assumption of the normality of the errors, a robust derivative-free restricted-maximum likelihood framework (DF-REML) together with a robust coefficient of determination are proposed for the LMM in the context of genetic studies of continuous traits.ResultsThe proposed approach, in addition to the robust estimation of variance components and robust computation of the coefficient of determination, allows in particular for the robust estimation of SNP-based heritability by reducing the bias and increasing the precision of its estimates. The performance of both classical and robust DF-REML approaches is compared via a Monte Carlo simulation study. Additionally, three examples of application of the methodologies to real datasets are given in order to validate the usefulness of the proposed robust approach. Although the main focus of this article is on plant breeding applications, the proposed methodology is applicable to both human and animal genetic studies.Availability and implementationSource code implemented in R is available in the Supplementary MaterialSupplementary Material.Contactvmml@fct.unl.ptSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-14
  • Standardizing biomass reactions and ensuring complete mass balance in
           genome-scale metabolic models
    • Authors: Chan S; Cai J, Wang L, et al.
      Abstract: MotivationIn a genome-scale metabolic model, the biomass produced is defined to have a molecular weight (MW) of 1 g mmol−1. This is critical for correctly predicting growth yields, contrasting multiple models and more importantly modeling microbial communities. However, the standard is rarely verified in the current practice and the chemical formulae of biomass components such as proteins, nucleic acids and lipids are often represented by undefined side groups (e.g. X, R).ResultsWe introduced a systematic procedure for checking the biomass weight and ensuring complete mass balance of a model. We identified significant departures after examining 64 published models. The biomass weights of 34 models differed by 5–50%, while 8 models have discrepancies >50%. In total 20 models were manually curated. By maximizing the original versus corrected biomass reactions, flux balance analysis revealed >10% differences in growth yields for 12 of the curated models. Biomass MW discrepancies are accentuated in microbial community simulations as they can cause significant and systematic errors in the community composition. Microbes with underestimated biomass MWs are overpredicted in the community whereas microbes with overestimated biomass weights are underpredicted. The observed departures in community composition are disproportionately larger than the discrepancies in the biomass weight estimate. We propose the presented procedure as a standard practice for metabolic reconstructions.Availability and implementationThe MALTAB and Python scripts are available in the Supplementary MaterialSupplementary or joshua.chan@connect.polyu.hkSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-14
  • DynaPho: a web platform for inferring the dynamics of time-series
    • Authors: Hsu C; Wang J, Lu P, et al.
      Abstract: SummaryLarge-scale phosphoproteomics studies have improved our understanding of dynamic cellular signaling, but the downstream analysis of phosphoproteomics data is still a bottleneck. We develop DynaPho, a useful web-based tool providing comprehensive and in-depth analyses of time-course phosphoproteomics data, making analysis intuitive and accessible to non-bioinformatics experts. The tool currently implements five analytic modules, which reveal the transition of biological pathways, kinase activity, dynamics of interaction networks and the predicted kinase-substrate associations. These features can assist users in translating their larger-scale time-course phosphoproteomics data into valuable biological discoveries.Availability and implementationDynaPho is freely available at or Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-07
  • The value of prior knowledge in machine learning of complex network
    • Authors: Ferranti D; Krane D, Craft D.
      Abstract: MotivationOur overall goal is to develop machine-learning approaches based on genomics and other relevant accessible information for use in predicting how a patient will respond to a given proposed drug or treatment. Given the complexity of this problem, we begin by developing, testing and analyzing learning methods using data from simulated systems, which allows us access to a known ground truth. We examine the benefits of using prior system knowledge and investigate how learning accuracy depends on various system parameters as well as the amount of training data available.ResultsThe simulations are based on Boolean networks—directed graphs with 0/1 node states and logical node update rules—which are the simplest computational systems that can mimic the dynamic behavior of cellular systems. Boolean networks can be generated and simulated at scale, have complex yet cyclical dynamics and as such provide a useful framework for developing machine-learning algorithms for modular and hierarchical networks such as biological systems in general and cancer in particular. We demonstrate that utilizing prior knowledge (in the form of network connectivity information), without detailed state equations, greatly increases the power of machine-learning algorithms to predict network steady-state node values (‘phenotypes’) and perturbation responses (‘drug effects’). Availability and implementationLinks to codes and datasets here: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-07-07
  • gVolante for standardizing completeness assessment of genome and
           transcriptome assemblies
    • Authors: Nishimura O; Hara Y, Kuraku S.
      Abstract: MotivationAlong with the increasing accessibility to comprehensive sequence information, such as whole genomes and transcriptomes, the demand for assessing their quality has been multiplied. To this end, metrics based on sequence lengths, such as N50, have become a standard, but they only evaluate one aspect of assembly quality. Conversely, analyzing the coverage of pre-selected reference protein-coding genes provides essential content-based quality assessment, but the currently available pipelines for this purpose, CEGMA and BUSCO, do not have a user-friendly interface to serve as a uniform environment for assembly completeness assessment. ResultsHere, we introduce a brand-new web server, gVolante, which provides an online tool for (i) on-demand completeness assessment of sequence sets by means of the previously developed pipelines CEGMA and BUSCO and (ii) browsing pre-computed completeness scores for publicly available data in its database section. Completeness assessments performed on gVolante report scores based on not just the coverage of reference genes but also on sequence lengths (e.g. N50 scaffold length), allowing quality control in multiple aspects. Using gVolante, one can compare the quality of original assemblies between their multiple versions (obtained through program choice and parameter tweaking, for example) and evaluate them in comparison to the scores of public resources found in the database section. Availability and implementationgVoalte is freely available at
      PubDate: 2017-07-07
  • CausalR: extracting mechanistic sense from genome scale data
    • Authors: Bradley G; Barrett S.
      Abstract: SummaryUtilization of causal interaction data enables mechanistic rather than descriptive interpretation of genome-scale data. Here we present CausalR, the first open source causal network analysis platform. Implemented functions enable regulator prediction and network reconstruction, with network and annotation files created for visualization in Cytoscape. False positives are limited using the introduced Sequential Causal Analysis of Networks approach.Availability and implementationCausalR is implemented in R, parallelized, and is available from BioconductorContactglyn.x.bradley@gsk.comSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-06-29
  • TSGSIS: a high-dimensional grouped variable selection approach for
           detection of whole-genome SNP–SNP interactions
    • Authors: Fang Y; Wang J, Hsiung C.
      Abstract: MotivationIdentification of single nucleotide polymorphism (SNP) interactions is an important and challenging topic in genome-wide association studies (GWAS). Many approaches have been applied to detecting whole-genome interactions. However, these approaches to interaction analysis tend to miss causal interaction effects when the individual marginal effects are uncorrelated to trait, while their interaction effects are highly associated with the trait.ResultsA grouped variable selection technique, called two-stage grouped sure independence screening (TS-GSIS), is developed to study interactions that may not have marginal effects. The proposed TS-GSIS is shown to be very helpful in identifying not only causal SNP effects that are uncorrelated to trait but also their corresponding SNP–SNP interaction effects. The benefit of TS-GSIS are gaining detection of interaction effects by taking the joint information among the SNPs and determining the size of candidate sets in the model. Simulation studies under various scenarios are performed to compare performance of TS-GSIS and current approaches. We also apply our approach to a real rheumatoid arthritis (RA) dataset. Both the simulation and real data studies show that the TS-GSIS performs very well in detecting SNP–SNP interactions.Availability and implementationR-package is delivered through CRAN and is available at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-06-23
  • Multimodal mechanistic signatures for neurodegenerative diseases
           (NeuroMMSig): a web server for mechanism enrichment
    • Authors: Domingo-Fernández D; Kodamullil A, Iyappan A, et al.
      Abstract: MotivationThe concept of a ‘mechanism-based taxonomy of human disease’ is currently replacing the outdated paradigm of diseases classified by clinical appearance. We have tackled the paradigm of mechanism-based patient subgroup identification in the challenging area of research on neurodegenerative diseases.ResultsWe have developed a knowledge base representing essential pathophysiology mechanisms of neurodegenerative diseases. Together with dedicated algorithms, this knowledge base forms the basis for a ‘mechanism-enrichment server’ that supports the mechanistic interpretation of multiscale, multimodal clinical data.Availability and implementationNeuroMMSig is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-06-23
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2016