Journal Cover Bioinformatics
  [SJR: 4.643]   [H-I: 271]   [307 followers]  Follow
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1367-4803 - ISSN (Online) 1460-2059
   Published by Oxford University Press Homepage  [370 journals]
  • Molecular signatures that can be transferred across different omics
    • Authors: Altenbuchinger MM; Schwarzfischer PP, Rehberg TT, et al.
      Abstract: Bioinformatics (2017) 33 (14): i333-i340.
      PubDate: 2017-08-08
  • Kollector: transcript-informed, targeted de novo assembly of gene loci
    • Authors: Kucuk E; Chu J, Vandervalk BP, et al.
      Abstract: Bioinformatics (2017) 33(12), 1782–1788
      PubDate: 2017-07-11
  • MD-TASK: a software suite for analyzing molecular dynamics trajectories
    • Authors: Brown DK; Penkler DL, Sheik Amamuddy O, et al.
      Abstract: AbstractSummary: Molecular dynamics (MD) determines the physical motions of atoms of a biological macromolecule in a cell-like environment and is an important method in structural bioinformatics. Traditionally, measurements such as root mean square deviation, root mean square fluctuation, radius of gyration, and various energy measures have been used to analyze MD simulations. Here, we present MD-TASK, a novel software suite that employs graph theory techniques, perturbation response scanning, and dynamic cross-correlation to provide unique ways for analyzing MD trajectories.Availability and implementation: MD-TASK has been open-sourced and is available for download from, implemented in Python and supported on Linux/
      PubDate: 2017-05-31
  • On patterns and re-use in bioinformatics databases
    • Authors: Bell MJ; Lord P.
      Abstract: AbstractMotivation: As the quantity of data being depositing into biological databases continues to increase, it becomes ever more vital to develop methods that enable us to understand this data and ensure that the knowledge is correct. It is widely-held that data percolates between different databases, which causes particular concerns for data correctness; if this percolation occurs, incorrect data in one database may eventually affect many others while, conversely, corrections in one database may fail to percolate to others. In this paper, we test this widely-held belief by directly looking for sentence reuse both within and between databases. Further, we investigate patterns of how sentences are reused over time. Finally, we consider the limitations of this form of analysis and the implications that this may have for bioinformatics database design.Results: We show that reuse of annotation is common within many different databases, and that also there is a detectable level of reuse between databases. In addition, we show that there are patterns of reuse that have previously been shown to be associated with percolation errors.Availability and implementation: Analytical software is available on
      PubDate: 2017-05-19
  • PIXiE: an algorithm for automated ion mobility arrival time extraction and
           collision cross section calculation using global data association
    • Authors: Ma J; Casey CP, Zheng X, et al.
      Abstract: AbstractMotivation: Drift tube ion mobility spectrometry coupled with mass spectrometry (DTIMS-MS) is increasingly implemented in high throughput omics workflows, and new informatics approaches are necessary for processing the associated data. To automatically extract arrival times for molecules measured by DTIMS at multiple electric fields and compute their associated collisional cross sections (CCS), we created the PNNL Ion Mobility Cross Section Extractor (PIXiE). The primary application presented for this algorithm is the extraction of data that can then be used to create a reference library of experimental CCS values for use in high throughput omics analyses.Results: We demonstrate the utility of this approach by automatically extracting arrival times and calculating the associated CCSs for a set of endogenous metabolites and xenobiotics. The PIXiE-generated CCS values were within error of those calculated using commercially available instrument vendor software. Availability and implementation: PIXiE is an open-source tool, freely available on Github. The documentation, source code of the software, and a GUI can be found at and the source code of the backend workflow library used by PIXiE can be found at or thomas.metz@pnnl.govSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-15
  • MAGenTA: a Galaxy implemented tool for complete Tn-Seq analysis and data
    • Authors: McCoy K; Antonio ML, van Opijnen T.
      Abstract: AbstractMotivation: Transposon insertion sequencing (Tn-Seq) is a microbial systems-level tool, that can determine on a genome-wide scale and in high-throughput, whether a gene, or a specific genomic region, is important for fitness under a specific experimental condition.Results: Here, we present MAGenTA, a suite of analysis tools which accurately calculate the growth rate for each disrupted gene in the genome to enable the discovery of: (i) new leads for gene function, (ii) non-coding RNAs; (iii) genes, pathways and ncRNAs that are involved in tolerating drugs or induce disease; (iv) higher order genome organization; and (v) host-factors that affect bacterial host susceptibility. MAGenTA is a complete Tn-Seq analysis pipeline making sensitive genome-wide fitness (i.e. growth rate) analysis available for most transposons and Tn-Seq associated approaches (e.g. TraDis, HiTS, IN-Seq) and includes fitness (growth rate) calculations, sliding window analysis, bottleneck calculations and corrections, statistics to compare experiments and strains and genome-wide fitness visualization.Availability and implementation: MAGenTA is available at the Galaxy public ToolShed repository and all source code can be found and are freely available at information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-11
  • PROXiMATE: a database of mutant protein–protein complex
           thermodynamics and kinetics
    • Authors: Jemimah S; Yugandhar KK, Michael Gromiha MM.
      Abstract: AbstractSummary: We have developed PROXiMATE, a database of thermodynamic data for more than 6000 missense mutations in 174 heterodimeric protein–protein complexes, supplemented with interaction network data from STRING database, solvent accessibility, sequence, structural and functional information, experimental conditions and literature information. Additional features include complex structure visualization, search and display options, download options and a provision for users to upload their data.Availability and implementation: The database is freely available at The website is implemented in Python, and supports recent versions of major browsers such as IE10, Firefox, Chrome and information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-11
  • POSSUM: a bioinformatics toolkit for generating numerical sequence feature
           descriptors based on PSSM profiles
    • Authors: Wang J; Yang B, Revote J, et al.
      Abstract: AbstractSummary: Evolutionary information in the form of a Position-Specific Scoring Matrix (PSSM) is a widely used and highly informative representation of protein sequences. Accordingly, PSSM-based feature descriptors have been successfully applied to improve the performance of various predictors of protein attributes. Even though a number of algorithms have been proposed in previous studies, there is currently no universal web server or toolkit available for generating this wide variety of descriptors. Here, we present POSSUM (Position-Specific Scoring matrix-based feature generator for machine learning), a versatile toolkit with an online web server that can generate 21 types of PSSM-based feature descriptors, thereby addressing a crucial need for bioinformaticians and computational biologists. We envisage that this comprehensive toolkit will be widely used as a powerful tool to facilitate feature extraction, selection, and benchmarking of machine learning-based models, thereby contributing to a more effective analysis and modeling pipeline for bioinformatics research.Availability and implementation: or information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-10
  • rMFilter: acceleration of long read-based structure variation calling by
           chimeric read filtering
    • Authors: Liu B; Jiang T, Yiu SM, et al.
      Abstract: AbstractMotivation: Long read sequencing technologies provide new opportunities to investigate genome structural variations (SVs) more accurately. However, the state-of-the-art SV calling pipelines are computational intensive and the applications of long reads are restricted.Results: We propose a local region match-based filter (rMFilter) to efficiently nail down chimeric noisy long reads based on short token matches within local genomic regions. rMFilter is able to substantially accelerate long read-based SV calling pipelines without loss of effectiveness. It can be easily integrated into current long read-based pipelines to facilitate SV studies.Availability and implementation: The C ++ source code of rMFilter is available at information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-08
  • Foldit Standalone: a video game-derived protein structure manipulation
           interface using Rosetta
    • Authors: Kleffner R; Flatten J, Leaver-Fay A, et al.
      Abstract: AbstractSummary: Foldit Standalone is an interactive graphical interface to the Rosetta molecular modeling package. In contrast to most command-line or batch interactions with Rosetta, Foldit Standalone is designed to allow easy, real-time, direct manipulation of protein structures, while also giving access to the extensive power of Rosetta computations. Derived from the user interface of the scientific discovery game Foldit (itself based on Rosetta), Foldit Standalone has added more advanced features and removed the competitive game elements. Foldit Standalone was built from the ground up with a custom rendering and event engine, configurable visualizations and interactions driven by Rosetta. Foldit Standalone contains, among other features: electron density and contact map visualizations, multiple sequence alignment tools for template-based modeling, rigid body transformation controls, RosettaScripts support and an embedded Lua interpreter.Availability and Implementation: Foldit Standalone is available for download at, under the Rosetta license, which is free for academic and non-profit users. It is implemented in cross-platform C ++ and binary executables are available for Windows, macOS and
      PubDate: 2017-05-08
  • RankProd 2.0: a refactored bioconductor package for detecting
           differentially expressed features in molecular profiling datasets
    • Authors: Del Carratore F; Jankevics A, Eisinga R, et al.
      Abstract: AbstractMotivation: The Rank Product (RP) is a statistical technique widely used to detect differentially expressed features in molecular profiling experiments such as transcriptomics, metabolomics and proteomics studies. An implementation of the RP and the closely related Rank Sum (RS) statistics has been available in the RankProd Bioconductor package for several years. However, several recent advances in the understanding of the statistical foundations of the method have made a complete refactoring of the existing package desirable.Results: We implemented a completely refactored version of the RankProd package, which provides a more principled implementation of the statistics for unpaired datasets. Moreover, the permutation-based P-value estimation methods have been replaced by exact methods, providing faster and more accurate results.Availability and implementation: RankProd 2.0 is available at Bioconductor ( and as part of the mzMatch pipeline ( information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-08
  • An efficient graph kernel method for non-coding RNA functional prediction
    • Authors: Navarin N; Costa F.
      Abstract: AbstractMotivation: The importance of RNA protein-coding gene regulation is by now well appreciated. Non-coding RNAs (ncRNAs) are known to regulate gene expression at practically every stage, ranging from chromatin packaging to mRNA translation. However the functional characterization of specific instances remains a challenging task in genome scale settings. For this reason, automatic annotation approaches are of interest. Existing computational methods are either efficient but non-accurate or they offer increased precision, but present scalability problems.Results: In this article, we present a predictive system based on kernel methods, a type of machine learning algorithm grounded in statistical learning theory. We employ a flexible graph encoding to preserve multiple structural hypotheses and exploit recent advances in representation and model induction to scale to large data volumes. Experimental results on tens of thousands of ncRNA sequences available from the Rfam database indicate that we can not only improve upon state-of-the-art predictors, but also achieve speedups of several orders of magnitude.Availability and implementation: The code is available from
      PubDate: 2017-05-05
  • Assembling draft genomes using contiBAIT
    • Authors: O’Neill K; Hills M, Gottlieb M, et al.
      Abstract: AbstractSummary: Massively parallel sequencing is now widely used, but data interpretation is only as good as the reference assembly to which it is aligned. While the number of reference assemblies has rapidly expanded, most of these remain at intermediate stages of completion, either as scaffold builds, or as chromosome builds (consisting of correctly ordered, but not necessarily correctly oriented scaffolds separated by gaps). Completion of de novo assemblies remains difficult, as regions that are repetitive or hard to sequence prevent the accumulation of larger scaffolds, and create errors such as misorientations and mislocalizations. Thus, complementary methods for determining the orientation and positioning of fragments are important for finishing assemblies. Strand-seq is a method for determining template strand inheritance in single cells, information that can be used to determine relative genomic distance and orientation between scaffolds, and find errors within them. We present contiBAIT, an R/Bioconductor package which uses Strand-seq data to repair and improve existing assemblies.Availability and Implementation: contiBAIT is available on Bioconductor. Source files available from or mark.hills@stemcell.comSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-05
  • twoddpcr: an R/Bioconductor package and Shiny app for Droplet Digital PCR
    • Authors: Chiu A; Ayub M, Dive C, et al.
      Abstract: AbstractSummary: Droplet Digital PCR (ddPCR) is a sensitive platform used to quantify specific nucleic acid molecules amplified by polymerase chain reactions. Its sensitivity makes it particularly useful for the detection of rare mutant molecules, such as those present in a sample of circulating free tumour DNA obtained from cancer patients. ddPCR works by partitioning a sample into individual droplets for which the majority contain only zero or one target molecule. Each droplet then becomes a reaction chamber for PCR, which through the use of fluorochrome labelled probes allows the target molecules to be detected by measuring the fluorescence intensity of each droplet. The technology supports two channels, allowing, for example, mutant and wild type molecules to be detected simultaneously in the same sample. As yet, no open source software is available for the automatic gating of two channel ddPCR experiments in the case where the droplets can be grouped into four clusters. Here, we present an open source R package ‘twoddpcr’, which uses Poisson statistics to estimate the number of molecules in such two channel ddPCR data. Using the Shiny framework, an accompanying graphical user interface (GUI) is also included for the package, allowing users to adjust parameters and see the results in real-time.Availability and implementation: twoddpcr is available from Bioconductor (3.5) at A Shiny-based GUI suitable for non-R users is available as a standalone application from within the package and also as a web application at or
      PubDate: 2017-05-05
  • MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the
    • Authors: Expósito RR; Veiga J, González-Domínguez J, et al.
      Abstract: AbstractSummary: This article presents MarDRe, a de novo cloud-ready duplicate and near-duplicate removal tool that can process single- and paired-end reads from FASTQ/FASTA datasets. MarDRe takes advantage of the widely adopted MapReduce programming model to fully exploit Big Data technologies on cloud-based infrastructures. Written in Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for scalable Big Data processing. On a 16-node cluster deployed on the Amazon EC2 cloud platform, MarDRe is up to 8.52 times faster than a representative state-of-the-art tool.Availability and implementation: Source code in Java and Hadoop as well as a user’s guide are freely available under the GNU GPLv3 license at
      PubDate: 2017-05-05
  • FlashPCA2: principal component analysis of Biobank-scale genotype datasets
    • Authors: Abraham G; Qiu Y, Inouye M.
      Abstract: AbstractMotivation: Principal component analysis (PCA) is a crucial step in quality control of genomic data and a common approach for understanding population genetic structure. With the advent of large genotyping studies involving hundreds of thousands of individuals, standard approaches are no longer feasible. However, when the full decomposition is not required, substantial computational savings can be made.Results: We present FlashPCA2, a tool that can perform partial PCA on 1 million individuals faster than competing approaches, while requiring substantially less memory.Availability and implementation: information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-05
  • Exploring spatially adjacent TFBS-clustered regions with Hi-C data
    • Authors: Chen H; Jiang S, Zhang Z, et al.
      Abstract: AbstractMotivation: Transcription factor binding sites (TFBSs) are clustered in the human genome, forming the TFBS-clustered regions that regulate gene transcription, which requires dynamic chromatin configurations between promoters and distal regulatory elements. Here, we propose a regulatory model called spatially adjacent TFBS-clustered regions (SATs), in which TFBS-clustered regions are connected by spatial proximity as identified by high-resolution Hi-C data.Results: TFBS-clustered regions forming SATs appeared less frequently in gene promoters than did isolated TFBS-clustered regions, whereas SATs as a whole appeared more frequently. These observations indicate that multiple distal TFBS-clustered regions combined to form SATs to regulate genes. Further examination confirmed that a substantial portion of genes regulated by SATs were located between the paired TFBS-clustered regions instead of the downstream. We reconstructed the chromosomal conformation of the H1 human embryonic stem cell line using the ShRec3D algorithm and proposed the SAT regulatory or information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-04
  • Annotating function to differentially expressed LincRNAs in
           myelodysplastic syndrome using a network-based method
    • Authors: Liu K; Beck D, Thoms JI, et al.
      Abstract: AbstractMotivation: Long non-coding RNAs (lncRNAs) have been implicated in the regulation of diverse biological functions. The number of newly identified lncRNAs has increased dramatically in recent years but their expression and function have not yet been described from most diseases. To elucidate lncRNA function in human disease, we have developed a novel network based method (NLCFA) integrating correlations between lncRNA, protein coding genes and noncoding miRNAs. We have also integrated target gene associations and protein-protein interactions and designed our model to provide information on the combined influence of mRNAs, lncRNAs and miRNAs on cellular signal transduction networks.Results: We have generated lncRNA expression profiles from the CD34+ haematopoietic stem and progenitor cells (HSPCs) from patients with Myelodysplastic syndromes (MDS) and healthy donors. We report, for the first time, aberrantly expressed lncRNAs in MDS and further prioritize biologically relevant lncRNAs using the NLCFA. Taken together, our data suggests that aberrant levels of specific lncRNAs are intimately involved in network modules that control multiple cancer-associated signalling pathways and cellular processes. Importantly, our method can be applied to prioritize aberrantly expressed lncRNAs for functional validation in other diseases and biological contexts.Availability and implementation: The method is implemented in R language and Matlab.Contact:xizhou@wakehealth.eduSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-04
  • Accounting for tumor purity improves cancer subtype classification from
           DNA methylation data
    • Authors: Zhang W; Feng H, Wu H, et al.
      Abstract: AbstractMotivation: Tumor sample classification has long been an important task in cancer research. Classifying tumors into different subtypes greatly benefits therapeutic development and facilitates application of precision medicine on patients. In practice, solid tumor tissue samples obtained from clinical settings are always mixtures of cancer and normal cells. Thus, the data obtained from these samples are mixed signals. The ‘tumor purity’, or the percentage of cancer cells in cancer tissue sample, will bias the clustering results if not properly accounted for.Results: In this article, we developed a model-based clustering method and an R function which uses DNA methylation microarray data to infer tumor subtypes with the consideration of tumor purity. Simulation studies and the analyses of The Cancer Genome Atlas data demonstrate improved results compared with existing methods. Availability and implementation: InfiniumClust is part of R package InfiniumPurify, which is freely available from CRAN ( or information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-04
  • modlAMP: Python for antimicrobial peptides
    • Authors: Müller AT; Gabernet G, Hiss JA, et al.
      Abstract: AbstractSummary: We have implemented the molecular design laboratory’s antimicrobial peptides package (modlAMP), a Python-based software package for the design, classification and visual representation of peptide data. modlAMP offers functions for molecular descriptor calculation and the retrieval of amino acid sequences from public or local sequence databases, and provides instant access to precompiled datasets for machine learning. The package also contains methods for the analysis and representation of circular dichroism spectra.Availability and Implementation: The modlAMP Python package is available under the BSD license from URL or via pip from the Python Package Index (PyPI).Contact:gisbert.schneider@pharma.ethz.chSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-04
  • KMC 3: counting and manipulating k -mer statistics
    • Authors: Kokot M; Długosz M, Deorowicz S.
      Abstract: AbstractSummary: Counting all k-mers in a given dataset is a standard procedure in many bioinformatics applications. We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k-mer databases. Usefulness of the tools is shown on a few real problems.Availability and implementation: Program is freely available at information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-04
  • PopFly: the Drosophila population genomics browser
    • Authors: Hervas S; Sanz E, Casillas S, et al.
      Abstract: AbstractSummary: The recent compilation of over 1100 worldwide wild-derived Drosophila melanogaster genome sequences reassembled using a standardized pipeline provides a unique resource for population genomic studies (Drosophila Genome Nexus, DGN). A visual display of the estimated metrics describing genome-wide variation and selection patterns would allow gaining a global view and understanding of the evolutionary forces shaping genome variation.Availability and implementation: Here, we present PopFly, a population genomics-oriented genome browser, based on JBrowse software, that contains a complete inventory of population genomic parameters estimated from DGN data. This browser is designed for the automatic analysis and display of genetic variation data within and between populations along the D. melanogaster genome. PopFly allows the visualization and retrieval of functional annotations, estimates of nucleotide diversity metrics, linkage disequilibrium statistics, recombination rates, a battery of neutrality tests, and population differentiation parameters at different window sizes through the euchromatic chromosomes. PopFly is open and freely available at site or
      PubDate: 2017-05-04
  • EBT: a statistic test identifying moderate size of significant features
           with balanced power and precision for genome-wide rate comparisons
    • Authors: Hui X; Hu Y, Sun M, et al.
      Abstract: AbstractMotivation: In genome-wide rate comparison studies, there is a big challenge for effective identification of an appropriate number of significant features objectively, since traditional statistical comparisons without multi-testing correction can generate a large number of false positives while multi-testing correction tremendously decreases the statistic power.Results: In this study, we proposed a new exact test based on the translation of rate comparison to two binomial distributions. With modeling and real datasets, the exact binomial test (EBT) showed an advantage in balancing the statistical precision and power, by providing an appropriate size of significant features for further studies. Both correlation analysis and bootstrapping tests demonstrated that EBT is as robust as the typical rate-comparison methods, e.g. χ2 test, Fisher’s exact test and Binomial test. Performance comparison among machine learning models with features identified by different statistical tests further demonstrated the advantage of EBT. The new test was also applied to analyze the genome-wide somatic gene mutation rate difference between lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), two main lung cancer subtypes and a list of new markers were identified that could be lineage-specifically associated with carcinogenesis of LUAD and LUSC, respectively. Interestingly, three cilia genes were found selectively with high mutation rates in LUSC, possibly implying the importance of cilia dysfunction in the carcinogenesis.Availability and implementation: An R package implementing EBT could be downloaded from the website freely: information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-03
  • A deep learning framework for improving long-range residue–residue
           contact prediction using a hierarchical strategy
    • Authors: Xiong D; Zeng J, Gong H.
      Abstract: AbstractMotivation: Residue–residue contacts are of great value for protein structure prediction, since contact information, especially from those long-range residue pairs, can significantly reduce the complexity of conformational sampling for protein structure prediction in practice. Despite progresses in the past decade on protein targets with abundant homologous sequences, accurate contact prediction for proteins with limited sequence information is still far from satisfaction. Methodologies for these hard targets still need further improvement.Results: We presented a computational program DeepConPred, which includes a pipeline of two novel deep-learning-based methods (DeepCCon and DeepRCon) as well as a contact refinement step, to improve the prediction of long-range residue contacts from primary sequences. When compared with previous prediction approaches, our framework employed an effective scheme to identify optimal and important features for contact prediction, and was only trained with coevolutionary information derived from a limited number of homologous sequences to ensure robustness and usefulness for hard targets. Independent tests showed that 59.33%/49.97%, 64.39%/54.01% and 70.00%/59.81% of the top L/5, top L/10 and top 5 predictions were correct for CASP10/CASP11 proteins, respectively. In general, our algorithm ranked as one of the best methods for CASP targets.Availability and implementation: All source data and codes are available at or information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-03
  • OMSim: a simulator for optical map data
    • Authors: Miclotte G; Plaisance S, Rombauts S, et al.
      Abstract: AbstractMotivation: The Bionano Genomics platform allows for the optical detection of short sequence patterns in very long DNA molecules (up to 2.5 Mbp). Molecules with overlapping patterns can be assembled to generate a consensus optical map of the entire genome. In turn, these optical maps can be used to validate or improve de novo genome assembly projects or to detect large-scale structural variation in genomes. Simulated optical map data can assist in the development and benchmarking of tools that operate on those data, such as alignment and assembly software. Additionally, it can help to optimize the experimental setup for a genome of interest. Such a simulator is currently not available.Results: We have developed a simulator, OMSim, that produces synthetic optical map data that mimics real Bionano Genomics data. These simulated data have been tested for compatibility with the Bionano Genomics Irys software system and the Irys-scaffolding scripts. OMSim is capable of handling very large genomes (over 30 Gbp) with high throughput and low memory requirements.Availability and implementation: The Python simulation tool and a cross-platform graphical user interface are available as open source software under the GNU GPL v2 license ( information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-03
  • STOPGAP: a database for systematic target opportunity assessment by
           genetic association predictions
    • Authors: Shen J; Song K, Slater AJ, et al.
      Abstract: AbstractSummary: We developed the STOPGAP (Systematic Target OPportunity assessment by Genetic Association Predictions) database, an extensive catalog of human genetic associations mapped to effector gene candidates. STOPGAP draws on a variety of publicly available GWAS associations, linkage disequilibrium (LD) measures, functional genomic and variant annotation sources. Algorithms were developed to merge the association data, partition associations into non-overlapping LD clusters, map variants to genes and produce a variant-to-gene score used to rank the relative confidence among potential effector genes. This database can be used for a multitude of investigations into the genes and genetic mechanisms underlying inter-individual variation in human traits, as well as supporting drug discovery applications.Availability and implementation: Shell, R, Perl and Python scripts and STOPGAP R data files (version 2.5.1 at publication) are available at Some of the most useful STOPGAP fields can be queried through an R Shiny web application at information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-05-02
  • RIblast: an ultrafast RNA–RNA interaction prediction system based on a
           seed-and-extension approach
    • Authors: Fukunaga T; Hamada M.
      Abstract: AbstractMotivation: LncRNAs play important roles in various biological processes. Although more than 58 000 human lncRNA genes have been discovered, most known lncRNAs are still poorly characterized. One approach to understanding the functions of lncRNAs is the detection of the interacting RNA target of each lncRNA. Because experimental detections of comprehensive lncRNA–RNA interactions are difficult, computational prediction of lncRNA–RNA interactions is an indispensable technique. However, the high computational costs of existing RNA–RNA interaction prediction tools prevent their application to large-scale lncRNA datasets.Results: Here, we present ‘RIblast’, an ultrafast RNA–RNA interaction prediction method based on the seed-and-extension approach. RIblast discovers seed regions using suffix arrays and subsequently extends seed regions based on an RNA secondary structure energy model. Computational experiments indicate that RIblast achieves a level of prediction accuracy similar to those of existing programs, but at speeds over 64 times faster than existing programs.Availability and implementation: The source code of RIblast is freely available at or mhamada@waseda.jpSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-04-28
  • SigSeeker: a peak-calling ensemble approach for constructing epigenetic
    • Authors: Lichtenberg J; Elnitski L, Bodine DM.
      Abstract: AbstractMotivation: Epigenetic data are invaluable when determining the regulatory programs governing a cell. Based on use of next-generation sequencing data for characterizing epigenetic marks and transcription factor binding, numerous peak-calling approaches have been developed to determine sites of genomic significance in these data. Such analyses can produce a large number of false positive predictions, suggesting that sites supported by multiple algorithms provide a stronger foundation for inferring and characterizing regulatory programs associated with the epigenetic data. Few methodologies integrate epigenetic based predictions of multiple approaches when combining profiles generated by different tools.Results: The SigSeeker peak-calling ensemble uses multiple tools to identify peaks, and with user-defined thresholds for peak overlap and signal strength it retains only those peaks that are concordant across multiple tools. Peaks predicted to be co-localized by only a very small number of tools, discovered to be only marginally overlapping, or found to represent significant outliers to the approximation model are removed from the results, providing concise and high quality epigenetic datasets. SigSeeker has been validated using established benchmarks for transcription factor binding and histone modification ChIP-Seq data. These comparisons indicate that the quality of our ensemble technique exceeds that of single tool approaches, enhances existing peak-calling ensembles, and results in epigenetic profiles of higher confidence.Availability and implementation:http://sigseeker.orgContact:lichtenbergj@mail.nih.govSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-04-25
  • Neuro-symbolic representation learning on biological knowledge graphs
    • Authors: Alshahrani M; Khan M, Maddouri O, et al.
      Abstract: AbstractMotivation: Biological data and knowledge bases increasingly rely on Semantic Web technologies and the use of knowledge graphs for data integration, retrieval and federated queries. In the past years, feature learning methods that are applicable to graph-structured data are becoming available, but have not yet widely been applied and evaluated on structured biological knowledge. Results: We develop a novel method for feature learning on biological knowledge graphs. Our method combines symbolic methods, in particular knowledge representation using symbolic logic and automated reasoning, with neural networks to generate embeddings of nodes that encode for related information within knowledge graphs. Through the use of symbolic logic, these embeddings contain both explicit and implicit information. We apply these embeddings to the prediction of edges in the knowledge graph representing problems of function prediction, finding candidate genes of diseases, protein-protein interactions, or drug target relations, and demonstrate performance that matches and sometimes outperforms traditional approaches based on manually crafted features. Our method can be applied to any biological knowledge graph, and will thereby open up the increasing amount of Semantic Web based knowledge bases in biology to use in machine learning and data analytics.Availability and implementation: information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-04-25
  • veqtl-mapper: variance association mapping for molecular phenotypes
    • Authors: Brown A.
      Abstract: AbstractMotivation: Genetic loci associated with the variance of phenotypic traits have been of recent interest as they can be signatures of genetic interactions, gene by environment interactions, parent of origin effects and canalization. We present a fast efficient tool to map loci affecting variance of gene expression and other molecular phenotypes in cis. Results: Applied to the publicly available Geuvadis gene expression dataset, we identify 816 loci associated with variance of gene expression using an additive model, and 32 showing differences in variance between homozygous and heterozygous alleles, signatures of parent of origin effects.Availability and implementation: Documentation and links to source code and binaries for linux can be found at andrew.brown@unige.chSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-04-25
  • HLA class I binding prediction via convolutional neural networks
    • Authors: Vang YS; Xie X.
      Abstract: AbstractMotivation: Many biological processes are governed by protein–ligand interactions. One such example is the recognition of self and non-self cells by the immune system. This immune response process is regulated by the major histocompatibility complex (MHC) protein which is encoded by the human leukocyte antigen (HLA) complex. Understanding the binding potential between MHC and peptides can lead to the design of more potent, peptide-based vaccines and immunotherapies for infectious autoimmune diseases.Results: We apply machine learning techniques from the natural language processing (NLP) domain to address the task of MHC-peptide binding prediction. More specifically, we introduce a new distributed representation of amino acids, name HLA-Vec, that can be used for a variety of downstream proteomic machine learning tasks. We then propose a deep convolutional neural network architecture, name HLA-CNN, for the task of HLA class I-peptide binding prediction. Experimental results show combining the new distributed representation with our HLA-CNN architecture achieves state-of-the-art results in the majority of the latest two Immune Epitope Database (IEDB) weekly automated benchmark datasets. We further apply our model to predict binding on the human genome and identify 15 genes with potential for self binding.Availability and Implementation: Codes to generate the HLA-Vec and HLA-CNN are publicly available at: information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-04-21
  • EigenTHREADER: analogous protein fold recognition by efficient contact map
    • Authors: Buchan DA; Jones DT.
      Abstract: AbstractMotivation: Protein fold recognition when appropriate, evolutionarily-related, structural templates can be identified is often trivial and may even be viewed as a solved problem. However in cases where no homologous structural templates can be detected, fold recognition is a notoriously difficult problem (Moult et al., 2014). Here we present EigenTHREADER, a novel fold recognition method capable of identifying folds where no homologous structures can be identified. EigenTHREADER takes a query amino acid sequence, generates a map of intra-residue contacts, and then searches a library of contact maps of known structures. To allow the contact maps to be compared, we use eigenvector decomposition to resolve the principal eigenvectors these can then be aligned using standard dynamic programming algorithms. The approach is similar to the Al-Eigen approach of Di Lena et al. (2010), but with improvements made both to speed and accuracy. With this search strategy, EigenTHREADER does not depend directly on sequence homology between the target protein and entries in the fold library to generate models. This in turn enables EigenTHREADER to correctly identify analogous folds where little or no sequence homology information is.Results: EigenTHREADER outperforms well-established fold recognition methods such as pGenTHREADER and HHSearch in terms of True Positive Rate in the difficult task of analogous fold recognition. This should allow template-based modelling to be extended to many new protein families that were previously intractable to homology based fold recognition methods.Availability and implementation: All code used to generate these results and the computational protocol can be downloaded from EigenTHREADER, the benchmark code and the data this paper is based on can be downloaded from:
      PubDate: 2017-04-13
  • Network module-based model in the differential expression analysis for
    • Authors: Lei M; Xu J, Huang L, et al.
      Abstract: AbstractMotivation: RNA-seq has emerged as a powerful technology for the detection of differential gene expression in the transcriptome. The commonly used statistical methods for RNA-seq differential expression analysis were designed for individual genes, which may detect too many irrelevant significantly genes or too few genes to interpret the phenotypic changes. Recently network module-based methods have been proposed as a powerful approach to analyze and interpret expression data in microarray and shotgun proteomics. But the module-based statistical model has not been adequately addressed for RNA-seq data.Result: we proposed a network module-based generalized linear model for differential expression analysis of the count-based sequencing data from RNA-seq. The simulation studies demonstrated the effectiveness of the proposed model and the improvement of the statistical power for identifying the differentially expressed modules in comparison to the existing methods. We also applied our method to tissue datasets and identified 207 significantly differentially expressed kidney-active or liver-active modules. For liver cancer datasets, significantly differentially expressed modules, including Wnt signaling pathway and VEGF pathway, were found to be tightly associated with liver cancer. Besides, in comparison with the single gene-level analysis, our method could identify more significantly biological modules, which related to the liver cancer.Availability and Implementation: The R package SeqMADE is available at informationSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-04-12
  • Pattern fusion analysis by adaptive alignment of multiple heterogeneous
           omics data
    • Authors: Shi Q; Zhang C, Peng M, et al.
      Abstract: AbstractMotivation: Integrating different omics profiles is a challenging task, which provides a comprehensive way to understand complex diseases in a multi-view manner. One key for such an integration is to extract intrinsic patterns in concordance with data structures, so as to discover consistent information across various data types even with noise pollution. Thus, we proposed a novel framework called ‘pattern fusion analysis’ (PFA), which performs automated information alignment and bias correction, to fuse local sample-patterns (e.g. from each data type) into a global sample-pattern corresponding to phenotypes (e.g. across most data types). In particular, PFA can identify significant sample-patterns from different omics profiles by optimally adjusting the effects of each data type to the patterns, thereby alleviating the problems to process different platforms and different reliability levels of heterogeneous data.Results: To validate the effectiveness of our method, we first tested PFA on various synthetic datasets, and found that PFA can not only capture the intrinsic sample clustering structures from the multi-omics data in contrast to the state-of-the-art methods, such as iClusterPlus, SNF and moCluster, but also provide an automatic weight-scheme to measure the corresponding contributions by data types or even samples. In addition, the computational results show that PFA can reveal shared and complementary sample-patterns across data types with distinct signal-to-noise ratios in Cancer Cell Line Encyclopedia (CCLE) datasets, and outperforms over other works at identifying clinically distinct cancer subtypes in The Cancer Genome Atlas (TCGA) datasets.Availability and implementation: PFA has been implemented as a Matlab package, which is available at, or information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-04-12
  • emMAW: computing minimal absent words in external memory
    • Authors: Héliou A; Pissis SP, Puglisi SJ.
      Abstract: AbstractMotivation: The biological significance of minimal absent words has been investigated in genomes of organisms from all domains of life. For instance, three minimal absent words of the human genome were found in Ebola virus genomes. There exists an O(n)-time and O(n)-space algorithm for computing all minimal absent words of a sequence of length n on a fixed-sized alphabet based on suffix arrays. A standard implementation of this algorithm, when applied to a large sequence of length n, requires more than 20n bytes of RAM. Such memory requirements are a significant hurdle to the computation of minimal absent words in large datasets.Results: We present emMAW, the first external-memory algorithm for computing minimal absent words. A free open-source implementation of our algorithm is made available. This allows for computation of minimal absent words on far bigger data sets than was previously possible. Our implementation requires less than 3 h on a standard workstation to process the full human genome when as little as 1 GB of RAM is made available. We stress that our implementation, despite making use of external memory, is fast; indeed, even on relatively smaller datasets when enough RAM is available to hold all necessary data structures, it is less than two times slower than state-of-the-art internal-memory implementations. Availability and implementation: (free software under the terms of the GNU GPL) or information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-04-12
  • Entropy-based consensus clustering for patient stratification
    • Authors: Liu H; Zhao R, Fang H, et al.
      Abstract: AbstractMotivation: Patient stratification or disease subtyping is crucial for precision medicine and personalized treatment of complex diseases. The increasing availability of high-throughput molecular data provides a great opportunity for patient stratification. Many clustering methods have been employed to tackle this problem in a purely data-driven manner. Yet, existing methods leveraging high-throughput molecular data often suffers from various limitations, e.g. noise, data heterogeneity, high dimensionality or poor interpretability.Results: Here we introduced an Entropy-based Consensus Clustering (ECC) method that overcomes those limitations all together. Our ECC method employs an entropy-based utility function to fuse many basic partitions to a consensus one that agrees with the basic ones as much as possible. Maximizing the utility function in ECC has a much more meaningful interpretation than any other consensus clustering methods. Moreover, we exactly map the complex utility maximization problem to the classic K-means clustering problem, which can then be efficiently solved with linear time and space complexity. Our ECC method can also naturally integrate multiple molecular data types measured from the same set of subjects, and easily handle missing values without any imputation. We applied ECC to 110 synthetic and 48 real datasets, including 35 cancer gene expression benchmark datasets and 13 cancer types with four molecular data types from The Cancer Genome Atlas. We found that ECC shows superior performance against existing clustering methods. Our results clearly demonstrate the power of ECC in clinically relevant patient stratification.Availability and implementation: The Matlab package is available at or yyl@channing.harvard.eduSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: 2017-03-24
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2016