for Journals by Title or ISSN
for Articles by Keywords
Followed Journals
Journal you Follow: 0
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Journals are published.
Already have an account? Sign In to see the journals you follow.
Journal Cover Bioinformatics
  [SJR: 4.643]   [H-I: 271]   [290 followers]  Follow
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1367-4803 - ISSN (Online) 1460-2059
   Published by Oxford University Press Homepage  [393 journals]
  • A benchmark study of scoring methods for non-coding mutations
    • Authors: Drubay D; Gautheret D, Michiels S.
      Pages: 1635 - 1641
      Abstract: MotivationDetailed knowledge of coding sequences has led to different candidate models for pathogenic variant prioritization. Several deleteriousness scores have been proposed for the non-coding part of the genome, but no large-scale comparison has been realized to date to assess their performance.ResultsWe compared the leading scoring tools (CADD, FATHMM-MKL, Funseq2 and GWAVA) and some recent competitors (DANN, SNP and SOM scores) for their ability to discriminate assumed pathogenic variants from assumed benign variants (using the ClinVar, COSMIC and 1000 genomes project databases). Using the ClinVar benchmark, CADD was the best tool for detecting the pathogenic variants that are mainly located in protein coding gene regions. Using the COSMIC benchmark, FATHMM-MKL, GWAVA and SOMliver outperformed the other tools for pathogenic variants that are typically located in lincRNAs, pseudogenes and other parts of the non-coding genome. However, all tools had low precision, which could potentially be improved by future non-coding genome feature discoveries. These results may have been influenced by the presence of potential benign variants in the COSMIC database. The development of a gold standard as consistent as ClinVar for these regions will be necessary to confirm our tool ranking.Availability and implementationThe Snakemake, C++ and R codes are freely available from and supported on or stefan.michiels@gustaveroussy.frSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 11 Jan 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty008
      Issue No: Vol. 34, No. 10 (2018)
  • Tumor purity quantification by clonal DNA methylation signatures
    • Authors: Benelli M; Romagnoli D, Demichelis F.
      Pages: 1642 - 1649
      Abstract: MotivationControlling for tumor purity in molecular analyses is essential to allow for reliable genomic aberration calls, for inter-sample comparison and to monitor heterogeneity of cancer cell populations. In genome wide screening studies, the assessment of tumor purity is typically performed by means of computational methods that exploit somatic copy number aberrations.ResultsWe present a strategy, called Purity Assessment from clonal MEthylation Sites (PAMES), which uses the methylation level of a few dozen, highly clonal, tumor type specific CpG sites to estimate the purity of tumor samples, without the need of a matched benign control. We trained and validated our method in more than 6000 samples from different datasets. Purity estimates by PAMES were highly concordant with other state-of-the-art tools and its evaluation in a cancer cell line dataset highlights its reliability to accurately estimate tumor admixtures. We extended the capability of PAMES to the analysis of CpG islands instead of the more platform-specific CpG sites and demonstrated its accuracy in a set of advanced tumors profiled by high throughput DNA methylation sequencing. These analyses show that PAMES is a valuable tool to assess the purity of tumor samples in the settings of clinical research and diagnostics.Availability and implementation or f.demichelis@unitn.itSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 08 Jan 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty011
      Issue No: Vol. 34, No. 10 (2018)
  • LeNup: learning nucleosome positioning from DNA sequences with improved
           convolutional neural networks
    • Authors: Zhang J; Peng W, Wang L.
      Pages: 1705 - 1712
      Abstract: MotivationNucleosome positioning plays significant roles in proper genome packing and its accessibility to execute transcription regulation. Despite a multitude of nucleosome positioning resources available on line including experimental datasets of genome-wide nucleosome occupancy profiles and computational tools to the analysis on these data, the complex language of eukaryotic Nucleosome positioning remains incompletely understood.ResultsHere, we address this challenge using an approach based on a state-of-the-art machine learning method. We present a novel convolutional neural network (CNN) to understand nucleosome positioning. We combined Inception-like networks with a gating mechanism for the response of multiple patterns and long term association in DNA sequences. We developed the open-source package LeNup based on the CNN to predict nucleosome positioning in Homo sapiens, Caenorhabditis elegans, Drosophila melanogaster as well as Saccharomyces cerevisiae genomes. We trained LeNup on four benchmark datasets. LeNup achieved greater predictive accuracy than previously published methods.Availability and implementationLeNup is freely available as Python and Lua script source code under a BSD style license from informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 10 Jan 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty003
      Issue No: Vol. 34, No. 10 (2018)
  • Sensitive and specific post-call filtering of genetic variants in
           xenograft and primary tumors
    • Authors: Mannakee B; Balaji U, Witkiewicz A, et al.
      Pages: 1713 - 1718
      Abstract: MotivationTumor genome sequencing offers great promise for guiding research and therapy, but spurious variant calls can arise from multiple sources. Mouse contamination can generate many spurious calls when sequencing patient-derived xenografts. Paralogous genome sequences can also generate spurious calls when sequencing any tumor. We developed a BLAST-based algorithm, Mouse And Paralog EXterminator (MAPEX), to identify and filter out spurious calls from both these sources.ResultsWhen calling variants from xenografts, MAPEX has similar sensitivity and specificity to more complex algorithms. When applied to any tumor, MAPEX also automatically flags calls that potentially arise from paralogous sequences. Our implementation, mapexr, runs quickly and easily on a desktop computer. MAPEX is thus a useful addition to almost any pipeline for calling genetic variants in tumors.Availability and implementationThe mapexr package for R is available at under the MIT or or eknudsen@email.arizona.eduSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 08 Jan 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty010
      Issue No: Vol. 34, No. 10 (2018)
  • A network approach to exploring the functional basis of gene–gene
           epistatic interactions in disease susceptibility
    • Authors: Yip D; Chan L, Pang I, et al.
      Pages: 1741 - 1749
      Abstract: MotivationIndividual genetic variants explain only a small fraction of heritability in some diseases. Some variants have weak marginal effects on disease risk, but their joint effects are significantly stronger when occurring together. Most studies on such epistatic interactions have focused on methods for identifying the interactions and interpreting individual cases, but few have explored their general functional basis. This was due to the lack of a comprehensive list of epistatic interactions and uncertainties in associating variants to genes.ResultsWe conducted a large-scale survey of published research articles to compile the first comprehensive list of epistatic interactions in human diseases with detailed annotations. We used various methods to associate these variants to genes to ensure robustness. We found that these genes are significantly more connected in protein interaction networks, are more co-expressed and participate more often in the same pathways. We demonstrate using the list to discover novel disease informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 10 Jan 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty005
      Issue No: Vol. 34, No. 10 (2018)
  • Enrichment analysis with EpiAnnotator
    • Authors: Pageaud Y; Plass C, Assenov Y.
      Pages: 1781 - 1783
      Abstract: MotivationDeciphering relevant biological insights from epigenomic data can be a challenging task. One commonly used approach is to perform enrichment analysis. However, finding, downloading and using the publicly available functional annotations require time, programming skills and IT infrastructure. Here we describe the online tool EpiAnnotator for performing enrichment analyses on epigenomic data in a fast and user-friendly way.ResultsEpiAnnotator is an R Package accompanied by a web interface. It contains regularly updated annotations from 4 public databases: Blueprint, RoadMap, GENCODE and the UCSC Genome Browser. Annotations are hosted locally or in a server environment and automatically updated by scripts of our own design. Thousands of tracks are available, reflecting data on a variety of tissues, cell types and cell lines from the human and mouse genomes. Users need to upload sets of selected and background regions. Results are displayed in customizable and easily interpretable figures.Availability and implementationThe R package and Shiny app are open source and available under the GPL v3 license. EpiAnnotator’s web interface is accessible at
      PubDate: Wed, 10 Jan 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty007
      Issue No: Vol. 34, No. 10 (2018)
  • PyChimera: use UCSF Chimera modules in any Python 2.7 project
    • Authors: Rodríguez-Guerra Pedregal J; Maréchal J.
      Pages: 1784 - 1785
      Abstract: MotivationUCSF Chimera is a powerful visualization tool remarkably present in the computational chemistry and structural biology communities. Built on a C++ core wrapped under a Python 2.7 environment, one could expect to easily import UCSF Chimera’s arsenal of resources in custom scripts or software projects. Nonetheless, this is not readily possible if the script is not executed within UCSF Chimera due to the isolation of the platform. UCSF ChimeraX, successor to the original Chimera, partially solves the problem but yet major upgrades need to be undergone so that this updated version can offer all UCSF Chimera features.ResultsPyChimera has been developed to overcome these limitations and provide access to the UCSF Chimera codebase from any Python 2.7 interpreter, including interactive programming with tools like IPython and Jupyter Notebooks, making it easier to use with additional third-party software.Availability and implementationPyChimera is LGPL-licensed and available at or jeandidier.marechal@uab.catSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 11 Jan 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty021
      Issue No: Vol. 34, No. 10 (2018)
  • A high-resolution map of the human small non-coding transcriptome
    • Authors: Fehlmann T; Backes C, Alles J, et al.
      Pages: 1621 - 1628
      Abstract: MotivationAlthough the amount of small non-coding RNA-sequencing data is continuously increasing, it is still unclear to which extent small RNAs are represented in the human genome.ResultsIn this study we analyzed 303 billion sequencing reads from nearly 25 000 datasets to answer this question. We determined that 0.8% of the human genome are reliably covered by 874 123 regions with an average length of 31 nt. On the basis of these regions, we found that among the known small non-coding RNA classes, microRNAs were the most prevalent. In subsequent steps, we characterized variations of miRNAs and performed a staged validation of 11 877 candidate miRNAs. Of these, many were actually expressed and significantly dysregulated in lung cancer. Selected candidates were finally validated by northern blots. Although isolated miRNAs could still be present in the human genome, our presented set likely contains the largest fraction of human or andreas.keller@ccb.uni-saarland.deSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 21 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx814
      Issue No: Vol. 34, No. 10 (2017)
  • Genome U-Plot: a whole genome visualization
    • Authors: Gaitatzes A; Johnson S, Smadbeck J, et al.
      Pages: 1629 - 1634
      Abstract: MotivationThe ability to produce and analyze whole genome sequencing (WGS) data from samples with structural variations (SV) generated the need to visualize such abnormalities in simplified plots. Conventional two-dimensional representations of WGS data frequently use either circular or linear layouts. There are several diverse advantages regarding both these representations, but their major disadvantage is that they do not use the two-dimensional space very efficiently. We propose a layout, termed the Genome U-Plot, which spreads the chromosomes on a two-dimensional surface and essentially quadruples the spatial resolution. We present the Genome U-Plot for producing clear and intuitive graphs that allows researchers to generate novel insights and hypotheses by visualizing SVs such as deletions, amplifications, and chromoanagenesis events. The main features of the Genome U-Plot are its layered layout, its high spatial resolution and its improved aesthetic qualities. We compare conventional visualization schemas with the Genome U-Plot using visualization metrics such as number of line crossings and crossing angle resolution measures. Based on our metrics, we improve the readability of the resulting graph by at least 2-fold, making apparent important features and making it easy to identify important genomic changes.ResultsA whole genome visualization tool with high spatial resolution and improved aesthetic qualities.Availability and implementationAn implementation and documentation of the Genome U-Plot is publicly available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 21 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx829
      Issue No: Vol. 34, No. 10 (2017)
  • CALQ: compression of quality values of aligned sequencing data
    • Authors: Voges J; Ostermann J, Hernaez M.
      Pages: 1650 - 1658
      Abstract: MotivationRecent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses.ResultsWe analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets.Availability and implementationCALQ is written in C ++ and can be downloaded from or mhernaez@illinois.eduSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 23 Nov 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx737
      Issue No: Vol. 34, No. 10 (2017)
  • Mapping-free variant calling using haplotype reconstruction from k-mer
    • Authors: Audano P; Ravishankar S, Vannberg F.
      Pages: 1659 - 1665
      Abstract: MotivationThe standard protocol for detecting variation in DNA is to map millions of short sequence reads to a known reference and find loci that differ. While this approach works well, it cannot be applied where the sample contains dense variants or is too distant from known references. De novo assembly or hybrid methods can recover genomic variation, but the cost of computation is often much higher. We developed a novel k-mer algorithm and software implementation, Kestrel, capable of characterizing densely packed SNPs and large indels without mapping, assembly or de Bruijn graphs.ResultsWhen applied to mosaic penicillin binding protein (PBP) genes in Streptococcus pneumoniae, we found near perfect concordance with assembled contigs at a fraction of the CPU time. Multilocus sequence typing (MLST) with this approach was able to bypass de novo assemblies. Kestrel has a very low false-positive rate when applied to the whole genome, and while Kestrel identified many variants missed by other methods, limitations of a purely k-mer based approach affect overall sensitivity.Availability and implementationSource code and documentation for a Java implementation of Kestrel can be found at All test code for this publication is located at or fredrik.vannberg@biology.gatech.eduSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 24 Nov 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx753
      Issue No: Vol. 34, No. 10 (2017)
  • Machine learning for classifying tuberculosis drug-resistance from DNA
           sequencing data
    • Authors: Yang Y; Niehaus K, Walker T, et al.
      Pages: 1666 - 1671
      Abstract: MotivationCorrect and rapid determination of Mycobacterium tuberculosis (MTB) resistance against available tuberculosis (TB) drugs is essential for the control and management of TB. Conventional molecular diagnostic test assumes that the presence of any well-studied single nucleotide polymorphisms is sufficient to cause resistance, which yields low sensitivity for resistance classification.SummaryGiven the availability of DNA sequencing data from MTB, we developed machine learning models for a cohort of 1839 UK bacterial isolates to classify MTB resistance against eight anti-TB drugs (isoniazid, rifampicin, ethambutol, pyrazinamide, ciprofloxacin, moxifloxacin, ofloxacin, streptomycin) and to classify multi-drug resistance.ResultsCompared to previous rules-based approach, the sensitivities from the best-performing models increased by 2-4% for isoniazid, rifampicin and ethambutol to 97% (P < 0.01), respectively; for ciprofloxacin and multi-drug resistant TB, they increased to 96%. For moxifloxacin and ofloxacin, sensitivities increased by 12 and 15% from 83 and 81% based on existing known resistance alleles to 95% and 96% (P < 0.01), respectively. Particularly, our models improved sensitivities compared to the previous rules-based approach by 15 and 24% to 84 and 87% for pyrazinamide and streptomycin (P < 0.01), respectively. The best-performing models increase the area-under-the-ROC curve by 10% for pyrazinamide and streptomycin (P < 0.01), and 4–8% for other drugs (P < 0.01).Availability and implementationThe details of source code are provided at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 12 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx801
      Issue No: Vol. 34, No. 10 (2017)
  • Computational identification of micro-structural variations and their
           proteogenomic consequences in cancer
    • Authors: Lin Y; Gawronski A, Hach F, et al.
      Pages: 1672 - 1681
      Abstract: MotivationRapid advancement in high throughput genome and transcriptome sequencing (HTS) and mass spectrometry (MS) technologies has enabled the acquisition of the genomic, transcriptomic and proteomic data from the same tissue sample. We introduce a computational framework, ProTIE, to integratively analyze all three types of omics data for a complete molecular profile of a tissue sample. Our framework features MiStrVar, a novel algorithmic method to identify micro structural variants (microSVs) on genomic HTS data. Coupled with deFuse, a popular gene fusion detection method we developed earlier, MiStrVar can accurately profile structurally aberrant transcripts in tumors. Given the breakpoints obtained by MiStrVar and deFuse, our framework can then identify all relevant peptides that span the breakpoint junctions and match them with unique proteomic signatures. Observing structural aberrations in all three types of omics data validates their presence in the tumor samples.ResultsWe have applied our framework to all The Cancer Genome Atlas (TCGA) breast cancer Whole Genome Sequencing (WGS) and/or RNA-Seq datasets, spanning all four major subtypes, for which proteomics data from Clinical Proteomic Tumor Analysis Consortium (CPTAC) have been released. A recent study on this dataset focusing on SNVs has reported many that lead to novel peptides. Complementing and significantly broadening this study, we detected 244 novel peptides from 432 candidate genomic or transcriptomic sequence aberrations. Many of the fusions and microSVs we discovered have not been reported in the literature. Interestingly, the vast majority of these translated aberrations, fusions in particular, were private, demonstrating the extensive inter-genomic heterogeneity present in breast cancer. Many of these aberrations also have matching out-of-frame downstream peptides, potentially indicating novel protein sequence and structure.Availability and implementationMiStrVar is available for download at, and ProTIE is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 18 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx807
      Issue No: Vol. 34, No. 10 (2017)
  • K2 and K2*: efficient alignment-free sequence similarity measurement based
           on Kendall statistics
    • Authors: Lin J; Adjeroh D, Jiang B, et al.
      Pages: 1682 - 1689
      Abstract: MotivationAlignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods.ResultsWe propose a new non-parametric alignment-free sequence comparison method, called K2, based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K2 demonstrates competitive performance in generating the phylogenetic tree, in evaluating functionally related regulatory sequences, and in computing the edit distance (similarity/dissimilarity) between sequences. Furthermore, the K2 approach is much faster than the other methods. An improved method, K2*, is also proposed, which is able to determine the appropriate algorithmic parameter (length) automatically, without first considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes.Availability and implementationThe K2 and K2* approaches are implemented in the R language as a package and is freely available for open access ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 15 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx809
      Issue No: Vol. 34, No. 10 (2017)
  • DeepSig: deep learning improves signal peptide detection in proteins
    • Authors: Savojardo C; Martelli P, Fariselli P, et al.
      Pages: 1690 - 1696
      Abstract: MotivationThe identification of signal peptides in protein sequences is an important step toward protein localization and function characterization.ResultsHere, we present DeepSig, an improved approach for signal peptide detection and cleavage-site prediction based on deep learning methods. Comparative benchmarks performed on an updated independent dataset of proteins show that DeepSig is the current best performing method, scoring better than other available state-of-the-art approaches on both signal peptide detection and precise cleavage-site identification.Availability and implementationDeepSig is available as both standalone program and web server at All datasets used in this study can be obtained from the same website.Contactpierluigi.martelli@unibo.itSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 21 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx818
      Issue No: Vol. 34, No. 10 (2017)
  • ChopStitch: exon annotation and splice graph construction using
           transcriptome assembly and whole genome sequencing data
    • Authors: Khan H; Mohamadi H, Vandervalk B, et al.
      Pages: 1697 - 1704
      Abstract: MotivationSequencing studies on non-model organisms often interrogate both genomes and transcriptomes with massive amounts of short sequences. Such studies require de novo analysis tools and techniques, when the species and closely related species lack high quality reference resources. For certain applications such as de novo annotation, information on putative exons and alternative splicing may be desirable.ResultsHere we present ChopStitch, a new method for finding putative exons de novo and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. ChopStitch identifies exon-exon boundaries in de novo assembled RNA-Seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also accounts for base substitutions in transcript sequences that may be derived from sequencing or assembly errors, haplotype variations, or putative RNA editing events. The primary output of our tool is a FASTA file containing putative exons. Further, exon edges are interrogated for alternative exon-exon boundaries to detect transcript isoforms, which are represented as splice graphs in DOT output format.Availability and implementationChopStitch is written in Python and C++ and is released under the GPL license. It is freely available at or ibirol@bcgsc.caSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 29 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx839
      Issue No: Vol. 34, No. 10 (2017)
  • mTM-align: an algorithm for fast and accurate multiple protein structure
    • Authors: Dong R; Peng Z, Zhang Y, et al.
      Pages: 1719 - 1725
      Abstract: MotivationAs protein structure is more conserved than sequence during evolution, multiple structure alignment can be more informative than multiple sequence alignment, especially for distantly related proteins. With the rapid increase of the number of protein structures in the Protein Data Bank, it becomes urgent to develop efficient algorithms for multiple structure alignment.ResultsA new multiple structure alignment algorithm (mTM-align) was proposed, which is an extension of the highly efficient pairwise structure alignment program TM-align. The algorithm was benchmarked on four widely used datasets, HOMSTRAD, SABmark_sup, SABmark_twi and SISY-multiple, showing that mTM-align consistently outperforms other algorithms. In addition, the comparison with the manually curated alignments in the HOMSTRAD database shows that the automated alignments built by mTM-align are in general more accurate. Therefore, mTM-align may be used as a reliable complement to construct multiple structure alignments for real-world applications.Availability and implementation or informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 21 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx828
      Issue No: Vol. 34, No. 10 (2017)
  • Multiple hot-deck imputation for network inference from RNA sequencing
    • Authors: Imbert A; Valsesia A, Le Gall C, et al.
      Pages: 1726 - 1732
      Abstract: MotivationNetwork inference provides a global view of the relations existing between gene expression in a given transcriptomic experiment (often only for a restricted list of chosen genes). However, it is still a challenging problem: even if the cost of sequencing techniques has decreased over the last years, the number of samples in a given experiment is still (very) small compared to the number of genes.ResultsWe propose a method to increase the reliability of the inference when RNA-seq expression data have been measured together with an auxiliary dataset that can provide external information on gene expression similarity between samples. Our statistical approach, hd-MI, is based on imputation for samples without available RNA-seq data that are considered as missing data but are observed on the secondary dataset. hd-MI can improve the reliability of the inference for missing rates up to 30% and provides more stable networks with a smaller number of false positive edges. On a biological point of view, hd-MI was also found relevant to infer networks from RNA-seq data acquired in adipose tissue during a nutritional intervention in obese individuals. In these networks, novel links between genes were highlighted, as well as an improved comparability between the two steps of the nutritional intervention.Availability and implementationSoftware and sample data are available as an R package, RNAseqNet, that can be downloaded from the Comprehensive R Archive Network (CRAN) or nathalie.villa-vialaneix@inra.frSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 21 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx819
      Issue No: Vol. 34, No. 10 (2017)
  • CRNET: an efficient sampling approach to infer functional regulatory
    • Authors: Chen X; Gu J, Wang X, et al.
      Pages: 1733 - 1740
      Abstract: MotivationNGS techniques have been widely applied in genetic and epigenetic studies. Multiple ChIP-seq and RNA-seq profiles can now be jointly used to infer functional regulatory networks (FRNs). However, existing methods suffer from either oversimplified assumption on transcription factor (TF) regulation or slow convergence of sampling for FRN inference from large-scale ChIP-seq and time-course RNA-seq data.ResultsWe developed an efficient Bayesian integration method (CRNET) for FRN inference using a two-stage Gibbs sampler to estimate iteratively hidden TF activities and the posterior probabilities of binding events. A novel statistic measure that jointly considers regulation strength and regression error enables the sampling process of CRNET to converge quickly, thus making CRNET very efficient for large-scale FRN inference. Experiments on synthetic and benchmark data showed a significantly improved performance of CRNET when compared with existing methods. CRNET was applied to breast cancer data to identify FRNs functional at promoter or enhancer regions in breast cancer MCF-7 cells. Transcription factor MYC is predicted as a key functional factor in both promoter and enhancer FRNs. We experimentally validated the regulation effects of MYC on CRNET-predicted target genes using appropriate RNAi approaches in MCF-7 cells.Availability and implementationR scripts of CRNET are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 21 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx827
      Issue No: Vol. 34, No. 10 (2017)
  • Ontological function annotation of long non-coding RNAs through
           hierarchical multi-label classification
    • Authors: Zhang J; Zhang Z, Wang Z, et al.
      Pages: 1750 - 1757
      Abstract: MotivationLong non-coding RNAs (lncRNAs) are an enormous collection of functional non-coding RNAs. Over the past decades, a large number of novel lncRNA genes have been identified. However, most of the lncRNAs remain function uncharacterized at present. Computational approaches provide a new insight to understand the potential functional implications of lncRNAs.ResultsConsidering that each lncRNA may have multiple functions and a function may be further specialized into sub-functions, here we describe NeuraNetL2GO, a computational ontological function prediction approach for lncRNAs using hierarchical multi-label classification strategy based on multiple neural networks. The neural networks are incrementally trained level by level, each performing the prediction of gene ontology (GO) terms belonging to a given level. In NeuraNetL2GO, we use topological features of the lncRNA similarity network as the input of the neural networks and employ the output results to annotate the lncRNAs. We show that NeuraNetL2GO achieves the best performance and the overall advantage in maximum F-measure and coverage on the manually annotated lncRNA2GO-55 dataset compared to other state-of-the-art methods.Availability and implementationThe source code and data are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx833
      Issue No: Vol. 34, No. 10 (2017)
  • Algorithmic identification of discrepancies between published ratios and
           their reported confidence intervals and P-values
    • Authors: Georgescu C; Wren J.
      Pages: 1758 - 1766
      Abstract: MotivationStudies, mostly from the operations/management literature, have shown that the rate of human error increases with task complexity. What is not known is how many errors make it into the published literature, given that they must slip by peer-review. By identifying paired, dependent values within text for reported calculations of varying complexity, we can identify discrepancies, quantify error rates and identify mitigating factors.ResultsWe extracted statistical ratios from MEDLINE abstracts (hazard ratio, odds ratio, relative risk), their 95% CIs, and their P-values. We re-calculated the ratios and P-values using the reported CIs. For comparison, we also extracted percent–ratio pairs, one of the simplest calculation tasks. Over 486 000 published values were found and analyzed for discrepancies, allowing for rounding and significant figures. Per reported item, discrepancies were less frequent in percent–ratio calculations (2.7%) than in ratio–CI and P-value calculations (5.6–7.5%), and smaller discrepancies were more frequent than large ones. Systematic discrepancies (multiple incorrect calculations of the same type) were higher for more complex tasks (14.3%) than simple ones (6.7%). Discrepancy rates decreased with increasing journal impact factor (JIF) and increasing number of authors, but with diminishing returns and JIF accounting for most of the effect. Approximately 87% of the 81 937 extracted P-values were ≤ 0.05.ConclusionUsing a simple, yet accurate, approach to identifying paired values within text, we offer the first quantitative evaluation of published error frequencies within these types of or jdwren@gmail.comSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 22 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx811
      Issue No: Vol. 34, No. 10 (2017)
  • A benchmark for comparing precision medicine methods in thyroid cancer
           diagnosis using tissue microarrays
    • Authors: Wang C; Lee Y, Calista E, et al.
      Pages: 1767 - 1773
      Abstract: MotivationThe aim of precision medicine is to harness new knowledge and technology to optimize the timing and targeting of interventions for maximal therapeutic benefit. This study explores the possibility of building AI models without precise pixel-level annotation in prediction of the tumor size, extrathyroidal extension, lymph node metastasis, cancer stage and BRAF mutation in thyroid cancer diagnosis, providing the patients’ background information, histopathological and immunohistochemical tissue images.ResultsA novel framework for objective evaluation of automatic patient diagnosis algorithms has been established under the auspices of the IEEE International Symposium on Biomedical Imaging 2017— A Grand Challenge for Tissue Microarray Analysis in Thyroid Cancer Diagnosis. Here, we present the datasets, methods and results of the challenge and lay down the principles for future uses of this benchmark. The main contributions of the challenge include the creation of the data repository of tissue microarrays; the creation of the clinical diagnosis classification data repository of thyroid cancer; and the definition of objective quantitative evaluation for comparison and ranking of the algorithms. With this benchmark, three automatic methods for predictions of the five clinical outcomes have been compared, and detailed quantitative evaluation results are presented in this paper. Based on the quantitative evaluation results, we believe automatic patient diagnosis is still a challenging and unsolved problem.Availability and implementationThe datasets and the evaluation software will be made available to the research community, further encouraging future developments in this field. ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx838
      Issue No: Vol. 34, No. 10 (2017)
  • SV2: accurate structural variation genotyping and de novo mutation
           detection from whole genomes
    • Authors: Antaki D; Brandler W, Sebat J.
      Pages: 1774 - 1777
      Abstract: MotivationStructural variation (SV) detection from short-read whole genome sequencing is error prone, presenting significant challenges for population or family-based studies of disease.ResultsHere, we describe SV2, a machine-learning algorithm for genotyping deletions and duplications from paired-end sequencing data. SV2 can rapidly integrate variant calls from multiple structural variant discovery algorithms into a unified call set with high genotyping accuracy and capability to detect de novo mutations.Availability and implementationSV2 is freely available on GitHub ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 29 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx813
      Issue No: Vol. 34, No. 10 (2017)
  • Personal Cancer Genome Reporter: variant interpretation report for
           precision oncology
    • Authors: Nakken S; Fournous G, Vodák D, et al.
      Pages: 1778 - 1780
      Abstract: SummaryIndividual tumor genomes pose a major challenge for clinical interpretation due to their unique sets of acquired mutations. There is a general scarcity of tools that can (i) systematically interrogate cancer genomes in the context of diagnostic, prognostic, and therapeutic biomarkers, (ii) prioritize and highlight the most important findings and (iii) present the results in a format accessible to clinical experts. We have developed a stand-alone, open-source software package for somatic variant annotation that integrates a comprehensive set of knowledge resources related to tumor biology and therapeutic biomarkers, both at the gene and variant level. Our application generates a tiered report that will aid the interpretation of individual cancer genomes in a clinical setting.Availability and implementationThe software is implemented in Python/R, and is freely available through Docker technology. Documentation, example reports, and installation instructions are accessible via the project GitHub page: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 20 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx817
      Issue No: Vol. 34, No. 10 (2017)
  • PGA: post-GWAS analysis for disease gene identification
    • Authors: Lin J; Jaroslawicz D, Cai Y, et al.
      Pages: 1786 - 1788
      Abstract: SummaryAlthough the genome-wide association study (GWAS) is a powerful method to identify disease-associated variants, it does not directly address the biological mechanisms underlying such genetic association signals. Here, we present PGA, a Perl- and Java-based program for post-GWAS analysis that predicts likely disease genes given a list of GWAS-reported variants. Designed with a command line interface, PGA incorporates genomic and eQTL data in identifying disease gene candidates and uses gene network and ontology data to score them based upon the strength of their relationship to the disease in question.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 29 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx845
      Issue No: Vol. 34, No. 10 (2017)
  • R2DGC: threshold-free peak alignment and identification for 2D gas
           chromatography-mass spectrometry in R
    • Authors: Ramaker R; Gordon E, Cooper S.
      Pages: 1789 - 1791
      Abstract: SummaryComprehensive 2D gas chromatography-mass spectrometry is a powerful method for analyzing complex mixtures of volatile compounds, but produces a large amount of raw data that requires downstream processing to align signals of interest (peaks) across multiple samples and match peak characteristics to reference standard libraries prior to downstream statistical analysis. Very few existing tools address this aspect of analysis and those that do have shortfalls in usability or performance. We have developed an R package that implements retention time and mass spectra similarity threshold-free alignments, seamlessly integrates retention time standards for universally reproducible alignments, performs common ion filtering and provides compatibility with multiple peak quantification methods. We demonstrate that our package’s performance compares favorably to existing tools on a controlled mix of metabolite standards separated under variable chromatography conditions and data generated from cell lines.Availability and implementationR2DGC can be downloaded at or installed via the Comprehensive R Archive Network (CRAN).Contactsjcooper@hudsonalpha.orgSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 21 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx825
      Issue No: Vol. 34, No. 10 (2017)
  • polyPK: an R package for pharmacokinetic analysis of multi-component drugs
           using a metabolomics approach
    • Authors: Li M; Wang S, Xie G, et al.
      Pages: 1792 - 1794
      Abstract: SummaryPharmacokinetics (PK) is a long-standing bottleneck for botanical drug and traditional medicine research. By using an integrated phytochemical and metabolomics approach coupled with multivariate statistical analysis, we propose a new strategy, Poly-PK, to simultaneously monitor the performance of drug constituents and endogenous metabolites, taking into account both the diversity of the drug’s chemical composition and its complex effects on the mammalian metabolic pathways. Poly-PK is independent of specific measurement platforms and has been successfully applied in the PK studies of Puerh tea, a traditional Chinese medicine Huangqi decoction and many other multi-component drugs. Here, we introduce an R package, polyPK, the first and only automation of the data analysis pipeline of Poly-PK strategy. polyPK provides 10 functions for data pre-processing, differential compound identification and grouping, traditional PK parameters calculation, multivariate statistical analysis, correlations, cluster analyses and resulting visualization. It may serve a wide range of users, including pharmacologists, biologists and doctors, in understanding the metabolic fate of multi-component drugs.Availability and implementationpolyPK package is freely available from the R archive CRAN ( or informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx834
      Issue No: Vol. 34, No. 10 (2017)
  • Aligning dynamic networks with DynaWAVE
    • Authors: Vijayan V; Milenković T.
      Pages: 1795 - 1798
      Abstract: MotivationNetwork alignment (NA) aims to find similar (conserved) regions between networks, such as cellular networks of different species. Until recently, existing methods were limited to aligning static networks. However, real-world systems, including cellular functioning, are dynamic. Hence, in our previous work, we introduced the first ever dynamic NA method, DynaMAGNA++, which improved upon the traditional static NA. However, DynaMAGNA++ does not necessarily scale well to larger networks in terms of alignment quality or runtime.ResultsTo address this, we introduce a new dynamic NA approach, DynaWAVE. We show that DynaWAVE complements DynaMAGNA++: while DynaMAGNA++ is more accurate yet slower than DynaWAVE for smaller networks, DynaWAVE is both more accurate and faster than DynaMAGNA++ for larger networks. We provide a friendly user interface and source code for DynaWAVE.Availability and implementation∼cone/DynaWAVE/.Contacttmilenko@nd.eduSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 28 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx841
      Issue No: Vol. 34, No. 10 (2017)
  • ChronQC: a quality control monitoring system for clinical next generation
    • Authors: Tawari N; Seow J, Perumal D, et al.
      Pages: 1799 - 1800
      Abstract: SummaryChronQC is a quality control (QC) tracking system for clinical implementation of next-generation sequencing (NGS). ChronQC generates time series plots for various QC metrics to allow comparison of current runs to historical runs. ChronQC has multiple features for tracking QC data including Westgard rules for clinical validity, laboratory-defined thresholds and historical observations within a specified time period. Users can record their notes and corrective actions directly onto the plots for long-term recordkeeping. ChronQC facilitates regular monitoring of clinical NGS to enable adherence to high quality clinical standards.Availability and implementationChronQC is freely available on GitHub (, Docker ( and the Python Package Index. ChronQC is implemented in Python and runs on all common operating systems (Windows, Linux and Mac OS X) or informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 28 Dec 2017 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btx843
      Issue No: Vol. 34, No. 10 (2017)
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-