Journal Cover
Journal Prestige (SJR): 6.14
Citation Impact (citeScore): 8
Number of Followers: 385  
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1367-4803 - ISSN (Online) 1460-2059
Published by Oxford University Press Homepage  [409 journals]
  • EPIP: a novel approach for condition-specific enhancer–promoter
           interaction prediction
    • Authors: Talukder A; Saadat S, Li X, et al.
      Pages: 3877 - 3883
      Abstract: AbstractMotivationThe identification of enhancer–promoter interactions (EPIs), especially condition-specific ones, is important for the study of gene transcriptional regulation. Existing experimental approaches for EPI identification are still expensive, and available computational methods either do not consider or have low performance in predicting condition-specific EPIs.ResultsWe developed a novel computational method called EPIP to reliably predict EPIs, especially condition-specific ones. EPIP is capable of predicting interactions in samples with limited data as well as in samples with abundant data. Tested on more than eight cell lines, EPIP reliably identifies EPIs, with an average area under the receiver operating characteristic curve of 0.95 and an average area under the precision–recall curve of 0.73. Tested on condition-specific EPIPs, EPIP correctly identified 99.26% of them. Compared with two recently developed methods, EPIP outperforms them with a better accuracy.Availability and implementationThe EPIP tool is freely available at˜xiaoman/EPIP/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 13 Aug 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz641
      Issue No: Vol. 35, No. 20 (2019)
  • Sub-dominant principal components inform new vaccine targets for HIV Gag
    • Authors: Ahmed S; Quadeer A, Morales-Jimenez D, et al.
      Pages: 3884 - 3889
      Abstract: AbstractMotivationPatterns of mutational correlations, learnt from patient-derived sequences of human immunodeficiency virus (HIV) proteins, are informative of biochemically linked networks of interacting sites that may enable viral escape from the host immune system. Accurate identification of these networks is important for rationally designing vaccines which can effectively block immune escape pathways. Previous computational methods have partly identified such networks by examining the principal components (PCs) of the mutational correlation matrix of HIV Gag proteins. However, driven by a conservative approach, these methods analyze the few dominant (strongest) PCs, potentially missing information embedded within the sub-dominant (relatively weaker) ones that may be important for vaccine design.ResultsBy using sequence data for HIV Gag, complemented by model-based simulations, we revealed that certain networks of interacting sites that appear important for vaccine design purposes are not accurately reflected by the dominant PCs. Rather, these networks are encoded jointly by both dominant and sub-dominant PCs. By incorporating information from the sub-dominant PCs, we identified a network of interacting sites of HIV Gag that associated very strongly with viral control. Based on this network, we propose several new candidates for a potent T-cell-based HIV vaccine.Availability and implementationAccession numbers of all sequences used and the source code scripts for all analysis and figures reported in this work are available online at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 28 Jun 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz524
      Issue No: Vol. 35, No. 20 (2019)
  • Multiresolution correction of GC bias and application to identification of
           copy number alterations
    • Authors: Jang H; Lee H, Birol I.
      Pages: 3890 - 3897
      Abstract: AbstractMotivationWhole-genome sequencing (WGS) data are affected by various sequencing biases such as GC bias and mappability bias. These biases degrade performance on detection of genetic variations such as copy number alterations. The existing methods use a relation between the GC proportion and depth of coverage (DOC) of markers by means of regression models. Nonetheless, severity of the GC bias varies from sample to sample. We developed a new method for correction of GC bias on the basis of multiresolution analysis. We used a translation-invariant wavelet transform to decompose biased raw signals into high- and low-frequency coefficients. Then, we modeled the relation between GC proportion and DOC of the genomic regions and constructed new control DOC signals that reflect the GC bias. The control DOC signals are used for normalizing genomic sequences by correcting the GC bias.ResultsWhen we applied our method to simulated sequencing data with various degrees of GC bias, our method showed more robust performance on correcting the GC bias than the other methods did. We also applied our method to real-world cancer sequencing datasets and successfully identified cancer-related focal alterations even when cancer genomes were not normalized to normal control samples. In conclusion, our method can be employed for WGS data with different degrees of GC bias.Availability and implementationThe code is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 13 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz174
      Issue No: Vol. 35, No. 20 (2019)
  • Dissecting differential signals in high-throughput data from complex
    • Authors: Li Z; Wu Z, Jin P, et al.
      Pages: 3898 - 3905
      Abstract: AbstractMotivationSamples from clinical practices are often mixtures of different cell types. The high-throughput data obtained from these samples are thus mixed signals. The cell mixture brings complications to data analysis, and will lead to biased results if not properly accounted for.ResultsWe develop a method to model the high-throughput data from mixed, heterogeneous samples, and to detect differential signals. Our method allows flexible statistical inference for detecting a variety of cell-type specific changes. Extensive simulation studies and analyses of two real datasets demonstrate the favorable performance of our proposed method compared with existing ones serving similar purpose.Availability and implementationThe proposed method is implemented as an R package and is freely available on GitHub ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz196
      Issue No: Vol. 35, No. 20 (2019)
  • ORE identifies extreme expression effects enriched for rare variants
    • Authors: Richter F; Hoffman G, Manheimer K, et al.
      Pages: 3906 - 3912
      Abstract: AbstractMotivationNon-coding rare variants (RVs) may contribute to Mendelian disorders but have been challenging to study due to small sample sizes, genetic heterogeneity and uncertainty about relevant non-coding features. Previous studies identified RVs associated with expression outliers, but varying outlier definitions were employed and no comprehensive open-source software was developed.ResultsWe developed Outlier-RV Enrichment (ORE) to identify biologically-meaningful non-coding RVs. We implemented ORE combining whole-genome sequencing and cardiac RNAseq from congenital heart defect patients from the Pediatric Cardiac Genomics Consortium and deceased adults from Genotype-Tissue Expression. Use of rank-based outliers maximized sensitivity while a most extreme outlier approach maximized specificity. Rarer variants had stronger associations, suggesting they are under negative selective pressure and providing a basis for investigating their contribution to Mendelian disorders.Availability and implementationORE, source code, and documentation are available at under the MIT license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz202
      Issue No: Vol. 35, No. 20 (2019)
  • ERVcaller: identifying polymorphic endogenous retrovirus and other
           transposable element insertions using whole-genome sequencing data
    • Authors: Chen X; Li D, Birol I.
      Pages: 3913 - 3922
      Abstract: AbstractMotivationApproximately 8% of the human genome is derived from endogenous retroviruses (ERVs). In recent years, an increasing number of human diseases have been found to be associated with ERVs. However, it remains challenging to accurately detect the full spectrum of polymorphic (unfixed) ERVs using whole-genome sequencing (WGS) data.ResultsWe designed a new tool, ERVcaller, to detect and genotype transposable element (TE) insertions, including ERVs, in the human genome. We evaluated ERVcaller using both simulated and real benchmark WGS datasets. Compared to existing tools, ERVcaller consistently obtained both the highest sensitivity and precision for detecting simulated ERV and other TE insertions derived from real polymorphic TE sequences. For the WGS data from the 1000 Genomes Project, ERVcaller detected the largest number of TE insertions per sample based on consensus TE loci. By analyzing the experimentally verified TE insertions, ERVcaller had 94.0% TE detection sensitivity and 96.6% genotyping accuracy. Polymerase chain reaction and Sanger sequencing in a small sample set verified 86.7% of examined insertion statuses and 100% of examined genotypes. In conclusion, ERVcaller is capable of detecting and genotyping TE insertions using WGS data with both high sensitivity and precision. This tool can be applied broadly to other species.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 21 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz205
      Issue No: Vol. 35, No. 20 (2019)
  • Discovery of tandem and interspersed segmental duplications using
           high-throughput sequencing
    • Authors: Soylev A; Le T, Amini H, et al.
      Pages: 3923 - 3930
      Abstract: AbstractMotivationSeveral algorithms have been developed that use high-throughput sequencing technology to characterize structural variations (SVs). Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions, likewise, duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants.ResultsWe developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing datasets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real datasets. In the simulation experiments, using a 30× coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state-of-the-art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (<5% for the top 50 predictions).Availability and implementationTARDIS source code is available at, and a corresponding Docker image is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 01 Apr 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz237
      Issue No: Vol. 35, No. 20 (2019)
  • Stable H3K4me3 is associated with transcription initiation during early
           embryo development
    • Authors: Huang X; Gao X, Li W, et al.
      Pages: 3931 - 3936
      Abstract: AbstractMotivationDuring development of the mammalian embryo, histone modification H3K4me3 plays an important role in regulating gene expression and exhibits extensive reprograming on the parental genomes. In addition to these dramatic epigenetic changes, certain unchanging regulatory elements are also essential for embryonic development.ResultsUsing large-scale H3K4me3 chromatin immunoprecipitation sequencing data, we identified a form of H3K4me3 that was present during all eight stages of the mouse embryo before implantation. This ‘stable H3K4me3’ was highly accessible and much longer than normal H3K4me3. Moreover, most of the stable H3K4me3 was in the promoter region and was enriched in higher chromatin architecture. Using in-depth analysis, we demonstrated that stable H3K4me3 was related to higher gene expression levels and transcriptional initiation during embryonic development. Furthermore, stable H3K4me3 was much more active in blood tumor cells than in normal blood cells, suggesting a potential mechanism of cancer progression.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 12 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz173
      Issue No: Vol. 35, No. 20 (2019)
  • Neural networks with circular filters enable data efficient inference of
           sequence motifs
    • Authors: Blum C; Kollmann M, Hancock J.
      Pages: 3937 - 3943
      Abstract: AbstractMotivationNucleic acids and proteins often have localized sequence motifs that enable highly specific interactions. Due to the biological relevance of sequence motifs, numerous inference methods have been developed. Recently, convolutional neural networks (CNNs) have achieved state of the art performance. These methods were able to learn transcription factor binding sites from ChIP-seq data, resulting in accurate predictions on test data. However, CNNs typically distribute learned motifs across multiple filters, making them difficult to interpret. Furthermore, networks trained on small datasets often do not generalize well to new sequences.ResultsHere we present circular filters, a novel convolutional architecture, that convolves sequences with circularly permutated variants of the same filter. We motivate circular filters by the observation that CNNs frequently learn filters that correspond to shifted and truncated variants of the true motif. Circular filters enable learning of full-length motifs and allow easy interpretation of the learned filters. We show that circular filters improve motif inference performance over a wide range of hyperparameters as well as sequence length. Furthermore, we show that CNNs with circular filters in most cases outperform conventional CNNs at inferring DNA binding sites from ChIP-seq data.Availability and implementationCode is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz194
      Issue No: Vol. 35, No. 20 (2019)
  • SArKS: de novo discovery of gene expression regulatory motif sites and
           domains by suffix array kernel smoothing
    • Authors: Wylie D; Hofmann H, Zemelman B, et al.
      Pages: 3944 - 3952
      Abstract: AbstractMotivationWe set out to develop an algorithm that can mine differential gene expression data to identify candidate cell type-specific DNA regulatory sequences. Differential expression is usually quantified as a continuous score—fold-change, test-statistic, P-value—comparing biological classes. Unlike existing approaches, our de novo strategy, termed SArKS, applies non-parametric kernel smoothing to uncover promoter motif sites that correlate with elevated differential expression scores. SArKS detects motif k-mers by smoothing sequence scores over sequence similarity. A second round of smoothing over spatial proximity reveals multi-motif domains (MMDs). Discovered motif sites can then be merged or extended based on adjacency within MMDs. False positive rates are estimated and controlled by permutation testing.ResultsWe applied SArKS to published gene expression data representing distinct neocortical neuron classes in Mus musculus and interneuron developmental states in Homo sapiens. When benchmarked against several existing algorithms using a cross-validation procedure, SArKS identified larger motif sets that formed the basis for regression models with higher correlative power.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz198
      Issue No: Vol. 35, No. 20 (2019)
  • FLAS: fast and high-throughput algorithm for PacBio long-read
    • Authors: Bao E; Xie F, Song C, et al.
      Pages: 3953 - 3960
      Abstract: AbstractMotivationThe third generation PacBio long reads have greatly facilitated sequencing projects with very large read lengths, but they contain about 15% sequencing errors and need error correction. For the projects with long reads only, it is challenging to make correction with fast speed, and also challenging to correct a sufficient amount of read bases, i.e. to achieve high-throughput self-correction. MECAT is currently among the fastest self-correction algorithms, but its throughput is relatively small (Xiao et al., 2017).ResultsHere, we introduce FLAS, a wrapper algorithm of MECAT, to achieve high-throughput long-read self-correction while keeping MECAT’s fast speed. FLAS finds additional alignments from MECAT prealigned long reads to improve the correction throughput, and removes misalignments for accuracy. In addition, FLAS also uses the corrected long-read regions to correct the uncorrected ones to further improve the throughput. In our performance tests on Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana and human long reads, FLAS can achieve 22.0–50.6% larger throughput than MECAT. FLAS is 2–13× faster compared to the self-correction algorithms other than MECAT, and its throughput is also 9.8–281.8% larger. The FLAS corrected long reads can be assembled into contigs of 13.1–29.8% larger N50 sizes than MECAT.Availability and implementationThe FLAS software can be downloaded for free from this site: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 21 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz206
      Issue No: Vol. 35, No. 20 (2019)
  • ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and
    • Authors: Yin J; Zhang C, Mirarab S, et al.
      Pages: 3961 - 3969
      Abstract: AbstractMotivationEvolutionary histories can change from one part of the genome to another. The potential for discordance between the gene trees has motivated the development of summary methods that reconstruct a species tree from an input collection of gene trees. ASTRAL is a widely used summary method and has been able to scale to relatively large datasets. However, the size of genomic datasets is quickly growing. Despite its relative efficiency, the current single-threaded implementation of ASTRAL is falling behind the data growth trends is not able to analyze the largest available datasets in a reasonable time.ResultsASTRAL uses dynamic programing and is not trivially parallel. In this paper, we introduce ASTRAL-MP, the first version of ASTRAL that can exploit parallelism and also uses randomization techniques to speed up some of its steps. Importantly, ASTRAL-MP can take advantage of not just multiple CPU cores but also one or several graphics processing units (GPUs). The ASTRAL-MP code scales very well with increasing CPU cores, and its GPU version, implemented in OpenCL, can have up to 158× speedups compared to ASTRAL-III. Using GPUs and multiple cores, ASTRAL-MP is able to analyze datasets with 10 000 species or datasets with more than 100 000 genes in <2 days.Availability and implementationASTRAL-MP is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz211
      Issue No: Vol. 35, No. 20 (2019)
  • Protein multiple alignments: sequence-based versus structure-based
    • Authors: Carpentier M; Chomilier J, Valencia A.
      Pages: 3970 - 3980
      Abstract: AbstractMotivationMultiple sequence alignment programs have proved to be very useful and have already been evaluated in the literature yet not alignment programs based on structure or both sequence and structure. In the present article we wish to evaluate the added value provided through considering structures.ResultsWe compared the multiple alignments resulting from 25 programs either based on sequence, structure or both, to reference alignments deposited in five databases (BALIBASE 2 and 3, HOMSTRAD, OXBENCH and SISYPHUS). On the whole, the structure-based methods compute more reliable alignments than the sequence-based ones, and even than the sequence+structure-based programs whatever the databases. Two programs lead, MAMMOTH and MATRAS, nevertheless the performances of MUSTANG, MATT, 3DCOMB, TCOFFEE+TM_ALIGN and TCOFFEE+SAP are better for some alignments. The advantage of structure-based methods increases at low levels of sequence identity, or for residues in regular secondary structures or buried ones. Concerning gap management, sequence-based programs set less gaps than structure-based programs. Concerning the databases, the alignments of the manually built databases are more challenging for the programs.Availability and implementationAll data and results presented in this study are available at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 03 Apr 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz236
      Issue No: Vol. 35, No. 20 (2019)
  • SCL: a lattice-based approach to infer 3D chromosome structures from
           single-cell Hi-C data
    • Authors: Zhu H; Wang Z, Valencia A.
      Pages: 3981 - 3988
      Abstract: AbstractMotivationIn contrast to population-based Hi-C data, single-cell Hi-C data are zero-inflated and do not indicate the frequency of proximate DNA segments. There are a limited number of computational tools that can model the 3D structures of chromosomes based on single-cell Hi-C data.ResultsWe developed single-cell lattice (SCL), a computational method to reconstruct 3D structures of chromosomes based on single-cell Hi-C data. We designed a loss function and a 2 D Gaussian function specifically for the characteristics of single-cell Hi-C data. A chromosome is represented as beads-on-a-string and stored in a 3 D cubic lattice. Metropolis–Hastings simulation and simulated annealing are used to simulate the structure and minimize the loss function. We evaluated the SCL-inferred 3 D structures (at both 500 and 50 kb resolutions) using multiple criteria and compared them with the ones generated by another modeling software program. The results indicate that the 3 D structures generated by SCL closely fit single-cell Hi-C data. We also found similar patterns of trans-chromosomal contact beads, Lamin-B1 enriched topologically associating domains (TADs), and H3K4me3 enriched TADs by mapping data from previous studies onto the SCL-inferred 3 D structures.Availability and implementationThe C++ source code of SCL is freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 13 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz181
      Issue No: Vol. 35, No. 20 (2019)
  • Classical scoring functions for docking are unable to exploit large
           volumes of structural and interaction data
    • Authors: Li H; Peng J, Sidorov P, et al.
      Pages: 3989 - 3995
      Abstract: AbstractMotivationStudies have shown that the accuracy of random forest (RF)-based scoring functions (SFs), such as RF-Score-v3, increases with more training samples, whereas that of classical SFs, such as X-Score, does not. Nevertheless, the impact of the similarity between training and test samples on this matter has not been studied in a systematic manner. It is therefore unclear how these SFs would perform when only trained on protein-ligand complexes that are highly dissimilar or highly similar to the test set. It is also unclear whether SFs based on machine learning algorithms other than RF can also improve accuracy with increasing training set size and to what extent they learn from dissimilar or similar training complexes.ResultsWe present a systematic study to investigate how the accuracy of classical and machine-learning SFs varies with protein-ligand complex similarities between training and test sets. We considered three types of similarity metrics, based on the comparison of either protein structures, protein sequences or ligand structures. Regardless of the similarity metric, we found that incorporating a larger proportion of similar complexes to the training set did not make classical SFs more accurate. In contrast, RF-Score-v3 was able to outperform X-Score even when trained on just 32% of the most dissimilar complexes, showing that its superior performance owes considerably to learning from dissimilar training complexes to those in the test set. In addition, we generated the first SF employing Extreme Gradient Boosting (XGBoost), XGB-Score, and observed that it also improves with training set size while outperforming the rest of SFs. Given the continuous growth of training datasets, the development of machine-learning SFs has become very appealing.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 14 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz183
      Issue No: Vol. 35, No. 20 (2019)
  • Significance tests for analyzing gene expression data with small sample
    • Authors: Ullah I; Paul S, Hong Z, et al.
      Pages: 3996 - 4003
      Abstract: AbstractMotivationUnder two biologically different conditions, we are often interested in identifying differentially expressed genes. It is usually the case that the assumption of equal variances on the two groups is violated for many genes where a large number of them are required to be filtered or ranked. In these cases, exact tests are unavailable and the Welch’s approximate test is most reliable one. The Welch’s test involves two layers of approximations: approximating the distribution of the statistic by a t-distribution, which in turn depends on approximate degrees of freedom. This study attempts to improve upon Welch’s approximate test by avoiding one layer of approximation.ResultsWe introduce a new distribution that generalizes the t-distribution and propose a Monte Carlo based test that uses only one layer of approximation for statistical inferences. Experimental results based on extensive simulation studies show that the Monte Carol based tests enhance the statistical power and performs better than Welch’s t-approximation, especially when the equal variance assumption is not met and the sample size of the sample with a larger variance is smaller. We analyzed two gene-expression datasets, namely the childhood acute lymphoblastic leukemia gene-expression dataset with 22 283 genes and Golden Spike dataset produced by a controlled experiment with 13 966 genes. The new test identified additional genes of interest in both datasets. Some of these genes have been proven to play important roles in medical literature.Availability and implementationR scripts and the R package mcBFtest is available in CRAN and to reproduce all reported results are available at the GitHub repository, informationSupplementary dataSupplementary data is available at Bioinformatics online.
      PubDate: Fri, 15 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz189
      Issue No: Vol. 35, No. 20 (2019)
  • Developing structural profile matrices for protein secondary structure and
           solvent accessibility prediction
    • Authors: Aydin Z; Azginoglu N, Bilgin H, et al.
      Pages: 4004 - 4010
      Abstract: AbstractMotivationPredicting secondary structure and solvent accessibility of proteins are among the essential steps that preclude more elaborate 3D structure prediction tasks. Incorporating class label information contained in templates with known structures has the potential to improve the accuracy of prediction methods. Building a structural profile matrix is one such technique that provides a distribution for class labels at each amino acid position of the target.ResultsIn this paper, a new structural profiling technique is proposed that is based on deriving PFAM families and is combined with an existing approach. Cross-validation experiments on two benchmark datasets and at various similarity intervals demonstrate that the proposed profiling strategy performs significantly better than Homolpro, a state-of-the-art method for incorporating template information, as assessed by statistical hypothesis tests.Availability and implementationThe DSPRED method can be accessed by visiting the PSP server at Source code and binaries are freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 01 Apr 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz238
      Issue No: Vol. 35, No. 20 (2019)
  • Probabilistic count matrix factorization for single cell expression data
    • Authors: Durif G; Modolo L, Mold J, et al.
      Pages: 4011 - 4019
      Abstract: AbstractMotivationThe development of high-throughput single-cell sequencing technologies now allows the investigation of the population diversity of cellular transcriptomes. The expression dynamics (gene-to-gene variability) can be quantified more accurately, thanks to the measurement of lowly expressed genes. In addition, the cell-to-cell variability is high, with a low proportion of cells expressing the same genes at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent a summarized view of single-cell expression data. Principal component analysis (PCA) is a most powerful tool for high dimensional data representation, by searching for latent directions catching the most variability in the data. Unfortunately, classical PCA is based on Euclidean distance and projections that poorly work in presence of over-dispersed count data with dropout events like single-cell expression data.ResultsWe propose a probabilistic Count Matrix Factorization (pCMF) approach for single-cell expression data analysis that relies on a sparse Gamma-Poisson factor model. This hierarchical model is inferred using a variational EM algorithm. It is able to jointly build a low dimensional representation of cells and genes. We show how this probabilistic framework induces a geometry that is suitable for single-cell data visualization, and produces a compression of the data that is very powerful for clustering purposes. Our method is competed against other standard representation methods like t-SNE, and we illustrate its performance for the representation of single-cell expression data.Availability and implementationOur work is implemented in the pCMF R-package ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 13 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz177
      Issue No: Vol. 35, No. 20 (2019)
  • maTE: discovering expressed interactions between microRNAs and their
    • Authors: Yousef M; Abdallah L, Allmer J, et al.
      Pages: 4020 - 4028
      Abstract: AbstractMotivationDisease is often manifested via changes in transcript and protein abundance. MicroRNAs (miRNAs) are instrumental in regulating protein abundance and may measurably influence transcript levels. miRNAs often target more than one mRNA (for humans, the average is three), and mRNAs are often targeted by more than one miRNA (for the genes considered in this study, the average is also three). Therefore, it is difficult to determine the miRNAs that may cause the observed differential gene expression. We present a novel approach, maTE, which is based on machine learning, that integrates information about miRNA target genes with gene expression data. maTE depends on the availability of a sufficient amount of patient and control samples. The samples are used to train classifiers to accurately classify the samples on a per miRNA basis. Multiple high scoring miRNAs are used to build a final classifier to improve separation.ResultsThe aim of the study is to find a set of miRNAs causing the regulation of their target genes that best explains the difference between groups (e.g. cancer versus control). maTE provides a list of significant groups of genes where each group is targeted by a specific miRNA. For the datasets used in this study, maTE generally achieves an accuracy well above 80%. Also, the results show that when the accuracy is much lower (e.g. ∼50%), the set of miRNAs provided is likely not causative of the difference in expression. This new approach of integrating miRNA regulation with expression data yields powerful results and is independent of external labels and training data. Thereby, this approach allows new avenues for exploring miRNA regulation and may enable the development of miRNA-based biomarkers and drugs.Availability and implementationThe KNIME workflow, implementing maTE, is available at Bioinformatics online.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 21 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz204
      Issue No: Vol. 35, No. 20 (2019)
  • Simultaneous clustering of multiview biomedical data using manifold
    • Authors: Yu Y; Zhang L, Zhang S, et al.
      Pages: 4029 - 4037
      Abstract: AbstractMotivationMultiview clustering has attracted much attention in recent years. Several models and algorithms have been proposed for finding the clusters. However, these methods are developed either to find the consistent/common clusters across different views, or to identify the differential clusters among different views. In reality, both consistent and differential clusters may exist in multiview datasets. Thus, development of simultaneous clustering methods such that both the consistent and the differential clusters can be identified is of great importance.ResultsIn this paper, we proposed one method for simultaneous clustering of multiview data based on manifold optimization. The binary optimization model for finding the clusters is relaxed to a real value optimization problem on the Stiefel manifold, which is solved by the line-search algorithm on manifold. We applied the proposed method to both simulation data and four real datasets from TCGA. Both studies show that when the underlying clusters are consistent, our method performs competitive to the state-of-the-art algorithms. When there are differential clusters, our method performs much better. In the real data study, we performed experiments on cancer stratification and differential cluster (module) identification across multiple cancer subtypes. For the patients of different subtypes, both consistent clusters and differential clusters are identified at the same time. The proposed method identifies more clusters that are enriched by gene ontology and KEGG pathways. The differential clusters could be used to explain the different mechanisms for the cancer development in the patients of different subtypes.Availability and implementationCodes can be downloaded from: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz217
      Issue No: Vol. 35, No. 20 (2019)
  • SummaryAUC: a tool for evaluating the performance of polygenic risk
           prediction models in validation datasets with only summary level
    • Authors: Song L; Liu A, Shi J, et al.
      Pages: 4038 - 4044
      Abstract: AbstractMotivationPolygenic risk score (PRS) methods based on genome-wide association studies (GWAS) have a potential for predicting the risk of developing complex diseases and are expected to become more accurate with larger training datasets and innovative statistical methods. The area under the ROC curve (AUC) is often used to evaluate the performance of PRSs, which requires individual genotypic and phenotypic data in an independent GWAS validation dataset. We are motivated to develop methods for approximating AUC of PRSs based on the summary level data of the validation dataset, which will greatly facilitate the development of PRS models for complex diseases.ResultsWe develop statistical methods and an R package SummaryAUC for approximating the AUC and its variance of a PRS when only the summary level data of the validation dataset are available. SummaryAUC can be applied to PRSs with SNPs either genotyped or imputed in the validation dataset. We examined the performance of SummaryAUC using a large-scale GWAS of schizophrenia. SummaryAUC provides accurate approximations to AUCs and their variances. The bias of AUC is typically <0.5% in most analyses. SummaryAUC cannot be applied to PRSs that use all SNPs in the genome because it is computationally prohibitive.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 26 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz176
      Issue No: Vol. 35, No. 20 (2019)
  • Modelling G×E with historical weather information improves genomic
           prediction in new environments
    • Authors: Gillberg J; Marttinen P, Mamitsuka H, et al.
      Pages: 4045 - 4052
      Abstract: AbstractMotivationInteraction between the genotype and the environment (G×E) has a strong impact on the yield of major crop plants. Although influential, taking G×E explicitly into account in plant breeding has remained difficult. Recently G×E has been predicted from environmental and genomic covariates, but existing works have not shown that generalization to new environments and years without access to in-season data is possible and practical applicability remains unclear. Using data from a Barley breeding programme in Finland, we construct an in silico experiment to study the viability of G×E prediction under practical constraints.ResultsWe show that the response to the environment of a new generation of untested Barley cultivars can be predicted in new locations and years using genomic data, machine learning and historical weather observations for the new locations. Our results highlight the need for models of G×E: non-linear effects clearly dominate linear ones, and the interaction between the soil type and daily rain is identified as the main driver for G×E for Barley in Finland. Our study implies that genomic selection can be used to capture the yield potential in G×E effects for future growth seasons, providing a possible means to achieve yield improvements, needed for feeding the growing population.Availability and implementationThe data accompanied by the method code ( is available in the form of kernels to allow reproducing the results.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 12 Apr 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz197
      Issue No: Vol. 35, No. 20 (2019)
  • SodaPop: a forward simulation suite for the evolutionary dynamics of
           asexual populations on protein fitness landscapes
    • Authors: Gauthier L; Di Franco R, Serohijos A, et al.
      Pages: 4053 - 4062
      Abstract: AbstractMotivationProtein evolution is determined by forces at multiple levels of biological organization. Random mutations have an immediate effect on the biophysical properties, structure and function of proteins. These same mutations also affect the fitness of the organism. However, the evolutionary fate of mutations, whether they succeed to fixation or are purged, also depends on population size and dynamics. There is an emerging interest, both theoretically and experimentally, to integrate these two factors in protein evolution. Although there are several tools available for simulating protein evolution, most of them focus on either the biophysical or the population-level determinants, but not both. Hence, there is a need for a publicly available computational tool to explore both the effects of protein biophysics and population dynamics on protein evolution.ResultsTo address this need, we developed SodaPop, a computational suite to simulate protein evolution in the context of the population dynamics of asexual populations. SodaPop accepts as input several fitness landscapes based on protein biochemistry or other user-defined fitness functions. The user can also provide as input experimental fitness landscapes derived from deep mutational scanning approaches or theoretical landscapes derived from physical force field estimates. Here, we demonstrate the broad utility of SodaPop with different applications describing the interplay of selection for protein properties and population dynamics. SodaPop is designed such that population geneticists can explore the influence of protein biochemistry on patterns of genetic variation, and that biochemists and biophysicists can explore the role of population size and demography on protein evolution.Availability and implementationSource code and binaries are freely available at under the GNU GPLv3 license. The software is implemented in C++ and supported on Linux, Mac OS/X and Windows.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 14 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz175
      Issue No: Vol. 35, No. 20 (2019)
  • CyTOFmerge: integrating mass cytometry data across multiple panels
    • Authors: Abdelaal T; Höllt T, van Unen V, et al.
      Pages: 4063 - 4071
      Abstract: AbstractMotivationHigh-dimensional mass cytometry (CyTOF) allows the simultaneous measurement of multiple cellular markers at single-cell level, providing a comprehensive view of cell compositions. However, the power of CyTOF to explore the full heterogeneity of a biological sample at the single-cell level is currently limited by the number of markers measured simultaneously on a single panel.ResultsTo extend the number of markers per cell, we propose an in silico method to integrate CyTOF datasets measured using multiple panels that share a set of markers. Additionally, we present an approach to select the most informative markers from an existing CyTOF dataset to be used as a shared marker set between panels. We demonstrate the feasibility of our methods by evaluating the quality of clustering and neighborhood preservation of the integrated dataset, on two public CyTOF datasets. We illustrate that by computationally extending the number of markers we can further untangle the heterogeneity of mass cytometry data, including rare cell-population detection.Availability and implementationImplementation is available on GitHub ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 15 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz180
      Issue No: Vol. 35, No. 20 (2019)
  • Simulation-assisted machine learning
    • Authors: Deist T; Patti A, Wang Z, et al.
      Pages: 4072 - 4080
      Abstract: AbstractMotivationIn a predictive modeling setting, if sufficient details of the system behavior are known, one can build and use a simulation for making predictions. When sufficient system details are not known, one typically turns to machine learning, which builds a black-box model of the system using a large dataset of input sample features and outputs. We consider a setting which is between these two extremes: some details of the system mechanics are known but not enough for creating simulations that can be used to make high quality predictions. In this context we propose using approximate simulations to build a kernel for use in kernelized machine learning methods, such as support vector machines. The results of multiple simulations (under various uncertainty scenarios) are used to compute similarity measures between every pair of samples: sample pairs are given a high similarity score if they behave similarly under a wide range of simulation parameters. These similarity values, rather than the original high dimensional feature data, are used to build the kernel.ResultsWe demonstrate and explore the simulation-based kernel (SimKern) concept using four synthetic complex systems—three biologically inspired models and one network flow optimization model. We show that, when the number of training samples is small compared to the number of features, the SimKern approach dominates over no-prior-knowledge methods. This approach should be applicable in all disciplines where predictive models are sought and informative yet approximate simulations are available.Availability and implementationThe Python SimKern software, the demonstration models (in MATLAB, R), and the datasets are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz199
      Issue No: Vol. 35, No. 20 (2019)
  • Enhanced Waddington landscape model with cell–cell communication can
           explain molecular mechanisms of self-organization
    • Authors: Fooladi H; Moradi P, Sharifi-Zarchi A, et al.
      Pages: 4081 - 4088
      Abstract: AbstractMotivationThe molecular mechanisms of self-organization that orchestrate embryonic cells to create astonishing patterns have been among major questions of developmental biology. It is recently shown that embryonic stem cells (ESCs), when cultured in particular micropatterns, can self-organize and mimic the early steps of pre-implantation embryogenesis. A systems-biology model to address this observation from a dynamical systems perspective is essential and can enhance understanding of the phenomenon.ResultsHere, we propose a multicellular mathematical model for pattern formation during in vitro gastrulation of human ESCs. This model enhances the basic principles of Waddington epigenetic landscape with cell–cell communication, in order to enable pattern and tissue formation. We have shown the sufficiency of a simple mechanism by using a minimal number of parameters in the model, in order to address a variety of experimental observations such as the formation of three germ layers and trophectoderm, responses to altered culture conditions and micropattern diameters and unexpected spotted forms of the germ layers under certain conditions. Moreover, we have tested different boundary conditions as well as various shapes, observing that the pattern is initiated from the boundary and gradually spreads towards the center. This model provides a basis for in-silico modeling of self-organization.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz201
      Issue No: Vol. 35, No. 20 (2019)
  • CliqueMS: a computational tool for annotating in-source metabolite ions
           from LC-MS untargeted metabolomics data based on a coelution similarity
    • Authors: Senan O; Aguilar-Mogas A, Navarro M, et al.
      Pages: 4089 - 4097
      Abstract: AbstractMotivationThe analysis of biological samples in untargeted metabolomic studies using LC-MS yields tens of thousands of ion signals. Annotating these features is of the utmost importance for answering questions as fundamental as, e.g. how many metabolites are there in a given sample.ResultsHere, we introduce CliqueMS, a new algorithm for annotating in-source LC-MS1 data. CliqueMS is based on the similarity between coelution profiles and therefore, as opposed to most methods, allows for the annotation of a single spectrum. Furthermore, CliqueMS improves upon the state of the art in several dimensions: (i) it uses a more discriminatory feature similarity metric; (ii) it treats the similarities between features in a transparent way by means of a simple generative model; (iii) it uses a well-grounded maximum likelihood inference approach to group features; (iv) it uses empirical adduct frequencies to identify the parental mass and (v) it deals more flexibly with the identification of the parental mass by proposing and ranking alternative annotations. We validate our approach with simple mixtures of standards and with real complex biological samples. CliqueMS reduces the thousands of features typically obtained in complex samples to hundreds of metabolites, and it is able to correctly annotate more metabolites and adducts from a single spectrum than available tools.Availability and implementation and informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz207
      Issue No: Vol. 35, No. 20 (2019)
  • Differential proteostatic regulation of insoluble and abundant proteins
    • Authors: Ramakrishnan R; Houben B, Rousseau F, et al.
      Pages: 4098 - 4107
      Abstract: AbstractMotivationDespite intense effort, it has been difficult to explain chaperone dependencies of proteins from sequence or structural properties.ResultsWe constructed a database collecting all publicly available data of experimental chaperone interaction and dependency data for the Escherichia coli proteome, and enriched it with an extensive set of protein-specific as well as cell-context-dependent proteostatic parameters. Employing this new resource, we performed a comprehensive meta-analysis of the key determinants of chaperone interaction. Our study confirms that GroEL client proteins are biased toward insoluble proteins of low abundance, but for client proteins of the Trigger Factor/DnaK axis, we instead find that cellular parameters such as high protein abundance, translational efficiency and mRNA turnover are key determinants. We experimentally confirmed the finding that chaperone dependence is a function of translation rate and not protein-intrinsic parameters by tuning chaperone dependence of Green Fluorescent Protein (GFP) in E.coli by synonymous mutations only. The juxtaposition of both protein-intrinsic and cell-contextual chaperone triage mechanisms explains how the E.coli proteome achieves combining reliable production of abundant and conserved proteins, while also enabling the evolution of diverging metabolic functions. Availability and implementationThe database will be made available via informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz214
      Issue No: Vol. 35, No. 20 (2019)
  • Drug repositioning through integration of prior knowledge and projections
           of drugs and diseases
    • Authors: Xuan P; Cao Y, Zhang T, et al.
      Pages: 4108 - 4119
      Abstract: AbstractMotivationIdentifying and developing novel therapeutic effects for existing drugs contributes to reduction of drug development costs. Most of the previous methods focus on integration of the heterogeneous data of drugs and diseases from multiple sources for predicting the candidate drug–disease associations. However, they fail to take the prior knowledge of drugs and diseases and their sparse characteristic into account. It is essential to develop a method that exploits the more useful information to predict the reliable candidate associations.ResultsWe present a method based on non-negative matrix factorization, DisDrugPred, to predict the drug-related candidate disease indications. A new type of drug similarity is firstly calculated based on their associated diseases. DisDrugPred completely integrates two types of disease similarities, the associations between drugs and diseases, and the various similarities between drugs from different levels including the chemical structures of drugs, the target proteins of drugs, the diseases associated with drugs and the side effects of drugs. The prior knowledge of drugs and diseases and the sparse characteristic of drug–disease associations provide a deep biological perspective for capturing the relationships between drugs and diseases. Simultaneously, the possibility that a drug is associated with a disease is also dependant on their projections in the low-dimension feature space. Therefore, DisDrugPred deeply integrates the diverse prior knowledge, the sparse characteristic of associations and the projections of drugs and diseases. DisDrugPred achieves superior prediction performance than several state-of-the-art methods for drug–disease association prediction. During the validation process, DisDrugPred also can retrieve more actual drug–disease associations in the top part of prediction result which often attracts more attention from the biologists. Moreover, case studies on five drugs further confirm DisDrugPred’s ability to discover potential candidate disease indications for drugs.Availability and implementationThe fourth type of drug similarity and the predicted candidates for all the drugs are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 13 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz182
      Issue No: Vol. 35, No. 20 (2019)
  • Compressed filesystem for managing large genome collections
    • Authors: Navarro G; Sepúlveda V, Marín M, et al.
      Pages: 4120 - 4128
      Abstract: AbstractMotivationGenome repositories are growing faster than our storage capacities, challenging our ability to store, transmit, process and analyze them. While genomes are not very compressible individually, those repositories usually contain myriads of genomes or genome reads of the same species, thereby creating opportunities for orders-of-magnitude compression by exploiting inter-genome similarities. A useful compression system, however, cannot be only usable for archival, but it must allow direct access to the sequences, ideally in transparent form so that applications do not need to be rewritten.ResultsWe present a highly compressed filesystem that specializes in storing large collections of genomes and reads. The system obtains orders-of-magnitude compression by using Relative Lempel-Ziv, which exploits the high similarities between genomes of the same species. The filesystem transparently stores the files in compressed form, intervening the system calls of the applications without the need to modify them. A client/server variant of the system stores the compressed files in a server, while the client’s filesystem transparently retrieves and updates the data from the server. The data between client and server are also transferred in compressed form, which saves an order of magnitude network time.Availability and implementationThe C++ source code of our implementation is available for download in
      PubDate: Mon, 18 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz192
      Issue No: Vol. 35, No. 20 (2019)
  • Health assistant: answering your questions anytime from biomedical
    • Authors: Jin Z; Zhang B, Fang F, et al.
      Pages: 4129 - 4139
      Abstract: MotivationWith the abundant medical resources, especially literature available online, it is possible for people to understand their own health status and relevant problems autonomously. However, how to obtain the most appropriate answer from the increasingly large-scale database, remains a great challenge. Here, we present a biomedical question answering framework and implement a system, Health Assistant, to enable the search process.MethodsIn Health Assistant, a search engine is firstly designed to rank biomedical documents based on contents. Then various query processing and search techniques are utilized to find the relevant documents. Afterwards, the titles and abstracts of top-N documents are extracted to generate candidate snippets. Finally, our own designed query processing and retrieval approaches for short text are applied to locate the relevant snippets to answer the questions.ResultsOur system is evaluated on the BioASQ benchmark datasets, and experimental results demonstrate the effectiveness and robustness of our system, compared to BioASQ participant systems and some state-of-the-art methods on both document retrieval and snippet retrieval tasks.Availability and implementationA demo of our system is available at
      PubDate: Mon, 18 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz195
      Issue No: Vol. 35, No. 20 (2019)
  • SPRINT-Gly: predicting N- and O-linked glycosylation sites of human and
           mouse proteins by using sequence and predicted structural properties
    • Authors: Taherzadeh G; Dehzangi A, Golchin M, et al.
      Pages: 4140 - 4146
      Abstract: AbstractMotivationProtein glycosylation is one of the most abundant post-translational modifications that plays an important role in immune responses, intercellular signaling, inflammation and host-pathogen interactions. However, due to the poor ionization efficiency and microheterogeneity of glycopeptides identifying glycosylation sites is a challenging task, and there is a demand for computational methods. Here, we constructed the largest dataset of human and mouse glycosylation sites to train deep learning neural networks and support vector machine classifiers to predict N-/O-linked glycosylation sites, respectively.ResultsThe method, called SPRINT-Gly, achieved consistent results between ten-fold cross validation and independent test for predicting human and mouse glycosylation sites. For N-glycosylation, a mouse-trained model performs equally well in human glycoproteins and vice versa, however, due to significant differences in O-linked sites separate models were generated. Overall, SPRINT-Gly is 18% and 50% higher in Matthews correlation coefficient than the next best method compared in N-linked and O-linked sites, respectively. This improved performance is due to the inclusion of novel structure and sequence-based features.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz215
      Issue No: Vol. 35, No. 20 (2019)
  • BrAPI—an application programming interface for plant breeding
    • Authors: Selby P; Abbeloos R, Backlund J, et al.
      Pages: 4147 - 4155
      Abstract: AbstractMotivationModern genomic breeding methods rely heavily on very large amounts of phenotyping and genotyping data, presenting new challenges in effective data management and integration. Recently, the size and complexity of datasets have increased significantly, with the result that data are often stored on multiple systems. As analyses of interest increasingly require aggregation of datasets from diverse sources, data exchange between disparate systems becomes a challenge.ResultsTo facilitate interoperability among breeding applications, we present the public plant Breeding Application Programming Interface (BrAPI). BrAPI is a standardized web service API specification. The development of BrAPI is a collaborative, community-based initiative involving a growing global community of over a hundred participants representing several dozen institutions and companies. Development of such a standard is recognized as critical to a number of important large breeding system initiatives as a foundational technology. The focus of the first version of the API is on providing services for connecting systems and retrieving basic breeding data including germplasm, study, observation, and marker data. A number of BrAPI-enabled applications, termed BrAPPs, have been written, that take advantage of the emerging support of BrAPI by many databases.Availability and implementationMore information on BrAPI, including links to the specification, test suites, BrAPPs, and sample implementations is available at The BrAPI specification and the developer tools are provided as free and open source.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz190
      Issue No: Vol. 35, No. 20 (2019)
  • ppsPCP: a plant presence/absence variants scanner and pan-genome
           construction pipeline
    • Authors: Tahir Ul Qamar M; Zhu X, Xing F, et al.
      Pages: 4156 - 4158
      Abstract: SummarySince the idea of pan-genomics emerged several tools and pipelines have been introduced for prokaryotic pan-genomics. However, not a single comprehensive pipeline has been reported which could overcome multiple challenges associated with eukaryotic pan-genomics. To aid the eukaryotic pan-genomic studies, here we present ppsPCP pipeline which is designed for eukaryotes especially for plants. It is capable of scanning presence/absence variants (PAVs) and constructing a fully annotated pan-genome. We believe with these unique features of PAV scanning and building a pan-genome together with its annotation, ppsPCP will be useful for plant pan-genomic studies and aid researchers to study genetic/phenotypic variations and genomic diversity.Availability and implementationThe ppsPCP is freely available at github
      DOI : and webpage informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 09 Mar 2019 00:00:00 GMT
      Issue No: Vol. 35, No. 20 (2019)
  • ScanNeo: identifying indel-derived neoantigens using RNA-Seq data
    • Authors: Wang T; Wang L, Alam S, et al.
      Pages: 4159 - 4161
      Abstract: AbstractSummaryInsertion and deletion (indels) have been recognized as an important source generating tumor-specific mutant peptides (neoantigens). The focus of indel-derived neoantigen identification has been on leveraging DNA sequencing such as whole exome sequencing, with the effort of using RNA-seq less well explored. Here we present ScanNeo, a fast-streamlined computational pipeline for analyzing RNA-seq to predict neoepitopes derived from small to large-sized indels. We applied ScanNeo in a prostate cancer cell line and validated our predictions with matched mass spectrometry data. Finally, we demonstrated that indel neoantigens predicted from RNA-seq were associated with checkpoint inhibitor response in a cohort of melanoma patients.Availability and implementationScanNeo is implemented in Python. It is freely accessible at the GitHub repository ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 19 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz193
      Issue No: Vol. 35, No. 20 (2019)
  • GToTree: a user-friendly workflow for phylogenomics
    • Authors: Lee M; Ponty Y.
      Pages: 4162 - 4164
      Abstract: AbstractSummaryGenome-level evolutionary inference (i.e. phylogenomics) is becoming an increasingly essential step in many biologists’ work. Accordingly, there are several tools available for the major steps in a phylogenomics workflow. But for the biologist whose main focus is not bioinformatics, much of the computational work required—such as accessing genomic data on large scales, integrating genomes from different file formats, performing required filtering, stitching different tools together etc.—can be prohibitive. Here I introduce GToTree, a command-line tool that can take any combination of fasta files, GenBank files and/or NCBI assembly accessions as input and outputs an alignment file, estimates of genome completeness and redundancy, and a phylogenomic tree based on a specified single-copy gene (SCG) set. Although GToTree can work with any custom hidden Markov Models (HMMs), also included are 13 newly generated SCG-set HMMs for different lineages and levels of resolution, built based on searches of ∼12 000 bacterial and archaeal high-quality genomes. GToTree aims to give more researchers the capability to make phylogenomic trees.Availability and implementationGToTree is open-source and freely available for download from: It is implemented primarily in bash with helper scripts written in python.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 13 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz188
      Issue No: Vol. 35, No. 20 (2019)
  • Lemon: a framework for rapidly mining structural information from the
           Protein Data Bank
    • Authors: Fine J; Chopra G, Valencia A.
      Pages: 4165 - 4167
      Abstract: AbstractMotivationThe Protein Data Bank (PDB) currently holds over 140 000 biomolecular structures and continues to release new structures on a weekly basis. The PDB is an essential resource to the structural bioinformatics community to develop software that mine, use, categorize and analyze such data. New computational biology methods are evaluated using custom benchmarking sets derived as subsets of 3D experimentally determined structures and structural features from the PDB. Currently, such benchmarking features are manually curated with custom scripts in a non-standardized manner that results in slow distribution and updates with new experimental structures. Finally, there is a scarcity of standardized tools to rapidly query 3D descriptors of the entire PDB.ResultsOur solution is the Lemon framework, a C++11 library with Python bindings, which provides a consistent workflow methodology for selecting biomolecular interactions based on user criterion and computing desired 3D structural features. This framework can parse and characterize the entire PDB in <10 min on modern, multithreaded hardware. The speed in parsing is obtained by using the recently developed MacroMolecule Transmission Format to reduce the computational cost of reading text-based PDB files. The use of C++ lambda functions and Python bindings provide extensive flexibility for analysis and categorization of the PDB by allowing the user to write custom functions to suite their objective. We think Lemon will become a one-stop-shop to quickly mine the entire PDB to generate desired structural biology features.Availability and implementationThe Lemon software is available as a C++ header library along with a PyPI package and example functions at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 14 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz178
      Issue No: Vol. 35, No. 20 (2019)
  • FoldX 5.0: working with RNA, small molecules and a new graphical interface
    • Authors: Delgado J; Radusky L, Cianferoni D, et al.
      Pages: 4168 - 4169
      Abstract: AbstractSummaryA new version of FoldX, whose main new features allows running classic FoldX commands on structures containing RNA molecules and includes a module that allows parametrization of ligands or small molecules (ParamX) that were not previously recognized in old versions, has been released. An extended FoldX graphical user interface has also being developed (available as a python plugin for the YASARA molecular viewer) allowing user-friendly parametrization of new custom user molecules encoded using JSON format.Availability and implementation
      PubDate: Fri, 15 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz184
      Issue No: Vol. 35, No. 20 (2019)
  • CABS-dock standalone: a toolbox for flexible protein–peptide docking
    • Authors: Kurcinski M; Pawel Ciemny M, Oleniecki T, et al.
      Pages: 4170 - 4172
      Abstract: AbstractSummaryCABS-dock standalone is a multiplatform Python package for protein–peptide docking with backbone flexibility. The main feature of the CABS-dock method is its ability to simulate significant backbone flexibility of the entire protein–peptide system in a reasonable computational time. In the default mode, the package runs a simulation of fully flexible peptide searching for a binding site on the surface of a flexible protein receptor. The flexibility level of the molecules may be defined by the user. Furthermore, the CABS-dock standalone application provides users with full control over the docking simulation from the initial setup to the analysis of results. The standalone version is an upgrade of the original web server implementation—it introduces a number of customizable options, provides support for large-sized systems and offers a framework for deeper analysis of docking results.Availability and implementationCABS-dock standalone is distributed under the MIT licence, which is free for academic and non-profit users. It is implemented in Python and Fortran. The CABS-dock standalone source code, wiki with documentation and examples of use and installation instructions for Linux, macOS and Windows are available in the CABS-dock standalone repository at
      PubDate: Wed, 13 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz185
      Issue No: Vol. 35, No. 20 (2019)
  • Holistic optimization of an RNA-seq workflow for multi-threaded
    • Authors: Hung L; Lloyd W, Agumbe Sridhar R, et al.
      Pages: 4173 - 4175
      Abstract: AbstractSummaryFor many next generation-sequencing pipelines, the most computationally intensive step is the alignment of reads to a reference sequence. As a result, alignment software such as the Burrows-Wheeler Aligner is optimized for speed and is often executed in parallel on the cloud. However, there are other less demanding steps that can also be optimized to significantly increase the speed especially when using many threads. We demonstrate this using a unique molecular identifier RNA-sequencing pipeline consisting of 3 steps: split, align, and merge. Optimization of all three steps yields a 40% increase in speed when executed using a single thread. However, when executed using 16 threads, we observe a 4-fold improvement over the original parallel implementation and more than an 8-fold improvement over the original single-threaded implementation. In contrast, optimizing only the alignment step results in just a 13% improvement over the original parallel workflow using 16 threads.Availability and implementationCode (M.I.T. license), supporting scripts and Dockerfiles are available at and Docker images at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 11 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz169
      Issue No: Vol. 35, No. 20 (2019)
  • Pinetree: a step-wise gene expression simulator with codon-specific
           translation rates
    • Authors: Jack B; Wilke C, Stegle O.
      Pages: 4176 - 4178
      Abstract: AbstractMotivationStochastic gene expression simulations often assume steady-state transcript levels, or they model transcription in more detail than translation. Moreover, they lack accessible programing interfaces, which limit their utility.ResultsWe present Pinetree, a step-wise gene expression simulator with codon-specific translation rates. Pinetree models both transcription and translation in a stochastic framework with individual polymerase and ribosome-level detail. Written in C++ with a Python front-end, any user familiar with Python can specify a genome and simulate gene expression. Pinetree was designed to be efficient and scale to simulate large plasmids or viral genomes.Availability and implementationPinetree is available on GitHub ( and the Python Package Index (
      PubDate: Thu, 28 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz203
      Issue No: Vol. 35, No. 20 (2019)
  • PleioNet: a web-based visualization tool for exploring pleiotropy across
           complex traits
    • Authors: Gao X; Huang H, Valencia A.
      Pages: 4179 - 4180
      Abstract: AbstractSummaryPleiotropy plays an important role in furthering our understanding of the shared genetic architecture of different human diseases and traits. However, exploring and visualizing pleiotropic information with currently publicly available tools is limiting and challenging. To aid researchers in constructing and digesting pleiotropic networks, we present PleioNet, a web-based visualization tool for exploring this information across human diseases and traits. This program provides an intuitive and interactive web interface that seamlessly integrates large database queries with visualizations that enable users to quickly explore complex high-dimensional pleiotropic information. PleioNet works on all modern computer and mobile web browsers, making pleiotropic information readily available to a broad range of researchers and clinicians with diverse technical backgrounds. We expect that PleioNet will be an important tool for studying the underlying pleiotropic connections among human diseases and traits.Availability and implementationPleioNet is hosted on Google cloud and freely available at
      PubDate: Wed, 13 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz179
      Issue No: Vol. 35, No. 20 (2019)
  • ECOGEMS: efficient compression and retrieve of SNP data of 2058 rice
           accessions with integer sparse matrices
    • Authors: Yao W; Huang F, Zhang X, et al.
      Pages: 4181 - 4183
      Abstract: AbstractSummaryWe proposed to store large-scale genotype data as integer sparse matrices, which consumed much fewer computing resources for storage and analysis than traditional approaches. In addition, the raw genotype data could be readily recovered from integer sparse matrices. Utilizing this approach, we stored the genotype data of 1612 Asian cultivated rice accessions and 446 Asian wild rice accessions across 8 584 244 SNP sites in the ECOGEMS database with 310 MB of disk usage. Graphical interface for visualization, analysis and download of SNP data were implemented in ECOGEMS, which made it a valuable resource for rice functional genomic studies.Availability and implementationThe code and data of ECOGEMS are freely available at ECOGEMS is deployed at and 3838/ECOGEMS/ for online use.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 14 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz186
      Issue No: Vol. 35, No. 20 (2019)
  • PedigreeNet: a web-based pedigree viewer for biological databases
    • Authors: Braun B; Schott D, Portwood J, et al.
      Pages: 4184 - 4186
      Abstract: AbstractMotivationPlant breeding aims to improve current germplasm that can tolerate a wide range of biotic and abiotic stresses. To accomplish this goal, breeders rely on developing a deeper understanding of genetic makeup and relationships between plant varieties to make informed plant selections. Although rapid advances in genotyping technology generated a large amount of data for breeders, tools that facilitate pedigree analysis and visualization are scant, leaving breeders to use classical, but inherently limited, hierarchical pedigree diagrams for a handful of plant varieties. To answer this need, we developed a simple web-based tool that can be easily implemented at biological databases, called PedigreeNet, to create and visualize customizable pedigree relationships in a network context, displaying pre- and user-uploaded data.ResultsAs a proof-of-concept, we implemented PedigreeNet at the maize model organism database, MaizeGDB. The PedigreeNet viewer at MaizeGDB has a dynamically-generated pedigree network of 4706 maize lines and 5487 relationships that are currently available as both a stand-alone web-based tool and integrated directly on the MaizeGDB Stock Pages. The tool allows the user to apply a number of filters, select or upload their own breeding relationships, center a pedigree network on a plant variety, identify the common ancestor between two varieties, and display the shortest path(s) between two varieties on the pedigree network. The PedigreeNet code layer is written as a JavaScript wrapper around Cytoscape Web. PedigreeNet fills a great need for breeders to have access to an online tool to represent and visually customize pedigree relationships.Availability and implementationPedigreeNet is accessible at The open source code is publically and freely available at GitHub: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz208
      Issue No: Vol. 35, No. 20 (2019)
  • CytoBackBone: an algorithm for merging of phenotypic information from
           different cytometric profiles
    • Authors: Leite Pereira A; Lambotte O, Le Grand R, et al.
      Pages: 4187 - 4189
      Abstract: AbstractMotivationFlow and mass cytometry are experimental techniques used to measure the level of proteins expressed by cells at the single-cell resolution. Several algorithms were developed in flow cytometry to increase the number of simultaneously measurable markers. These approaches aim to combine phenotypic information of different cytometric profiles obtained from different cytometry panels.ResultsWe present here a new algorithm, called CytoBackBone, which can merge phenotypic information from different cytometric profiles. This algorithm is based on nearest-neighbor imputation, but introduces the notion of acceptable and non-ambiguous nearest neighbors. We used mass cytometry data to illustrate the merging of cytometric profiles obtained by the CytoBackBone algorithm.Availability and implementationCytoBackBone is implemented in R and the source code is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz212
      Issue No: Vol. 35, No. 20 (2019)
  • GladiaTOX: GLobal Assessment of Dose-IndicAtor in TOXicology
    • Authors: Belcastro V; Cano S, Marescotti D, et al.
      Pages: 4190 - 4192
      Abstract: AbstractSummaryGladiaTOX R package is an open-source, flexible solution to high-content screening data processing and reporting in biomedical research. GladiaTOX takes advantage of the ‘tcpl’ core functionalities and provides a number of extensions: it provides a web-service solution to fetch raw data; it computes severity scores and exports ToxPi formatted files; furthermore it contains a suite of functionalities to generate PDF reports for quality control and data processing.Availability and implementationGladiaTOX R package (bioconductor). Also available via: git clone informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 14 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz187
      Issue No: Vol. 35, No. 20 (2019)
  • PgpRules: a decision tree based prediction server for P-glycoprotein
           substrates and inhibitors
    • Authors: Wang P; Tu Y, Tseng Y, et al.
      Pages: 4193 - 4195
      Abstract: AbstractSummaryP-glycoprotein (P-gp) is a member of ABC transporter family that actively pumps xenobiotics out of cells to protect organisms from toxic compounds. P-gp substrates can be easily pumped out of the cells to reduce their absorption; conversely P-gp inhibitors can reduce such pumping activity. Hence, it is crucial to know if a drug is a P-gp substrate or inhibitor in view of pharmacokinetics. Here we present PgpRules, an online P-gp substrate and P-gp inhibitor prediction server with ruled-sets. The two models were built using classification and regression tree algorithm. For each compound uploaded, PgpRules not only predicts whether the compound is a P-gp substrate or a P-gp inhibitor, but also provides the rules containing chemical structural features for further structural optimization.Availability and implementationPgpRules is freely accessible at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 27 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz213
      Issue No: Vol. 35, No. 20 (2019)
  • onlineFDR: an R package to control the false discovery rate for growing
           data repositories
    • Authors: Robertson D; Wildenhain J, Javanmard A, et al.
      Pages: 4196 - 4199
      Abstract: AbstractSummaryIn many areas of biological research, hypotheses are tested in a sequential manner, without having access to future P-values or even the number of hypotheses to be tested. A key setting where this online hypothesis testing occurs is in the context of publicly available data repositories, where the family of hypotheses to be tested is continually growing as new data is accumulated over time. Recently, Javanmard and Montanari proposed the first procedures that control the FDR for online hypothesis testing. We present an R package, onlineFDR, which implements these procedures and provides wrapper functions to apply them to a historic dataset or a growing data repository.Availability and implementationThe R package is freely available through Bioconductor ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 14 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz191
      Issue No: Vol. 35, No. 20 (2019)
  • TISIDB: an integrated repository portal for tumor–immune system
    • Authors: Ru B; Wong C, Tong Y, et al.
      Pages: 4200 - 4202
      Abstract: AbstractSummaryThe interaction between tumor and immune system plays a crucial role in both cancer development and treatment response. To facilitate comprehensive investigation of tumor–immune interactions, we have designed a user-friendly web portal TISIDB, which integrated multiple types of data resources in oncoimmunology. First, we manually curated 4176 records from 2530 publications, which reported 988 genes related to anti-tumor immunity. Second, genes associated with the resistance or sensitivity of tumor cells to T cell-mediated killing and immunotherapy were identified by analyzing high-throughput screening and genomic profiling data. Third, associations between any gene and immune features, such as lymphocytes, immunomodulators and chemokines, were pre-calculated for 30 TCGA cancer types. In TISIDB, biologists can cross-check a gene of interest about its role in tumor–immune interactions through literature mining and high-throughput data analysis, and generate testable hypotheses and high quality figures for publication.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 23 Mar 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz210
      Issue No: Vol. 35, No. 20 (2019)
  • The TMCrys server for supporting crystallization of transmembrane proteins
    • Authors: Varga J; Tusnády G, Valencia A.
      Pages: 4203 - 4204
      Abstract: AbstractMotivationDue to their special properties, the structures of transmembrane proteins are extremely hard to determine. Several methods exist to predict the propensity of successful completion of the structure determination process. However, available predictors incorporate data of any kind of proteins, hence they can hardly differentiate between crystallizable and non-crystallizable membrane proteins.ResultsWe implemented a web server to simplify running TMCrys prediction method that was developed specifically to separate crystallizable and non-crystallizable membrane proteins.Availability and implementationhttp://tmcrys.enzim.ttk.mta.huSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 21 Feb 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz108
      Issue No: Vol. 35, No. 20 (2019)
  • relax: the analysis of biomolecular kinetics and thermodynamics using NMR
           relaxation dispersion data
    • Authors: Morin S; Linnet T, Lescanne M, et al.
      Pages: 4205 - 4205
      Abstract: Bioinformatics, (2014) doi: 10.1093/bioinformatics/btu166
      PubDate: Tue, 04 Jun 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz397
      Issue No: Vol. 35, No. 20 (2019)
  • Corrigendum to: A new cis-acting regulatory element driving gene
           expression in the zebrafish pineal gland
    • Authors: Alon S; Eisenberg E, Jacob-Hirsch J, et al.
      Pages: 4206 - 4206
      Abstract: Bioinformatics (2009) doi:10.1093/bioinformatics/btp031
      PubDate: Sat, 18 May 2019 00:00:00 GMT
      DOI: 10.1093/bioinformatics/btz410
      Issue No: Vol. 35, No. 20 (2019)
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-