for Journals by Title or ISSN
for Articles by Keywords
Followed Journals
Journal you Follow: 0
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Journals are published.
Already have an account? Sign In to see the journals you follow.
Journal Cover
Journal Prestige (SJR): 6.14
Citation Impact (citeScore): 8
Number of Followers: 315  
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1367-4803 - ISSN (Online) 1460-2059
Published by Oxford University Press Homepage  [396 journals]
  • Recombinational DSBs-intersected genes converge on specific disease- and
           adaptability-related pathways
    • Authors: Yang Z; Luo H, Zhang Y, et al.
      Pages: 3421 - 3426
      Abstract: MotivationThe budding yeast Saccharomyces cerevisiae is a model species powerful for studying the recombination of eukaryotes. Although many recombination studies have been performed for this species by experimental methods, the population genomic study based on bioinformatics analyses is urgently needed to greatly increase the range and accuracy of recombination detection. Here, we carry out the population genomic analysis of recombination in S.cerevisiae to reveal the potential rules between recombination and evolution in eukaryotes.ResultsBy population genomic analysis, we discover significantly more and longer recombination events in clinical strains, which indicates that adverse environmental conditions create an obviously wider range of genetic combination in response to the selective pressure. Based on the analysis of recombinational double strand breaks (DSBs)-intersected genes (RDIGs), we find that RDIGs significantly converge on specific disease- and adaptability-related pathways, indicating that recombination plays a biologically key role in the repair of DSBs related to diseases and environmental adaptability, especially the human neurological disorders. By evolutionary analysis of RDIGs, we find that the RDIGs highly prevailing in populations of yeast tend to be more evolutionarily conserved, indicating the accurate repair of DSBs in these RDIGs is critical to ensure the eukaryotic survival or fitness.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 03 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty376
      Issue No: Vol. 34, No. 20 (2018)
  • Predicting RNA–protein binding sites and motifs through combining local
           and global deep convolutional neural networks
    • Authors: Pan X; Shen H, Valencia A.
      Pages: 3427 - 3436
      Abstract: MotivationRNA-binding proteins (RBPs) take over 5–10% of the eukaryotic proteome and play key roles in many biological processes, e.g. gene regulation. Experimental detection of RBP binding sites is still time-intensive and high-costly. Instead, computational prediction of the RBP binding sites using patterns learned from existing annotation knowledge is a fast approach. From the biological point of view, the local structure context derived from local sequences will be recognized by specific RBPs. However, in computational modeling using deep learning, to our best knowledge, only global representations of entire RNA sequences are employed. So far, the local sequence information is ignored in the deep model construction process.ResultsIn this study, we present a computational method iDeepE to predict RNA–protein binding sites from RNA sequences by combining global and local convolutional neural networks (CNNs). For the global CNN, we pad the RNA sequences into the same length. For the local CNN, we split a RNA sequence into multiple overlapping fixed-length subsequences, where each subsequence is a signal channel of the whole sequence. Next, we train deep CNNs for multiple subsequences and the padded sequences to learn high-level features, respectively. Finally, the outputs from local and global CNNs are combined to improve the prediction. iDeepE demonstrates a better performance over state-of-the-art methods on two large-scale datasets derived from CLIP-seq. We also find that the local CNN runs 1.8 times faster than the global CNN with comparable performance when using GPUs. Our results show that iDeepE has captured experimentally verified binding motifs.Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 02 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty364
      Issue No: Vol. 34, No. 20 (2018)
  • Generic accelerated sequence alignment in SeqAn using vectorization and
    • Authors: Rahn R; Budach S, Costanza P, et al.
      Pages: 3437 - 3445
      Abstract: MotivationPairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (single instruction multiple data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we (a) distribute many independent alignments on multiple threads and (b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal.ResultsWe evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon PhiTM (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon PhiTM and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module.Availability and implementationThe module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4 under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME: SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 03 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty380
      Issue No: Vol. 34, No. 20 (2018)
  • Base-pair resolution detection of transcription factor binding site by
           deep deconvolutional network
    • Authors: Salekin S; Zhang J, Huang Y, et al.
      Pages: 3446 - 3453
      Abstract: MotivationTranscription factor (TF) binds to the promoter region of a gene to control gene expression. Identifying precise TF binding sites (TFBSs) is essential for understanding the detailed mechanisms of TF-mediated gene regulation. However, there is a shortage of computational approach that can deliver single base pair resolution prediction of TFBS.ResultsIn this paper, we propose DeepSNR, a Deep Learning algorithm for predicting TF binding location at Single Nucleotide Resolution de novo from DNA sequence. DeepSNR adopts a novel deconvolutional network (deconvNet) model and is inspired by the similarity to image segmentation by deconvNet. The proposed deconvNet architecture is constructed on top of ‘DeepBind’ and we trained the entire model using TF-specific data from ChIP-exonuclease (ChIP-exo) experiments. DeepSNR has been shown to outperform motif search–based methods for several evaluation metrics. We have also demonstrated the usefulness of DeepSNR in the regulatory analysis of TFBS as well as in improving the TFBS prediction specificity using ChIP-seq data.Availability and implementationDeepSNR is available open source in the GitHub repository ( informationSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 10 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty383
      Issue No: Vol. 34, No. 20 (2018)
  • In vitro versus in vivo compositional landscapes of histone sequence
           preferences in eucaryotic genomes
    • Authors: Giancarlo R; Rombo S, Utro F, et al.
      Pages: 3454 - 3460
      Abstract: MotivationAlthough the nucleosome occupancy along a genome can be in part predicted by in vitro experiments, it has been recently observed that the chromatin organization presents important differences in vitro with respect to in vivo. Such differences mainly regard the hierarchical and regular structures of the nucleosome fiber, whose existence has long been assumed, and in part also observed in vitro, but that does not apparently occur in vivo. It is also well known that the DNA sequence has a role in determining the nucleosome occupancy. Therefore, an important issue is to understand if, and to what extent, the structural differences in the chromatin organization between in vitro and in vivo have a counterpart in terms of the underlying genomic sequences.ResultsWe present the first quantitative comparison between the in vitro and in vivo nucleosome maps of two model organisms (S. cerevisiae and C. elegans). The comparison is based on the construction of weighted k-mer dictionaries. Our findings show that there is a good level of sequence conservation between in vitro and in vivo in both the two organisms, in contrast to the abovementioned important differences in chromatin structural organization. Moreover, our results provide evidence that the two organisms predispose themselves differently, in terms of sequence composition and both in vitro and in vivo, for the nucleosome occupancy. This leads to the conclusion that, although the notion of a genome encoding for its own nucleosome occupancy is general, the intrinsic histone k-mer sequence preferences tend to be species-specific.Availability and implementationThe files containing the dictionaries and the main results of the analysis are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 10 Sep 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty799
      Issue No: Vol. 34, No. 20 (2018)
  • Efficient flexible backbone protein–protein docking for challenging
    • Authors: Marze N; Roy Burman S, Sheffler W, et al.
      Pages: 3461 - 3469
      Abstract: MotivationBinding-induced conformational changes challenge current computational docking algorithms by exponentially increasing the conformational space to be explored. To restrict this search to relevant space, some computational docking algorithms exploit the inherent flexibility of the protein monomers to simulate conformational selection from pre-generated ensembles. As the ensemble size expands with increased flexibility, these methods struggle with efficiency and high false positive rates.ResultsHere, we develop and benchmark RosettaDock 4.0, which efficiently samples large conformational ensembles of flexible proteins and docks them using a novel, six-dimensional, coarse-grained score function. A strong discriminative ability allows an eight-fold higher enrichment of near-native candidate structures in the coarse-grained phase compared to RosettaDock 3.2. It adaptively samples 100 conformations each of the ligand and the receptor backbone while increasing computational time by only 20–80%. In local docking of a benchmark set of 88 proteins of varying degrees of flexibility, the expected success rate (defined as cases with ≥50% chance of achieving 3 near-native structures in the 5 top-ranked ones) for blind predictions after resampling is 77% for rigid complexes, 49% for moderately flexible complexes and 31% for highly flexible complexes. These success rates on flexible complexes are a substantial step forward from all existing methods. Additionally, for highly flexible proteins, we demonstrate that when a suitable conformer generation method exists, the method successfully docks the complex.Availability and implementationAs a part of the Rosetta software suite, RosettaDock 4.0 is available at to all non-commercial users for free and to commercial users for a fee.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 30 Apr 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty355
      Issue No: Vol. 34, No. 20 (2018)
  • JRmGRN: joint reconstruction of multiple gene regulatory networks with
           common hub genes using data from multiple tissues or conditions
    • Authors: Deng W; Zhang K, Liu S, et al.
      Pages: 3470 - 3478
      Abstract: MotivationJoint reconstruction of multiple gene regulatory networks (GRNs) using gene expression data from multiple tissues/conditions is very important for understanding common and tissue/condition-specific regulation. However, there are currently no computational models and methods available for directly constructing such multiple GRNs that not only share some common hub genes but also possess tissue/condition-specific regulatory edges.ResultsIn this paper, we proposed a new graphic Gaussian model for joint reconstruction of multiple gene regulatory networks (JRmGRN), which highlighted hub genes, using gene expression data from several tissues/conditions. Under the framework of Gaussian graphical model, JRmGRN method constructs the GRNs through maximizing a penalized log likelihood function. We formulated it as a convex optimization problem, and then solved it with an alternating direction method of multipliers (ADMM) algorithm. The performance of JRmGRN was first evaluated with synthetic data and the results showed that JRmGRN outperformed several other methods for reconstruction of GRNs. We also applied our method to real Arabidopsis thaliana RNA-seq data from two light regime conditions in comparison with other methods, and both common hub genes and some conditions-specific hub genes were identified with higher accuracy and precision.Availability and implementationJRmGRN is available as a R program from: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 30 Apr 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty354
      Issue No: Vol. 34, No. 20 (2018)
  • Edge-group sparse PCA for network-guided high dimensional data analysis
    • Authors: Min W; Liu J, Zhang S, et al.
      Pages: 3479 - 3487
      Abstract: MotivationPrincipal component analysis (PCA) has been widely used to deal with high-dimensional gene expression data. In this study, we proposed an Edge-group Sparse PCA (ESPCA) model by incorporating the group structure from a prior gene network into the PCA framework for dimension reduction and feature interpretation. ESPCA enforces sparsity of principal component (PC) loadings through considering the connectivity of gene variables in the prior network. We developed an alternating iterative algorithm to solve ESPCA. The key of this algorithm is to solve a new k-edge sparse projection problem and a greedy strategy has been adapted to address it. Here we adopted ESPCA for analyzing multiple gene expression matrices simultaneously. By incorporating prior knowledge, our method can overcome the drawbacks of sparse PCA and capture some gene modules with better biological interpretations.ResultsWe evaluated the performance of ESPCA using a set of artificial datasets and two real biological datasets (including TCGA pan-cancer expression data and ENCODE expression data), and compared their performance with PCA and sparse PCA. The results showed that ESPCA could identify more biologically relevant genes, improve their biological interpretations and reveal distinct sample characteristics.Availability and implementationAn R package of ESPCA is available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Thu, 03 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty362
      Issue No: Vol. 34, No. 20 (2018)
  • geck: trio-based comparative benchmarking of variant calls
    • Authors: Kómár P; Kural D, Stegle O.
      Pages: 3488 - 3495
      Abstract: MotivationClassical methods of comparing the accuracies of variant calling pipelines are based on truth sets of variants whose genotypes are previously determined with high confidence. An alternative way of performing benchmarking is based on Mendelian constraints between related individuals. Statistical analysis of Mendelian violations can provide truth set-independent benchmarking information, and enable benchmarking less-studied variants and diverse populations.ResultsWe introduce a statistical mixture model for comparing two variant calling pipelines from genotype data they produce after running on individual members of a trio. We determine the accuracy of our model by comparing the precision and recall of GATK Unified Genotyper and Haplotype Caller on the high-confidence SNPs of the NIST Ashkenazim trio and the two independent Platinum Genome trios. We show that our method is able to estimate differential precision and recall between the two pipelines with 10−3 uncertainty.Availability and implementationThe Python library geck, and usage examples are available at the following URL:, under the GNU General Public License v3.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 29 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty415
      Issue No: Vol. 34, No. 20 (2018)
  • polymapR—linkage analysis and genetic map construction from F1
           populations of outcrossing polyploids
    • Authors: Bourke P; van Geest G, Voorrips R, et al.
      Pages: 3496 - 3502
      Abstract: MotivationPolyploid species carry more than two copies of each chromosome, a condition found in many of the world’s most important crops. Genetic mapping in polyploids is more complex than in diploid species, resulting in a lack of available software tools. These are needed if we are to realize all the opportunities offered by modern genotyping platforms for genetic research and breeding in polyploid crops.ResultspolymapR is an R package for genetic linkage analysis and integrated genetic map construction from bi-parental populations of outcrossing autopolyploids. It can currently analyse triploid, tetraploid and hexaploid marker datasets and is applicable to various crops including potato, leek, alfalfa, blueberry, chrysanthemum, sweet potato or kiwifruit. It can detect, estimate and correct for preferential chromosome pairing, and has been tested on high-density marker datasets from potato, rose and chrysanthemum, generating high-density integrated linkage maps in all of these crops.Availability and implementationpolymapR is freely available under the general public license from the Comprehensive R Archive Network (CRAN) at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 02 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty371
      Issue No: Vol. 34, No. 20 (2018)
  • REGGAE: a novel approach for the identification of key transcriptional
    • Authors: Kehl T; Schneider L, Kattler K, et al.
      Pages: 3503 - 3510
      Abstract: MotivationTranscriptional regulators play a major role in most biological processes. Alterations in their activities are associated with a variety of diseases and in particular with tumor development and progression. Hence, it is important to assess the effects of deregulated regulators on pathological processes.ResultsHere, we present REGulator-Gene Association Enrichment (REGGAE), a novel method for the identification of key transcriptional regulators that have a significant effect on the expression of a given set of genes, e.g. genes that are differentially expressed between two sample groups. REGGAE uses a Kolmogorov–Smirnov-like test statistic that implicitly combines associations between regulators and their target genes with an enrichment approach to prioritize the influence of transcriptional regulators. We evaluated our method in two different application scenarios, which demonstrate that REGGAE is well suited for uncovering the influence of transcriptional regulators and is a valuable tool for the elucidation of complex regulatory mechanisms.Availability and implementationREGGAE is freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 07 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty372
      Issue No: Vol. 34, No. 20 (2018)
  • Application of network smoothing to glycan LC-MS profiling
    • Authors: Klein J; Carvalho L, Zaia J, et al.
      Pages: 3511 - 3518
      Abstract: MotivationGlycosylation is one of the most heterogeneous and complex protein post-translational modifications. Liquid chromatography coupled mass spectrometry (LC-MS) is a common high throughput method for analyzing complex biological samples. Accurate study of glycans require high resolution mass spectrometry. Mass spectrometry data contains intricate sub-structures that encode mass and abundance, requiring several transformations before it can be used to identify biological molecules, requiring automated tools to analyze samples in a high throughput setting. Existing tools for interpreting the resulting data do not take into account related glycans when evaluating individual observations, limiting their sensitivity.ResultsWe developed an algorithm for assigning glycan compositions from LC-MS data by exploring biosynthetic network relationships among glycans. Our algorithm optimizes a set of likelihood scoring functions based on glycan chemical properties but uses network Laplacian regularization and optionally prior information about expected glycan families to smooth the likelihood and thus achieve a consistent and more representative solution. Our method was able to identify as many, or more glycan compositions compared to previous approaches, and demonstrated greater sensitivity with regularization. Our network definition was tailored to N-glycans but the method may be applied to glycomics data from other glycan families like O-glycans or heparan sulfate where the relationships between compositions can be expressed as a graph.Availability and implementation Built Executable and Source Code: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 22 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty397
      Issue No: Vol. 34, No. 20 (2018)
  • Identification and characterization of moonlighting long non-coding RNAs
           based on RNA and protein interactome
    • Authors: Cheng L; Leung K, Berger B.
      Pages: 3519 - 3528
      Abstract: MotivationMoonlighting proteins are a class of proteins having multiple distinct functions, which play essential roles in a variety of cellular and enzymatic functioning systems. Although there have long been calls for computational algorithms for the identification of moonlighting proteins, research on approaches to identify moonlighting long non-coding RNAs (lncRNAs) has never been undertaken. Here, we introduce a novel methodology, MoonFinder, for the identification of moonlighting lncRNAs. MoonFinder is a statistical algorithm identifying moonlighting lncRNAs without a priori knowledge through the integration of protein interactome, RNA–protein interactions and functional annotation of proteins.ResultsWe identify 155 moonlighting lncRNA candidates and uncover that they are a distinct class of lncRNAs characterized by specific sequence and cellular localization features. The non-coding genes that transcript moonlighting lncRNAs tend to have shorter but more exons and the moonlighting lncRNAs have a variable localization pattern with a high chance of residing in the cytoplasmic compartment in comparison to the other lncRNAs. Moreover, moonlighting lncRNAs and moonlighting proteins are rather mutually exclusive in terms of both their direct interactions and interacting partners. Our results also shed light on how the moonlighting candidates and their interacting proteins implicated in the formation and development of cancers and other diseases.Availability and implementationThe code implementing MoonFinder is supplied as an R package in the supplementary materialsupplementary material.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 16 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty399
      Issue No: Vol. 34, No. 20 (2018)
  • Covariate-adjusted heatmaps for visualizing biological data via
           correlation decomposition
    • Authors: Wu H; Tien Y, Ho M, et al.
      Pages: 3529 - 3538
      Abstract: MotivationHeatmap is a popular visualization technique in biology and related fields. In this study, we extend heatmaps within the framework of matrix visualization (MV) by incorporating a covariate adjustment process through the estimation of conditional correlations. MV can explore the embedded information structure of high-dimensional large-scale datasets effectively without dimension reduction. The benefit of the proposed covariate-adjusted heatmap is in the exploration of conditional association structures among the subjects or variables that cannot be done with conventional MV.ResultsFor adjustment of a discrete covariate, the conditional correlation is estimated by the within and between analysis. This procedure decomposes a correlation matrix into the within- and between-component matrices. The contribution of the covariate effects can then be assessed through the relative structure of the between-component to the original correlation matrix while the within-component acts as a residual. When a covariate is of continuous nature, the conditional correlation is equivalent to the partial correlation under the assumption of a joint normal distribution. A test is then employed to identify the variable pairs which possess the most significant differences at varying levels of correlation before and after a covariate adjustment. In addition, a z-score significance map is constructed to visualize these results. A simulation and three biological datasets are employed to illustrate the power and versatility of our proposed method.Availability and implementationGAP is available to readers and is free to non-commercial applications. The installation instructions, the user’s manual, and the detailed tutorials can be found at informationSupplementary DataSupplementary Data are available at Bioinformatics online.
      PubDate: Thu, 26 Apr 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty335
      Issue No: Vol. 34, No. 20 (2018)
  • D3NER: biomedical named entity recognition using CRF-biLSTM improved with
           fine-tuned embeddings of various linguistic information
    • Authors: Dang T; Le H, Nguyen T, et al.
      Pages: 3539 - 3546
      Abstract: MotivationRecognition of biomedical named entities in the textual literature is a highly challenging research topic with great interest, playing as the prerequisite for extracting huge amount of high-valued biomedical knowledge deposited in unstructured text and transforming them into well-structured formats. Long Short-Term Memory (LSTM) networks have recently been employed in various biomedical named entity recognition (NER) models with great success. They, however, often did not take advantages of all useful linguistic information and still have many aspects to be further improved for better performance.ResultsWe propose D3NER, a novel biomedical named entity recognition (NER) model using conditional random fields and bidirectional long short-term memory improved with fine-tuned embeddings of various linguistic information. D3NER is thoroughly compared with seven very recent state-of-the-art NER models, of which two are even joint models with named entity normalization (NEN), which was proven to bring performance improvements to NER. Experimental results on benchmark datasets, i.e. the BioCreative V Chemical Disease Relation (BC5 CDR), the NCBI Disease and the FSU-PRGE gene/protein corpus, demonstrate the out-performance and stability of D3NER over all compared models for chemical, gene/protein NER and over all models (without NEN jointed, as D3NER) for disease NER, in almost all cases. On the BC5 CDR corpus, D3NER achieves F1 of 93.14 and 84.68% for the chemical and disease NER, respectively; while on the NCBI Disease corpus, its F1 for the disease NER is 84.41%. Its F1 for the gene/protein NER on FSU-PRGE is 87.62%.Availability and implementationData and source code are available at: informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 30 Apr 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty356
      Issue No: Vol. 34, No. 20 (2018)
  • MiRGOFS: a GO-based functional similarity measurement for miRNAs, with
           applications to the prediction of miRNA subcellular localization and
           miRNA–disease association
    • Authors: Yang Y; Fu X, Qu W, et al.
      Pages: 3547 - 3556
      Abstract: MotivationBenefiting from high-throughput experimental technologies, whole-genome analysis of microRNAs (miRNAs) has been more and more common to uncover important regulatory roles of miRNAs and identify miRNA biomarkers for disease diagnosis. As a complementary information to the high-throughput experimental data, domain knowledge like the Gene Ontology and KEGG pathway is usually used to guide gene function analysis. However, functional annotation for miRNAs is scarce in the public databases. Till now, only a few methods have been proposed for measuring the functional similarity between miRNAs based on public annotation data, and these methods cover a very limited number of miRNAs, which are not applicable to large-scale miRNA analysis.ResultsIn this paper, we propose a new method to measure the functional similarity for miRNAs, called miRGOFS, which has two notable features: (i) it adopts a new GO semantic similarity metric which considers both common ancestors and descendants of GO terms; (i) it computes similarity between GO sets in an asymmetric manner, and weights each GO term by its statistical significance. The miRGOFS-based predictor achieves an F1 of 61.2% on a benchmark dataset of miRNA localization, and AUC values of 87.7 and 81.1% on two benchmark sets of miRNA–disease association, respectively. Compared with the existing functional similarity measurements of miRNAs, miRGOFS has the advantages of higher accuracy and larger coverage of human miRNAs (over 1000 miRNAs).Availability and implementation informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Fri, 27 Apr 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty343
      Issue No: Vol. 34, No. 20 (2018)
  • ProteomeVis: a web app for exploration of protein properties from
           structure to sequence evolution across organisms’ proteomes
    • Authors: Razban R; Gilson A, Durfee N, et al.
      Pages: 3557 - 3565
      Abstract: MotivationProtein evolution spans time scales and its effects span the length of an organism. A web app named ProteomeVis is developed to provide a comprehensive view of protein evolution in the Saccharomyces cerevisiae and Escherichia coli proteomes. ProteomeVis interactively creates protein chain graphs, where edges between nodes represent structure and sequence similarities within user-defined ranges, to study the long time scale effects of protein structure evolution. The short time scale effects of protein sequence evolution are studied by sequence evolutionary rate (ER) correlation analyses with protein properties that span from the molecular to the organismal level.ResultsWe demonstrate the utility and versatility of ProteomeVis by investigating the distribution of edges per node in organismal protein chain universe graphs (oPCUGs) and putative ER determinants. S.cerevisiae and E.coli oPCUGs are scale-free with scaling constants of 1.79 and 1.56, respectively. Both scaling constants can be explained by a previously reported theoretical model describing protein structure evolution. Protein abundance most strongly correlates with ER among properties in ProteomeVis, with Spearman correlations of –0.49 (P-value < 10−10) and –0.46 (P-value < 10−10) for S.cerevisiae and E.coli, respectively. This result is consistent with previous reports that found protein expression to be the most important ER determinant.Availability and implementationProteomeVis is freely accessible at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 08 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty370
      Issue No: Vol. 34, No. 20 (2018)
  • MRMAssayDB: an integrated resource for validated targeted proteomics
    • Authors: Bhowmick P; Mohammed Y, Borchers C, et al.
      Pages: 3566 - 3571
      Abstract: MotivationMultiple Reaction Monitoring (MRM)-based targeted proteomics is increasingly being used to study the molecular basis of disease. When combined with an internal standard, MRM allows absolute quantification of proteins in virtually any type of sample but the development and validation of an MRM assay for a specific protein is laborious. Therefore, several public repositories now host targeted proteomics MRM assays, including NCI’s Clinical Proteomic Tumor Analysis Consortium assay portals, PeptideAtlas SRM Experiment Library, SRMAtlas, PanoramaWeb and PeptideTracker, with all of which contain different levels of information.ResultsHere we present MRMAssayDB, a web-based application that integrates these repositories into a single resource. MRMAssayDB maps and links the targeted assays, annotates the proteins with information from UniProtKB, KEGG pathways and Gene Ontologies, and provides several visualization options on the peptide and protein level. Currently MRMAssayDB contains >168K assays covering more than 34K proteins from 63 organisms; >13.5K of these proteins are present in >2.3K KEGG biological pathways corresponding to >300 master pathways, and mapping to >13K GO biological processes. MRMAssayDB allows comprehensive searches for a targeted-proteomics assay depending on the user’s interests, by using target-protein name or accession number, or using annotations such as subcellular localization, biological pathway, or disease or drug associations. The user can see how many data repositories include a specific peptide assay, and the commonly used transitions for each peptide in all empirical data from the repositories.Availability and implementationhttp://mrmassaydb.proteincentre.comSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 14 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty385
      Issue No: Vol. 34, No. 20 (2018)
  • AnnotSV: an integrated tool for structural variations annotation
    • Authors: Geoffroy V; Herenger Y, Kress A, et al.
      Pages: 3572 - 3574
      Abstract: SummaryStructural Variations (SV) are a major source of variability in the human genome that shaped its actual structure during evolution. Moreover, many human diseases are caused by SV, highlighting the need to accurately detect those genomic events but also to annotate them and assist their biological interpretation. Therefore, we developed AnnotSV that compiles functionally, regulatory and clinically relevant information and aims at providing annotations useful to (i) interpret SV potential pathogenicity and (ii) filter out SV potential false positive. In particular, AnnotSV reports heterozygous and homozygous counts of single nucleotide variations (SNVs) and small insertions/deletions called within each SV for the analyzed patients, this genomic information being extremely useful to support or question the existence of an SV. We also report the computed allelic frequency relative to overlapping variants from DGV (MacDonald et al., 2014), that is especially powerful to filter out common SV. To delineate the strength of AnnotSV, we annotated the 4751 SV from one sample of the 1000 Genomes Project, integrating the sample information of four million of SNV/indel, in less than 60 s.Availability and implementationAnnotSV is implemented in Tcl and runs in command line on all platforms. The source code is available under the GNU GPL license. Source code, README and Supplementary dataSupplementary data are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Sat, 14 Apr 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty304
      Issue No: Vol. 34, No. 20 (2018)
  • FlexiDot: highly customizable, ambiguity-aware dotplots for visual
           sequence analyses
    • Authors: Seibt K; Schmidt T, Heitkam T, et al.
      Pages: 3575 - 3577
      Abstract: SummaryFlexiDot is a cross-platform dotplot suite generating high quality self, pairwise and all-against-all visualizations. To improve dotplot suitability for comparison of consensus and error-prone sequences, FlexiDot harbors routines for strict and relaxed handling of ambiguities and substitutions. Our shading modules facilitate dotplot interpretation and motif identification by adding information on sequence annotations and sequence similarities. Combined with collage-like outputs, FlexiDot supports simultaneous visual screening of large sequence sets, enabling dotplot use for routine analyses.Availability and implementationFlexiDot is implemented in Python 2.7. Software and documentation are freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 14 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty395
      Issue No: Vol. 34, No. 20 (2018)
  • YAMDA: thousandfold speedup of EM-based motif discovery using deep
           learning libraries and GPU
    • Authors: Quang D; Guan Y, Parker S, et al.
      Pages: 3578 - 3580
      Abstract: MotivationMotif discovery in large biopolymer sequence datasets can be computationally demanding, presenting significant challenges for discovery in omics research. MEME, arguably one of the most popular motif discovery software, takes quadratic time with respect to dataset size, leading to excessively long runtimes for large datasets. Therefore, there is a demand for fast programs that can generate results of the same quality as MEME.ResultsHere we describe YAMDA, a highly scalable motif discovery software package. It is built on Pytorch, a tensor computation deep learning library with strong GPU acceleration that is highly optimized for tensor operations that are also useful for motifs. YAMDA takes linear time to find motifs as accurately as MEME, completing in seconds or minutes, which translates to speedups over a thousandfold.Availability and implementationYAMDA is freely available on Github ( informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 22 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty396
      Issue No: Vol. 34, No. 20 (2018)
  • bcSeq: an R package for fast sequence mapping in high-throughput shRNA and
           CRISPR screens
    • Authors: Lin J; Gresham J, Wang T, et al.
      Pages: 3581 - 3583
      Abstract: SummaryCRISPR-Cas9 and shRNA high-throughput sequencing screens have abundant applications for basic and translational research. Methods and tools for the analysis of these screens must properly account for sequencing error, resolve ambiguous mappings among similar sequences in the barcode library in a statistically principled manner, and be computationally efficient. Herein we present bcSeq, an open source R package that implements a fast and parallelized algorithm for mapping high-throughput sequencing reads to a barcode library while tolerating sequencing error. The algorithm uses a Trie data structure for speed and resolves ambiguous mappings by using a statistical sequencing error model based on Phred scores for each read.Availability and implementationThe package source code and an accompanying tutorial are available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 22 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty402
      Issue No: Vol. 34, No. 20 (2018)
  • Jwalk and MNXL web server: model validation using restraints from
           crosslinking mass spectrometry
    • Authors: Bullock J; Thalassinos K, Topf M, et al.
      Pages: 3584 - 3585
      Abstract: MotivationCrosslinking Mass Spectrometry generates restraints that can be used to model proteins and protein complexes. Previously, we have developed two methods, to help users achieve better modelling performance from their crosslinking restraints: Jwalk, to estimate solvent accessible distances between crosslinked residues and MNXL, to assess the quality of the models based on these distances.ResultsHere, we present the Jwalk and MNXL webservers, which streamline the process of validating monomeric protein models using restraints from crosslinks. We demonstrate this by using the MNXL server to filter models made of varying quality, selecting the most native-like.Availability and implementationThe webserver and source code are freely available from and informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 07 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty366
      Issue No: Vol. 34, No. 20 (2018)
  • CAVER Analyst 2.0: analysis and visualization of channels and tunnels in
           protein structures and molecular dynamics trajectories
    • Authors: Jurcik A; Bednar D, Byska J, et al.
      Pages: 3586 - 3588
      Abstract: MotivationStudying the transport paths of ligands, solvents, or ions in transmembrane proteins and proteins with buried binding sites is fundamental to the understanding of their biological function. A detailed analysis of the structural features influencing the transport paths is also important for engineering proteins for biomedical and biotechnological applications.ResultsCAVER Analyst 2.0 is a software tool for quantitative analysis and real-time visualization of tunnels and channels in static and dynamic structures. This version provides the users with many new functions, including advanced techniques for intuitive visual inspection of the spatiotemporal behavior of tunnels and channels. Novel integrated algorithms allow an efficient analysis and data reduction in large protein structures and molecular dynamic simulations.Availability and implementationCAVER Analyst 2.0 is a multi-platform standalone Java-based application. Binaries and documentation are freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 08 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty386
      Issue No: Vol. 34, No. 20 (2018)
  • SimExTargId: a comprehensive package for real-time LC-MS data acquisition
           and analysis
    • Authors: Edmands W; Hayes J, Rappaport S, et al.
      Pages: 3589 - 3590
      Abstract: SummaryLiquid chromatography mass spectrometry (LC-MS) is the favored method for untargeted metabolomic analysis of small molecules in biofluids. Here we present SimExTargId, an open-source R package for autonomous analysis of metabolomic data and real-time observation of experimental runs. This simultaneous, fully automated and multi-threaded (optional) package is a wrapper for vendor-independent format conversion (ProteoWizard), xcms- and CAMERA- based peak-picking, MetMSLine-based pre-processing and covariate-based statistical analysis. Users are notified of detrimental instrument drift or errors by email. Also included are two shiny applications, targetId for real-time MS2 target identification, and peakMonitor to monitor targeted metabolites.Availability and implementationSimExTargId is publicly available under GNU LGPL v3.0 license at, which includes a vignette with example data. SimExTargId should be installed on a dedicated data-processing workstation or server that is networked to the LC-MS platform to facilitate MS1 profiling of metabolomic data.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 22 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty218
      Issue No: Vol. 34, No. 20 (2018)
  • pyABC: distributed, likelihood-free inference
    • Authors: Klinger E; Rickert D, Hasenauer J, et al.
      Pages: 3591 - 3593
      Abstract: SummaryLikelihood-free methods are often required for inference in systems biology. While approximate Bayesian computation (ABC) provides a theoretical solution, its practical application has often been challenging due to its high computational demands. To scale likelihood-free inference to computationally demanding stochastic models, we developed pyABC: a distributed and scalable ABC-Sequential Monte Carlo (ABC-SMC) framework. It implements a scalable, runtime-minimizing parallelization strategy for multi-core and distributed environments scaling to thousands of cores. The framework is accessible to non-expert users and also enables advanced users to experiment with and to custom implement many options of ABC-SMC schemes, such as acceptance threshold schedules, transition kernels and distance functions without alteration of pyABC’s source code. pyABC includes a web interface to visualize ongoing and finished ABC-SMC runs and exposes an API for data querying and post-processing.Availability and ImplementationpyABC is written in Python 3 and is released under a 3-clause BSD license. The source code is hosted on and the documentation on It can be installed from the Python Package Index (PyPI).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 14 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty361
      Issue No: Vol. 34, No. 20 (2018)
  • PANDA-view: an easy-to-use tool for statistical analysis and visualization
           of quantitative proteomics data
    • Authors: Chang C; Xu K, Guo C, et al.
      Pages: 3594 - 3596
      Abstract: SummaryCompared with the numerous software tools developed for identification and quantification of -omics data, there remains a lack of suitable tools for both downstream analysis and data visualization. To help researchers better understand the biological meanings in their -omics data, we present an easy-to-use tool, named PANDA-view, for both statistical analysis and visualization of quantitative proteomics data and other -omics data. PANDA-view contains various kinds of analysis methods such as normalization, missing value imputation, statistical tests, clustering and principal component analysis, as well as the most commonly-used data visualization methods including an interactive volcano plot. Additionally, it provides user-friendly interfaces for protein-peptide-spectrum representation of the quantitative proteomics data.Availability and implementationPANDA-view is freely available at informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Tue, 22 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty408
      Issue No: Vol. 34, No. 20 (2018)
  • Transform-MinER: transforming molecules in enzyme reactions
    • Authors: Tyzack J; Ribeiro A, Borkakoti N, et al.
      Pages: 3597 - 3599
      Abstract: MotivationOne goal of synthetic biology is to make new enzymes to generate new products, but identifying the starting enzymes for further investigation is often elusive and relies on expert knowledge, intensive literature searching and trial and error.ResultsWe present Transform Molecules in Enzyme Reactions, an online computational tool that transforms query substrate molecules into products using enzyme reactions. The most similar native enzyme reactions for each transformation are found, highlighting those that may be of most interest for enzyme design and directed evolution approaches.Availability and implementation
      PubDate: Mon, 14 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty394
      Issue No: Vol. 34, No. 20 (2018)
  • Snakemake—a scalable bioinformatics workflow engine
    • Authors: Köster J; Rahmann S.
      Pages: 3600 - 3600
      Abstract: Bioinformatics, Volume 28, Issue 19, 1 October 2012, Pages 2520–2522,
      PubDate: Wed, 16 May 2018 00:00:00 GMT
      DOI: 10.1093/bioinformatics/bty350
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-