![]() |
Bioinformatics
Journal Prestige (SJR): 6.14 ![]() Citation Impact (citeScore): 8 Number of Followers: 283 ![]() ISSN (Print) 1367-4803 - ISSN (Online) 1460-2059 Published by Oxford University Press ![]() |
- Correction to: Phylovar: toward scalable phylogeny-aware inference of
single-nucleotide variations from single-cell DNA sequencing data-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad321
Abstract: This is a correction to: Mohammadamin Edrisi, Monica V Valecha, Sunkara B V Chowdary, Sergio Robledo, Huw A Ogilvie, David Posada, Hamim Zafar, Luay Nakhleh, Phylovar: toward scalable phylogeny-aware inference of single-nucleotide variations from single-cell DNA sequencing data, Bioinformatics, Volume 38, Issue Supplement_1, July 2022, Pages i195–i202, https://doi.org/10.1093/bioinformatics/btac254
PubDate: Tue, 23 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad321
Issue No: Vol. 39, No. 5 (2023)
-
- Correction to: Integrative analysis of individual-level data and
high-dimensional summary statistics-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad324
Abstract: This is a correction to: Sheng Fu, Lu Deng, Han Zhang, William Wheeler, Jing Qin, Kai Yu, Integrative analysis of individual-level data and high-dimensional summary statistics, Bioinformatics, Volume 39, Issue 4, April 2023, btad156, https://doi.org/10.1093/bioinformatics/btad156
PubDate: Fri, 19 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad324
Issue No: Vol. 39, No. 5 (2023)
-
- Correction to: wpLogicNet: logic gate and structure inference in gene
regulatory networks-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad304
Abstract: This is a correction to: Seyed Amir Malekpour, Maryam Shahdoust, Rosa Aghdam, Mehdi Sadeghi, wpLogicNet: logic gate and structure inference in gene regulatory networks, Bioinformatics, Volume 39, Issue 2, February 2023, https://doi.org/10.1093/bioinformatics/btad072
PubDate: Wed, 17 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad304
Issue No: Vol. 39, No. 5 (2023)
-
- NanoPack2: population-scale evaluation of long-read sequencing data
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad311
Abstract: SummaryIncreases in the cohort size in long-read sequencing projects necessitate more efficient software for quality assessment and processing of sequencing data from Oxford Nanopore Technologies and Pacific Biosciences. Here, we describe novel tools for summarizing experiments, filtering datasets, visualizing phased alignments results, and updates to the NanoPack software suite.Availability and implementationThe cramino, chopper, kyber, and phasius tools are written in Rust and available as executable binaries without requiring installation or managing dependencies. Binaries build on musl are available for broad compatibility. NanoPlot and NanoComp are written in Python3. Links to the separate tools and their documentation can be found at https://github.com/wdecoster/nanopack. All tools are compatible with Linux, Mac OS, and the MS Windows Subsystem for Linux and are released under the MIT license. The repositories include test data, and the tools are continuously tested using GitHub Actions and can be installed with the conda dependency manager.
PubDate: Fri, 12 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad311
Issue No: Vol. 39, No. 5 (2023)
-
- CscoreTool-M infers 3D sub-compartment probabilities within cell
population-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad314
Abstract: MotivationComputational inference of genome organization based on Hi-C sequencing has greatly aided the understanding of chromatin and nuclear organization in three dimensions (3D). However, existing computational methods fail to address the cell population heterogeneity. Here we describe a probabilistic-modeling-based method called CscoreTool-M that infers multiple 3D genome sub-compartments from Hi-C data.ResultsThe compartment scores inferred using CscoreTool-M represents the probability of a genomic region locating in a specific sub-compartment. Compared to published methods, CscoreTool-M is more accurate in inferring sub-compartments corresponding to both active and repressed chromatin. The compartment scores calculated by CscoreTool-M also help to quantify the levels of heterogeneity in sub-compartment localization within cell populations. By comparing proliferating cells and terminally differentiated non-proliferating cells, we show that the proliferating cells have higher genome organization heterogeneity, which is likely caused by cells at different cell-cycle stages. By analyzing 10 sub-compartments, we found a sub-compartment containing chromatin potentially related to the early-G1 chromatin regions proximal to the nuclear lamina in HCT116 cells, suggesting the method can deconvolve cell cycle stage-specific genome organization among asynchronously dividing cells. Finally, we show that CscoreTool-M can identify sub-compartments that contain genes enriched in housekeeping or cell-type-specific functions.Availability and implementationhttps://github.com/scoutzxb/CscoreTool-M.
PubDate: Thu, 11 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad314
Issue No: Vol. 39, No. 5 (2023)
-
- High-quality, customizable heuristics for RNA 3D structure alignment
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad315
Abstract: MotivationTertiary structure alignment is one of the main challenges in the computer-aided comparative study of molecular structures. Its aim is to optimally overlay the 3D shapes of two or more molecules in space to find the correspondence between their nucleotides. Alignment is the starting point for most algorithms that assess structural similarity or find common substructures. Thus, it has applications in solving a variety of bioinformatics problems, e.g. in the search for structural patterns, structure clustering, identifying structural redundancy, and evaluating the prediction accuracy of 3D models. To date, several tools have been developed to align 3D structures of RNA. However, most of them are not applicable to arbitrarily large structures and do not allow users to parameterize the optimization algorithm.ResultsWe present two customizable heuristics for flexible alignment of 3D RNA structures, geometric search (GEOS), and genetic algorithm (GENS). They work in sequence-dependent/independent mode and find the suboptimal alignment of expected quality (below a predefined RMSD threshold). We compare their performance with those of state-of-the-art methods for aligning RNA structures. We show the results of quantitative and qualitative tests run for all of these algorithms on benchmark sets of RNA structures.Availability and implementationSource codes for both heuristics are hosted at https://github.com/RNApolis/rnahugs.
PubDate: Thu, 11 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad315
Issue No: Vol. 39, No. 5 (2023)
-
- TRASH: Tandem Repeat Annotation and Structural Hierarchy
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad308
Abstract: MotivationThe advent of long-read DNA sequencing is allowing complete assembly of highly repetitive genomic regions for the first time, including the megabase-scale satellite repeat arrays found in many eukaryotic centromeres. The assembly of such repetitive regions creates a need for their de novo annotation, including patterns of higher order repetition. To annotate tandem repeats, methods are required that can be widely applied to diverse genome sequences, without prior knowledge of monomer sequences.ResultsTandem Repeat Annotation and Structural Hierarchy (TRASH) is a tool that identifies and maps tandem repeats in nucleotide sequence, without prior knowledge of repeat composition. TRASH analyses a fasta assembly file, identifies regions occupied by repeats and then precisely maps them and their higher order structures. To demonstrate the applicability and scalability of TRASH for centromere research, we apply our method to the recently published Col-CEN genome of Arabidopsis thaliana and the complete human CHM13 genome.Availability and implementationTRASH is freely available at:https://github.com/vlothec/TRASH and supported on Linux.
PubDate: Wed, 10 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad308
Issue No: Vol. 39, No. 5 (2023)
-
- Genome mining for anti-CRISPR operons using machine learning
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad309
Abstract: MotivationEncoded by (pro-)viruses, anti-CRISPR (Acr) proteins inhibit the CRISPR-Cas immune system of their prokaryotic hosts. As a result, Acr proteins can be employed to develop more controllable CRISPR-Cas genome editing tools. Recent studies revealed that known acr genes often coexist with other acr genes and with phage structural genes within the same operon. For example, we found that 47 of 98 known acr genes (or their homologs) co-exist in the same operons. None of the current Acr prediction tools have considered this important genomic context feature. We have developed a new software tool AOminer to facilitate the improved discovery of new Acrs by fully exploiting the genomic context of known acr genes and their homologs.ResultsAOminer is the first machine learning based tool focused on the discovery of Acr operons (AOs). A two-state HMM (hidden Markov model) was trained to learn the conserved genomic context of operons that contain known acr genes or their homologs, and the learnt features could distinguish AOs and non-AOs. AOminer allows automated mining for potential AOs from query genomes or operons. AOminer outperformed all existing Acr prediction tools with an accuracy = 0.85. AOminer will facilitate the discovery of novel anti-CRISPR operons.Availability and implementationThe webserver is available at: http://aca.unl.edu/AOminer/AOminer_APP/. The python program is at: https://github.com/boweny920/AOminer.
PubDate: Tue, 09 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad309
Issue No: Vol. 39, No. 5 (2023)
-
- GTExVisualizer: a web platform for supporting ageing studies
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad303
Abstract: MotivationStudying ageing effects on molecules is an important new topic for life science. To perform such studies, the need for data, models, algorithms, and tools arises to elucidate molecular mechanisms. GTEx (standing for Genotype-Tissue Expression) portal is a web-based data source allowing to retrieve patients’ transcriptomics data annotated with tissues, gender, and age information. It represents the more complete data sources for ageing effects studies. Nevertheless, it lacks functionalities to query data at the sex/age level, as well as tools for protein interaction studies, thereby limiting ageing studies. As a result, users need to download query results to proceed to further analysis, such as retrieving the expression of a given gene on different age (or sex) classes in many tissues.ResultsWe present the GTExVisualizer, a platform to query and analyse GTEx data. This tool contains a web interface able to: (i) graphically represent and study query results; (ii) analyse genes using sex/age expression patterns, also integrated with network-based modules; and (iii) report results as plot-based representation as well as (gene) networks. Finally, it allows the user to obtain basic statistics which evidence differences in gene expression among sex/age groups.ConclusionThe GTExVisualizer novelty consists in providing a tool for studying ageing/sex-related effects on molecular processes.Availability and implementationGTExVisualizer is available at: http://gtexvisualizer.herokuapp.com. The source code and data are available at: https://github.com/UgoLomoio/gtex_visualizer.
PubDate: Mon, 08 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad303
Issue No: Vol. 39, No. 5 (2023)
-
- STEMSIM: a simulator of within-strain short-term evolutionary mutations
for longitudinal metagenomic data-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad302
Abstract: MotivationAs the resolution of metagenomic analysis increases, the evolution of microbial genomes in longitudinal metagenomic data has become a research focus. Some software has been developed for the simulation of complex microbial communities at the strain level. However, the tool for simulating within-strain evolutionary signals in longitudinal samples is still lacking.ResultsIn this study, we introduce STEMSIM, a user-friendly command-line simulator of short-term evolutionary mutations for longitudinal metagenomic data. The input is simulated longitudinal raw sequencing reads of microbial communities or single species. The output is the modified reads with within-strain evolutionary mutations and the relevant information of these mutations. STEMSIM will be of great use for the evaluation of analytic tools that detect short-term evolutionary mutations in metagenomic data.Availability and implementationSTEMSIM and its tutorial are freely available online at https://github.com/BoyanZhou/STEMSim.
PubDate: Mon, 08 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad302
Issue No: Vol. 39, No. 5 (2023)
-
- Atomic protein structure refinement using all-atom graph representations
and SE(3)-equivariant graph transformer-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad298
Abstract: MotivationThe state-of-art protein structure prediction methods such as AlphaFold are being widely used to predict structures of uncharacterized proteins in biomedical research. There is a significant need to further improve the quality and nativeness of the predicted structures to enhance their usability. In this work, we develop ATOMRefine, a deep learning-based, end-to-end, all-atom protein structural model refinement method. It uses a SE(3)-equivariant graph transformer network to directly refine protein atomic coordinates in a predicted tertiary structure represented as a molecular graph.ResultsThe method is first trained and tested on the structural models in AlphaFoldDB whose experimental structures are known, and then blindly tested on 69 CASP14 regular targets and 7 CASP14 refinement targets. ATOMRefine improves the quality of both backbone atoms and all-atom conformation of the initial structural models generated by AlphaFold. It also performs better than two state-of-the-art refinement methods in multiple evaluation metrics including an all-atom model quality score—the MolProbity score based on the analysis of all-atom contacts, bond length, atom clashes, torsion angles, and side-chain rotamers. As ATOMRefine can refine a protein structure quickly, it provides a viable, fast solution for improving protein geometry and fixing structural errors of predicted structures through direct coordinate refinement.Availability and implementationThe source code of ATOMRefine is available in the GitHub repository (https://github.com/BioinfoMachineLearning/ATOMRefine). All the required data for training and testing are available at https://doi.org/10.5281/zenodo.6944368.
PubDate: Fri, 05 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad298
Issue No: Vol. 39, No. 5 (2023)
-
- ROptimus: a parallel general-purpose adaptive optimization engine
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad292
Abstract: SummaryMotivationVarious computational biology calculations require a probabilistic optimization protocol to determine the parameters that capture the system at a desired state in the configurational space. Many existing methods excel at certain scenarios, but fail in others due, in part, to an inefficient exploration of the parameter space and easy trapping into local minima. Here, we developed a general-purpose optimization engine in R that can be plugged to any, simple or complex, modelling initiative through a few lucid interfacing functions, to perform a seamless optimization with rigorous parameter sampling.ResultsROptimus features simulated annealing and replica exchange implementations equipped with adaptive thermoregulation to drive Monte Carlo optimization process in a flexible manner, through constrained acceptance frequency but unconstrained adaptive pseudo temperature regimens. We exemplify the applicability of our R optimizer to a diverse set of problems spanning data analyses and computational biology tasks.Availability and implementationROptimus is written and implemented in R, and is freely available from CRAN (http://cran.r-project.org/web/packages/ROptimus/index.html) and GitHub (http://github.com/SahakyanLab/ROptimus).
PubDate: Thu, 04 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad292
Issue No: Vol. 39, No. 5 (2023)
-
- HAMPLE: deciphering TF-DNA binding mechanism in different cellular
environments by characterizing higher-order nucleotide dependency-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad299
Abstract: MotivationTranscription factor (TF) binds to conservative DNA binding sites in different cellular environments and development stages by physical interaction with interdependent nucleotides. However, systematic computational characterization of the relationship between higher-order nucleotide dependency and TF-DNA binding mechanism in diverse cell types remains challenging.ResultsHere, we propose a novel multi-task learning framework HAMPLE to simultaneously predict TF binding sites (TFBS) in distinct cell types by characterizing higher-order nucleotide dependencies. Specifically, HAMPLE first represents a DNA sequence through three higher-order nucleotide dependencies, including k-mer encoding, DNA shape and histone modification. Then, HAMPLE uses the customized gate control and the channel attention convolutional architecture to further capture cell-type-specific and cell-type-shared DNA binding motifs and epigenomic languages. Finally, HAMPLE exploits the joint loss function to optimize the TFBS prediction for different cell types in an end-to-end manner. Extensive experimental results on seven datasets demonstrate that HAMPLE significantly outperforms the state-of-the-art approaches in terms of auROC. In addition, feature importance analysis illustrates that k-mer encoding, DNA shape, and histone modification have predictive power for TF-DNA binding in different cellular environments and are complementary to each other. Furthermore, ablation study, and interpretable analysis validate the effectiveness of the customized gate control and the channel attention convolutional architecture in characterizing higher-order nucleotide dependencies.Availability and implementationThe source code is available at https://github.com/ZhangLab312/Hample.
PubDate: Thu, 04 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad299
Issue No: Vol. 39, No. 5 (2023)
-
- ppBAM: ProteinPaint BAM track for read alignment visualization and variant
genotyping-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad300
Abstract: SummaryProteinPaint BAM track (ppBAM) is designed to assist variant review for cancer research and clinical genomics. With performant server-side computing and rendering, ppBAM supports on-the-fly variant genotyping of thousands of reads using Smith–Waterman alignment. To better visualize support for complex variants, reads are realigned against the mutated reference sequence using ClustalO. ppBAM also supports the BAM slicing API of the NCI Genomic Data Commons (GDC) portal, letting researchers conveniently examine genomic details of vast amounts of cancer sequencing data and reinterpret variant calls.Availability and implementationBAM track examples, tutorial, and GDC file access links are available at https://proteinpaint.stjude.org/bam/. Source code is available at https://github.com/stjude/proteinpaint.
PubDate: Thu, 04 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad300
Issue No: Vol. 39, No. 5 (2023)
-
- Kimma: flexible linear mixed effects modeling with kinship covariance for
RNA-seq data-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad279
Abstract: MotivationThe identification of differentially expressed genes (DEGs) from transcriptomic datasets is a major avenue of research across diverse disciplines. However, current bioinformatic tools do not support covariance matrices in DEG modeling. Here, we introduce kimma (Kinship In Mixed Model Analysis), an open-source R package for flexible linear mixed effects modeling including covariates, weights, random effects, covariance matrices, and fit metrics.ResultsIn simulated datasets, kimma detects DEGs with similar specificity, sensitivity, and computational time as limma unpaired and dream paired models. Unlike other software, kimma supports covariance matrices as well as fit metrics like Akaike information criterion (AIC). Utilizing genetic kinship covariance, kimma revealed that kinship impacts model fit and DEG detection in a related cohort. Thus, kimma equals or outcompetes current DEG pipelines in sensitivity, computational time, and model complexity.Availability and implementationKimma is freely available on GitHub https://github.com/BIGslu/kimma with an instructional vignette at https://bigslu.github.io/kimma_vignette/kimma_vignette.html.
PubDate: Thu, 04 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad279
Issue No: Vol. 39, No. 5 (2023)
-
- PascalX: a Python library for GWAS gene and pathway enrichment tests
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad296
Abstract: Summary‘PascalX’ is a Python library providing fast and accurate tools for mapping SNP-wise GWAS summary statistics. Specifically, it allows for scoring genes and annotated gene sets for enrichment signals based on data from, both, single GWAS and pairs of GWAS. The gene scores take into account the correlation pattern between SNPs. They are based on the cumulative density function of a linear combination of χ2 distributed random variables, which can be calculated either approximately or exactly to high precision. Acceleration via multithreading and GPU is supported. The code of PascalX is fully open source and well suited as a base for method development in the GWAS enrichment test context.Availability and implementationThe source code is available at https://github.com/BergmannLab/PascalX and archived under doi://10.5281/zenodo.4429922. A user manual with usage examples is available at https://bergmannlab.github.io/PascalX/.
PubDate: Wed, 03 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad296
Issue No: Vol. 39, No. 5 (2023)
-
- pyGOMoDo: GPCRs modeling and docking with python
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad294
Abstract: MotivationWe present pyGOMoDo, a Python library to perform homology modeling and docking, specifically designed for human GPCRs. pyGOMoDo is a python wrap-up of the updated functionalities of GOMoDo web server (https://molsim.sci.univr.it/gomodo). It was developed having in mind its usage through Jupyter notebooks, where users can create their own protocols of modeling and docking of GPCRs. In this article, we focus on the internal structure and general capabilities of pyGOMoDO and on how it can be useful for carrying out structural biology studies of GPCRs.ResultsThe source code is freely available at https://github.com/rribeiro-sci/pygomodo under the Apache 2.0 license. Tutorial notebooks containing minimal working examples can be found at https://github.com/rribeiro-sci/pygomodo/tree/main/examples.
PubDate: Wed, 03 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad294
Issue No: Vol. 39, No. 5 (2023)
-
- Signed Distance Correlation (SiDCo): an online implementation of distance
correlation and partial distance correlation for data-driven network
analysis-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad210
Abstract: MotivationThere is a need for easily accessible implementations that measure the strength of both linear and non-linear relationships between metabolites in biological systems as an approach for data-driven network development. While multiple tools implement linear Pearson and Spearman methods, there are no such tools that assess distance correlation.ResultsWe present here SIgned Distance COrrelation (SiDCo). SiDCo is a GUI platform for calculation of distance correlation in omics data, measuring linear and non-linear dependencies between variables, as well as correlation between vectors of different lengths, e.g. different sample sizes. By combining the sign of the overall trend from Pearson’s correlation with distance correlation values, we further provide a novel “signed distance correlation” of particular use in metabolomic and lipidomic analyses. Distance correlations can be selected as one-to-one or one-to-all correlations, showing relationships between each feature and all other features one at a time or in combination. Additionally, we implement “partial distance correlation,” calculated using the Gaussian Graphical model approach adapted to distance covariance. Our platform provides an easy-to-use software implementation that can be applied to the investigation of any dataset.Availability and implementationThe SiDCo software application is freely available at https://complimet.ca/sidco. Supplementary help pages are provided at https://complimet.ca/sidco. Supplementary MaterialSupplementary Material shows an example of an application of SiDCo in metabolomics.
PubDate: Wed, 03 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad210
Issue No: Vol. 39, No. 5 (2023)
-
- PepGM: a probabilistic graphical model for taxonomic inference of viral
proteome samples with associated confidence scores-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad289
Abstract: MotivationInferring taxonomy in mass spectrometry-based shotgun proteomics is a complex task. In multi-species or viral samples of unknown taxonomic origin, the presence of proteins and corresponding taxa must be inferred from a list of identified peptides, which is often complicated by protein homology: many proteins do not only share peptides within a taxon but also between taxa. However, the correct taxonomic inference is crucial when identifying different viral strains with high-sequence homology—considering, e.g., the different epidemiological characteristics of the various strains of severe acute respiratory syndrome-related coronavirus-2. Additionally, many viruses mutate frequently, further complicating the correct identification of viral proteomic samples.ResultsWe present PepGM, a probabilistic graphical model for the taxonomic assignment of virus proteomic samples with strain-level resolution and associated confidence scores. PepGM combines the results of a standard proteomic database search algorithm with belief propagation to calculate the marginal distributions, and thus confidence scores, for potential taxonomic assignments. We demonstrate the performance of PepGM using several publicly available virus proteomic datasets, showing its strain-level resolution performance. In two out of eight cases, the taxonomic assignments were only correct on the species level, which PepGM clearly indicates by lower confidence scores.Availability and implementationPepGM is written in Python and embedded into a Snakemake workflow. It is available at https://github.com/BAMeScience/PepGM.
PubDate: Tue, 02 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad289
Issue No: Vol. 39, No. 5 (2023)
-
- BUSZ: compressed BUS files
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad295
Abstract: SummaryWe describe a compression scheme for BUS files and an implementation of the algorithm in the BUStools software. Our compression algorithm yields smaller file sizes than gzip, at significantly faster compression and decompression speeds. We evaluated our algorithm on 533 BUS files from scRNA-seq experiments with a total size of 1TB. Our compression is 2.2× faster than the fastest gzip option 35% slower than the fastest zstd option and results in 1.5× smaller files than both methods. This amounts to an 8.3× reduction in the file size, resulting in a compressed size of 122GB for the dataset.Availability and implementationA complete description of the format is available at https://github.com/BUStools/BUSZ-format and an implementation at https://github.com/BUStools/bustools. The code to reproduce the results of this article is available at https://github.com/pmelsted/BUSZ_paper.
PubDate: Tue, 02 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad295
Issue No: Vol. 39, No. 5 (2023)
-
- VirPipe: an easy-to-use and customizable pipeline for detecting viral
genomes from Nanopore sequencing-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad293
Abstract: Summary Detection and analysis of viral genomes with Nanopore sequencing has shown great promise in the surveillance of pathogen outbreaks. However, the number of virus detection pipelines supporting Nanopore sequencing is very limited. Here, we present VirPipe, a new pipeline for the detection of viral genomes from Nanopore or Illumina sequencing input featuring streamlined installation and customization.Availability and implementationVirPipe source code and documentation are freely available for download at https://github.com/KijinKims/VirPipe, implemented in Python and Nextflow.
PubDate: Tue, 02 May 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad293
Issue No: Vol. 39, No. 5 (2023)
-
- Predicting allosteric pockets in protein biological assemblages
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad275
Abstract: MotivationAllostery enables changes to the dynamic behavior of a protein at distant positions induced by binding. Here, we present APOP, a new allosteric pocket prediction method, which perturbs the pockets formed in the structure by stiffening pairwise interactions in the elastic network across the pocket, to emulate ligand binding. Ranking the pockets based on the shifts in the global mode frequencies, as well as their mean local hydrophobicities, leads to high prediction success when tested on a dataset of allosteric proteins, composed of both monomers and multimeric assemblages.ResultsOut of the 104 test cases, APOP predicts known allosteric pockets for 92 within the top 3 rank out of multiple pockets available in the protein. In addition, we demonstrate that APOP can also find new alternative allosteric pockets in proteins. Particularly interesting findings are the discovery of previously overlooked large pockets located in the centers of many protein biological assemblages; binding of ligands at these sites would likely be particularly effective in changing the protein’s global dynamics.Availability and implementationAPOP is freely available as an open-source code (https://github.com/Ambuj-UF/APOP) and as a web server at https://apop.bb.iastate.edu/.
PubDate: Fri, 28 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad275
Issue No: Vol. 39, No. 5 (2023)
-
- CNV-ClinViewer: enhancing the clinical interpretation of large copy-number
variants online-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad290
Abstract: MotivationPathogenic copy-number variants (CNVs) can cause a heterogeneous spectrum of rare and severe disorders. However, most CNVs are benign and are part of natural variation in human genomes. CNV pathogenicity classification, genotype–phenotype analyses, and therapeutic target identification are challenging and time-consuming tasks that require the integration and analysis of information from multiple scattered sources by experts.ResultsHere, we introduce the CNV-ClinViewer, an open-source web application for clinical evaluation and visual exploration of CNVs. The application enables real-time interactive exploration of large CNV datasets in a user-friendly designed interface and facilitates semi-automated clinical CNV interpretation following the ACMG guidelines by integrating the ClassifCNV tool. In combination with clinical judgment, the application enables clinicians and researchers to formulate novel hypotheses and guide their decision-making process. Subsequently, the CNV-ClinViewer enhances for clinical investigators’ patient care and for basic scientists’ translational genomic research.Availability and implementationThe web application is freely available at https://cnv-ClinViewer.broadinstitute.org and the open-source code can be found at https://github.com/LalResearchGroup/CNV-clinviewer.
PubDate: Thu, 27 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad290
Issue No: Vol. 39, No. 5 (2023)
-
- A maximum kernel-based association test to detect the pleiotropic genetic
effects on multiple phenotypes-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad291
Abstract: MotivationTesting the association between multiple phenotypes with a set of genetic variants simultaneously, rather than analyzing one trait at a time, is receiving increasing attention for its high statistical power and easy explanation on pleiotropic effects. The kernel-based association test (KAT), being free of data dimensions and structures, has proven to be a good alternative method for genetic association analysis with multiple phenotypes. However, KAT suffers from substantial power loss when multiple phenotypes have moderate to strong correlations. To handle this issue, we propose a maximum KAT (MaxKAT) and suggest using the generalized extreme value distribution to calculate its statistical significance under the null hypothesis.ResultsWe show that MaxKAT reduces computational intensity greatly while maintaining high accuracy. Extensive simulations demonstrate that MaxKAT can properly control type I error rates and obtain remarkably higher power than KAT under most of the considered scenarios. Application to a porcine dataset used in biomedical experiments of human disease further illustrates its practical utility.Availability and implementationThe R package MaxKAT that implements the proposed method is available on Github https://github.com/WangJJ-xrk/MaxKAT.
PubDate: Thu, 27 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad291
Issue No: Vol. 39, No. 5 (2023)
-
- DeepMicroGen: a generative adversarial network-based method for
longitudinal microbiome data imputation-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad286
Abstract: MotivationThe human microbiome, which is linked to various diseases by growing evidence, has a profound impact on human health. Since changes in the composition of the microbiome across time are associated with disease and clinical outcomes, microbiome analysis should be performed in a longitudinal study. However, due to limited sample sizes and differing numbers of timepoints for different subjects, a significant amount of data cannot be utilized, directly affecting the quality of analysis results. Deep generative models have been proposed to address this lack of data issue. Specifically, a generative adversarial network (GAN) has been successfully utilized for data augmentation to improve prediction tasks. Recent studies have also shown improved performance of GAN-based models for missing value imputation in a multivariate time series dataset compared with traditional imputation methods.ResultsThis work proposes DeepMicroGen, a bidirectional recurrent neural network-based GAN model, trained on the temporal relationship between the observations, to impute the missing microbiome samples in longitudinal studies. DeepMicroGen outperforms standard baseline imputation methods, showing the lowest mean absolute error for both simulated and real datasets. Finally, the proposed model improved the predicted clinical outcome for allergies, by providing imputation for an incomplete longitudinal dataset used to train the classifier.Availability and implementationDeepMicroGen is publicly available at https://github.com/joungmin-choi/DeepMicroGen.
PubDate: Wed, 26 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad286
Issue No: Vol. 39, No. 5 (2023)
-
- twas_sim, a Python-based tool for simulation and power analysis of
transcriptome-wide association analysis-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad288
Abstract: SummaryGenome-wide association studies (GWASs) have identified numerous genetic variants associated with complex disease risk; however, most of these associations are non-coding, complicating identifying their proximal target gene. Transcriptome-wide association studies (TWASs) have been proposed to mitigate this gap by integrating expression quantitative trait loci (eQTL) data with GWAS data. Numerous methodological advancements have been made for TWAS, yet each approach requires ad hoc simulations to demonstrate feasibility. Here, we present twas_sim, a computationally scalable and easily extendable tool for simplified performance evaluation and power analysis for TWAS methods.Availability and implementationSoftware and documentation are available at https://github.com/mancusolab/twas_sim.
PubDate: Wed, 26 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad288
Issue No: Vol. 39, No. 5 (2023)
-
- Fixing molecular complexes in BioPAX standards to enrich interactions and
detect redundancies using semantic web technologies-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad257
Abstract: MotivationMolecular complexes play a major role in the regulation of biological pathways. The Biological Pathway Exchange format (BioPAX) facilitates the integration of data sources describing interactions some of which involving complexes. The BioPAX specification explicitly prevents complexes to have any component that is another complex (unless this component is a black-box complex whose composition is unknown). However, we observed that the well-curated Reactome pathway database contains such recursive complexes of complexes. We propose reproductible and semantically rich SPARQL queries for identifying and fixing invalid complexes in BioPAX databases, and evaluate the consequences of fixing these nonconformities in the Reactome database.ResultsFor the Homo sapiens version of Reactome, we identify 5833 recursively defined complexes out of the 14 987 complexes (39%). This situation is not specific to the Human dataset, as all tested species of Reactome exhibit between 30% (Plasmodium falciparum) and 40% (Sus scrofa, Bos taurus, Canis familiaris, and Gallus gallus) of recursive complexes. As an additional consequence, the procedure also allows the detection of complex redundancies. Overall, this method improves the conformity and the automated analysis of the graph by repairing the topology of the complexes in the graph. This will allow to apply further reasoning methods on better consistent data.Availability and implementationWe provide a Jupyter notebook detailing the analysis https://github.com/cjuigne/non_conformities_detection_biopax.
PubDate: Tue, 25 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad257
Issue No: Vol. 39, No. 5 (2023)
-
- pyInfinityFlow: optimized imputation and analysis of high-dimensional flow
cytometry data for millions of cells-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad287
Abstract: MotivationWhile conventional flow cytometry is limited to dozens of markers, new experimental and computational strategies, such as Infinity Flow, allow for the generation and imputation of hundreds of cell surface protein markers in millions of cells. Here, we describe an end-to-end analysis workflow for Infinity Flow data in Python.ResultspyInfinityFlow enables the efficient analysis of millions of cells, without down-sampling, through direct integration with well-established Python packages for single-cell genomics analysis. pyInfinityFlow accurately identifies both common and extremely rare cell populations which are challenging to define from single-cell genomics studies alone. We demonstrate that this workflow can nominate novel markers to design new flow cytometry gating strategies for predicted cell populations. pyInfinityFlow can be extended to diverse cell discovery analyses with flexibility to adapt to diverse Infinity Flow experimental designs.Availability and implementationpyInfinityFlow is freely available in GitHub (https://github.com/KyleFerchen/pyInfinityFlow) and on PyPI (https://pypi.org/project/pyInfinityFlow/). Package documentation with tutorials on a test dataset is available by Read the Docs (pyinfinityflow.readthedocs.io). The scripts and data for reproducing the results are available at https://github.com/KyleFerchen/pyInfinityFlow/tree/main/analysis_scripts, along with the raw flow cytometry input data.
PubDate: Tue, 25 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad287
Issue No: Vol. 39, No. 5 (2023)
-
- epiTCR: a highly sensitive predictor for TCR–peptide binding
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad284
Abstract: MotivationPredicting the binding between T-cell receptor (TCR) and peptide presented by human leucocyte antigen molecule is a highly challenging task and a key bottleneck in the development of immunotherapy. Existing prediction tools, despite exhibiting good performance on the datasets they were built with, suffer from low true positive rates when used to predict epitopes capable of eliciting T-cell responses in patients. Therefore, an improved tool for TCR–peptide prediction built upon a large dataset combining existing publicly available data is still needed.ResultsWe collected data from five public databases (IEDB, TBAdb, VDJdb, McPAS-TCR, and 10X) to form a dataset of >3 million TCR–peptide pairs, 3.27% of which were binding interactions. We proposed epiTCR, a Random Forest-based method dedicated to predicting the TCR–peptide interactions. epiTCR used simple input of TCR CDR3β sequences and antigen sequences, which are encoded by flattened BLOSUM62. epiTCR performed with area under the curve (0.98) and higher sensitivity (0.94) than other existing tools (NetTCR, Imrex, ATM-TCR, and pMTnet), while maintaining comparable prediction specificity (0.9). We identified seven epitopes that contributed to 98.67% of false positives predicted by epiTCR and exerted similar effects on other tools. We also demonstrated a considerable influence of peptide sequences on prediction, highlighting the need for more diverse peptides in a more balanced dataset. In conclusion, epiTCR is among the most well-performing tools, thanks to the use of combined data from public sources and its use will contribute to the quest in identifying neoantigens for precision cancer immunotherapy.Availability and implementationepiTCR is available on GitHub (https://github.com/ddiem-ri-4D/epiTCR).
PubDate: Mon, 24 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad284
Issue No: Vol. 39, No. 5 (2023)
-
- DIGGER-Bac: prediction of seed regions for high-fidelity construction of
synthetic small RNAs in bacteria-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad285
Abstract: SummarySynthetic small RNAs (sRNAs) are gaining increasing attention in the field of synthetic biology and bioengineering for efficient post-transcriptional regulation of gene expression. However, the optimal design of synthetic sRNAs is challenging because alterations may impair functions or off-target effects can arise. Here, we introduce DIGGER-Bac, a toolbox for Design and Identification of seed regions for Golden Gate assembly and Expression of synthetic sRNAs in Bacteria. The SEEDling tool predicts optimal sRNA seed regions in combination with user-defined sRNA scaffolds for efficient regulation of specified mRNA targets. Results are passed on to the G-GArden tool, which assists with primer design for high-fidelity Golden Gate assembly of the desired synthetic sRNA constructs.
PubDate: Sat, 22 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad285
Issue No: Vol. 39, No. 5 (2023)
-
- Digital PCR cluster predictor: a universal R-package and shiny app for the
automated analysis of multiplex digital PCR data-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad282
Abstract: SummaryDigital polymerase chain reaction (dPCR) is an emerging technology that enables accurate and sensitive quantification of nucleic acids. Most available dPCR systems have two channel optics, with ad hoc software limited to the analysis of single and duplex assays. Although multiplexing strategies were developed, variable assay designs, dPCR systems, and the analysis of low DNA input data restricted the ability for a universal automated clustering approach. To overcome these issues, we developed dPCR Cluster Predictor (dPCP), an R package and a Shiny app for automated analysis of up to 4-plex dPCR data. dPCP can analyse and visualize data generated by multiple dPCR systems carrying out accurate and fast clustering not influenced by the amount and integrity of input of nucleic acids. With the companion Shiny app, the functionalities of dPCP can be accessed through a web browser.
PubDate: Sat, 22 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad282
Issue No: Vol. 39, No. 5 (2023)
-
- ICAT: a novel algorithm to robustly identify cell states following
perturbations in single-cell transcriptomes-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad278
Abstract: MotivationThe detection of distinct cellular identities is central to the analysis of single-cell RNA sequencing (scRNA-seq) experiments. However, in perturbation experiments, current methods typically fail to correctly match cell states between conditions or erroneously remove population substructure. Here, we present the novel, unsupervised algorithm Identify Cell states Across Treatments (ICAT) that employs self-supervised feature weighting and control-guided clustering to accurately resolve cell states across heterogeneous conditions.ResultsUsing simulated and real datasets, we show ICAT is superior in identifying and resolving cell states compared with current integration workflows. While requiring no a priori knowledge of extant cell states or discriminatory marker genes, ICAT is robust to low signal strength, high perturbation severity, and disparate cell type proportions. We empirically validate ICAT in a developmental model and find that only ICAT identifies a perturbation-unique cellular response. Taken together, our results demonstrate that ICAT offers a significant improvement in defining cellular responses to perturbation in scRNA-seq data.Availability and implementationhttps://github.com/BradhamLab/icat.
PubDate: Sat, 22 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad278
Issue No: Vol. 39, No. 5 (2023)
-
- HOTSPOT: hierarchical host prediction for assembled plasmid contigs with
transformer-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad283
Abstract: MotivationAs prevalent extrachromosomal replicons in many bacteria, plasmids play an essential role in their hosts’ evolution and adaptation. The host range of a plasmid refers to the taxonomic range of bacteria in which it can replicate and thrive. Understanding host ranges of plasmids sheds light on studying the roles of plasmids in bacterial evolution and adaptation. Metagenomic sequencing has become a major means to obtain new plasmids and derive their hosts. However, host prediction for assembled plasmid contigs still needs to tackle several challenges: different sequence compositions and copy numbers between plasmids and the hosts, high diversity in plasmids, and limited plasmid annotations. Existing tools have not yet achieved an ideal tradeoff between sensitivity and precision on metagenomic assembled contigs.ResultsIn this work, we construct a hierarchical classification tool named HOTSPOT, whose backbone is a phylogenetic tree of the bacterial hosts from phylum to species. By incorporating the state-of-the-art language model, Transformer, in each node’s taxon classifier, the top-down tree search achieves an accurate host taxonomy prediction for the input plasmid contigs. We rigorously tested HOTSPOT on multiple datasets, including RefSeq complete plasmids, artificial contigs, simulated metagenomic data, mock metagenomic data, the Hi-C dataset, and the CAMI2 marine dataset. All experiments show that HOTSPOT outperforms other popular methods.Availability and implementationThe source code of HOTSPOT is available via: https://github.com/Orin-beep/HOTSPOT
PubDate: Sat, 22 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad283
Issue No: Vol. 39, No. 5 (2023)
-
- Predicting the pathogenicity of missense variants using features derived
from AlphaFold2-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad280
Abstract: MotivationMissense variants are a frequent class of variation within the coding genome, and some of them cause Mendelian diseases. Despite advances in computational prediction, classifying missense variants into pathogenic or benign remains a major challenge in the context of personalized medicine. Recently, the structure of the human proteome was derived with unprecedented accuracy using the artificial intelligence system AlphaFold2. This raises the question of whether AlphaFold2 wild-type structures can improve the accuracy of computational pathogenicity prediction for missense variants.ResultsTo address this, we first engineered a set of features for each amino acid from these structures. We then trained a random forest to distinguish between relatively common (proxy-benign) and singleton (proxy-pathogenic) missense variants from gnomAD v3.1. This yielded a novel AlphaFold2-based pathogenicity prediction score, termed AlphScore. Important feature classes used by AlphScore are solvent accessibility, amino acid network related features, features describing the physicochemical environment, and AlphaFold2’s quality parameter (predicted local distance difference test). AlphScore alone showed lower performance than existing in silico scores used for missense prediction, such as CADD or REVEL. However, when AlphScore was added to those scores, the performance increased, as measured by the approximation of deep mutational scan data, as well as the prediction of expert-curated missense variants from the ClinVar database. Overall, our data indicate that the integration of AlphaFold2-predicted structures can improve pathogenicity prediction of missense variants.Availability and implementationAlphScore, combinations of AlphScore with existing scores, as well as variants used for training and testing are publicly available.
PubDate: Fri, 21 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad280
Issue No: Vol. 39, No. 5 (2023)
-
- Effective design and inference for cell sorting and sequencing based
massively parallel reporter assays-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad277
Abstract: MotivationThe ability to measure the phenotype of millions of different genetic designs using Massively Parallel Reporter Assays (MPRAs) has revolutionized our understanding of genotype-to-phenotype relationships and opened avenues for data-centric approaches to biological design. However, our knowledge of how best to design these costly experiments and the effect that our choices have on the quality of the data produced is lacking.ResultsIn this article, we tackle the issues of data quality and experimental design by developing FORECAST, a Python package that supports the accurate simulation of cell-sorting and sequencing-based MPRAs and robust maximum likelihood-based inference of genetic design function from MPRA data. We use FORECAST’s capabilities to reveal rules for MPRA experimental design that help ensure accurate genotype-to-phenotype links and show how the simulation of MPRA experiments can help us better understand the limits of prediction accuracy when this data are used for training deep learning-based classifiers. As the scale and scope of MPRAs grows, tools like FORECAST will help ensure we make informed decisions during their development and the most of the data produced.Availability and implementationThe FORECAST package is available at: https://gitlab.com/Pierre-Aurelien/forecast. Code for the deep learning analysis performed in this study is available at: https://gitlab.com/Pierre-Aurelien/rebeca.
PubDate: Fri, 21 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad277
Issue No: Vol. 39, No. 5 (2023)
-
- AcrNET: predicting anti-CRISPR with deep learning
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad259
Abstract: MotivationAs an important group of proteins discovered in phages, anti-CRISPR inhibits the activity of the immune system of bacteria (i.e. CRISPR-Cas), offering promise for gene editing and phage therapy. However, the prediction and discovery of anti-CRISPR are challenging due to their high variability and fast evolution. Existing biological studies rely on known CRISPR and anti-CRISPR pairs, which may not be practical considering the huge number. Computational methods struggle with prediction performance. To address these issues, we propose a novel deep neural network for anti-CRISPR analysis (AcrNET), which achieves significant performance.ResultsOn both the cross-fold and cross-dataset validation, our method outperforms the state-of-the-art methods. Notably, AcrNET improves the prediction performance by at least 15% regarding the F1 score for the cross-dataset test problem comparing with state-of-art Deep Learning method. Moreover, AcrNET is the first computational method to predict the detailed anti-CRISPR classes, which may help illustrate the anti-CRISPR mechanism. Taking advantage of a Transformer protein language model ESM-1b, which was pre-trained on 250 million protein sequences, AcrNET overcomes the data scarcity problem. Extensive experiments and analysis suggest that the Transformer model feature, evolutionary feature, and local structure feature complement each other, which indicates the critical properties of anti-CRISPR proteins. AlphaFold prediction, further motif analysis, and docking experiments further demonstrate that AcrNET can capture the evolutionarily conserved pattern and the interaction between anti-CRISPR and the target implicitly.Availability and implementationWeb server: https://proj.cse.cuhk.edu.hk/aihlab/AcrNET/. Training code and pre-trained model are available at.
PubDate: Fri, 21 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad259
Issue No: Vol. 39, No. 5 (2023)
-
- FAS: assessing the similarity between proteins using multi-layered feature
architectures-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad226
Abstract: MotivationProtein sequence comparison is a fundamental element in the bioinformatics toolkit. When sequences are annotated with features such as functional domains, transmembrane domains, low complexity regions or secondary structure elements, the resulting feature architectures allow better informed comparisons. However, many existing schemes for scoring architecture similarities cannot cope with features arising from multiple annotation sources. Those that do fall short in the resolution of overlapping and redundant feature annotations.ResultsHere, we introduce FAS, a scoring method that integrates features from multiple annotation sources in a directed acyclic architecture graph. Redundancies are resolved as part of the architecture comparison by finding the paths through the graphs that maximize the pair-wise architecture similarity. In a large-scale evaluation on more than 10 000 human-yeast ortholog pairs, architecture similarities assessed with FAS are consistently more plausible than those obtained using e-values to resolve overlaps or leaving overlaps unresolved. Three case studies demonstrate the utility of FAS on architecture comparison tasks: benchmarking of orthology assignment software, identification of functionally diverged orthologs, and diagnosing protein architecture changes stemming from faulty gene predictions. With the help of FAS, feature architecture comparisons can now be routinely integrated into these and many other applications.Availability and implementationFAS is available as python package: https://pypi.org/project/greedyFAS/.
PubDate: Fri, 21 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad226
Issue No: Vol. 39, No. 5 (2023)
-
- Deciphering associations between gut microbiota and clinical factors using
microbial modules-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad213
Abstract: MotivationHuman gut microbiota plays a vital role in maintaining body health. The dysbiosis of gut microbiota is associated with a variety of diseases. It is critical to uncover the associations between gut microbiota and disease states as well as other intrinsic or environmental factors. However, inferring alterations of individual microbial taxa based on relative abundance data likely leads to false associations and conflicting discoveries in different studies. Moreover, the effects of underlying factors and microbe–microbe interactions could lead to the alteration of larger sets of taxa. It might be more robust to investigate gut microbiota using groups of related taxa instead of the composition of individual taxa.ResultsWe proposed a novel method to identify underlying microbial modules, i.e. groups of taxa with similar abundance patterns affected by a common latent factor, from longitudinal gut microbiota and applied it to inflammatory bowel disease (IBD). The identified modules demonstrated closer intragroup relationships, indicating potential microbe–microbe interactions and influences of underlying factors. Associations between the modules and several clinical factors were investigated, especially disease states. The IBD-associated modules performed better in stratifying the subjects compared with the relative abundance of individual taxa. The modules were further validated in external cohorts, demonstrating the efficacy of the proposed method in identifying general and robust microbial modules. The study reveals the benefit of considering the ecological effects in gut microbiota analysis and the great promise of linking clinical factors with underlying microbial modules.Availability and implementationhttps://github.com/rwang-z/microbial_module.git.
PubDate: Fri, 21 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad213
Issue No: Vol. 39, No. 5 (2023)
-
- A functional analysis of omic network embedding spaces reveals key altered
functions in cancer-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad281
Abstract: MotivationAdvances in omics technologies have revolutionized cancer research by producing massive datasets. Common approaches to deciphering these complex data are by embedding algorithms of molecular interaction networks. These algorithms find a low-dimensional space in which similarities between the network nodes are best preserved. Currently available embedding approaches mine the gene embeddings directly to uncover new cancer-related knowledge. However, these gene-centric approaches produce incomplete knowledge, since they do not account for the functional implications of genomic alterations. We propose a new, function-centric perspective and approach, to complement the knowledge obtained from omic data.ResultsWe introduce our Functional Mapping Matrix (FMM) to explore the functional organization of different tissue-specific and species-specific embedding spaces generated by a Non-negative Matrix Tri-Factorization algorithm. Also, we use our FMM to define the optimal dimensionality of these molecular interaction network embedding spaces. For this optimal dimensionality, we compare the FMMs of the most prevalent cancers in human to FMMs of their corresponding control tissues. We find that cancer alters the positions in the embedding space of cancer-related functions, while it keeps the positions of the noncancer-related ones. We exploit this spacial ‘movement’ to predict novel cancer-related functions. Finally, we predict novel cancer-related genes that the currently available methods for gene-centric analyses cannot identify; we validate these predictions by literature curation and retrospective analyses of patient survival data.Availability and implementationData and source code can be accessed at https://github.com/gaiac/FMM.
PubDate: Fri, 21 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad281
Issue No: Vol. 39, No. 5 (2023)
-
- matchRanges: generating null hypothesis genomic ranges via
covariate-matched sampling-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad197
Abstract: MotivationDeriving biological insights from genomic data commonly requires comparing attributes of selected genomic loci to a null set of loci. The selection of this null set is non-trivial, as it requires careful consideration of potential covariates, a problem that is exacerbated by the non-uniform distribution of genomic features including genes, enhancers, and transcription factor binding sites. Propensity score-based covariate matching methods allow the selection of null sets from a pool of possible items while controlling for multiple covariates; however, existing packages do not operate on genomic data classes and can be slow for large data sets making them difficult to integrate into genomic workflows.ResultsTo address this, we developed matchRanges, a propensity score-based covariate matching method for the efficient and convenient generation of matched null ranges from a set of background ranges within the Bioconductor framework.Availability and implementationPackage: https://bioconductor.org/packages/nullranges, Code: https://github.com/nullranges, Documentation: https://nullranges.github.io/nullranges.
PubDate: Fri, 21 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad197
Issue No: Vol. 39, No. 5 (2023)
-
- DFHiC: a dilated full convolution model to enhance the resolution of Hi-C
data-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad211
Abstract: MotivationHi-C technology has been the most widely used chromosome conformation capture (3C) experiment that measures the frequency of all paired interactions in the entire genome, which is a powerful tool for studying the 3D structure of the genome. The fineness of the constructed genome structure depends on the resolution of Hi-C data. However, due to the fact that high-resolution Hi-C data require deep sequencing and thus high experimental cost, most available Hi-C data are in low-resolution. Hence, it is essential to enhance the quality of Hi-C data by developing the effective computational methods.ResultsIn this work, we propose a novel method, so-called DFHiC, which generates the high-resolution Hi-C matrix from the low-resolution Hi-C matrix in the framework of the dilated convolutional neural network. The dilated convolution is able to effectively explore the global patterns in the overall Hi-C matrix by taking advantage of the information of the Hi-C matrix in a way of the longer genomic distance. Consequently, DFHiC can improve the resolution of the Hi-C matrix reliably and accurately. More importantly, the super-resolution Hi-C data enhanced by DFHiC is more in line with the real high-resolution Hi-C data than those done by the other existing methods, in terms of both chromatin significant interactions and identifying topologically associating domains.Availability and implementationhttps://github.com/BinWangCSU/DFHiC.
PubDate: Fri, 21 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad211
Issue No: Vol. 39, No. 5 (2023)
-
- Molecular property prediction by contrastive learning with
attention-guided positive sample selection-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad258
Abstract: MotivationPredicting molecular properties is one of the fundamental problems in drug design and discovery. In recent years, self-supervised learning (SSL) has shown its promising performance in image recognition, natural language processing, and single-cell data analysis. Contrastive learning (CL) is a typical SSL method used to learn the features of data so that the trained model can more effectively distinguish the data. One important issue of CL is how to select positive samples for each training example, which will significantly impact the performance of CL.ResultsIn this article, we propose a new method for molecular property prediction (MPP) by Contrastive Learning with Attention-guided Positive-sample Selection (CLAPS). First, we generate positive samples for each training example based on an attention-guided selection scheme. Second, we employ a Transformer encoder to extract latent feature vectors and compute the contrastive loss aiming to distinguish positive and negative sample pairs. Finally, we use the trained encoder for predicting molecular properties. Experiments on various benchmark datasets show that our approach outperforms the state-of-the-art (SOTA) methods in most cases.Availability and implementationThe code is publicly available at https://github.com/wangjx22/CLAPS.
PubDate: Thu, 20 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad258
Issue No: Vol. 39, No. 5 (2023)
-
- LogBTF: gene regulatory network inference using Boolean threshold network
model from single-cell gene expression data-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad256
Abstract: MotivationFrom a systematic perspective, it is crucial to infer and analyze gene regulatory network (GRN) from high-throughput single-cell RNA sequencing data. However, most existing GRN inference methods mainly focus on the network topology, only few of them consider how to explicitly describe the updated logic rules of regulation in GRNs to obtain their dynamics. Moreover, some inference methods also fail to deal with the over-fitting problem caused by the noise in time series data.ResultsIn this article, we propose a novel embedded Boolean threshold network method called LogBTF, which effectively infers GRN by integrating regularized logistic regression and Boolean threshold function. First, the continuous gene expression values are converted into Boolean values and the elastic net regression model is adopted to fit the binarized time series data. Then, the estimated regression coefficients are applied to represent the unknown Boolean threshold function of the candidate Boolean threshold network as the dynamical equations. To overcome the multi-collinearity and over-fitting problems, a new and effective approach is designed to optimize the network topology by adding a perturbation design matrix to the input data and thereafter setting sufficiently small elements of the output coefficient vector to zeros. In addition, the cross-validation procedure is implemented into the Boolean threshold network model framework to strengthen the inference capability. Finally, extensive experiments on one simulated Boolean value dataset, dozens of simulation datasets, and three real single-cell RNA sequencing datasets demonstrate that the LogBTF method can infer GRNs from time series data more accurately than some other alternative methods for GRN inference.Availability and implementationThe source data and code are available at https://github.com/zpliulab/LogBTF.
PubDate: Thu, 20 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad256
Issue No: Vol. 39, No. 5 (2023)
-
- RING-PyMOL: residue interaction networks of structural ensembles and
molecular dynamics-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad260
Abstract: RING-PyMOL is a plugin for PyMOL providing a set of analysis tools for structural ensembles and molecular dynamic simulations. RING-PyMOL combines residue interaction networks, as provided by the RING software, with structural clustering to enhance the analysis and visualization of the conformational complexity. It combines precise calculation of non-covalent interactions with the power of PyMOL to manipulate and visualize protein structures. The plugin identifies and highlights correlating contacts and interaction patterns that can explain structural allostery, active sites, and structural heterogeneity connected with molecular function. It is easy to use and extremely fast, processing and rendering hundreds of models and long trajectories in seconds. RING-PyMOL generates a number of interactive plots and output files for use with external tools. The underlying RING software has been improved extensively. It is 10 times faster, can process mmCIF files and it identifies typed interactions also for nucleic acids.Availability and implementationhttps://github.com/BioComputingUP/ring-pymol
PubDate: Thu, 20 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad260
Issue No: Vol. 39, No. 5 (2023)
-
- CONNECTOR, fitting and clustering of longitudinal data to reveal a new
risk stratification system-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad201
Abstract: MotivationThe transition from evaluating a single time point to examining the entire dynamic evolution of a system is possible only in the presence of the proper framework. The strong variability of dynamic evolution makes the definition of an explanatory procedure for data fitting and clustering challenging.ResultsWe developed CONNECTOR, a data-driven framework able to analyze and inspect longitudinal data in a straightforward and revealing way. When used to analyze tumor growth kinetics over time in 1599 patient-derived xenograft growth curves from ovarian and colorectal cancers, CONNECTOR allowed the aggregation of time-series data through an unsupervised approach in informative clusters. We give a new perspective of mechanism interpretation, specifically, we define novel model aggregations and we identify unanticipated molecular associations with response to clinically approved therapies.Availability and implementationCONNECTOR is freely available under GNU GPL license at https://qbioturin.github.io/connector and https://doi.org/10.17504/protocols.io.8epv56e74g1b/v1.
PubDate: Thu, 20 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad201
Issue No: Vol. 39, No. 5 (2023)
-
- ConsAlign: simultaneous RNA structural aligner based on rich transfer
learning and thermodynamic ensemble model of alignment scoring-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad255
Abstract: MotivationTo capture structural homology in RNAs, alignment and folding (AF) of RNA homologs has been a fundamental framework around RNA science. Learning sufficient scoring parameters for simultaneous AF (SAF) is an undeveloped subject because evaluating them is computationally expensive.ResultsWe developed ConsTrain—a gradient-based machine learning method for rich SAF scoring. We also implemented ConsAlign—a SAF tool composed of ConsTrain’s learned scoring parameters. To aim for better AF quality, ConsAlign employs (1) transfer learning from well-defined scoring models and (2) the ensemble model between the ConsTrain model and a well-established thermodynamic scoring model. Keeping comparable running time, ConsAlign demonstrated competitive AF prediction quality among current AF tools.Availability and implementationOur code and our data are freely available at https://github.com/heartsh/consalign and https://github.com/heartsh/consprob-trained.
PubDate: Wed, 19 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad255
Issue No: Vol. 39, No. 5 (2023)
-
- Evolink: a phylogenetic approach for rapid identification of
genotype–phenotype associations in large-scale microbial multispecies
data-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad215
Abstract: MotivationThe discovery of the genetic features that underly a phenotype is a fundamental task in microbial genomics. With the growing number of microbial genomes that are paired with phenotypic data, new challenges, and opportunities are arising for genotype-phenotype inference. Phylogenetic approaches are frequently used to adjust for the population structure of microbes but scaling them to trees with thousands of leaves representing heterogeneous populations is highly challenging. This greatly hinders the identification of prevalent genetic features that contribute to phenotypes that are observed in a wide diversity of species.ResultsIn this study, Evolink was developed as an approach to rapidly identify genotypes associated with phenotypes in large-scale multispecies microbial datasets. Compared with other similar tools, Evolink was consistently among the top-performing methods in terms of precision and sensitivity when applied to simulated and real-world flagella datasets. In addition, Evolink significantly outperformed all other approaches in terms of computation time. Application of Evolink on flagella and gram-staining datasets revealed findings that are consistent with known markers and supported by the literature. In conclusion, Evolink can rapidly detect phenotype-associated genotypes across multiple species, demonstrating its potential to be broadly utilized to identify gene families associated with traits of interest.Availability and implementationThe source code, docker container, and web server for Evolink are freely available at https://github.com/nlm-irp-jianglab/Evolink.
PubDate: Wed, 19 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad215
Issue No: Vol. 39, No. 5 (2023)
-
- PyHMMER: a Python library binding to HMMER for efficient sequence analysis
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad214
Abstract: SummaryPyHMMER provides Python integration of the popular profile Hidden Markov Model software HMMER via Cython bindings. This allows the annotation of protein sequences with profile HMMs and building new ones directly with Python. PyHMMER increases flexibility of use, allowing creating queries directly from Python code, launching searches, and obtaining results without I/O, or accessing previously unavailable statistics like uncorrected P-values. A new parallelization model greatly improves performance when running multithreaded searches, while producing the exact same results as HMMER.Availability and implementationPyHMMER supports all modern Python versions (Python 3.6+) and similar platforms as HMMER (x86 or PowerPC UNIX systems). Pre-compiled packages are released via PyPI (https://pypi.org/project/pyhmmer/) and Bioconda (https://anaconda.org/bioconda/pyhmmer). The PyHMMER source code is available under the terms of the open-source MIT licence and hosted on GitHub (https://github.com/althonos/pyhmmer); its documentation is available on ReadTheDocs (https://pyhmmer.readthedocs.io).
PubDate: Wed, 19 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad214
Issue No: Vol. 39, No. 5 (2023)
-
- 3D-MSNet: a point cloud-based deep learning model for untargeted feature
detection and quantification in profile LC-HRMS data-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad195
Abstract: MotivationLiquid chromatography coupled with high-resolution mass spectrometry is widely used in composition profiling in untargeted metabolomics research. While retaining complete sample information, mass spectrometry (MS) data naturally have the characteristics of high dimensionality, high complexity, and huge data volume. In mainstream quantification methods, none of the existing methods can perform direct 3D analysis on lossless profile MS signals. All software simplify calculations by dimensionality reduction or lossy grid transformation, ignoring the full 3D signal distribution of MS data and resulting in inaccurate feature detection and quantification.ResultsOn the basis that the neural network is effective for high-dimensional data analysis and can discover implicit features from large amounts of complex data, in this work, we propose 3D-MSNet, a novel deep learning-based model for untargeted feature extraction. 3D-MSNet performs direct feature detection on 3D MS point clouds as an instance segmentation task. After training on a self-annotated 3D feature dataset, we compared our model with nine popular software (MS-DIAL, MZmine 2, XCMS Online, MarkerView, Compound Discoverer, MaxQuant, Dinosaur, DeepIso, PointIso) on two metabolomics and one proteomics public benchmark datasets. Our 3D-MSNet model outperformed other software with significant improvement in feature detection and quantification accuracy on all evaluation datasets. Furthermore, 3D-MSNet has high feature extraction robustness and can be widely applied to profile MS data acquired with various high-resolution mass spectrometers with various resolutions.Availability and implementation3D-MSNet is an open-source model and is freely available at https://github.com/CSi-Studio/3D-MSNet under a permissive license. Benchmark datasets, training dataset, evaluation methods, and results are available at https://doi.org/10.5281/zenodo.6582912.
PubDate: Tue, 18 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad195
Issue No: Vol. 39, No. 5 (2023)
-
- Neither random nor censored: estimating intensity-dependent probabilities
for missing values in label-free proteomics-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad200
Abstract: MotivationMass spectrometry proteomics is a powerful tool in biomedical research but its usefulness is limited by the frequent occurrence of missing values in peptides that cannot be reliably quantified (detected) for particular samples. Many analysis strategies have been proposed for missing values where the discussion often focuses on distinguishing whether values are missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR).ResultsStatistical models and algorithms are proposed for estimating the detection probabilities and for evaluating how much statistical information can or cannot be recovered from the missing value pattern. The probability that an intensity is detected is shown to be accurately modeled as a logit-linear function of the underlying intensity, showing that missing value process is intermediate between MAR and censoring. The detection probability asymptotes to 100% for high intensities, showing that missing values unrelated to intensity are rare. The rule applies globally to each dataset and is appropriate for both high and lowly expressed peptides. A probability model is developed that allows the distribution of unobserved intensities to be inferred from the observed values. The detection probability model is incorporated into a likelihood-based approach for assessing differential expression and successfully recovers statistical power compared to omitting the missing values from the analysis. In contrast, imputation methods are shown to perform poorly, either reducing statistical power or increasing the false discovery rate to unacceptable levels.Availability and implementationData and code to reproduce the results shown in this article are available from https://mengbo-li.github.io/protDP/.
PubDate: Mon, 17 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad200
Issue No: Vol. 39, No. 5 (2023)
-
- EBD: an eye biomarker database
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad194
Abstract: MotivationMany ophthalmic disease biomarkers have been identified through comprehensive multiomics profiling, and hold significant potential in advancing the diagnosis, prognosis, and management of diseases. Meanwhile, the eye itself serves as a natural biomarker for several systemic diseases including neurological, renal, and cardiovascular systems. We aimed to collect and standardize this eye biomarkers information and construct the eye biomarker database (EBD) to provide ophthalmologists with a platform to search, analyze, and download these eye biomarker data.Results In this study, we present the EBD <http://www.eyeseeworld.com/ebd/index.html>, a world-first online compilation comprising 889 biomarkers for 26 ocular diseases and 939 eye biomarkers for 181 systemic diseases. The EBD also includes the information of 78 “nonbiomarkers”—the objects that have been proven cannot be biomarkers. Biological function and network analysis were conducted for these ocular disease biomarkers, and several hub pathways and common network topology characteristics were newly identified, which may promote future ocular disease biomarker discovery and characterizes the landscape of biomarkers for eye diseases at the pathway and network level. The EBD is expected to yield broader utility among developmental biologists and clinical scientists in and outside of the eye field by assisting in the identification of biomarkers linked to eye disorders and related systemic diseases.Availability and implementationEBD is available at http://www.eyeseeworld.com/ebd/index.html.
PubDate: Thu, 13 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad194
Issue No: Vol. 39, No. 5 (2023)
-
- bootRanges: flexible generation of null sets of genomic ranges for
hypothesis testing-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad190
Abstract: MotivationEnrichment analysis is a widely utilized technique in genomic analysis that aims to determine if there is a statistically significant association between two sets of genomic features. To conduct this type of hypothesis testing, an appropriate null model is typically required. However, the null distribution that is commonly used can be overly simplistic and may result in inaccurate conclusions.ResultsbootRanges provides fast functions for generation of block bootstrapped genomic ranges representing the null hypothesis in enrichment analysis. As part of a modular workflow, bootRanges offers greater flexibility for computing various test statistics leveraging other Bioconductor packages. We show that shuffling or permutation schemes may result in overly narrow test statistic null distributions and over-estimation of statistical significance, while creating new range sets with a block bootstrap preserves local genomic correlation structure and generates more reliable null distributions. It can also be used in more complex analyses, such as accessing correlations between cis-regulatory elements (CREs) and genes across cell types or providing optimized thresholds, e.g. log fold change (logFC) from differential analysis.Availability and implementationbootRanges is freely available in the R/Bioconductor package nullranges hosted at https://bioconductor.org/packages/nullranges.
PubDate: Wed, 12 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad190
Issue No: Vol. 39, No. 5 (2023)
-
- FISHFactor: a probabilistic factor model for spatial transcriptomics data
with subcellular resolution-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad183
Abstract: MotivationFactor analysis is a widely used tool for unsupervised dimensionality reduction of high-throughput datasets in molecular biology, with recently proposed extensions designed specifically for spatial transcriptomics data. However, these methods expect (count) matrices as data input and are therefore not directly applicable to single molecule resolution data, which are in the form of coordinate lists annotated with genes and provide insight into subcellular spatial expression patterns. To address this, we here propose FISHFactor, a probabilistic factor model that combines the benefits of spatial, non-negative factor analysis with a Poisson point process likelihood to explicitly model and account for the nature of single molecule resolution data. In addition, FISHFactor shares information across a potentially large number of cells in a common weight matrix, allowing consistent interpretation of factors across cells and yielding improved latent variable estimates.ResultsWe compare FISHFactor to existing methods that rely on aggregating information through spatial binning and cannot combine information from multiple cells and show that our method leads to more accurate results on simulated data. We show that our method is scalable and can be readily applied to large datasets. Finally, we demonstrate on a real dataset that FISHFactor is able to identify major subcellular expression patterns and spatial gene clusters in a data-driven manner.Availability and implementationThe model implementation, data simulation and experiment scripts are available under https://www.github.com/bioFAM/FISHFactor.
PubDate: Tue, 11 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad183
Issue No: Vol. 39, No. 5 (2023)
-
- Accurate flux predictions using tissue-specific gene expression in plant
metabolic modeling-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad186
Abstract: MotivationThe accurate prediction of complex phenotypes such as metabolic fluxes in living systems is a grand challenge for systems biology and central to efficiently identifying biotechnological interventions that can address pressing industrial needs. The application of gene expression data to improve the accuracy of metabolic flux predictions using mechanistic modeling methods such as flux balance analysis (FBA) has not been previously demonstrated in multi-tissue systems, despite their biotechnological importance. We hypothesized that a method for generating metabolic flux predictions informed by relative expression levels between tissues would improve prediction accuracy.ResultsRelative gene expression levels derived from multiple transcriptomic and proteomic datasets were integrated into FBA predictions of a multi-tissue, diel model of Arabidopsis thaliana’s central metabolism. This integration dramatically improved the agreement of flux predictions with experimentally based flux maps from 13C metabolic flux analysis compared with a standard parsimonious FBA approach. Disagreement between FBA predictions and MFA flux maps was measured using weighted averaged percent error values, and for parsimonious FBA this was169%–180% for high light conditions and 94%–103% for low light conditions, depending on the gene expression dataset used. This fell to 10%-13% and 9%-11% upon incorporating expression data into the modeling process, which also substantially altered the predicted carbon and energy economy of the plant.Availability and implementationCode and data generated as part of this study are available from https://github.com/Gibberella/ArabidopsisGeneExpressionWeights.
PubDate: Tue, 11 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad186
Issue No: Vol. 39, No. 5 (2023)
-
- Finite mixtures of matrix variate Poisson-log normal distributions for
three-way count data-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad167
Abstract: MotivationThree-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for n genes across p conditions at r occasions. Matrix variate distributions offer a natural way to model three-way data and mixtures of matrix variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means of discovering gene co-expression networks.ResultsIn this work, a mixture of matrix variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix variate structure, full information on the conditions and occasions of the RNA sequencing dataset is simultaneously considered, and the number of covariance parameters to be estimated is reduced. We propose three different frameworks for parameter estimation: a Markov chain Monte Carlo-based approach, a variational Gaussian approximation-based approach, and a hybrid approach. Various information criteria are used for model selection. The models are applied to both real and simulated data, and we demonstrate that the proposed approaches can recover the underlying cluster structure in both cases. In simulation studies where the true model parameters are known, our proposed approach shows good parameter recovery.Availability and implementationThe GitHub R package for this work is available at https://github.com/anjalisilva/mixMVPLN and is released under the open source MIT license.
PubDate: Wed, 05 Apr 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad167
Issue No: Vol. 39, No. 5 (2023)
-
- A framework for high-throughput sequence alignment using real
processing-in-memory systems-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad155
Abstract: MotivationSequence alignment is a memory bound computation whose performance in modern systems is limited by the memory bandwidth bottleneck. Processing-in-memory (PIM) architectures alleviate this bottleneck by providing the memory with computing competencies. We propose Alignment-in-Memory (AIM), a framework for high-throughput sequence alignment using PIM, and evaluate it on UPMEM, the first publicly available general-purpose programmable PIM system.ResultsOur evaluation shows that a real PIM system can substantially outperform server-grade multi-threaded CPU systems running at full-scale when performing sequence alignment for a variety of algorithms, read lengths, and edit distance thresholds. We hope that our findings inspire more work on creating and accelerating bioinformatics algorithms for such real PIM systems.Availability and implementationOur code is available at https://github.com/safaad/aim.
PubDate: Mon, 27 Mar 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad155
Issue No: Vol. 39, No. 5 (2023)
-
- nf-core/isoseq: simple gene and isoform annotation with PacBio Iso-Seq
long-read sequencing-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad150
Abstract: MotivationIso-Seq RNA long-read sequencing enables the identification of full-length transcripts and isoforms, removing the need for complex analysis such as transcriptome assembly. However, the raw sequencing data need to be processed in a series of steps before annotation is complete. Here, we present nf-core/isoseq, a pipeline for automatic read processing and genome annotation. Following nf-core guidelines, the pipeline has few dependencies and can be run on any of platforms.Availability and implementationThe pipeline is freely available online on the nf-core website (https://nf-co.re/isoseq) and on GitHub (https://github.com/nf-core/isoseq) under MIT License (
DOI : 10.5281/zenodo.7116979).
PubDate: Fri, 24 Mar 2023 00:00:00 GMT
Issue No: Vol. 39, No. 5 (2023)
-
- Scrooge: a fast and memory-frugal genomic sequence aligner for CPUs, GPUs,
and ASICs-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
First page: btad151
Abstract: MotivationPairwise sequence alignment is a very time-consuming step in common bioinformatics pipelines. Speeding up this step requires heuristics, efficient implementations, and/or hardware acceleration. A promising candidate for all of the above is the recently proposed GenASM algorithm. We identify and address three inefficiencies in the GenASM algorithm: it has a high amount of data movement, a large memory footprint, and does some unnecessary work.ResultsWe propose Scrooge, a fast and memory-frugal genomic sequence aligner. Scrooge includes three novel algorithmic improvements which reduce the data movement, memory footprint, and the number of operations in the GenASM algorithm. We provide efficient open-source implementations of the Scrooge algorithm for CPUs and GPUs, which demonstrate the significant benefits of our algorithmic improvements. For long reads, the CPU version of Scrooge achieves a 20.1×, 1.7×, and 2.1× speedup over KSW2, Edlib, and a CPU implementation of GenASM, respectively. The GPU version of Scrooge achieves a 4.0×, 80.4×, 6.8×, 12.6×, and 5.9× speedup over the CPU version of Scrooge, KSW2, Edlib, Darwin-GPU, and a GPU implementation of GenASM, respectively. We estimate an ASIC implementation of Scrooge to use 3.6× less chip area and 2.1× less power than a GenASM ASIC while maintaining the same throughput. Further, we systematically analyze the throughput and accuracy behavior of GenASM and Scrooge under various configurations. As the best configuration of Scrooge depends on the computing platform, we make several observations that can help guide future implementations of Scrooge.Availability and implementationhttps://github.com/CMU-SAFARI/Scrooge.
PubDate: Fri, 24 Mar 2023 00:00:00 GMT
DOI: 10.1093/bioinformatics/btad151
Issue No: Vol. 39, No. 5 (2023)
-