Subjects -> BIOLOGY (Total: 3134 journals)
    - BIOCHEMISTRY (239 journals)
    - BIOENGINEERING (143 journals)
    - BIOLOGY (1491 journals)
    - BIOPHYSICS (53 journals)
    - BIOTECHNOLOGY (243 journals)
    - BOTANY (220 journals)
    - CYTOLOGY AND HISTOLOGY (32 journals)
    - ENTOMOLOGY (67 journals)
    - GENETICS (152 journals)
    - MICROBIOLOGY (265 journals)
    - MICROSCOPY (13 journals)
    - ORNITHOLOGY (26 journals)
    - PHYSIOLOGY (73 journals)
    - ZOOLOGY (117 journals)

BIOLOGY (1491 journals)                  1 2 3 4 5 6 7 8 | Last

Showing 1 - 200 of 1720 Journals sorted alphabetically
AAPS Journal     Hybrid Journal   (Followers: 31)
Abasyn Journal of Life Sciences     Open Access   (Followers: 3)
ACS Pharmacology & Translational Science     Hybrid Journal   (Followers: 5)
ACS Synthetic Biology     Hybrid Journal   (Followers: 38)
Acta Biologica Hungarica     Full-text available via subscription   (Followers: 5)
Acta Biologica Marisiensis     Open Access   (Followers: 3)
Acta Biologica Sibirica     Open Access   (Followers: 2)
Acta Biologica Turcica     Open Access   (Followers: 1)
Acta Biomaterialia     Hybrid Journal   (Followers: 31)
Acta Biotheoretica     Hybrid Journal   (Followers: 3)
Acta Chiropterologica     Full-text available via subscription   (Followers: 5)
acta ethologica     Hybrid Journal   (Followers: 7)
Acta Fytotechnica et Zootechnica     Open Access   (Followers: 3)
Acta Ichthyologica et Piscatoria     Open Access   (Followers: 5)
Acta Médica Costarricense     Open Access   (Followers: 2)
Acta Musei Silesiae, Scientiae Naturales     Open Access  
Acta Neurobiologiae Experimentalis     Open Access  
Acta Scientiae Biological Research     Open Access   (Followers: 1)
Acta Scientiarum. Biological Sciences     Open Access   (Followers: 2)
Acta Scientifica Naturalis     Open Access   (Followers: 4)
Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis     Open Access   (Followers: 2)
Acta Universitatis Lodziensis : Folia Biologica et Oecologica     Open Access  
Actualidades Biológicas     Open Access   (Followers: 1)
Advanced Biology     Hybrid Journal   (Followers: 1)
Advanced Health Care Technologies     Open Access   (Followers: 12)
Advanced Journal of Graduate Research     Open Access   (Followers: 1)
Advanced Membranes     Open Access   (Followers: 5)
Advanced Quantum Technologies     Hybrid Journal   (Followers: 3)
Advances in Bioinformatics     Open Access   (Followers: 22)
Advances in Biological Regulation     Hybrid Journal   (Followers: 4)
Advances in Biology     Open Access   (Followers: 12)
Advances in Biomarker Sciences and Technology     Open Access   (Followers: 3)
Advances in Biosensors and Bioelectronics     Open Access   (Followers: 6)
Advances in Cell Biology/ Medical Journal of Cell Biology     Open Access   (Followers: 26)
Advances in Ecological Research     Full-text available via subscription   (Followers: 45)
Advances in Environmental Sciences - International Journal of the Bioflux Society     Open Access   (Followers: 17)
Advances in Enzyme Research     Open Access   (Followers: 10)
Advances in High Energy Physics     Open Access   (Followers: 26)
Advances in Human Biology     Open Access   (Followers: 8)
Advances in Life Science and Technology     Open Access   (Followers: 12)
Advances in Life Sciences     Open Access   (Followers: 5)
Advances in Marine Biology     Full-text available via subscription   (Followers: 29)
Advances in Tropical Biodiversity and Environmental Sciences     Open Access   (Followers: 5)
Advances in Virus Research     Full-text available via subscription   (Followers: 8)
Adversity and Resilience Science : Journal of Research and Practice     Hybrid Journal   (Followers: 3)
African Journal of Ecology     Hybrid Journal   (Followers: 18)
African Journal of Range & Forage Science     Hybrid Journal   (Followers: 12)
AFRREV STECH : An International Journal of Science and Technology     Open Access   (Followers: 3)
Ageing Research Reviews     Hybrid Journal   (Followers: 13)
Aggregate     Open Access   (Followers: 1)
Aging Cell     Open Access   (Followers: 22)
Agrokémia és Talajtan     Full-text available via subscription   (Followers: 2)
AJP Cell Physiology     Hybrid Journal   (Followers: 16)
AJP Endocrinology and Metabolism     Hybrid Journal   (Followers: 26)
AJP Lung Cellular and Molecular Physiology     Hybrid Journal   (Followers: 4)
Al-Kauniyah : Jurnal Biologi     Open Access  
Alasbimn Journal     Open Access   (Followers: 1)
Alces : A Journal Devoted to the Biology and Management of Moose     Open Access  
Alfarama Journal of Basic & Applied Sciences     Open Access   (Followers: 8)
All Life     Open Access   (Followers: 1)
AMB Express     Open Access   (Followers: 1)
Ambix     Hybrid Journal   (Followers: 3)
American Journal of Agricultural and Biological Sciences     Open Access   (Followers: 7)
American Journal of Bioethics     Hybrid Journal   (Followers: 18)
American Journal of Human Biology     Hybrid Journal   (Followers: 17)
American Journal of Medical and Biological Research     Open Access   (Followers: 4)
American Journal of Plant Sciences     Open Access   (Followers: 24)
American Journal of Primatology     Hybrid Journal   (Followers: 17)
American Naturalist     Full-text available via subscription   (Followers: 80)
Amphibia-Reptilia     Hybrid Journal   (Followers: 5)
Anaerobe     Hybrid Journal   (Followers: 3)
Analytical Methods     Hybrid Journal   (Followers: 8)
Analytical Science Advances     Open Access   (Followers: 1)
Anatomia     Open Access   (Followers: 12)
Anatomical Science International     Hybrid Journal   (Followers: 3)
Animal Cells and Systems     Hybrid Journal   (Followers: 5)
Animal Microbiome     Open Access   (Followers: 3)
Animal Models and Experimental Medicine     Open Access  
Annales françaises d'Oto-rhino-laryngologie et de Pathologie Cervico-faciale     Full-text available via subscription   (Followers: 2)
Annales Henri Poincaré     Hybrid Journal   (Followers: 2)
Annales Universitatis Mariae Curie-Sklodowska, sectio C – Biologia     Open Access   (Followers: 1)
Annals of Applied Biology     Hybrid Journal   (Followers: 6)
Annals of Biomedical Engineering     Hybrid Journal   (Followers: 18)
Annals of Human Biology     Hybrid Journal   (Followers: 5)
Annals of Science and Technology     Open Access   (Followers: 2)
Annual Research & Review in Biology     Open Access  
Annual Review of Biomedical Engineering     Full-text available via subscription   (Followers: 18)
Annual Review of Biophysics     Full-text available via subscription   (Followers: 24)
Annual Review of Cancer Biology     Full-text available via subscription   (Followers: 3)
Annual Review of Cell and Developmental Biology     Full-text available via subscription   (Followers: 44)
Annual Review of Food Science and Technology     Full-text available via subscription   (Followers: 13)
Annual Review of Genomics and Human Genetics     Full-text available via subscription   (Followers: 31)
Annual Review of Phytopathology     Full-text available via subscription   (Followers: 11)
Anthropological Review     Open Access   (Followers: 28)
Antibiotics     Open Access   (Followers: 12)
Antioxidants     Open Access   (Followers: 4)
Antioxidants & Redox Signaling     Hybrid Journal   (Followers: 8)
Antonie van Leeuwenhoek     Hybrid Journal   (Followers: 3)
Anzeiger für Schädlingskunde     Hybrid Journal   (Followers: 1)
Apidologie     Hybrid Journal   (Followers: 4)
Apmis     Hybrid Journal   (Followers: 1)
APOPTOSIS     Hybrid Journal   (Followers: 8)
Applied Biology     Open Access  
Applied Bionics and Biomechanics     Open Access   (Followers: 4)
Applied Phycology     Open Access  
Applied Vegetation Science     Full-text available via subscription   (Followers: 9)
Aquaculture Environment Interactions     Open Access   (Followers: 7)
Aquaculture International     Hybrid Journal   (Followers: 25)
Aquaculture Reports     Open Access   (Followers: 3)
Aquaculture, Aquarium, Conservation & Legislation - International Journal of the Bioflux Society     Open Access   (Followers: 9)
Aquatic Biology     Open Access   (Followers: 9)
Aquatic Ecology     Hybrid Journal   (Followers: 42)
Aquatic Ecosystem Health & Management     Hybrid Journal   (Followers: 16)
Aquatic Science and Technology     Open Access   (Followers: 4)
Aquatic Toxicology     Hybrid Journal   (Followers: 26)
Arabian Journal of Scientific Research / المجلة العربية للبحث العلمي     Open Access  
Archaea     Open Access   (Followers: 3)
Archiv für Molluskenkunde: International Journal of Malacology     Full-text available via subscription   (Followers: 1)
Archives of Biological Sciences     Open Access  
Archives of Microbiology     Hybrid Journal   (Followers: 9)
Archives of Natural History     Hybrid Journal   (Followers: 8)
Archives of Oral Biology     Hybrid Journal   (Followers: 2)
Archives of Virology     Hybrid Journal   (Followers: 6)
Archivum Immunologiae et Therapiae Experimentalis     Hybrid Journal   (Followers: 2)
Arctic     Open Access   (Followers: 8)
Arid Ecosystems     Hybrid Journal   (Followers: 2)
Arquivos do Instituto Biológico     Open Access   (Followers: 1)
Arquivos do Museu Dinâmico Interdisciplinar     Open Access  
Arthropod Structure & Development     Hybrid Journal   (Followers: 2)
Arthropod Systematics & Phylogeny     Open Access   (Followers: 3)
Artificial DNA: PNA & XNA     Hybrid Journal   (Followers: 2)
Artificial Intelligence in the Life Sciences     Open Access  
Asian Bioethics Review     Full-text available via subscription   (Followers: 2)
Asian Journal of Biological Sciences     Open Access   (Followers: 2)
Asian Journal of Biology     Open Access  
Asian Journal of Biotechnology and Bioresource Technology     Open Access  
Asian Journal of Cell Biology     Open Access   (Followers: 4)
Asian Journal of Developmental Biology     Open Access   (Followers: 1)
Asian Journal of Medical and Biological Research     Open Access   (Followers: 3)
Asian Journal of Nematology     Open Access   (Followers: 4)
Asian Journal of Poultry Science     Open Access   (Followers: 3)
Atti della Accademia Peloritana dei Pericolanti - Classe di Scienze Medico-Biologiche     Open Access  
Australian Life Scientist     Full-text available via subscription   (Followers: 2)
Australian Mammalogy     Hybrid Journal   (Followers: 8)
Autophagy     Hybrid Journal   (Followers: 8)
Avian Biology Research     Hybrid Journal   (Followers: 4)
Avian Conservation and Ecology     Open Access   (Followers: 17)
Bacterial Empire     Open Access   (Followers: 1)
Bacteriology Journal     Open Access   (Followers: 2)
Bacteriophage     Full-text available via subscription   (Followers: 2)
Bangladesh Journal of Bioethics     Open Access  
Bangladesh Journal of Plant Taxonomy     Open Access  
Bangladesh Journal of Scientific Research     Open Access  
Berita Biologi     Open Access  
Between the Species     Open Access   (Followers: 2)
BIO Web of Conferences     Open Access  
Bio-Grafía. Escritos sobre la Biología y su enseñanza     Open Access  
Bio-Lectura     Open Access  
BIO-SITE : Biologi dan Sains Terapan     Open Access  
Bioactive Compounds in Health and Disease     Open Access  
Biocatalysis and Biotransformation     Hybrid Journal   (Followers: 5)
BioCentury Innovations     Full-text available via subscription   (Followers: 2)
Biochemistry and Cell Biology     Hybrid Journal   (Followers: 18)
Biochimie     Hybrid Journal   (Followers: 4)
BioControl     Hybrid Journal   (Followers: 2)
Biocontrol Science and Technology     Hybrid Journal   (Followers: 5)
Biodemography and Social Biology     Hybrid Journal   (Followers: 1)
BIODIK : Jurnal Ilmiah Pendidikan Biologi     Open Access  
BioDiscovery     Open Access   (Followers: 2)
Biodiversitas : Journal of Biological Diversity     Open Access   (Followers: 2)
Biodiversity : Research and Conservation     Open Access   (Followers: 30)
Biodiversity Data Journal     Open Access   (Followers: 8)
Biodiversity Informatics     Open Access   (Followers: 3)
Biodiversity Information Science and Standards     Open Access   (Followers: 2)
Biodiversity Observations     Open Access   (Followers: 2)
Bioeduca : Journal of Biology Education     Open Access   (Followers: 1)
Bioeduscience     Open Access   (Followers: 2)
Bioeksperimen : Jurnal Penelitian Biologi     Open Access  
Bioelectrochemistry     Hybrid Journal   (Followers: 1)
Bioelectromagnetics     Hybrid Journal   (Followers: 1)
Bioenergy Research     Hybrid Journal   (Followers: 3)
Bioengineering and Bioscience     Open Access   (Followers: 1)
BioEssays     Hybrid Journal   (Followers: 11)
Bioethica     Open Access   (Followers: 1)
Bioethics     Hybrid Journal   (Followers: 21)
BioéthiqueOnline     Open Access   (Followers: 1)
Biogeographia : The Journal of Integrative Biogeography     Open Access   (Followers: 2)
Biogeosciences (BG)     Open Access   (Followers: 17)
Biogeosciences Discussions (BGD)     Open Access   (Followers: 4)
Bioinformatics     Hybrid Journal   (Followers: 283)
Bioinformatics Advances : Journal of the International Society for Computational Biology     Open Access   (Followers: 3)
Bioinformatics and Biology Insights     Open Access   (Followers: 13)
Biointerphases     Open Access   (Followers: 1)
Biojournal of Science and Technology     Open Access  
BioLink : Jurnal Biologi Lingkungan, Industri, Kesehatan     Open Access  
Biologia     Hybrid Journal   (Followers: 1)
Biologia Futura     Hybrid Journal  
Biologia on-line : Revista de divulgació de la Facultat de Biologia     Open Access  
Biological Bulletin     Partially Free   (Followers: 6)
Biological Control     Hybrid Journal   (Followers: 6)

        1 2 3 4 5 6 7 8 | Last

Similar Journals
Journal Cover
Bioinformatics Advances : Journal of the International Society for Computational Biology
Number of Followers: 3  

  This is an Open Access Journal Open Access journal
ISSN (Online) 2635-0041
Published by Oxford University Press Homepage  [425 journals]
  • Correction: promor: a comprehensive R package for label-free proteomics
           data analysis and predictive modeling

    • First page: vbad041
      Abstract: This is a correction to: Chathurani Ranathunge, Sagar S. Patel, Lubna Pinky, Vanessa L. Correll, Shimin Chen, O. John Semmes, Robert K. Armstrong, C. Donald Combs, Julius O.s Nyalwidhe, promor: a comprehensive R package for label-free proteomics data analysis and predictive modeling, Bioinformatics Advances, Volume 3, Issue 1, 2023, vbad025, https://doi.org/10.1093/bioadv/vbad025
      PubDate: Thu, 06 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad041
      Issue No: Vol. 39, No. 4 (2023)
       
  • NRRS: a re-tracing strategy to refine neuron reconstruction

    • First page: vbad054
      Abstract:  It is crucial to develop accurate and reliable algorithms for fine reconstruction of neural morphology from whole-brain image datasets. Even though the involvement of human experts in the reconstruction process can help to ensure the quality and accuracy of the reconstructions, automated refinement algorithms are necessary to handle substantial deviations problems of reconstructed branches and bifurcation points from the large-scale and high-dimensional nature of the image data. Our proposed Neuron Reconstruction Refinement Strategy (NRRS) is a novel approach to address the problem of deviation errors in neuron morphology reconstruction. Our method partitions the reconstruction into fixed-size segments and resolves the deviation problems by re-tracing in two steps. We also validate the performance of our method using a synthetic dataset. Our results show that NRRS outperforms existing solutions and can handle most deviation errors. We apply our method to SEU-ALLEN/BICCN dataset containing 1741 complete neuron reconstructions and achieve remarkable improvements in the accuracy of the neuron skeleton representation, the task of radius estimation and axonal bouton detection. Our findings demonstrate the critical role of NRRS in refining neuron morphology reconstruction.Availability and implementationThe proposed refinement method is implemented as a Vaa3D plugin and the source code are available under the repository of vaa3d_tools/hackathon/Levy/refinement. The original fMOST images of mouse brains can be found at the BICCN’s Brain Image Library (BIL) (https://www.brainimagelibrary.org). The synthetic dataset is hosted on GitHub (https://github.com/Vaa3D/vaa3d_tools/tree/master/hackathon/Levy/refinement).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Thu, 18 May 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad054
      Issue No: Vol. 3, No. 1 (2023)
       
  • Identification of representative species-specific genes for abundance
           measurements

    • First page: vbad060
      Abstract: MotivationMetagenomic binning facilitates the reconstruction of genomes and identification of Metagenomic Species Pan-genomes or Metagenomic Assembled Genomes. We propose a method for identifying a set of de novo representative genes, termed signature genes, which can be used to measure the relative abundance and used as markers of each metagenomic species with high accuracy.ResultsAn initial set of the 100 genes that correlate with the median gene abundance profile of the entity is selected. A variant of the coupon collector’s problem was utilized to evaluate the probability of identifying a certain number of unique genes in a sample. This allows us to reject the abundance measurements of strains exhibiting a significantly skewed gene representation. A rank-based negative binomial model is employed to assess the performance of different gene sets across a large set of samples, facilitating identification of an optimal signature gene set for the entity. When benchmarked the method on a synthetic gene catalog, our optimized signature gene sets estimate relative abundance significantly closer to the true relative abundance compared to the starting gene sets extracted from the metagenomic species. The method was able to replicate results from a study with real data and identify around three times as many metagenomic entities.Availability and implementationThe code used for the analysis is available on GitHub: https://github.com/trinezac/SG_optimization.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 08 May 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad060
      Issue No: Vol. 3, No. 1 (2023)
       
  • TRviz: a Python library for decomposing and visualizing tandem repeat
           sequences

    • First page: vbad058
      Abstract: SummaryTRviz is an open-source Python library for decomposing, encoding, aligning and visualizing tandem repeat (TR) sequences. TRviz takes a collection of alleles (TR containing sequences) and one or more motifs as input and generates a plot showing the motif composition of the TR sequences.Availability and implementationTRviz is an open-source Python library and freely available at https://github.com/Jong-hun-Park/trviz. Detailed documentation is available at https://trviz.readthedocs.io.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Wed, 26 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad058
      Issue No: Vol. 3, No. 1 (2023)
       
  • PGPointNovo: an efficient neural network-based tool for parallel de novo
           peptide sequencing

    • First page: vbad057
      Abstract: SummaryDe novo peptide sequencing for tandem mass spectrometry data is not only a key technology for novel peptide identification, but also a precedent task for many downstream tasks, such as vaccine and antibody studies. In recent years, neural network models for de novo peptide sequencing have manifested a remarkable ability to accommodate various data sources and outperformed conventional peptide identification tools. However, the excellent model is computationally expensive, taking up to 1 week to process about 400 000 spectrums. This article presents PGPointNovo, a novel neural network-based tool for parallel de novo peptide sequencing. PGPointNovo uses data parallelization technology to accelerate training and inference and optimizes the training obstacles caused by large batch sizes. The results of extensive experiments conducted on multiple datasets of different sizes demonstrate that compared with PointNovo the excellent neural network-based de novo peptide sequencing tool, PGPointNovo, accelerates de novo peptide sequencing by up to 7.35× without precision or recall compromises.Availability and implementationThe source code and the parameter settings are available at https://github.com/shallFun4Learning/PGPointNovo.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Tue, 25 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad057
      Issue No: Vol. 3, No. 1 (2023)
       
  • RBAtools: a programming interface for Resource Balance Analysis models

    • First page: vbad056
      Abstract: MotivationEfficient resource allocation can contribute to an organism’s fitness and can improve evolutionary success. Resource Balance Analysis (RBA) is a computational framework that models an organism’s growth-optimal proteome configurations in various environments. RBA software enables the construction of RBA models on genome scale and the calculation of medium-specific, growth-optimal cell states including metabolic fluxes and the abundance of macromolecular machines. However, existing software lacks a simple programming interface for non-expert users, easy to use and interoperable with other software.ResultsThe python package RBAtools provides convenient access to RBA models. As a flexible programming interface, it enables the implementation of custom workflows and the modification of existing genome-scale RBA models. Its high-level functions comprise simulation, model fitting, parameter screens, sensitivity analysis, variability analysis and the construction of Pareto fronts. Models and data are represented as structured tables and can be exported to common data formats for fluxomics and proteomics visualization.Availability and implementationRBAtools documentation, installation instructions and tutorials are available at https://sysbioinra.github.io/rbatools/. General information about RBA and related software can be found at rba.inrae.fr.
      PubDate: Sat, 22 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad056
      Issue No: Vol. 3, No. 1 (2023)
       
  • CREPE: a Shiny app for transcription factor cataloguing

    • First page: vbad055
      Abstract: SummaryTranscription factors (TFs) are proteins that directly interpret the genome to regulate gene expression and determine cellular phenotypes. TF identification is a common first step in unraveling gene regulatory networks. We present CREPE, an R Shiny app to catalogue and annotate TFs. CREPE was benchmarked against curated human TF datasets. Next, we use CREPE to explore the TF repertoires of Heliconius erato and Heliconius melpomene butterflies.Availability and implementationCREPE is available as a Shiny app package available at GitHub (github.com/dirostri/CREPE).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Fri, 21 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad055
      Issue No: Vol. 3, No. 1 (2023)
       
  • HMMerge: an ensemble method for multiple sequence alignment

    • First page: vbad052
      Abstract: MotivationDespite advances in method development for multiple sequence alignment over the last several decades, the alignment of datasets exhibiting substantial sequence length heterogeneity, especially when the input sequences include very short sequences (either as a result of sequencing technologies or of large deletions during evolution) remains an inadequately solved problem.ResultsWe present HMMerge, a method to compute an alignment of datasets exhibiting high sequence length heterogeneity, or to add short sequences into a given ‘backbone’ alignment. HMMerge builds on the technique from its predecessor alignment methods, UPP and WITCH, which build an ensemble of profile HMMs to represent the backbone alignment and add the remaining sequences into the backbone alignment using the ensemble. HMMerge differs from UPP and WITCH by building a new ‘merged’ HMM from the ensemble, and then using that merged HMM to align the query sequences. We show that HMMerge is competitive with WITCH, with an advantage over WITCH when adding very short sequences into backbone alignments.Availability and implementationHMMerge is freely available at https://github.com/MinhyukPark/HMMerge.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 17 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad052
      Issue No: Vol. 3, No. 1 (2023)
       
  • CanIsoNet: a database to study the functional impact of isoform switching
           events in diseases

    • First page: vbad050
      Abstract: MotivationAlternative splicing, as an essential regulatory mechanism in normal mammalian cells, is frequently disturbed in cancer and other diseases. Switches in the expression of most dominant alternative isoforms can alter protein interaction networks of associated genes giving rise to disease and disease progression. Here, we present CanIsoNet, a database to view, browse and search isoform switching events in diseases. CanIsoNet is the first webserver that incorporates isoform expression data with STRING interaction networks and ClinVar annotations to predict the pathogenic impact of isoform switching events in various diseases.ResultsData in CanIsoNet can be browsed by disease or searched by genes or isoforms in annotation-rich data tables. Various annotations for 11 811 isoforms and 14 357 unique isoform switching events across 31 different disease types are available. The network density score for each disease-specific isoform, PFAM domain IDs of disrupted interactions, domain structure visualization of transcripts and expression data of switched isoforms for each sample is given. Additionally, the genes annotated in ClinVar are highlighted in interactive interaction networks.Availability and implementationCanIsoNet is freely available at https://www.caniso.net. The source codes can be found under a Creative Common License at https://github.com/kahramanlab/CanIsoNet_Web.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 17 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad050
      Issue No: Vol. 3, No. 1 (2023)
       
  • ACDA: implementation of an augmented drug synergy prediction algorithm

    • First page: vbad051
      Abstract: MotivationDrug synergy prediction is approached with machine learning techniques using molecular and pharmacological data. The published Cancer Drug Atlas (CDA) predicts a synergy outcome in cell-line models from drug target information, gene mutations and the models’ monotherapy drug sensitivity. We observed low performance of the CDA, 0.339, measured by Pearson correlation of predicted versus measured sensitivity on DrugComb datasets.ResultsWe augmented the approach CDA by applying a random forest regression and optimization via cross-validation hyper-parameter tuning and named it Augmented CDA (ACDA). We benchmarked the ACDA’s performance, which is 68% higher than that of the CDA when trained and validated on the same dataset spanning 10 tissues. We compared the performance of ACDA to one of the winning methods of the DREAM Drug Combination Prediction Challenge, the performance of which was lower than ACDA in 16 out of 19 cases. We further trained the ACDA on Novartis Institutes for BioMedical Research PDX encyclopedia data and generated sensitivity predictions for PDX models. Finally, we developed a novel approach to visualize synergy-prediction data.Availability and implementationThe source code is available at https://github.com/TheJacksonLaboratory/drug-synergy and the software package at PyPI.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Thu, 13 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad051
      Issue No: Vol. 3, No. 1 (2023)
       
  • nestedcv: an R package for fast implementation of nested cross-validation
           with embedded feature selection designed for transcriptomics and
           high-dimensional data

    • First page: vbad048
      Abstract: MotivationAlthough machine learning models are commonly used in medical research, many analyses implement a simple partition into training data and hold-out test data, with cross-validation (CV) for tuning of model hyperparameters. Nested CV with embedded feature selection is especially suited to biomedical data where the sample size is frequently limited, but the number of predictors may be significantly larger (P ≫ n).ResultsThe nestedcv R package implements fully nested k × l-fold CV for lasso and elastic-net regularized linear models via the glmnet package and supports a large array of other machine learning models via the caret framework. Inner CV is used to tune models and outer CV is used to determine model performance without bias. Fast filter functions for feature selection are provided and the package ensures that filters are nested within the outer CV loop to avoid information leakage from performance test sets. Measurement of performance by outer CV is also used to implement Bayesian linear and logistic regression models using the horseshoe prior over parameters to encourage a sparse model and determine unbiased model accuracy.Availability and implementationThe R package nestedcv is available from CRAN: https://CRAN.R-project.org/package=nestedcv.
      PubDate: Thu, 13 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad048
      Issue No: Vol. 3, No. 1 (2023)
       
  • Latent disease similarities and therapeutic repurposing possibilities
           uncovered by multi-modal generative topic modeling of human diseases

    • First page: vbad047
      Abstract: MotivationHuman diseases are characterized by multiple features such as their pathophysiological, molecular and genetic changes. The rapid expansion of such multi-modal disease-omics space provides an opportunity to re-classify diverse human diseases and to uncover their latent molecular similarities, which could be exploited to repurpose a therapeutic-target for one disease to another.ResultsHerein, we probe this underexplored space by soft-clustering 6955 human diseases by multi-modal generative topic modeling. Focusing on chronic kidney disease and myocardial infarction, two most life-threatening diseases, unveiled are their previously underrecognized molecular similarities to neoplasia and mental/neurological-disorders, and 69 repurposable therapeutic-targets for these diseases. Using an edit-distance-based pathway-classifier, we also find molecular pathways by which these targets could elicit their clinical effects. Importantly, for the 17 targets, the evidence for their therapeutic usefulness is retrospectively found in the pre-clinical and clinical space, illustrating the effectiveness of the method, and suggesting its broader applications across diverse human diseases.Availability and implementationThe code reported in this article is available at: https://github.com/skozawa170301ktx/MultiModalDiseaseModelingSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Wed, 12 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad047
      Issue No: Vol. 3, No. 1 (2023)
       
  • SAINT-Angle: self-attention augmented inception-inside-inception network
           and transfer learning improve protein backbone torsion angle prediction

    • First page: vbad042
      Abstract: MotivationProtein structure provides insight into how proteins interact with one another as well as their functions in living organisms. Protein backbone torsion angles (ϕ and ψ) prediction is a key sub-problem in predicting protein structures. However, reliable determination of backbone torsion angles using conventional experimental methods is slow and expensive. Therefore, considerable effort is being put into developing computational methods for predicting backbone angles.ResultsWe present SAINT-Angle, a highly accurate method for predicting protein backbone torsion angles using a self-attention-based deep learning network called SAINT, which was previously developed for the protein secondary structure prediction. We extended and improved the existing SAINT architecture as well as used transfer learning to predict backbone angles. We compared the performance of SAINT-Angle with the state-of-the-art methods through an extensive evaluation study on a collection of benchmark datasets, namely, TEST2016, TEST2018, TEST2020-HQ, CAMEO and CASP. The experimental results suggest that our proposed self-attention-based network, together with transfer learning, has achieved notable improvements over the best alternate methods.Availability and implementationSAINT-Angle is freely available as an open-source project at https://github.com/bayzidlab/SAINT-Angle.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Wed, 05 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad042
      Issue No: Vol. 3, No. 1 (2023)
       
  • AGRN: accurate gene regulatory network inference using ensemble machine
           learning methods

    • First page: vbad032
      Abstract: MotivationBiological processes are regulated by underlying genes and their interactions that form gene regulatory networks (GRNs). Dysregulation of these GRNs can cause complex diseases such as cancer, Alzheimer’s and diabetes. Hence, accurate GRN inference is critical for elucidating gene function, allowing for the faster identification and prioritization of candidate genes for functional investigation. Several statistical and machine learning-based methods have been developed to infer GRNs based on biological and synthetic datasets. Here, we developed a method named AGRN that infers GRNs by employing an ensemble of machine learning algorithms.ResultsFrom the idea that a single method may not perform well on all datasets, we calculate the gene importance scores using three machine learning methods—random forest, extra tree and support vector regressors. We calculate the importance scores from Shapley Additive Explanations, a recently published method to explain machine learning models. We have found that the importance scores from Shapley values perform better than the traditional importance scoring methods based on almost all the benchmark datasets. We have analyzed the performance of AGRN using the datasets from the DREAM4 and DREAM5 challenges for GRN inference. The proposed method, AGRN—an ensemble machine learning method with Shapley values, outperforms the existing methods both in the DREAM4 and DREAM5 datasets. With improved accuracy, we believe that AGRN inferred GRNs would enhance our mechanistic understanding of biological processes in health and disease.Availabilityand implementationhttps://github.com/DuaaAlawad/AGRN.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Wed, 05 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad032
      Issue No: Vol. 3, No. 1 (2023)
       
  • Mpox Knowledge Graph: a comprehensive representation embedding chemical
           entities and associated biology of Mpox

    • First page: vbad045
      Abstract: SummaryThe outbreak of Mpox virus (MPXV) infection in May 2022 is declared a global health emergency by WHO. A total of 84 330 cases have been confirmed as of 5 January 2023 and the numbers are on the rise. The MPXV pathophysiology and its underlying mechanisms are unfortunately not yet understood. Likewise, the knowledge of biochemicals and drugs used against MPXV and their downstream effects is sparse. In this work, using Knowledge Graph (KG) representations we have depicted chemical and biological aspects of MPXV. To achieve this, we have collected and rationally assembled several biological study results, assays, drug candidates and pre-clinical evidence to form a dynamic and comprehensive network. The KG is compliant with FAIR annotations allowing seamless transformation and integration to/with other formats and infrastructures.Availability and implementationThe programmatic scripts for Mpox KG are publicly available at https://github.com/Fraunhofer-ITMP/mpox-kg. It is hosted publicly at https://doi.org/10.18119/N9SG7D.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 03 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad045
      Issue No: Vol. 3, No. 1 (2023)
       
  • Different approaches to Imaging Mass Cytometry data analysis

    • First page: vbad046
      Abstract: SummaryImaging Mass Cytometry (IMC) is a novel, high multiplexing imaging platform capable of simultaneously detecting and visualizing up to 40 different protein targets. It is a strong asset available for in-depth study of histology and pathophysiology of the tissues. Bearing in mind the robustness of this technique and the high spatial context of the data it gives, it is especially valuable in studying the biology of cancer and tumor microenvironment. IMC-derived data are not classical micrographic images, and due to the characteristics of the data obtained using IMC, the image analysis approach, in this case, can diverge to a certain degree from the classical image analysis pipelines. As the number of publications based on the IMC is on the rise, this trend is also followed by an increase in the number of available methodologies designated solely to IMC-derived data analysis. This review has for an aim to give a systematic synopsis of all the available classical image analysis tools and pipelines useful to be employed for IMC data analysis and give an overview of tools intentionally developed solely for this purpose, easing the choice to researchers of selecting the most suitable methodologies for a specific type of analysis desired.
      PubDate: Mon, 03 Apr 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad046
      Issue No: Vol. 3, No. 1 (2023)
       
  • Peak Pair Pruner: a post-processing software to MS-DIAL for peak pair
           validation and ratio quantification of isotopic labeling LC-MS(/MS) data

    • First page: vbad044
      Abstract: MotivationIsotopic labeling is an essential relative quantification strategy in mass spectrometry-based metabolomics, ideal for studying large cohorts by minimizing common sources of variations in quantitation. MS-DIAL is a free and popular general metabolomics platform that has isotopic labeling data processing capabilities but lacks features provided by other software specialized for isotopic labeling data analysis, such as isotopic pair validation and tabular light-to-heavy peak ratio reporting.ResultsWe developed Peak Pair Pruner (PPP), a standalone Python program for post-processing of MS-DIAL alignment matrixes. PPP provides these missing features and innovation including isotopic overlap subtraction based on a light-tagged pool sample as quality control. The MS-DIAL+PPP workflow for isotopic labeling-based metabolomics data processing was validated using light and heavy dansylated amino acid standard mixture and metabolite extract from human plasma.Availability and implementationPeak Pair Pruner is freely available on Github: https://github.com/QibinZhangLab/Peak_Pair_Pruner. Raw MS data and .ibf files analyzed are on Metabolomics Workbench with Study ID ST002427.Contactq_zhang2@uncg.eduSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 27 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad044
      Issue No: Vol. 3, No. 1 (2023)
       
  • iEnhancer-ELM: improve enhancer identification by extracting
           position-related multiscale contextual information based on enhancer
           language models

    • First page: vbad043
      Abstract: MotivationEnhancers are important cis-regulatory elements that regulate a wide range of biological functions and enhance the transcription of target genes. Although many feature extraction methods have been proposed to improve the performance of enhancer identification, they cannot learn position-related multiscale contextual information from raw DNA sequences.ResultsIn this article, we propose a novel enhancer identification method (iEnhancer-ELM) based on BERT-like enhancer language models. iEnhancer-ELM tokenizes DNA sequences with multi-scale k-mers and extracts contextual information of different scale k-mers related with their positions via an multi-head attention mechanism. We first evaluate the performance of different scale k-mers, then ensemble them to improve the performance of enhancer identification. The experimental results on two popular benchmark datasets show that our model outperforms state-of-the-art methods. We further illustrate the interpretability of iEnhancer-ELM. For a case study, we discover 30 enhancer motifs via a 3-mer-based model, where 12 of motifs are verified by STREME and JASPAR, demonstrating our model has a potential ability to unveil the biological mechanism of enhancer.Availability and implementationThe models and associated code are available at https://github.com/chen-bioinfo/iEnhancer-ELMSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Sat, 25 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad043
      Issue No: Vol. 3, No. 1 (2023)
       
  • EasyCellType: marker-based cell-type annotation by automatically querying
           multiple databases

    • First page: vbad029
      Abstract: MotivationCell label annotation is a challenging step in the analysis of single-cell RNA sequencing (scRNA-seq) data, especially for tissue types that are less commonly studied. The accumulation of scRNA-seq studies and biological knowledge leads to several well-maintained cell marker databases. Manually examining the cell marker lists against these databases can be difficult due to the large amount of available information. Additionally, simply overlapping the two lists without considering gene ranking might lead to unreliable results. Thus, an automated method with careful statistical testing is needed to facilitate the usage of these databases.ResultsWe develop a user-friendly computational tool, EasyCellType, which automatically checks an input marker list obtained by differential expression analysis against the databases and provides annotation recommendations in graphical outcomes. The package provides two statistical tests, gene set enrichment analysis and a modified version of Fisher’s exact test, as well as customized database and tissue type choices. We also provide an interactive shiny application to annotate cells in a user-friendly graphical user interface. The simulation study and real-data applications demonstrate favorable results by the proposed method.Availability and implementationhttps://biostatistics.mdanderson.org/shinyapps/EasyCellType/; https://bioconductor.org/packages/devel/bioc/html/EasyCellType.html.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Fri, 24 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad029
      Issue No: Vol. 3, No. 1 (2023)
       
  • HTRX: an R package for learning non-contiguous haplotypes associated with
           a phenotype

    • First page: vbad038
      Abstract: SummaryHaplotype Trend Regression with eXtra flexibility (HTRX) is an R package to learn sets of interacting features that explain variance in a phenotype. Genome-wide association studies (GWAS) have identified thousands of single nucleotide polymorphisms (SNPs) associated with complex traits and diseases, but finding the true causal signal from a high linkage disequilibrium block is challenging. We focus on the simpler task of quantifying the total variance explainable not just with main effects but also interactions and tagging, using haplotype-based associations. HTRX identifies haplotypes composed of non-contiguous SNPs associated with a phenotype and can naturally be performed on regions with a GWAS hit before or after fine-mapping. To reduce the space and computational complexity when investigating many features, we constrain the search by growing good feature sets using ‘Cumulative HTRX’, and limit the maximum complexity of a feature set. As the computational time scales linearly with the number of SNPs, HTRX has the potential to be applied to large chromosome regions.Availability and implementationHTRX is implemented in R and is available under GPL-3 licence from CRAN (https://cran.r-project.org/web/packages/HTRX/readme/README.html). The development version is maintained on GitHub (https://github.com/YaolingYang/HTRX).Contactyaoling.yang@bristol.ac.ukSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Thu, 23 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad038
      Issue No: Vol. 3, No. 1 (2023)
       
  • The Epilepsy Ontology: a community-based ontology tailored for semantic
           interoperability and text mining

    • First page: vbad033
      Abstract: MotivationEpilepsy is a multifaceted complex disorder that requires a precise understanding of the classification, diagnosis, treatment and disease mechanism governing it. Although scattered resources are available on epilepsy, comprehensive and structured knowledge is missing. In contemplation to promote multidisciplinary knowledge exchange and facilitate advancement in clinical management, especially in pre-clinical research, a disease-specific ontology is necessary. The presented ontology is designed to enable better interconnection between scientific community members in the epilepsy domain.ResultsThe Epilepsy Ontology (EPIO) is an assembly of structured knowledge on various aspects of epilepsy, developed according to Basic Formal Ontology (BFO) and Open Biological and Biomedical Ontology (OBO) Foundry principles. Concepts and definitions are collected from the latest International League against Epilepsy (ILAE) classification, domain-specific ontologies and scientific literature. This ontology consists of 1879 classes and 28 151 axioms (2171 declaration axioms, 2219 logical axioms) from several aspects of epilepsy. This ontology is intended to be used for data management and text mining purposes.Availability and implementationThe current release of the ontology is publicly available under a Creative Commons 4.0 License and shared via http://purl.obolibrary.org/obo/epso.owl and is a community-based effort assembling various facets of the complex disease. The ontology is also deposited in BioPortal at https://bioportal.bioontology.org/ontologies/EPIO.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Thu, 23 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad033
      Issue No: Vol. 3, No. 1 (2023)
       
  • TB-ML—a framework for comparing machine learning approaches to predict
           drug resistance of Mycobacterium tuberculosis

    • First page: vbad040
      Abstract: MotivationMachine learning (ML) has shown impressive performance in predicting antimicrobial resistance (AMR) from sequence data, including for Mycobacterium tuberculosis, the causative agent of tuberculosis. However, current ML development and publication practices make it difficult for researchers and clinicians to use, test or reproduce published models.ResultsWe packaged a number of published and unpublished ML models for predicting AMR of M.tuberculosis into Docker containers. Similarly, the pipelines required for pre-processing genomic data into the formats required by the models were also packaged into separate containers. By following a minimal container I/O standard, we ensured as much interoperability as possible. We also created a command-line application, TB-ML, which can be used to easily combine pre-processing and prediction containers into complete pipelines ready for predicting resistance from novel, raw data with a single command. As long as there is adherence to this minimal standard for the container interface, containers produced by researchers holding new models can likewise be included in these pipelines, making benchmark comparisons of different models simple and facilitating faster uptake in the clinic.Availability and implementationTB-ML contains a simple Docker API written in Python and is available at https://github.com/jodyphelan/tb-ml. Example Docker containers for resistance prediction and corresponding data pre-processing as well as a tutorial on how to create new containers for TB-ML are available at https://tb-ml.github.io/tb-ml-containers/.Contactjody.phelan@lshtm.ac.uk
      PubDate: Thu, 23 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad040
      Issue No: Vol. 3, No. 1 (2023)
       
  • Computational speed-up of large-scale, single-cell model simulations via a
           fully integrated SBML-based format

    • First page: vbad039
      Abstract: SummaryLarge-scale and whole-cell modeling has multiple challenges, including scalable model building and module communication bottlenecks (e.g. between metabolism, gene expression, signaling, etc.). We previously developed an open-source, scalable format for a large-scale mechanistic model of proliferation and death signaling dynamics, but communication bottlenecks between gene expression and protein biochemistry modules remained. Here, we developed two solutions to communication bottlenecks that speed-up simulation by ∼4-fold for hybrid stochastic-deterministic simulations and by over 100-fold for fully deterministic simulations. Fully deterministic speed-up facilitates model initialization, parameter estimation and sensitivity analysis tasks.Availability and implementationSource code is freely available at https://github.com/birtwistlelab/SPARCED/releases/tag/v1.3.0 implemented in python, and supported on Linux, Windows and MacOS (via Docker).
      PubDate: Thu, 23 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad039
      Issue No: Vol. 3, No. 1 (2023)
       
  • easyPheno: An easy-to-use and easy-to-extend Python framework for
           phenotype prediction using Bayesian optimization

    • First page: vbad035
      Abstract: SummaryPredicting complex traits from genotypic information is a major challenge in various biological domains. With easyPheno, we present a comprehensive Python framework enabling the rigorous training, comparison and analysis of phenotype predictions for a variety of different models, ranging from common genomic selection approaches over classical machine learning and modern deep learning-based techniques. Our framework is easy-to-use, also for non-programming-experts, and includes an automatic hyperparameter search using state-of-the-art Bayesian optimization. Moreover, easyPheno provides various benefits for bioinformaticians developing new prediction models. easyPheno enables to quickly integrate novel models and functionalities in a reliable framework and to benchmark against various integrated prediction models in a comparable setup. In addition, the framework allows the assessment of newly developed prediction models under pre-defined settings using simulated data. We provide a detailed documentation with various hands-on tutorials and videos explaining the usage of easyPheno to novice users.Availability and implementationeasyPheno is publicly available at https://github.com/grimmlab/easyPheno and can be easily installed as Python package via https://pypi.org/project/easypheno/ or using Docker. A comprehensive documentation including various tutorials complemented with videos can be found at https://easypheno.readthedocs.io/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Wed, 22 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad035
      Issue No: Vol. 3, No. 1 (2023)
       
  • Identification of a gene expression signature associated with breast
           cancer survival and risk that improves clinical genomic platforms

    • First page: vbad037
      Abstract: MotivationModern genomic technologies allow us to perform genome-wide analysis to find gene markers associated with the risk and survival in cancer patients. Accurate risk prediction and patient stratification based on robust gene signatures is a key path forward in personalized treatment and precision medicine. Several authors have proposed the identification of gene signatures to assign risk in patients with breast cancer (BRCA), and some of these signatures have been implemented within commercial platforms in the clinic, such as Oncotype and Prosigna. However, these platforms are black boxes in which the influence of selected genes as survival markers is unclear and where the risk scores provided cannot be clearly related to the standard clinicopathological tumor markers obtained by immunohistochemistry (IHC), which guide clinical and therapeutic decisions in breast cancer.ResultsHere, we present a framework to discover a robust list of gene expression markers associated with survival that can be biologically interpreted in terms of the three main biomolecular factors (IHC clinical markers: ER, PR and HER2) that define clinical outcome in BRCA. To test and ensure the reproducibility of the results, we compiled and analyzed two independent datasets with a large number of tumor samples (1024 and 879) that include full genome-wide expression profiles and survival data. Using these two cohorts, we obtained a robust subset of gene survival markers that correlate well with the major IHC clinical markers used in breast cancer. The geneset of survival markers that we identify (which includes 34 genes) significantly improves the risk prediction provided by the genesets included in the commercial platforms: Oncotype (16 genes) and Prosigna (50 genes, i.e. PAM50). Furthermore, some of the genes identified have recently been proposed in the literature as new prognostic markers and may deserve more attention in current clinical trials to improve breast cancer risk prediction.Availability and implementationAll data integrated and analyzed in this research will be available on GitHub (https://github.com/jdelasrivas-lab/breastcancersurvsign), including the R scripts and protocols used for the analyses.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Wed, 22 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad037
      Issue No: Vol. 3, No. 1 (2023)
       
  • Federated learning framework integrating REFINED CNN and Deep Regression
           Forests

    • First page: vbad036
      Abstract: SummaryPredictive learning from medical data incurs additional challenge due to concerns over privacy and security of personal data. Federated learning, intentionally structured to preserve high level of privacy, is emerging to be an attractive way to generate cross-silo predictions in medical scenarios. However, the impact of severe population-level heterogeneity on federated learners is not well explored. In this article, we propose a methodology to detect presence of population heterogeneity in federated settings and propose a solution to handle such heterogeneity by developing a federated version of Deep Regression Forests. Additionally, we demonstrate that the recently conceptualized REpresentation of Features as Images with NEighborhood Dependencies CNN framework can be combined with the proposed Federated Deep Regression Forests to provide improved performance as compared to existing approaches.Availability and implementationThe Python source code for reproducing the main results are available on GitHub: https://github.com/DanielNolte/FederatedDeepRegressionForests.Contactranadip.pal@ttu.eduSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Wed, 22 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad036
      Issue No: Vol. 3, No. 1 (2023)
       
  • Motif elucidation in ChIP-seq datasets with a knockout control

    • First page: vbad031
      Abstract: SummaryChromatin immunoprecipitation-sequencing is widely used to find transcription factor binding sites, but suffers from various sources of noise. Knocking out the target factor mitigates noise by acting as a negative control. Paired wild-type and knockout (KO) experiments can generate improved motifs but require optimal differential analysis. We introduce peaKO—a computational method to automatically optimize motif analyses with KO controls, which we compare to two other methods. PeaKO often improves elucidation of the target factor and highlights the benefits of KO controls, which far outperform input controls.Availability and implementationPeaKO is freely available at https://peako.hoffmanlab.org.Contactmichael.hoffman@utoronto.ca
      PubDate: Thu, 16 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad031
      Issue No: Vol. 3, No. 1 (2023)
       
  • Inferring the heritability of bacterial traits in the era of machine
           learning

    • First page: vbad027
      Abstract:  Quantification of heritability is a fundamental desideratum in genetics, which allows an assessment of the contribution of additive genetic variation to the variability of a trait of interest. The traditional computational approaches for assessing the heritability of a trait have been developed in the field of quantitative genetics. However, the rise of modern population genomics with large sample sizes has led to the development of several new machine learning-based approaches to inferring heritability. In this article, we systematically summarize recent advances in machine learning which can be used to infer heritability. We focus on an application of these methods to bacterial genomes, where heritability plays a key role in understanding phenotypes such as antibiotic resistance and virulence, which are particularly important due to the rising frequency of antimicrobial resistance. By designing a heritability model incorporating realistic patterns of genome-wide linkage disequilibrium for a frequently recombining bacterial pathogen, we test the performance of a wide spectrum of different inference methods, including also GCTA. In addition to the synthetic data benchmark, we present a comparison of the methods for antibiotic resistance traits for multiple bacterial pathogens. Insights from the benchmarking and real data analyses indicate a highly variable performance of the different methods and suggest that heritability inference would likely benefit from tailoring of the methods to the specific genetic architecture of the target organism.Availability and implementationThe R codes and data used in the numerical experiments are available at: https://github.com/tienmt/her_MLs.
      PubDate: Tue, 14 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad027
      Issue No: Vol. 3, No. 1 (2023)
       
  • scAnnotate: an automated cell-type annotation tool for single-cell
           RNA-sequencing data

    • First page: vbad030
      Abstract: MotivationSingle-cell RNA-sequencing (scRNA-seq) technology enables researchers to investigate a genome at the cellular level with unprecedented resolution. An organism consists of a heterogeneous collection of cell types, each of which plays a distinct role in various biological processes. Hence, the first step of scRNA-seq data analysis is often to distinguish cell types so they can be investigated separately. Researchers have recently developed several automated cell-type annotation tools, requiring neither biological knowledge nor subjective human decisions. Dropout is a crucial characteristic of scRNA-seq data widely used in differential expression analysis. However, no current cell annotation method explicitly utilizes dropout information. Fully utilizing dropout information motivated this work.ResultsWe present scAnnotate, a cell annotation tool that fully utilizes dropout information. We model every gene’s marginal distribution using a mixture model, which describes both the dropout proportion and the distribution of the non-dropout expression levels. Then, using an ensemble machine learning approach, we combine the mixture models of all genes into a single model for cell-type annotation. This combining approach can avoid estimating numerous parameters in the high-dimensional joint distribution of all genes. Using 14 real scRNA-seq datasets, we demonstrate that scAnnotate is competitive against nine existing annotation methods. Furthermore, because of its distinct modelling strategy, scAnnotate’s misclassified cells differ greatly from competitor methods. This suggests using scAnnotate together with other methods could further improve annotation accuracy.Availability and implementationWe implemented scAnnotate as an R package and made it publicly available from CRAN: https://cran.r-project.org/package=scAnnotate.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 13 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad030
      Issue No: Vol. 3, No. 1 (2023)
       
  • Phylostems: a new graphical tool to investigate temporal signal of
           heterochronous sequences datasets

    • First page: vbad026
      Abstract: MotivationMolecular tip-dating of phylogenetic trees is a growing discipline that uses DNA sequences sampled at different points in time to co-estimate the timing of evolutionary events with rates of molecular evolution. Importantly, such inferences should only be performed on datasets displaying sufficient temporal signal, a feature important to test prior to any tip-dating inference. For this purpose, the most popular method considered to-date has been the ‘root-to-tip regression’ which consist in fitting a linear regression of the number of substitutions accumulated from the root to the tips of a phylogenetic tree as a function of sampling times. The main limitation of the regression method, in its current implementation, relies in the fact that the temporal signal can only be tested at the whole-tree scale (i.e. its root).ResultsTo overcome this limitation we introduce Phylostems, a new graphical user-friendly tool developed to investigate temporal signal within every clade of a phylogenetic tree. We provide a ‘how to’ guide by running Phylostems on an empirical dataset and supply guidance for results interpretation.Availability and implementationPhylostems is freely available at https://pvbmt-apps.cirad.fr/apps/phylostems.
      PubDate: Mon, 13 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad026
      Issue No: Vol. 3, No. 1 (2023)
       
  • Predicting phenotypes from novel genomic markers using deep learning

    • First page: vbad028
      Abstract: SummaryGenomic selection (GS) models use single nucleotide polymorphism (SNP) markers to predict phenotypes. However, these predictive models face challenges due to the high dimensionality of genome-wide SNP marker data. Thanks to recent breakthroughs in DNA sequencing and decreased sequencing cost, the study of novel genomic variants such as structural variations (SVs) and transposable elements (TEs) become increasingly prevalent. In this article, we develop a deep convolutional neural network model, NovGMDeep, to predict phenotypes using SVs and TEs markers for GS. The proposed model is trained and tested on samples of Arabidopsis thaliana and Oryza sativa using k-fold cross-validation. The prediction accuracy is evaluated using Pearson’s Correlation Coefficient (PCC), mean absolute error (MAE) and SD of MAE. The predicted results showed higher correlation when the model is trained with SVs and TEs than with SNPs. NovGMDeep also has higher prediction accuracy when comparing with conventional statistical models. This work sheds light on the unappreciated function of SVs and TEs in genotype-to-phenotype associations, as well as their extensive significance and value in crop development.
      PubDate: Thu, 09 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad028
      Issue No: Vol. 3, No. 1 (2023)
       
  • ProFeatMap: a highly customizable tool for 2D feature representation of
           protein sets

    • First page: vbad022
      Abstract: MotivationStudies of sets of proteins are a central point in biology. In particular, the application of omics in the last decades has generated lists of several hundreds or thousands of proteins or genes. However, these lists are often not inspected globally, possibly due to the lack of tools capable of simultaneously visualizing the feature architectures of a large number of proteins.ResultsHere, we present ProFeatMap, an intuitive Python-based website. For a given set of proteins, it allows to display features such as domains, repeats, disorder or post-translational modifications and their organization along the sequences, into a highly customizable 2D map. Starting from a user-defined protein list of UniProt accession codes, ProFeatMap extracts the most important annotated features available for each protein from one of the well-established databases such as Uniprot or InterPro, allocates shapes and colors, potentially depending on quantitative or qualitative data and sorts the protein list based on homologous feature content. The resulting publication-quality map allows even large protein families to be explored, and to classify them based on shared features. It can help to gain insights, for example, feature redundancy or feature pattern, that were previously overlooked. ProFeatMap is freely available on the web at: https://profeatmap.pythonanywhere.com/.Availability and implementationSource code is freely accessible at https://github.com/profeatmap/ProFeatMap under the GPL license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Thu, 09 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad022
      Issue No: Vol. 3, No. 1 (2023)
       
  • promor: a comprehensive R package for label-free proteomics data analysis
           and predictive modeling

    • First page: vbad025
      Abstract: SummaryWe present promor, a comprehensive, user-friendly R package that streamlines label-free quantification proteomics data analysis and building machine learning-based predictive models with top protein candidates.Availability and implementationpromor is freely available as an open source R package on the Comprehensive R Archive Network (CRAN) (https://CRAN.R-project.org/package=promor) and distributed under the Lesser General Public License (version 2.1 or later). Development version of promor is maintained on GitHub (https://github.com/caranathunge/promor) and additional documentation and tutorials are provided on the package website (https://caranathunge.github.io/promor/).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Tue, 07 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad025
      Issue No: Vol. 3, No. 1 (2023)
       
  • WITCH-NG: efficient and accurate alignment of datasets with sequence
           length heterogeneity

    • First page: vbad024
      Abstract: SummaryMultiple sequence alignment is a basic part of many bioinformatics pipelines, including in phylogeny estimation, prediction of structure for both RNAs and proteins, and metagenomic sequence analysis. Yet many sequence datasets exhibit substantial sequence length heterogeneity, both because of large insertions and deletions in the evolutionary history of the sequences and the inclusion of unassembled reads or incompletely assembled sequences in the input. A few methods have been developed that can be highly accurate in aligning datasets with sequence length heterogeneity, with UPP one of the first methods to achieve good accuracy, and WITCH a recent improvement on UPP for accuracy. In this article, we show how we can speed up WITCH. Our improvement includes replacing a critical step in WITCH (currently performed using a heuristic search) by a polynomial time exact algorithm using Smith–Waterman. Our new method, WITCH-NG (i.e. ‘next generation WITCH’) achieves the same accuracy but is substantially faster. WITCH-NG is available at https://github.com/RuneBlaze/WITCH-NG.Availability and implementationThe datasets used in this study are from prior publications and are freely available in public repositories, as indicated in the Supplementary MaterialsSupplementary Materials.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 06 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad024
      Issue No: Vol. 3, No. 1 (2023)
       
  • GPTree Cluster: phylogenetic tree cluster generator in the context of
           supertree inference

    • First page: vbad023
      Abstract: SummaryFor many years, evolutionary and molecular biologists have been working with phylogenetic supertrees, which are oriented acyclic graph structures. In the standard approaches, supertrees are obtained by concatenating a set of phylogenetic trees defined on different but overlapping sets of taxa (i.e. species). More recent approaches propose alternative solutions for supertree inference. The testing of new metrics for comparing supertrees and adapting clustering algorithms to overlapping phylogenetic trees with different numbers of leaves requires large amounts of data. In this context, designing a new approach and developing a computer program to generate phylogenetic tree clusters with different numbers of overlapping leaves are key elements to advance research on phylogenetic supertrees and evolution. The main objective of the project is to propose a new approach to simulate clusters of phylogenetic trees defined on different, but mutually overlapping, sets of taxa, with biological events. The proposed generator can be used to generate a certain number of clusters of phylogenetic trees in Newick format with a variable number of leaves and with a defined level of overlap between trees in clusters.Availability and implementationA Python script version 3.7, called GPTree Cluster, which implements the discussed approach, is freely available at: https://github.com/tahiri-lab/GPTree/tree/GPTreeCluster
      PubDate: Fri, 03 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad023
      Issue No: Vol. 3, No. 1 (2023)
       
  • Exploiting parallelization in positional Burrows–Wheeler transform
           (PBWT) algorithms for efficient haplotype matching and compression

    • First page: vbad021
      Abstract: SummaryThe positional Burrows–Wheeler transform (PBWT) data structure allows for efficient haplotype data matching and compression. Its performance makes it a powerful tool for bioinformatics. However, existing algorithms do not exploit parallelism due to inner dependencies. We introduce a new method to break the dependencies and show how to fully exploit modern multi-core processors.Availability and implementationSource code and applications are available at https://github.com/rwk-unil/parallel_pbwt.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Thu, 02 Mar 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad021
      Issue No: Vol. 3, No. 1 (2023)
       
  • MonoNet: enhancing interpretability in neural networks via monotonic
           features

    • First page: vbad016
      Abstract: MotivationBeing able to interpret and explain the predictions made by a machine learning model is of fundamental importance. Unfortunately, a trade-off between accuracy and interpretability is often observed. As a result, the interest in developing more transparent yet powerful models has grown considerably over the past few years. Interpretable models are especially needed in high-stake scenarios, such as computational biology and medical informatics, where erroneous or biased models’ predictions can have deleterious consequences for a patient. Furthermore, understanding the inner workings of a model can help increase the trust in the model.ResultsWe introduce a novel structurally constrained neural network, MonoNet, which is more transparent, while still retaining the same learning capabilities of traditional neural models. MonoNet contains monotonically connected layers that ensure monotonic relationships between (high-level) features and outputs. We show how, by leveraging the monotonic constraint in conjunction with other post hoc strategies, we can interpret our model. To demonstrate our model’s capabilities, we train MonoNet to classify cellular populations in a single-cell proteomic dataset. We also demonstrate MonoNet’s performance in other benchmark datasets in different domains, including non-biological applications (in the Supplementary MaterialSupplementary Material). Our experiments show how our model can achieve good performance, while providing at the same time useful biological insights about the most important biomarkers. We finally carry out an information-theoretical analysis to show how the monotonic constraint actively contributes to the learning process of the model.Availability and implementationCode and sample data are available at https://github.com/phineasng/mononet.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Thu, 23 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad016
      Issue No: Vol. 3, No. 1 (2023)
       
  • Benchmarking of microbiome detection tools on RNA-seq synthetic databases
           according to diverse conditions

    • First page: vbad014
      Abstract: MotivationHere, we performed a benchmarking analysis of five tools for microbe sequence detection using transcriptomics data (Kraken2, MetaPhlAn2, PathSeq, DRAC and Pandora). We built a synthetic database mimicking real-world structure with tuned conditions accounting for microbe species prevalence, base calling quality and sequence length. Sensitivity and positive predictive value (PPV) parameters, as well as computational requirements, were used for tool ranking.ResultsGATK PathSeq showed the highest sensitivity on average and across all scenarios considered. However, the main drawback of this tool was its slowness. Kraken2 was the fastest tool and displayed the second-best sensitivity, though with large variance depending on the species to be classified. There was no significant difference for the other three algorithms sensitivity. The sensitivity of MetaPhlAn2 and Pandora was affected by sequence number and DRAC by sequence quality and length. Results from this study support the use of Kraken2 for routine microbiome profiling based on its competitive sensitivity and runtime performance. Nonetheless, we strongly endorse to complement it by combining with MetaPhlAn2 for thorough taxonomic analyses.Availability and implementationhttps://github.com/fjuradorueda/MIME/ and https://github.com/lola4/DRAC/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Wed, 22 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad014
      Issue No: Vol. 3, No. 1 (2023)
       
  • recountmethylation enables flexible analysis of public blood DNA
           methylation array data

    • First page: vbad020
      Abstract: SummaryThousands of DNA methylation (DNAm) array samples from human blood are publicly available on the Gene Expression Omnibus (GEO), but they remain underutilized for experiment planning, replication and cross-study and cross-platform analyses. To facilitate these tasks, we augmented our recountmethylation R/Bioconductor package with 12 537 uniformly processed EPIC and HM450K blood samples on GEO as well as several new features. We subsequently used our updated package in several illustrative analyses, finding (i) study ID bias adjustment increased variation explained by biological and demographic variables, (ii) most variation in autosomal DNAm was explained by genetic ancestry and CD4+ T-cell fractions and (iii) the dependence of power to detect differential methylation on sample size was similar for each of peripheral blood mononuclear cells (PBMC), whole blood and umbilical cord blood. Finally, we used PBMC and whole blood to perform independent validations, and we recovered 38–46% of differentially methylated probes between sexes from two previously published epigenome-wide association studies.Availability and implementationSource code to reproduce the main results are available on GitHub (repo: recountmethylation_flexible-blood-analysis_manuscript; url: https://github.com/metamaden/recountmethylation_flexible-blood-analysis_manuscript). All data was publicly available and downloaded from the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/). Compilations of the analyzed public data can be accessed from the website recount.bio/data (preprocessed HM450K array data: https://recount.bio/data/remethdb_h5se-gm_epic_0-0-2_1589820348/; preprocessed EPIC array data: https://recount.bio/data/remethdb_h5se-gm_epic_0-0-2_1589820348/).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 20 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad020
      Issue No: Vol. 3, No. 1 (2023)
       
  • iSC.MEB: an R package for multi-sample spatial clustering analysis of
           spatial transcriptomics data

    • First page: vbad019
      Abstract: SummaryEmerging spatially resolved transcriptomics (SRT) technologies are powerful in measuring gene expression profiles while retaining tissue spatial localization information and typically provide data from multiple tissue sections. We have previously developed the tool SC.MEB—an empirical Bayes approach for SRT data analysis using a hidden Markov random field. Here, we introduce an extension to SC.MEB, denoted as integrated spatial clustering with hidden Markov random field using empirical Bayes (iSC.MEB) that permits the users to simultaneously estimate the batch effect and perform spatial clustering for low-dimensional representations of multiple SRT datasets. We demonstrate that iSC.MEB can provide accurate cell/domain detection results using two SRT datasets.Availability and implementationiSC.MEB is implemented in an open-source R package, and source code is freely available at https://github.com/XiaoZhangryy/iSC.MEB. Documentation and vignettes are provided on our package website (https://xiaozhangryy.github.io/iSC.MEB/index.html).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Fri, 17 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad019
      Issue No: Vol. 3, No. 1 (2023)
       
  • LAVAA: a lightweight association viewer across ailments

    • First page: vbad018
      Abstract: MotivationBiobank scale genetic associations results over thousands of traits can be difficult to visualize and navigate.ResultsWe have created LAVAA, a visualization web-application to generate genetic volcano plots for simultaneously considering the P-value, effect size, case counts, trait class and fine-mapping posterior probability at a single-nucleotide polymorphism (SNP) across a range of traits from a large set of genome-wide association study. We find that user interaction with association results in LAVAA can enrich and enhance the biological interpretation of individual loci.Availability and implementationLAVAA is available as a stand-alone web service (https://geneviz.aalto.fi/LAVAA/) and will be available in future releases of the finngen.fi website starting with release 10 in late 2023.
      PubDate: Wed, 15 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad018
      Issue No: Vol. 3, No. 1 (2023)
       
  • baseLess: lightweight detection of sequences in raw MinION data

    • First page: vbad017
      Abstract: SummaryWith its candybar form factor and low initial investment cost, the MinION brought affordable portable nucleic acid analysis within reach. However, translating the electrical signal it outputs into a sequence of bases still requires mid-tier computer hardware, which remains a caveat when aiming for deployment of many devices at once or usage in remote areas. For applications focusing on detection of a target sequence, such as infectious disease monitoring or species identification, the computational cost of analysis may be reduced by directly detecting the target sequence in the electrical signal instead. Here, we present baseLess, a computational tool that enables such target-detection-only analysis. BaseLess makes use of an array of small neural networks, each of which efficiently detects a fixed-size subsequence of the target sequence directly from the electrical signal. We show that baseLess can accurately determine the identity of reads between three closely related fish species and can classify sequences in mixtures of 20 bacterial species, on an inexpensive single-board computer.Availability and implementationbaseLess and all code used in data preparation and validation are available on Github at https://github.com/cvdelannoy/baseLess, under an MIT license. Used validation data and scripts can be found at https://doi.org/10.4121/20261392, under an MIT license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Wed, 15 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad017
      Issue No: Vol. 3, No. 1 (2023)
       
  • DISCO+QR: rooting species trees in the presence of GDL and ILS

    • First page: vbad015
      Abstract: MotivationGenes evolve under processes such as gene duplication and loss (GDL), so that gene family trees are multi-copy, as well as incomplete lineage sorting (ILS); both processes produce gene trees that differ from the species tree. The estimation of species trees from sets of gene family trees is challenging, and the estimation of rooted species trees presents additional analytical challenges. Two of the methods developed for this problem are STRIDE, which roots species trees by considering GDL events, and Quintet Rooting (QR), which roots species trees by considering ILS.ResultsWe present DISCO+QR, a new approach to rooting species trees that first uses DISCO to address GDL and then uses QR to perform rooting in the presence of ILS. DISCO+QR operates by taking the input gene family trees and decomposing them into single-copy trees using DISCO and then roots the given species tree using the information in the single-copy gene trees using QR. We show that the relative accuracy of STRIDE and DISCO+QR depend on the properties of the dataset (number of species, genes, rate of gene duplication, degree of ILS and gene tree estimation error), and that each provides advantages over the other under some conditions.Availability and implementationDISCO and QR are available in github.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Tue, 07 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad015
      Issue No: Vol. 3, No. 1 (2023)
       
  • NSPA: characterizing the disease association of multiple genetic
           interactions at single-subject resolution

    • First page: vbad010
      Abstract: MotivationThe interaction between genetic variables is one of the major barriers to characterizing the genetic architecture of complex traits. To consider epistasis, network science approaches are increasingly being used in research to elucidate the genetic architecture of complex diseases. Network science approaches associate genetic variables’ disease susceptibility to their topological importance in the network. However, this network only represents genetic interactions and does not describe how these interactions attribute to disease association at the subject-scale. We propose the Network-based Subject Portrait Approach (NSPA) and an accompanying feature transformation method to determine the collective risk impact of multiple genetic interactions for each subject.ResultsThe feature transformation method converts genetic variants of subjects into new values that capture how genetic variables interact with others to attribute to a subject’s disease association. We apply this approach to synthetic and genetic datasets and learn that (1) the disease association can be captured using multiple disjoint sets of genetic interactions and (2) the feature transformation method based on NSPA improves predictive performance comparing with using the original genetic variables. Our findings confirm the role of genetic interaction in complex disease and provide a novel approach for gene–disease association studies to identify genetic architecture in the context of epistasis.Availability and implementationThe codes of NSPA are now available in: https://github.com/MIB-Lab/Network-based-Subject-Portrait-ApproachContactting.hu@queensu.caSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Tue, 07 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad010
      Issue No: Vol. 3, No. 1 (2023)
       
  • G-RANK: an equivariant graph neural network for the scoring of
           protein–protein docking models

    • First page: vbad011
      Abstract: MotivationProtein complex structure prediction is important for many applications in bioengineering. A widely used method for predicting the structure of protein complexes is computational docking. Although many tools for scoring protein–protein docking models have been developed, it is still a challenge to accurately identify near-native models for unknown protein complexes. A recently proposed model called the geometric vector perceptron–graph neural network (GVP-GNN), a subtype of equivariant graph neural networks, has demonstrated success in various 3D molecular structure modeling tasks.ResultsHerein, we present G-RANK, a GVP-GNN-based method for the scoring of protein-protein docking models. When evaluated on two different test datasets, G-RANK achieved a performance competitive with or better than the state-of-the-art scoring functions. We expect G-RANK to be a useful tool for various applications in biological engineering.Availability and implementationSource code is available at https://github.com/ha01994/grank.Contactkds@kaist.ac.krSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Fri, 03 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad011
      Issue No: Vol. 3, No. 1 (2023)
       
  • An immune-suppressing protein in human endogenous retroviruses

    • First page: vbad013
      Abstract: MotivationRetroviruses are important contributors to disease and evolution in vertebrates. Sometimes, retrovirus DNA is heritably inserted in a vertebrate genome: an endogenous retrovirus (ERV). Vertebrate genomes have many such virus-derived fragments, usually with mutations disabling their original functions.ResultsSome primate ERVs appear to encode an overlooked protein. This protein is homologous to protein MC132 from Molluscum contagiosum virus, which is a human poxvirus, not a retrovirus. MC132 suppresses the immune system by targeting NF-κB, and it had no known homologs until now. The ERV homologs of MC132 in the human genome are mostly disrupted by mutations, but there is an intact copy on chromosome 4. We found homologs of MC132 in ERVs of apes, monkeys and bushbaby, but not tarsiers, lemurs or non-primates. This suggests that some primate retroviruses had, or have, an extra immune-suppressing protein, which underwent horizontal genetic transfer between unrelated viruses.Contactmcfrith@edu.k.u-tokyo.ac.jp
      PubDate: Thu, 02 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad013
      Issue No: Vol. 3, No. 1 (2023)
       
  • Improving classification of correct and incorrect protein–protein
           docking models by augmenting the training set

    • First page: vbad012
      Abstract: MotivationProtein–protein interactions drive many relevant biological events, such as infection, replication and recognition. To control or engineer such events, we need to access the molecular details of the interaction provided by experimental 3D structures. However, such experiments take time and are expensive; moreover, the current technology cannot keep up with the high discovery rate of new interactions. Computational modeling, like protein–protein docking, can help to fill this gap by generating docking poses. Protein–protein docking generally consists of two parts, sampling and scoring. The sampling is an exhaustive search of the tridimensional space. The caveat of the sampling is that it generates a large number of incorrect poses, producing a highly unbalanced dataset. This limits the utility of the data to train machine learning classifiers.ResultsUsing weak supervision, we developed a data augmentation method that we named hAIkal. Using hAIkal, we increased the labeled training data to train several algorithms. We trained and obtained different classifiers; the best classifier has 81% accuracy and 0.51 Matthews’ correlation coefficient on the test set, surpassing the state-of-the-art scoring functions.Availability and implementationDocking models from Benchmark 5 are available at https://doi.org/10.5281/zenodo.4012018. Processed tabular data are available at https://repository.kaust.edu.sa/handle/10754/666961. Google colab is available at https://colab.research.google.com/drive/1vbVrJcQSf6\_C3jOAmZzgQbTpuJ5zC1RP'usp=sharingSupplementary informationSupplementary data are available at Bioinformatics Advances online.
      PubDate: Thu, 02 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad012
      Issue No: Vol. 3, No. 1 (2023)
       
  • Snekmer: a scalable pipeline for protein sequence fingerprinting based on
           amino acid recoding

    • First page: vbad005
      Abstract: MotivationThe vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has been used for rapid, accurate classification of proteins into functional categories; however, this approach employs an exact-match methodology and thus may be limited in terms of sensitivity and coverage. We have previously used similarity groupings, based on the chemical properties of amino acids, to form reduced character sets and recode proteins. This amino acid recoding (AAR) approach simplifies the construction of protein representations in the form of kmer vectors, which can link sequences with distant sequence similarity and provide accurate classification of problematic protein families.ResultsHere, we describe Snekmer, a software tool for recoding proteins into AAR kmer vectors and performing either (i) construction of supervised classification models trained on input protein families or (ii) clustering for de novo determination of protein families. We provide examples of the operation of the tool against a set of nitrogen cycling families originally collected using both standard hidden Markov models and a larger set of proteins from Uniprot and demonstrate that our method accurately differentiates these sequences in both operation modes.Availability and implementationSnekmer is written in Python using Snakemake. Code and data used in this article, along with tutorial notebooks, are available at http://github.com/PNNL-CompBio/Snekmer under an open-source BSD-3 license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Thu, 02 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad005
      Issue No: Vol. 3, No. 1 (2023)
       
  • IntLIM 2.0: identifying multi-omic relationships dependent on discrete or
           continuous phenotypic measurements

    • First page: vbad009
      Abstract: MotivationIntLIM uncovers phenotype-dependent linear associations between two types of analytes (e.g. genes and metabolites) in a multi-omic dataset, which may reflect chemically or biologically relevant relationships.ResultsThe new IntLIM R package includes newly added support for generalized data types, covariate correction, continuous phenotypic measurements, model validation and unit testing. IntLIM analysis uncovered biologically relevant gene–metabolite associations in two separate datasets, and the run time is improved over baseline R functions by multiple orders of magnitude.Availability and implementationIntLIM is available as an R package with a detailed vignette (https://github.com/ncats/IntLIM) and as an R Shiny app (see Supplementary Figs S1–S6Supplementary Figs S1–S6) (https://intlim.ncats.io/).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Wed, 01 Feb 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad009
      Issue No: Vol. 3, No. 1 (2023)
       
  • SCAMPP+FastTree: improving scalability for likelihood-based phylogenetic
           placement

    • First page: vbad008
      Abstract: SummaryPhylogenetic placement is the problem of placing ‘query’ sequences into an existing tree (called a ‘backbone tree’). One of the most accurate phylogenetic placement methods to date is the maximum likelihood-based method pplacer, using RAxML to estimate numeric parameters on the backbone tree and then adding the given query sequence to the edge that maximizes the probability that the resulting tree generates the query sequence. Unfortunately, this way of running pplacer fails to return valid outputs on many moderately large backbone trees and so is limited to backbone trees with at most ∼10 000 leaves. SCAMPP is a technique to enable pplacer to run on larger backbone trees, which operates by finding a small ‘placement subtree’ specific to each query sequence, within which the query sequence are placed using pplacer. That approach matched the scalability and accuracy of APPLES-2, the previous most scalable method. Here, we explore a different aspect of pplacer’s strategy: the technique used to estimate numeric parameters on the backbone tree. We confirm anecdotal evidence that using FastTree instead of RAxML to estimate numeric parameters on the backbone tree enables pplacer to scale to much larger backbone trees, almost (but not quite) matching the scalability of APPLES-2 and pplacer-SCAMPP. We then evaluate the combination of these two techniques—SCAMPP and the use of FastTree. We show that this combined approach, pplacer-SCAMPP-FastTree, has the same scalability as APPLES-2, improves on the scalability of pplacer-FastTree and achieves better accuracy than the comparably scalable methods.Availability and implementationhttps://github.com/gillichu/PLUSplacer-taxtastic.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 30 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad008
      Issue No: Vol. 3, No. 1 (2023)
       
  • Faltwerk: a library for spatial exploratory data analysis of protein
           structures

    • First page: vbad007
      Abstract: SummaryProteins are fundamental building blocks of life and are investigated in a broad range of scientific fields, especially in the context of recent progress using in silico structure prediction models and the surge of resulting protein structures in public databases. However, exploratory data analysis of these proteins can be slow because of the need for several methods, ranging from geometric and spatial analysis to visualization. The Python library faltwerk provides an integrated toolkit to perform explorative work with rapid feedback. This toolkit includes support for protein complexes, spatial analysis (point density or spatial autocorrelation), ligand binding site prediction and an intuitive visualization interface based on the grammar of graphics.Availability and implementationfaltwerk is distributed under the permissive BSD-3 open source license. Source code and documentation, including an extensive common-use case tutorial, can be found at github.com/phiweger/faltwerk; binaries are available from the pypi repository.
      PubDate: Mon, 23 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad007
      Issue No: Vol. 3, No. 1 (2023)
       
  • Pancancer survival prediction using a deep learning architecture with
           multimodal representation and integration

    • First page: vbad006
      Abstract: MotivationUse of multi-omics data carrying comprehensive signals about the disease is strongly desirable for understanding and predicting disease progression, cancer particularly as a serious disease with a high mortality rate. However, recent methods currently fail to effectively utilize the multi-omics data for cancer survival prediction and thus significantly limiting the accuracy of survival prediction using omics data.ResultsIn this work, we constructed a deep learning model with multimodal representation and integration to predict the survival of patients using multi-omics data. We first developed an unsupervised learning part to extract high-level feature representations from omics data of different modalities. Then, we used an attention-based method to integrate feature representations, produced by the unsupervised learning part, into a single compact vector and finally we fed the vector into fully connected layers for survival prediction. We used multimodal data to train the model and predict pancancer survival, and the results show that using multimodal data can lead to higher prediction accuracy compared to using single modal data. Furthermore, we used the concordance index and the 5-fold cross-validation method for comparing our proposed method with current state-of-the-art methods and our results show that our model achieves better performance on the majority of cancer types in our testing datasets.Availability and implementationhttps://github.com/ZhangqiJiang07/MultimodalSurvivalPrediction.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
      PubDate: Mon, 23 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad006
      Issue No: Vol. 3, No. 1 (2023)
       
  • cvlr: finding heterogeneously methylated genomic regions using ONT reads

    • First page: vbac101
      Abstract: SummaryNanopore reads encode information on the methylation status of cytosines in CpG dinucleotides. The length of the reads makes it comparatively easy to look at patterns consisting of multiple loci; here, we exploit this property to search for regions where one can define subpopulations of molecules based on methylation patterns. As an example, we run our clustering algorithm on known imprinted genes; we also scan chromosome 15 looking for windows corresponding to heterogeneous methylation. Our software can also compute the covariance of methylation across these regions while keeping into account the mixture of different types of reads.Availability and implementationhttps://github.com/EmanueleRaineri/cvlr.Contactsimon.heath@cnag.crg.euSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 23 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbac101
      Issue No: Vol. 3, No. 1 (2023)
       
  • A cloud-based pipeline for analysis of FHIR and long-read data

    • First page: vbac095
      Abstract: MotivationAs genome sequencing becomes cheaper and more accurate, it is becoming increasingly viable to merge this data with electronic health information to inform clinical decisions.ResultsIn this work, we demonstrate a full pipeline for working with both PacBio sequencing data and clinical FHIR® data, from initial data to tertiary analysis. The electronic health records are stored in FHIR® (Fast Healthcare Interoperability Resource) format, the current leading standard for healthcare data exchange. For the genomic data, we perform variant calling on long-read PacBio HiFi data using Cromwell on Azure. Both data formats are parsed, processed and merged in a single scalable pipeline which securely performs tertiary analyses using cloud-based Jupyter notebooks. We include three example applications: exporting patient information to a database, clustering patients and performing a simple pharmacogenomic study.Availability and implementationhttps://github.com/microsoft/genomicsnotebook/tree/main/fhirgenomicsSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Fri, 20 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbac095
      Issue No: Vol. 3, No. 1 (2023)
       
  • OmicsTIDE: interactive exploration of trends in multi-omics data

    • First page: vbac093
      Abstract: MotivationThe increasing amount of data produced by omics technologies has enabled researchers to study phenomena across multiple omics layers. Besides data-driven analysis strategies, interactive visualization tools have been developed for a more transparent analysis. However, most state-of-the-art tools do not reconstruct the impact of a single omics layer on the integration result.ResultsWe developed a data classification scheme focusing on different aspects of multi-omics datasets for a systemic understanding. Based on this classification, we developed the Omics Trend-comparing Interactive Data Explorer (OmicsTIDE), an interactive visualization tool for the comparison of gene-based quantitative omics data. The tool consists of a computational part that clusters omics datasets to determine trends and an interactive visualization. The trends are visualized as profile plots and are connected by a Sankey diagram that allows for an interactive pairwise trend comparison to discover concordant and discordant trends. Moreover, large-scale omics datasets are broken down into small subsets that can be analyzed functionally using Gene Ontology enrichment within few analysis steps. We demonstrate the interactive analysis using OmicsTIDE with two case studies focusing on different experimental designs.Availability and implementationOmicsTIDE is a web tool available via http://omicstide-tuevis.cs.uni-tuebingen.de/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Fri, 20 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbac093
      Issue No: Vol. 3, No. 1 (2023)
       
  • scMEGA: single-cell multi-omic enhancer-based gene regulatory network
           inference

    • First page: vbad003
      Abstract: SummaryThe increasing availability of single-cell multi-omics data allows to quantitatively characterize gene regulation. We here describe scMEGA (Single-cell Multiomic Enhancer-based Gene Regulatory Network Inference) that enables an end-to-end analysis of multi-omics data for gene regulatory network inference including modalities integration, trajectory analysis, enhancer-to-promoter association, network analysis and visualization. This enables to study the complex gene regulation mechanisms for dynamic biological processes, such as cellular differentiation and disease-driven cellular remodeling. We provide a case study on gene regulatory networks controlling myofibroblast activation in human myocardial infarction.Availability and implementationscMEGA is implemented in R, released under the MIT license and available from https://github.com/CostaLab/scMEGA. Tutorials are available from https://costalab.github.io/scMEGA.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Thu, 12 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad003
      Issue No: Vol. 3, No. 1 (2023)
       
  • Deep learning predicts the impact of regulatory variants on
           cell-type-specific enhancers in the brain

    • First page: vbad002
      Abstract: MotivationPrevious studies have shown that the heritability of multiple brain-related traits and disorders is highly enriched in transcriptional enhancer regions. However, these regions often contain many individual variants, while only a subset of them are likely to causally contribute to a trait. Statistical fine-mapping techniques can identify putative causal variants, but their resolution is often limited, especially in regions with multiple variants in high linkage disequilibrium. In these cases, alternative computational methods to estimate the impact of individual variants can aid in variant prioritization.ResultsHere, we develop a deep learning pipeline to predict cell-type-specific enhancer activity directly from genomic sequences and quantify the impact of individual genetic variants in these regions. We show that the variants highlighted by our deep learning models are targeted by purifying selection in the human population, likely indicating a functional role. We integrate our deep learning predictions with statistical fine-mapping results for 8 brain-related traits, identifying 63 distinct candidate causal variants predicted to contribute to these traits by modulating enhancer activity, representing 6% of all genome-wide association study signals analyzed. Overall, our study provides a valuable computational method that can prioritize individual variants based on their estimated regulatory impact, but also highlights the limitations of existing methods for variant prioritization and fine-mapping.Availability and implementationThe data underlying this article, nucleotide-level importance scores, and code for running the deep learning pipeline are available at https://github.com/Pandaman-Ryan/AgentBind-brain.Contactmgymrek@ucsd.eduSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Thu, 12 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad002
      Issue No: Vol. 3, No. 1 (2023)
       
  • Applications of transformer-based language models in bioinformatics: a
           survey

    • First page: vbad001
      Abstract: SummaryThe transformer-based language models, including vanilla transformer, BERT and GPT-3, have achieved revolutionary breakthroughs in the field of natural language processing (NLP). Since there are inherent similarities between various biological sequences and natural languages, the remarkable interpretability and adaptability of these models have prompted a new wave of their application in bioinformatics research. To provide a timely and comprehensive review, we introduce key developments of transformer-based language models by describing the detailed structure of transformers and summarize their contribution to a wide range of bioinformatics research from basic sequence analysis to drug discovery. While transformer-based applications in bioinformatics are diverse and multifaceted, we identify and discuss the common challenges, including heterogeneity of training data, computational expense and model interpretability, and opportunities in the context of bioinformatics research. We hope that the broader community of NLP researchers, bioinformaticians and biologists will be brought together to foster future research and development in transformer-based language models, and inspire novel bioinformatics applications that are unattainable by traditional methods.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Wed, 11 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbad001
      Issue No: Vol. 3, No. 1 (2023)
       
  • Learning from small medical data—robust semi-supervised cancer prognosis
           classifier with Bayesian variational autoencoder

    • First page: vbac100
      Abstract: MotivationCancer is one of the world’s leading mortality causes, and its prognosis is hard to predict due to complicated biological interactions among heterogeneous data types. Numerous challenges, such as censorship, high dimensionality and small sample size, prevent researchers from using deep learning models for precise prediction.ResultsWe propose a robust Semi-supervised Cancer prognosis classifier with bAyesian variational autoeNcoder (SCAN) as a structured machine-learning framework for cancer prognosis prediction. SCAN incorporates semi-supervised learning for predicting 5-year disease-specific survival and overall survival in breast and non-small cell lung cancer (NSCLC) patients, respectively. SCAN achieved significantly better AUROC scores than all existing benchmarks (81.73% for breast cancer; 80.46% for NSCLC), including our previously proposed bimodal neural network classifiers (77.71% for breast cancer; 78.67% for NSCLC). Independent validation results showed that SCAN still achieved better AUROC scores (74.74% for breast; 72.80% for NSCLC) than the bimodal neural network classifiers (64.13% for breast; 67.07% for NSCLC). SCAN is general and can potentially be trained on more patient data. This paves the foundation for personalized medicine for early cancer risk screening.Availability and implementationThe source codes reproducing the main results are available on GitHub: https://gitfront.io/r/user-4316673/36e8714573f3fbfa0b24690af5d1a9d5ca159cf4/scan/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 09 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbac100
      Issue No: Vol. 3, No. 1 (2023)
       
  • GlobeCorr: interactive globe-based visualization for correlation datasets

    • First page: vbac099
      Abstract: MotivationIncreasingly complex omics datasets are being generated, along with associated diverse categories of metadata (environmental, clinical, etc.). Looking at the correlation between these variables can be critical to identify potential confounding factors and novel relationships. To date, some correlation globe software has been developed to aid investigations; however, they lack secure, dynamic visualization capability.ResultsGlobeCorr.ca is a web-based application designed to provide user-friendly, interactive visualization and analysis of correlation datasets. Users load tabular data listing pairwise variables and their correlation values, and GlobeCorr creates a dynamic visualization using ribbons to represent positive and negative correlations, optionally grouped by domain/category (such as microbiome taxa against other metadata). GlobeCorr runs securely (locally on a user’s computer) and provides a simple method for users to visualize and summarize complex datasets. This tool is applicable to a wide range of disciplines and domains of interest, including the bioinformatics/microbiome and metadata examples provided within.Availability and ImplementationSee https://GlobeCorr.ca; Code provided under an open source MIT license: https://github.com/brinkmanlab/globecorr.
      PubDate: Fri, 06 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbac099
      Issue No: Vol. 3, No. 1 (2023)
       
  • Prediction of antibody binding to SARS-CoV-2 RBDs

    • First page: vbac103
      Abstract: SummaryThe ability to predict antibody–antigen binding is essential for computational models of antibody affinity maturation and protein design. While most models aim to predict binding for arbitrary antigens and antibodies, the global impact of SARS-CoV-2 on public health and the availability of associated data suggest that a SARS-CoV-2-specific model would be highly beneficial. In this work, we present a neural network model, trained on ∼315 000 datapoints from deep mutational scanning experiments, that predicts escape fractions of SARS-CoV-2 RBDs binding to arbitrary antibodies. The antibody embeddings within the model constitute an effective sequence space, which correlates with the Hamming distance, suggesting that these embeddings may be useful for downstream tasks such as binding prediction. Indeed, the model achieves Spearman correlation coefficients of 0.46 and 0.52 on two held-out test sets. By comparison, correlation coefficients calculated using existing structure and sequence-based models do not exceed 0.28. The correlation coefficient against dissociation constants of antibodies binding to SARS-CoV-2 RBD variants is 0.46. Additionally, the residue-level escapes are highest in the antibody epitope, correlating well with experimentally measured escapes. We further study the effect of antibody chain use, embedding dimension size and feed-forward and convolutional architectures on the model results. Lastly, we find that the inference time of our model is significantly faster than previous models, suggesting that it could be a useful tool for the accurate and rapid prediction of antibodies binding to SARS-CoV-2 RBDs.Availability and implementationThe model and associated code are available for download at https://github.com/ericzwang/RBD_AB.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 02 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbac103
      Issue No: Vol. 3, No. 1 (2023)
       
  • CoDe: a web-based tool for codon deoptimization

    • First page: vbac102
      Abstract: SummaryWe have developed a web-based tool, CoDe (Codon Deoptimization) that deoptimizes genetic sequences based on different codon usage bias, ultimately reducing expression of the corresponding protein. The tool could also deoptimize the sequence for a specific region and/or selected amino acid(s). Moreover, CoDe can highlight sites targeted by restriction enzymes in the wild-type and codon-deoptimized sequences. Importantly, our web-based tool has a user-friendly interface with flexible options to download results.Availability and implementationThe web-based tool CoDe is freely available at https://web.iitm.ac.in/bioinfo2/codeop/landing_page.html.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 02 Jan 2023 00:00:00 GMT
      DOI: 10.1093/bioadv/vbac102
      Issue No: Vol. 3, No. 1 (2023)
       
  • Computational analyses reveal fundamental properties of the AT structure
           related to thrombosis

    • First page: vbac098
      Abstract: SummaryBlood coagulation is a vital process for humans and other species. Following an injury to a blood vessel, a cascade of molecular signals is transmitted, inhibiting and activating more than a dozen coagulation factors and resulting in the formation of a fibrin clot that ceases the bleeding. In this process, antithrombin (AT), encoded by the SERPINC1 gene is a key player regulating the clotting activity and ensuring that it stops at the right time. In this sense, mutations to this factor often result in thrombosis—the excessive coagulation that leads to the potentially fatal formation of blood clots that obstruct veins. Although this process is well known, it is still unclear why even single residue substitutions to AT lead to drastically different phenotypes. In this study, to understand the effect of mutations throughout the AT structure, we created a detailed network map of this protein, where each node is an amino acid, and two amino acids are connected if they are in close proximity in the three-dimensional structure. With this simple and intuitive representation and a machine-learning framework trained using genetic information from more than 130 patients, we found that different types of thrombosis have emerging patterns that are readily identifiable. Together, these results demonstrate how clinical features, genetic data and in silico analysis are converging to enhance the diagnosis and treatment of coagulation disorders.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Fri, 23 Dec 2022 00:00:00 GMT
      DOI: 10.1093/bioadv/vbac098
      Issue No: Vol. 3, No. 1 (2022)
       
  • Simulation of mass spectrometry-based proteomics data with Synthedia

    • First page: vbac096
      Abstract: MotivationA large number of experimental and bioinformatic parameters must be set to identify and quantify peptides in mass spectrometry experiments and each of these will impact the results. An ability to simulate raw data with known contents would allow researchers to rapidly explore the effects of varying experimental parameters and systematically investigate downstream processing software. A range of data simulators are available for established data-dependent acquisition methodologies, but these do not extend to the rapidly developing field of data-independent acquisition (DIA) strategies.ResultsHere, we present Synthedia—a software package to simulate DIA liquid chromatography-mass spectrometry for bottom-up proteomics experiments. Synthedia can generate datasets with known peptide precursor ions and fragments and allows for the customization of a wide variety of chromatographic and mass spectrometry parameters.Availability and implementationSynthedia is freely available via the internet and can be used through a graphical website (https://synthedia.org/) or locally via the command line (https://github.com/mgleeming/synthedia/).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Mon, 19 Dec 2022 00:00:00 GMT
      DOI: 10.1093/bioadv/vbac096
      Issue No: Vol. 3, No. 1 (2022)
       
  • Nucleotide augmentation for machine learning-guided protein engineering

    • First page: vbac094
      Abstract: SummaryMachine learning-guided protein engineering is a rapidly advancing field. Despite major experimental and computational advances, collecting protein genotype (sequence) and phenotype (function) data remains time- and resource-intensive. As a result, the quality and quantity of training data are often a limiting factor in developing machine learning models. Data augmentation techniques have been successfully applied to the fields of computer vision and natural language processing; however, there is a lack of such augmentation techniques for biological sequence data. Towards this end, we develop nucleotide augmentation (NTA), which leverages natural nucleotide codon degeneracy to augment protein sequence data via synonymous codon substitution. As a proof of concept for protein engineering, we test several online and offline augmentation implementations to train machine learning models with benchmark datasets of protein genotype and phenotype, revealing performance gains on par and surpassing benchmark models using a fraction of the training data. NTA also enables substantial improvements for classification tasks under heavy class imbalance.Availability and implementationThe code used in this study is publicly available at https://github.com/minotm/NTASupplementary informationSupplementary dataSupplementary data are available at Bioinformatics Advances online.
      PubDate: Fri, 09 Dec 2022 00:00:00 GMT
      DOI: 10.1093/bioadv/vbac094
      Issue No: Vol. 3, No. 1 (2022)
       
  • Porechop_ABI: discovering unknown adapters in Oxford Nanopore Technology
           sequencing reads for downstream trimming

    • First page: vbac085
      Abstract: MotivationOxford Nanopore Technologies (ONT) sequencing has become very popular over the past few years and offers a cost-effective solution for many genomic and transcriptomic projects. One distinctive feature of the technology is that the protocol includes the ligation of adapters to both ends of each fragment. Those adapters should then be removed before downstream analyses, either during the basecalling step or by explicit trimming. This basic task may be tricky when the definition of the adapter sequence is not well documented.ResultsWe have developed a new method to scan a set of ONT reads to see if it contains adapters, without any prior knowledge on the sequence of the potential adapters, and then trim out those adapters. The algorithm is based on approximate k-mers and is able to discover adapter sequences based on their frequency alone. The method was successfully tested on a variety of ONT datasets with different flowcells, sequencing kits and basecallers.Availability and implementationThe resulting software, named Porechop_ABI, is open-source and is available at https://github.com/bonsai-team/Porechop_ABI.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics advances online.
      PubDate: Mon, 21 Nov 2022 00:00:00 GMT
      DOI: 10.1093/bioadv/vbac085
      Issue No: Vol. 3, No. 1 (2022)
       
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
 


Your IP address: 3.230.152.133
 
Home (Search)
API
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-