Predictive accuracy can be enhanced by integrating TransFun predictions with sequence similarity-based forecasts.
The GitHub repository https//github.com/jianlin-cheng/TransFun houses the TransFun source code.
Access the TransFun source code on GitHub at https://github.com/jianlin-cheng/TransFun.
Non-canonical DNA, also known as non-B DNA, is characterized by distinct three-dimensional structures, differing from the standard double-helix configuration within genomic regions. Non-B DNA's participation in crucial cellular processes is undeniable, and its influence extends to genomic instability, the control of gene expression, and the progression of oncogenesis. Experimental approaches to identifying non-B DNA structures suffer from low throughput and are limited in the types of non-B conformations they can detect, whereas computational methods, while dependent on the presence of specific non-B DNA base patterns, are still not definitively conclusive in predicting the existence of these structures. Oxford Nanopore sequencing provides a cost-effective and efficient platform, yet the applicability of nanopore reads for the identification of non-B DNA structures remains an open question.
A pioneering computational pipeline is constructed to forecast non-B DNA structures based on nanopore sequencing data. The detection of non-B elements is framed as a problem of novelty detection, and we have designed the GoFAE-DND autoencoder, employing goodness-of-fit (GoF) tests as a regularizing technique. The discriminative loss function actively discourages the reconstruction of non-B DNA structures, and optimized Gaussian goodness-of-fit tests permit the calculation of P-values indicating the presence of non-B structures. Nanopore sequencing of the complete NA12878 genome highlights substantial discrepancies in DNA translocation timing between non-B and B-DNA base pairs. Our approach's merit is highlighted through comparisons with novelty detection methods, using both experimental and simulated data from a novel translocation time simulator. Nanopore sequencing experiments show that the accurate recognition of non-B DNA forms is feasible.
The project's source code, ONT-nonb-GoFAE-DND, is hosted on GitHub at https://github.com/bayesomicslab/ONT-nonb-GoFAE-DND.
The source code is accessible on GitHub, located at https//github.com/bayesomicslab/ONT-nonb-GoFAE-DND.
Massive datasets, now standard, including whole-genome sequences of various bacterial strains, are a critical and plentiful resource for modern genomic epidemiology and metagenomics. For optimal utilization of these datasets, indexing structures that are both scalable and capable of providing rapid query throughput are essential.
Themisto, a scalable colored k-mer index, is presented as a solution for large microbial reference genome datasets, offering support for both short and long read data. Salmonella enterica genomes, 179 thousand in number, are indexed by Themisto in a mere nine hours. To store the index, 142 gigabytes are needed. In contrast to the best competing software Metagraph and Bifrost, indexing was limited to 11,000 genomes over the identical timeframe. Psychosocial oncology When compared to Themisto, the performance of these other tools in pseudoalignment was either one-tenth as fast, or they consumed ten times as much memory. In terms of pseudoalignment quality, Themisto outperforms prior methods, achieving a higher recall rate when processing Nanopore reads.
https//github.com/algbio/themisto provides the documented C++ package Themisto, licensed under GPLv2.
https://github.com/algbio/themisto hosts the documented C++ Themisto package, licensed under GPLv2.
The escalating pace of genomic sequencing data generation has produced a burgeoning array of gene network repositories. Downstream applications benefit from the informative representations of each gene, learned through unsupervised network integration methods, subsequently used as crucial features. Nevertheless, the methods of network integration must be scalable to accommodate the burgeoning number of networks and resilient to disparities in network types across hundreds of gene networks.
To tackle these necessities, Gemini, a novel network integration technique, is presented. This method employs memory-efficient high-order pooling to represent and quantify the uniqueness of each network. Through a process of mixing existing networks, Gemini aims to overcome the uneven distribution, thereby establishing many new networks. For human protein function prediction, Gemini, by integrating numerous networks from BioGRID, leads to over a 10% gain in F1 score, a 15% improvement in micro-AUPRC, and a 63% enhancement in macro-AUPRC, a marked contrast to the performance decline observed in Mashup and BIONIC embeddings with escalating network input. Gemini, subsequently, enables memory-efficient and illuminating network integration for extensive gene networks, and it can be used to comprehensively integrate and analyze networks in other application areas.
Gemini is available for access at the GitHub link: https://github.com/MinxZ/Gemini.
The GitHub repository for Gemini, where you can access it, is https://github.com/MinxZ/Gemini.
Understanding the relationships among cellular types is paramount for effectively transferring experimental data from murine studies to human contexts. The task of aligning cell types, however, is complicated by the biological divergence among species. Current methods focusing solely on one-to-one orthologous genes overlook a significant quantity of evolutionary information held within the intergenic regions between genes, which could aid in species alignment. In some methods, gene relationships are explicitly included to retain relevant information, but this approach isn't without its challenges.
A model for transferring and aligning cell types across species, called TACTiCS, is presented in this work. TACTiCS employs a natural language processing model for the purpose of matching genes, examining their protein sequences for alignment. Next, a neural network within TACTiCS is employed to classify the different cell types of a particular species. Subsequently, the application of transfer learning within TACTiCS extends cell type annotations across species. TACTiCS analysis was carried out on single-cell RNA sequencing data from the human, mouse, and marmoset primary motor cortex. With these datasets, our model demonstrably aligns and matches cell types with accuracy. selleck chemicals Our model surpasses both Seurat and the current best SAMap method in performance. In conclusion, our gene matching methodology showcases enhanced cell type alignment accuracy over BLAST within our model.
The implementation is hosted on GitHub, specifically at the link https://github.com/kbiharie/TACTiCS. Users can access the preprocessed datasets and trained models through the Zenodo link: https//doi.org/105281/zenodo.7582460.
On the GitHub platform, the implementation is located at this URL: (https://github.com/kbiharie/TACTiCS). Zenodo hosts the preprocessed datasets and trained models, retrievable through this DOI: https//doi.org/105281/zenodo.7582460.
Deep learning models, employing sequential data, have successfully predicted various functional genomic outputs, encompassing open chromatin regions and gene RNA expression patterns. Nonetheless, a significant constraint of existing methodologies lies in the computationally intensive post-hoc analyses required for model interpretation, often failing to elucidate the inner workings of highly complex, parameter-rich models. In this paper, a deep learning architecture, called the totally interpretable sequence-to-function model (tiSFM), is presented. Standard multilayer convolutional models' performance is enhanced by tiSFM, which accomplishes this with a reduced parameter count. Additionally, tiSFM's multi-layer neural network structure conceals interpretable internal model parameters that directly correlate to important sequence motifs.
Across hematopoietic cell types, we scrutinize publicly accessible open chromatin measurements and find that tiSFM demonstrates superior performance compared to a top-performing convolutional neural network model, specifically designed for this dataset. It has been further shown that the tool correctly identifies context-sensitive functions of transcription factors, for example, Pax5 and Ebf1 in B-cell development, as well as Rorc in innate lymphoid cell generation, within the process of hematopoietic differentiation. The model parameters within tiSFM exhibit biological meaning, and we present the utility of our approach concerning the challenging task of forecasting alterations in epigenetic state as a consequence of developmental shifts.
Python scripts for analyzing key findings are included in the source code, available at the link https://github.com/boooooogey/ATAConv.
Python's implementation of the analysis scripts for key findings from the source code is situated at https//github.com/boooooogey/ATAConv.
Nanopore sequencers generate real-time raw electrical signals as they sequence long genomic strands. Raw signals, as they are created, can be analyzed, thus enabling real-time genome analysis. The Read Until function in nanopore sequencing permits the expulsion of strands from the sequencer prior to full sequencing, offering opportunities to streamline sequencing costs and timelines through computational methods. Wound infection In contrast, existing methods employing Read Until either (a) require substantial computing infrastructure incompatible with portable sequencers, or (b) lack scalability for large-scale genome projects, ultimately affecting their validity and utility. Employing a hash-based similarity search, RawHash, a pioneering mechanism, enables the precise and efficient real-time analysis of raw nanopore signals from large genomes. By maintaining uniformity in hash values, RawHash ensures signals corresponding to identical DNA sequences yield the same hash value, irrespective of minor signal variations. By quantizing raw signals in a manner that preserves similarity for DNA content, RawHash accurately identifies similar sequences through hash-based searches, thereby producing identical quantized and hash values for corresponding signals.