A few words about the this project.
PhageAI is an application that simultaneously represents a repository of knowledge of bacteriophages and a tool to analyse genomes with Artificial Intelligence support.
Machine Learning algorithms can process enormous amounts of data in relatively short time in order to find connections and dependencies that are unobvious for human beings. Correctly designed applications based on AI are able to vastly improve and speed up the work of the domain experts.
Models based on DNA contextual vectorization and Deep Neural Networks are particularly effective when it comes to analysis of genomic data. The system that we propose aims to use the phages sequences uploaded to the database to build a model which is able to predict if a bacteriophage is virulent or temperate with a high probability.
One of the key system modules is the bacteriophages repository with a clean web interface that allows to browse, upload and share data with other users. The gathered knowledge about the bacteriophages is not only valuable on its own but also because of the ability to train the ever-improving Machine Learning models.
Detection of virulent or temperate features is only one of the first tasks that can be solved with Artificial Intelligence. The combination of Biology, Natural Language Processing and Machine Learning allows us to create algorithms for genomic data processing that could eventually turn out to be effective in a wide range of problems with focus on classification and information extraced from DNA.
PhageAI is an AI-driven software platform using advanced Machine Learning and Natural Language Processing techniques for deeper understanding of the bacteriophages genomics.
For AI model training we used 469 manually selected bacteriophages from different species and families. Each of the them was represented by a complete nucleotide sequence in FASTA format.
Application of continuous embeddings of DNA sequences and feature ranking with recursive feature elimination allowed us to prepare optimal datasets for training a new Support Vector Machine model which resulted in the creation of a new accurate lifecycle (virulent or temperate) classifier with more than 98% of accuracy on both sets: training and validation.
To confirm that score, lifecycle classifier was also tested on unseen data delivered by Proteon Pharmaceuticals S.A. company. All of 61 samples (49 virulent, 12 temperate) were predicted correctly by the model, in accordance with experts lifecycle assumptions.
A current methodology opens up opportunities for further research in the field of phage classification.