A few words about the this project.
PhageAI is an application that simultaneously represents a repository of knowledge of bacteriophages and a tool to analyse genomes with Artificial Intelligence support.
Machine Learning algorithms can process enormous amounts of data in relatively short time in order to find connections and dependencies that are unobvious for human beings. Correctly designed applications based on AI are able to vastly improve and speed up the work of the domain experts.
Models based on DNA contextual vectorization and Deep Neural Networks are particularly effective when it comes to analysis of genomic data. The system that we propose aims to use the phages sequences uploaded to the database to build a model which is able to predict if a bacteriophage is virulent, temperate or chronic with a high probability.
One of the key system modules is the bacteriophages repository with a clean web interface that allows to browse, upload and share data with other users. The gathered knowledge about the bacteriophages is not only valuable on its own but also because of the ability to train the ever-improving Machine Learning models.
Detection of virulent or temperate features is only one of the first tasks that can be solved with Artificial Intelligence. The combination of Biology, Natural Language Processing and Machine Learning allows us to create algorithms for genomic data processing that could eventually turn out to be effective in a wide range of problems with focus on classification and information extraced from DNA.
PhageAI is an AI-driven software platform using advanced Machine Learning and Natural Language Processing techniques for deeper understanding of the bacteriophages genomics.
We invent Phage2Vec technology - phage language model for general usage - trained on 17 559 complete bacteriophage sequneces. Machine Learning models for lifecycle prediction were trained on 4 694 manually selected bacteriophages from different species and families. Each of the sample was represented by a complete nucleotide sequence in FASTA format.
Application of continuous embeddings of DNA sequences allowed us to prepare optimal datasets for training a new Support Vector Machine model which resulted in the creation of a new accurate lifecycle (virulent, temperate or chronic) classifier with ~98% of accuracy on both sets: train and test (unseen data).
To confirm that score, lifecycle classifier was also tested on another unseen data delivered by Proteon Pharmaceuticals S.A. company. All of 61 samples (49 virulent, 12 temperate) were predicted correctly by the model with 97% of confidence level, in accordance with experts lifecycle assumptions.
A current methodology opens up opportunities for further research in the field of phage classification.