PhageAI S.A. would like to share more details about the progress of the on-going scientific grant from The National Centre for Research and Development (NCBR). In 2022 we started a new research and development project of an AI-based screening platform for bacteriophages. In May 2023, we managed to complete the first three out of five planned stages. The NCBR grant is continuing till the end of 2023.
**Phage global database**
In the first stage, we have built the PhageAI External Data Acquisition (EDA) global database of bacteriophages including omics data which covers 18,179 high-quality phage and 10,152,477 amino acids sequences (access: 11.06.2023). We implemented a set of algorithms for searching, crawling and indexing the data, executing the gene prediction pipelines as well as a custom method to score the quality of samples and share them for further research. The global database is growing every week by dozens of phage and protein samples.
Then, we trained and fine-tuned a set of Natural Language Processing (NLP) models using a global database of bacteriophages dedicated to solve down-stream tasks in phage biology. Different architectures of language models (e.g. transformers) were benchmarked depending on the data format and type of task (genome or proteins-based). The most accurate models were identified as Phage2Vec - scalable cloud-based technology for phage data vectorization - and included in the PhageAI pipeline what allows us to transform samples into multidimensional vectors and to deliver the knowledge in expected format by Machine Learning methods.
**Phage taxonomy classification**
In the second stage, we extended our knowledge of the current taxonomy of bacteriophages handled by ICTV. At that time (VMR 21), it covered 9 orders, 50 families and 1 652 genus of phages. We designed and implemented domain criteria of selection for phage samples for AI-based taxonomy classification research. It allows us to prepare train, test and validation sets and use vectorized versions of phage data delivered by Phage2Vec.
As a next step, we have worked on a prediction model where the final version supports 7 orders, 31 families and 409 genera. After 5-fold cross-validation best classification algorithms achieved F1-score:
* 99,07% on order taxonomy level;
* 99,59% on family taxonomy level;
* 98,36% on genus taxonomy level.
We have also tested hierarchical classification algorithms for phage taxonomy prediction as an alternative solution for this down-stream task.
**Phage structural proteins classification**
In the third stage, we have focused on protein classification of phages, especially 10 classes of structural proteins. Manually curated dataset was prepared and covered by 8,931,173 samples in the train set and 1,221,304 in the test set. Both also included phage proteins with unknown function.
The best Machine Learning prediction model was able to achieve 98,86% F1-score on the test set.
Furthermore, we implemented a pipeline for phage protein exploration in 2D interactive space. It was used to benchmark how different language models are able to create acurate phage protein projections and affect the clusters represented by structural and unknown protein classes.
Both, phage taxonomy classification and phage structural proteins classification were included in the PhageAI platform pipelines.
Currently, we are focusing on the visualization of phage annotation results and preparing a new technology to find the most similar phages, as well as to highlight those with application (therapeutic) potential. All the results will be available via PDF report possible to be generated directly from the PhageAI platform.
Measure 1.1 R&D projects of enterprises, Sub-measure 1.1.1 Industrial research and development work implemented by enterprises, Smart Growth Operational Programme 2014-2020