Numerous studies demonstrate frequent mutations in the genome of SARS-CoV-2. Our goal was to statistically link mutations to severe disease outcome.
We found that automated machine learning, such as the method of Tsamardinos and coworkers used here, is a versatile and effective tool to find salient features in large and noisy databases, such as the fast growing collection of SARS-CoV-2 genomes.
In this work we used machine learning techniques to select mutation signatures associated with severe SARS-CoV-2 infections. We grouped patients into 2 major categories (“mild” and “severe”) by grouping the 179 outcome designations in the GISAID database.
A protocol combined of logistic regression and feature selection algorithms revealed that mutation signatures of about twenty mutations can be used to separate the two groups. The mutation signature is in good agreement with the variants well known from previous genome sequencing studies, including Spike protein variants V1176F and S477N that co-occur with DG14G mutations and account for a large proportion of fast spreading SARS-CoV-2 variants. UTR mutations were also selected as part of the best mutation signatures. The mutations identified here are also part of previous, statistically derived mutation profiles.
An online prediction platform was set up that can assign a probabilistic measure of infection severity to SARS-CoV-2 sequences, including a qualitative index of the strength of the diagnosis. The data confirm that machine learning methods can be conveniently used to select genomic mutations associated with disease severity, but one has to be cautious that such statistical associations – like common sequence signatures, or marker fingerprints in general – are by no means causal relations, unless confirmed by experiments.
Our plans are to update the predictions server in regular time intervals. While this project was underway more than 100 thousand sequences were deposited in public databases, and importantly, new variants emerged in the UK and in South Africa that are not yet included in the current datasets. Also, in addition to mutations, we plan to include also insertions and deletions which will hopefully further improve the predictive power of the server.
The study was funded by the Hungarian Ministry for Innovation and Technology (MIT) , within the framework of the Bionic thematic programme of the Semmelweis University.
Read the entire study at https://www.biorxiv.org/content/10.1101/2021.04.01.438063v1.full
Access the online portal mentioned above at https://covidoutcome.com/
Lire l’article complet sur : www.biorxiv.org