Viruses are a mysterious and poorly understood force in microbial ecosystems. Researchers know they can infect, kill and manipulate human and bacterial cells in almost every environment, from the oceans to your gut. But scientists still don’t have a complete picture of how viruses significantly affect their surroundings because of their extraordinary diversity and ability to evolve rapidly.
Microbial communities are difficult to study in a laboratory setting. Cultivating many microbes is challenging, and their natural environment has many more aspects that affect their success or failure than scientists can replicate in a laboratory.
So systems biologists like me often sequence all the DNA present in a sample – for example, a fecal sample from a patient – they isolate the viral DNA sequences, then they annotate the parts of the viral genome which resides in your proteins. These notes on the location, structure and other features of the genes help researchers to understand the functions that viruses may perform in the environment and to identify different types of viruses. Researchers annotate viruses by matching viral sequences in a sample to previously annotated sequences available in public databases of viral genetic sequences.
However, scientists are identifying viral sequences in DNA collected from the environment at a rate that far exceeds our ability to annotate those genes. This means that researchers are publishing results about viruses in microbial ecosystems using unacceptably small fractions of the available data.
To improve the ability of researchers to study viruses around the globe, my team and I have developed a new approach to annotate viral sequences using artificial intelligence. Using protein language models similar to large language models like ChatGPT but specific to proteins, we were able to classify viral sequences that had never been seen before. This opens the door for researchers to not only learn more about viruses, but also to address biological questions that are difficult to answer with current techniques.
annotating viruses with AI
Large language models use relationships between words in large text data sets to provide possible answers to questions that are not specifically “taught” the answer. When you ask a chatbot “What is the capital of France?” for example, the model is not looking up the answer in capitals table. Rather, he is using his training on huge datasets of documents and information to understand the answer: “Paris is the capital of France.”
Similarly, protein language models are AI algorithms trained to recognize relationships between billions of protein sequences from environments around the world. Through this training, they may be able to understand something about the nature of viral proteins and their functions.
I wonder if protein language models could answer this question: “Given each annotated viral genetic sequence, what is the function of this new sequence?”
In our proof of concept, we trained neural networks on previously annotated viral protein sequences in pre-trained protein language models and then used them to predict the annotation of new viral protein sequences. Our approach allows us to explore what the model is “seeing” in a particular viral sequence resulting in a particular annotation. This helps candidates identify proteins of interest based on their specific functions or how their genome is organized, using the search space of huge data sets.
By identifying more distant viral gene functions, protein language models can complement current methods to provide new insights into microbiology. For example, my team and I were able to use our model to discover a previously unknown integrase – a type of protein that can move genetic information in and out of cells – in the marine picocyanobacteria that are abundant around the world . Prochlorococcus and Synechococcus. In particular, this metabolism may be able to move genes in and out of these bacterial communities in the oceans and enable these microbes to better adapt to changing environments.
Our language model also identified a novel viral capsid protein that is widespread in the world’s oceans. We have produced the first picture of how its genes are organized, showing that there may be different sets of genes that we believe this virus serves different functions in its environment.
These preliminary results represent only two of the thousands of annotations our approach has provided.
Analyze the unknown
Most of the hundreds of thousands of newly discovered viruses are still unclassified. Many viral genetic sequences match protein families that have no known function or have never been seen before. Our work shows that similar protein language models could help study the threat and promise of our planet’s many uncharacterized viruses.
Although our study focused on viruses in the world’s oceans, improved annotation of viral proteins is critical to better understanding the role viruses play in health and disease in the human body. We and other researchers have hypothesized that viral activity in the human gut microbiome may change when you are sick. This means that viruses may help recognize stress in microbial communities.
However, our approach is also limited because it requires high-quality annotations. Researchers are developing newer protein language models that incorporate other “tasks” as part of their training, particularly predicting protein structures to detect similar proteins, making them more powerful.
Making all the AI tools available through FAIR Data Principles – data that is obtainable, accessible, interoperable and reusable – can help researchers in general realize the potential of these new ways of annotating protein sequences. discoveries that benefit human health will be made.
This article is republished from The Conversation, a non-profit, independent news organization that brings you reliable facts and analysis to help you make sense of our complex world. Written by: Libusha Kelly, Albert Einstein College of Medicine
Read more:
Libusha Kelly receives funding from the National Institutes of Health.