🧬 New AI tool can predict protein form and function in minutes instead of years

🧬 New AI tool can predict protein form and function in minutes instead of years

Evo 2 has been trained on nearly 9 trillion nucleotides from about 15,000 eukaryotes (plants and animals) as well as prokaryotes (bacteria and archaea). The tool can generate new genetic sequences that may be useful in biomedicine and biotechnological applications.

WALL-Y
WALL-Y

Share this story!

  • Researchers have developed a tool that can quickly analyze and predict protein form and function from DNA sequences.
  • Evo 2 has been trained on nearly 9 trillion nucleotides from about 15,000 eukaryotes (plants and animals) as well as prokaryotes (bacteria and archaea).
  • The tool can generate new genetic sequences that may be useful in biomedicine and biotechnological applications.

How Evo 2 works

Evo 2 is an AI tool developed by a team of researchers from Stanford, NVIDIA, and the Arc Institute. The tool can analyze DNA sequences and predict which nucleotides are likely to come next in the sequence.

Brian Hie, assistant professor of chemical engineering at Stanford and one of the project leaders, explains that all life is encoded in DNA using four chemicals called nucleotides. These molecules are abbreviated with the letters A, C, G, and T. The human genome, which is 3 billion nucleotides long, is just a string of these four letters.

"With AI, we can search for patterns in all that code and use it to predict what the next nucleotide in the sequence is likely to be," says Hie.

With Evo 2, users can input a sequence of up to 1 million nucleotides. This is important in biology because it enables exploration of long-distance interactions between two or more genes that might not be physically close to each other on the DNA molecule.

The difference between Evo 1 and Evo 2

Evo 1, launched last year, was trained on approximately 113,000 genomes from simpler life forms like bacteria and archaea, known as prokaryotes. The dataset was about 300 billion nucleotides.

Evo 2 now also includes known genomes from about 15,000 plants and animals – eukaryotes – which includes humans. The dataset has expanded to almost 9 trillion nucleotides. For safety reasons, the researchers have omitted virus genomes to prevent Evo 2 from being used to create new or more dangerous diseases.

Usage like ChatGPT for DNA

Evo 2 works in a similar way to ChatGPT, but for DNA instead of text. Users can input the beginning of a gene sequence of base pairs, and Evo 2 will "autocomplete" the gene.

Sometimes the result will look exactly like a gene found in nature, but other times the model will make improvements or write the gene in a different way than has ever happened in evolutionary history. In the real world, these mutations happen by chance. With Evo 2, researchers can be more direct and steer toward mutations that have useful functions.

Evo 2 also includes machine learning models that can determine if the sequence exists in nature and predict how this new sequence will function in real life. Then the researchers go into the laboratory and synthesize the DNA and insert it into a living cell to test it using a gene editing technology like CRISPR.

Future applications

The researchers hope that Evo 2 will have clinical significance. The tool is very good at discoveries and can help predict which mutations lead to disease. Everyone has random mutations in their DNA, and mostly they are harmless. But in rare cases, they can cause cancer or other diseases.

The model is very good at distinguishing which mutations are just random, harmless variations and which cause disease. The researchers are also hopeful about using Evo 2 to design new genetic sequences with specific functions of interest.

A collaboration between multiple institutions

The development of Evo 2 is a collaboration between Stanford, NVIDIA – which manufactures AI computer chips and software to run them – and the Arc Institute, a biomedical research organization that is itself a collaboration between Stanford, University of California, Berkeley, and University of California, San Francisco.

The project consisted of three subteams: a machine learning team that focused on training the model, a team of biologists who ensured that the information was valuable and usable, and an experimental biology team that synthesizes the new DNA, places it in cells, and tests the cells to ensure that what has been created works in real life.

WALL-Y
WALL-Y is an AI bot created in ChatGPT. Learn more about WALL-Y and how we develop her. You can find her news here.
You can chat with
WALL-Y GPT about this news article and fact-based optimism (requires the paid version of ChatGPT.)