A team of researchers from the Arc Institute, NVIDIA, and leading universities has unveiled Evo 2, the largest artificial intelligence (AI) model ever built for biology. Trained on DNA sequences from over 128,000 genomes spanning all three domains of life, Evo 2 represents a major leap in AI-powered genetic research, capable of identifying disease-causing mutations, analyzing evolutionary patterns, and even designing new genomic structures.
Evo 2 was developed by scientists from the Arc Institute in collaboration with NVIDIA, Stanford University, UC Berkeley, and UC San Francisco. The model was trained using over 9.3 trillion nucleotides—the fundamental building blocks of DNA and RNA—making it comparable in scale to the most advanced generative AI language models.
Patrick Hsu, co-founder of the Arc Institute and co-senior author of the Evo 2 research, called the model a significant breakthrough in generative biology. “Evo 2 has a generalist understanding of the tree of life that’s useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life,” said Hsu.
Evo 2 has already demonstrated its ability to predict how genetic mutations impact human health. In an analysis of variants of the BRCA1 gene, which is linked to breast cancer, the model achieved over 90% accuracy in distinguishing benign mutations from potentially harmful ones. This capability could dramatically accelerate medical research by helping scientists pinpoint the genetic causes of disease without the need for costly and time-consuming laboratory experiments.
The model also detects genetic elements such as transcription factor binding sites and exon-intron boundaries, providing researchers with a clearer understanding of how genes function and evolve. According to co-senior author Brian Hie, an assistant professor at Stanford University, Evo 2 is able to recognize patterns refined over millions of years of evolution, similar to how large language models learn from internet text.
Beyond analysis, Evo 2 introduces a revolutionary capability: the ability to generate entire genomes. The model can create synthetic DNA sequences at the scale of bacterial genomes, with precision control over elements such as gene expression. This opens new doors for bioengineering applications, from synthetic biology to personalized gene therapies.
“If you have a gene therapy that you want to turn on only in neurons to avoid side effects, or only in liver cells, you could design a genetic element that is only accessible in those specific cells,” explained co-author Hani Goodarzi, a computational biologist at UC San Francisco.
Evo 2 is fully open-source, with its training code, dataset, and model weights available to the public. The model is also integrated into NVIDIA’s BioNeMo framework, ensuring broad accessibility for researchers. The Arc Institute has additionally worked with AI lab Goodfire to create a mechanistic interpretability tool, allowing scientists to better understand how Evo 2 makes its predictions.
Recognizing ethical concerns, the research team deliberately excluded pathogens that infect humans and complex organisms from the training data. Stanford professor Tina Hernandez-Boussard and her team helped implement safeguards to prevent the model from generating harmful biological sequences.
Dave Burke, Arc’s chief technology officer, likened Evo 2 to an operating system kernel that could support a wide range of specialized AI applications. “From predicting how single DNA mutations affect a protein’s function to designing genetic elements that behave differently in different cell types, we expect to see beneficial uses for Evo 2 we haven’t even imagined yet.”
Need Help?
If you’re concerned or have questions about how to navigate the AI landscape, don’t hesitate to reach out to BABL AI. Their Audit Experts can offer valuable insight and ensure you’re informed and compliant.