Stanford, MIT Researchers Unveil “Canaries” to Detect and Measure Risks in Large Language Models

Written by Jeremy Werner

Jeremy is an experienced journalist, skilled communicator, and constant learner with a passion for storytelling and a track record of crafting compelling narratives. He has a diverse background in broadcast journalism, AI, public relations, data science, and social media management.
Posted on 09/18/2025
In News

A team of researchers from Stanford University and MIT has introduced a new technique called “Canaries,” designed to expose hidden risks in large language models (LLMs). The approach, detailed in a recent paper by Erik Brynjolfsson, Raghuram Chandar, and Danielle Li Chen, could provide policymakers, companies, and the public with a clearer picture of the potential dangers and benefits posed by AI systems.

 

LLMs such as OpenAI’s GPT-4 or Google’s Gemini have become powerful engines of productivity and creativity. But they also present risks, including spreading misinformation, generating harmful content, or being misused for malicious purposes. The researchers argue that current evaluation methods are not sufficient to detect rare but high-impact model failures—what they call “low-probability, high-stakes” risks.

 

To address this gap, the team developed Canaries: targeted prompts embedded into models that act like “early warning systems.” Much like canaries once warned miners of toxic gases, these prompts are designed to trigger when a model produces responses that could indicate unsafe or harmful capabilities. By planting thousands of carefully designed Canaries into an LLM, evaluators can better measure how often the model behaves undesirably, and in what contexts.

 

The researchers tested Canaries across several high-profile LLMs and found significant differences in performance and safety profiles. Some models, for example, were more prone to generating misinformation or showing bias, while others demonstrated greater resilience. The results highlight the urgent need for standardized, transparent methods of testing AI, especially as governments and companies race to deploy LLMs in sensitive domains such as healthcare, education, and national security.

 

Canaries also have implications for AI governance. The paper suggests that regulators could use Canary-based evaluations to set benchmarks for acceptable risk, helping align industry practices with public safety concerns. The method could also empower independent watchdogs and civil society groups by giving them tools to test AI systems without requiring access to proprietary training data.

 

The authors caution, however, that Canaries are not a silver bullet. They must be combined with other safety and governance strategies, such as transparency reporting, red-teaming, and third-party audits. Still, they see the approach as an important step toward building accountability into AI at scale.

 

“Society cannot manage what it cannot measure,” Brynjolfsson and his co-authors write. “By providing a practical way to measure risks in large language models, Canaries offer a foundation for more trustworthy and responsible AI deployment.”

 

The work arrives at a critical moment. Policymakers in the U.S., Europe, and Asia are moving quickly to regulate AI amid mounting concerns about disinformation, job disruption, and safety. Initiatives such as the EU’s AI Act and the U.S. AI Safety Institute emphasize the need for robust evaluation methods—exactly the kind of tools Canaries are designed to provide.

 

Industry response is expected to be mixed. While some AI companies have welcomed standardized benchmarks, others may resist external testing that could expose weaknesses or commercial risks. Still, as the authors note, public trust in AI will depend on independent verification that these technologies are safe and reliable.

 

The researchers plan to expand their work by refining Canary design and testing it across a wider array of languages and use cases. They also hope to collaborate with international bodies to integrate Canary-style evaluations into global AI safety standards.

 

For now, the launch of Canaries marks a new frontier in the effort to understand and mitigate the risks of advanced AI. By shining a light on the “unknown unknowns” of language models, Brynjolfsson, Chandar, and Chen believe their method can help societies reap AI’s benefits while minimizing its dangers.

 

Need Help?

 

If you have questions or concerns about any global guidelines, regulations and laws, don’t hesitate to reach out to BABL AI. Their Audit Experts can offer valuable insight, and ensure you’re informed and compliant.

 

Subscribe to our Newsletter

Keep up with the latest on BABL AI, AI Auditing and
AI Governance News by subscribing to our news letter