A new study published in Nature is introducing what researchers call one of the most difficult benchmarks yet for measuring artificial intelligence capabilities, aiming to address growing concerns that existing tests no longer reflect the rapid progress of advanced AI models.
The paper, titled “A benchmark of expert-level academic questions to assess AI capabilities,” introduces Humanity’s Last Exam (HLE), a large-scale evaluation designed to test AI systems using questions that challenge even human experts. Developed by the Center for AI Safety, Scale AI, and a global consortium of contributors, the benchmark includes 2,500 closed-ended questions spanning dozens of academic fields, including mathematics, natural sciences, and the humanities.
Researchers say the new benchmark was created because widely used evaluations such as Massive Multitask Language Understanding (MMLU) are no longer sufficient for measuring the frontier of AI performance. According to the study, leading large language models now achieve more than 90 percent accuracy on some existing benchmarks, making it difficult for researchers, policymakers, and the public to understand where real capability limits still exist.
Humanity’s Last Exam aims to fill that gap by focusing on expert-level questions that require deeper reasoning rather than simple information retrieval. Each question is designed to have a clear, verifiable answer while resisting easy solutions that could be obtained through basic internet searches. The benchmark includes both multiple-choice and short-answer formats, allowing automated grading while maintaining rigorous academic standards.
The authors describe the project as a global effort involving subject-matter experts who contributed questions intended to reflect the frontier of human knowledge. The benchmark is also multimodal, meaning it includes questions that require interpreting different forms of information rather than relying solely on text.
Early results outlined in the Nature paper suggest that even state-of-the-art AI systems struggle with the test. Researchers reported low accuracy rates and poor calibration among leading large language models, indicating a significant gap between current AI performance and expert human-level understanding on complex academic tasks. Calibration refers to how well a model’s confidence aligns with the correctness of its answers, an important factor for assessing reliability in high-stakes environments.
The authors argue that this gap is important for both scientific progress and policy discussions. As AI systems become increasingly integrated into research, education, and professional decision-making, having realistic measures of capability is essential for understanding risks and setting expectations. They suggest that benchmarks like HLE could help policymakers and developers make more informed decisions about deployment, safety measures, and governance frameworks.
The study also reflects a broader trend in AI evaluation research, where developers and independent organizations are racing to design tests that keep pace with rapidly advancing models. Existing benchmarks often become saturated quickly, meaning that incremental improvements in models can appear larger than they actually are because the tests no longer present meaningful challenges.
By publicly releasing Humanity’s Last Exam at lastexam.ai, the researchers say they hope to encourage transparency and open research collaboration. The benchmark is intended to evolve over time as AI capabilities change, helping the research community maintain a clearer picture of progress at the cutting edge.
The paper’s publication in Nature underscores growing interest in establishing standardized, high-quality evaluations for AI systems at a time when global debates over regulation, safety, and trust are intensifying. Experts increasingly warn that understanding what AI systems can and cannot do is critical as governments and industries explore how to integrate the technology into sensitive domains.
While the results show that AI still falls short of expert-level performance on many complex tasks, the authors note that progress in large language models continues at a rapid pace. They suggest that maintaining rigorous and evolving benchmarks will be essential for tracking future advancements and ensuring that public discussions about AI capabilities remain grounded in measurable evidence rather than hype.
Ultimately, Humanity’s Last Exam represents an effort to redefine how AI progress is measured — not by how well systems perform on simplified tests, but by how close they come to matching the depth and rigor of human expert reasoning.
Need Help?
If you’re wondering how AI policies, or any other government’s AI bill or regulation could impact you, don’t hesitate to reach out to BABL AI. Their Audit Experts are ready to provide valuable assistance while answering your questions and concerns.


