Scoring Intelligence: Benchmarks, Evals, and the Buzzwords That Run Artificial Intelligence
As large language models power everything from medical diagnosis to financial trading, and as multimodal systems begin generating convincing audio and video content, the question of how to evaluate AI has become inseparable from questions of safety, governance, and societal impact.

The numbers still flash across conference slides with scientific authority: 90.2% on MMLU, superhuman performance on ImageNet, state-of-the-art results across seventeen benchmarks. But in 2025, the conversation around AI evaluation has evolved far beyond these familiar metrics. While traditional benchmarks continue to influence development, a new generation of safety evaluations, constitutional frameworks, and adversarial testing methods is reshaping how we understand artificial intelligence capabilities and risks.
The stakes have never been higher. As large language models power everything from medical diagnosis to financial trading, and as multimodal systems begin generating convincing audio and video content, the question of how to evaluate AI has become inseparable from questions of safety, governance, and societal impact. Yet the evaluation landscape remains fragmented, evolving rapidly, and often misunderstood by the very policymakers and business leaders who must make decisions based on these assessments.
This is not simply a story about better tests replacing worse ones. It's about a fundamental tension in how we measure intelligence itself, and how those measurements, however imperfect, end up shaping the AI systems that will define our future.
The Persistence of the Old Guard
Despite years of criticism, traditional benchmarks like ImageNet and MMLU continue to dominate AI evaluation. This persistence reveals something important about the inertia of measurement systems. ImageNet, introduced in 2009, still serves as a standard for computer vision research even as MIT researchers have documented systematic annotation issues affecting 20% of its images. MMLU, designed to test language models across 57 academic subjects, remains a go-to metric despite recent analysis revealing numerous errors, from simple parsing mistakes to fundamentally flawed questions.
The continued reliance on these flawed benchmarks isn't simply institutional laziness. These tools provide something that newer, more sophisticated evaluations often cannot: comparability across time and systems. A researcher can compare a 2025 model's ImageNet performance to results from 2015, creating a sense of measurable progress that more nuanced evaluations struggle to provide. This comparability comes at a cost, but it's a cost that many in the field have been willing to pay.
Yet the limitations of these traditional benchmarks have become increasingly apparent as AI systems have grown more sophisticated. The "benchmark overfitting" problem - where models become highly optimized for specific tests while losing broader capabilities, has evolved from a theoretical concern to a practical reality. Models that achieve near-perfect scores on reading comprehension benchmarks may still struggle with basic reasoning tasks that any human would find trivial. The gap between benchmark performance and real-world utility has become a chasm that the field can no longer ignore.
The Safety Evaluation Revolution
The most significant development in AI evaluation over the past two years has been the emergence of safety-focused assessment methods. Unlike traditional benchmarks that measure capabilities, these new approaches attempt to probe for risks, vulnerabilities, and potential harms. The shift represents a fundamental change in perspective: from asking "what can this system do?" to asking "what might this system do wrong?"
This evolution has been driven partly by the recognition that capability and safety are not simply opposite sides of the same coin. A model might excel at mathematical reasoning while being vulnerable to jailbreaking attacks that cause it to generate harmful content. It might demonstrate sophisticated language understanding while exhibiting systematic biases that could cause real harm when deployed in high-stakes applications.
The Georgetown Center for Security and Emerging Technology has identified two primary categories of safety evaluations: model safety evaluations that assess outputs directly, and contextual safety evaluations that examine how models impact real-world outcomes. This distinction captures something crucial that traditional benchmarks miss - the difference between what a model can do in isolation and what it actually does when humans interact with it in complex, unpredictable ways.
Model safety evaluations include capability testing that specifically probes for risky abilities, such as a model's knowledge of dangerous chemical processes or its ability to generate convincing misinformation. These evaluations often use specialized benchmarks like GPQA for graduate-level science questions or WMDP for weapons-related knowledge. The goal is not to celebrate high performance but to identify concerning capabilities that might require additional safeguards.
Contextual safety evaluations take a different approach, examining how models behave in realistic interaction scenarios. Red-teaming exercises, where researchers attempt to bypass safety guardrails, have become standard practice at major AI labs. These evaluations reveal vulnerabilities that static benchmarks cannot capture - the ways that determined users can manipulate models into producing harmful outputs through carefully crafted prompts or multi-turn conversations.
Constitutional AI and the Search for Principled Evaluation
One of the most promising developments in AI evaluation has been the emergence of Constitutional AI (CAI), a framework developed by Anthropic that attempts to ground AI training and evaluation in explicit principles. Rather than relying on human feedback alone, Constitutional AI uses a "constitution" - a set of written principles, to guide both training and assessment.
The constitutional approach represents a significant evolution from earlier safety methods. Traditional approaches often relied on human evaluators to identify harmful outputs, a process that was expensive, inconsistent, and potentially biased. Constitutional AI attempts to make the evaluation process more transparent and systematic by explicitly stating the principles that should guide AI behavior.
In practice, Constitutional AI involves two phases: supervised learning where the model learns to follow constitutional principles, and reinforcement learning where AI systems evaluate their own outputs according to these principles. This self-evaluation aspect is particularly significant - it suggests a path toward AI systems that can assess their own safety and alignment without constant human oversight.
However, the constitutional approach also reveals the deep challenges inherent in AI evaluation. Who writes the constitution? How do we ensure that written principles capture the full complexity of human values? How do we handle cases where constitutional principles conflict with each other? These questions highlight that even sophisticated evaluation frameworks cannot escape fundamental questions about values, priorities, and trade-offs.
Recent work has begun to address some of these challenges. The C3AI framework provides structured methods for crafting and evaluating constitutions, while researchers have begun exploring how different constitutional principles affect model behavior. But the fundamental tension remains: any evaluation framework embeds particular values and assumptions, and those choices have consequences for the AI systems that emerge from the evaluation process.
The RLHF Evolution and Its Discontents
Reinforcement Learning from Human Feedback (RLHF) has become perhaps the most influential evaluation and training methodology in modern AI development. The approach, which involves training models to optimize for human preferences rather than traditional metrics, has been credited with making large language models more helpful, harmless, and honest.
RLHF represents a fundamental shift in how we think about AI evaluation. Instead of measuring performance on predetermined tasks, RLHF attempts to capture human preferences directly. Human evaluators compare different model outputs and indicate which they prefer, and these preferences are used to train a reward model that can then guide further AI development.
The appeal of RLHF is obvious: it seems to address the fundamental problem of benchmark misalignment by directly optimizing for what humans actually want. If traditional benchmarks fail to capture real-world utility, why not just ask humans what they prefer and optimize for that?
But RLHF has revealed its own set of challenges that highlight the complexity of human preference evaluation. Human preferences are inconsistent, context-dependent, and often influenced by factors that have little to do with actual utility. Different human evaluators often disagree about which outputs are better, and the same evaluator might give different ratings to the same output on different days.
More fundamentally, RLHF may be optimizing for the wrong thing entirely. Human preferences in controlled evaluation settings may not reflect what people actually want from AI systems in real-world contexts. A model trained to produce outputs that human evaluators prefer in a laboratory setting might behave very differently when deployed in complex, high-stakes environments.
The evolution of RLHF has led to increasingly sophisticated approaches to preference modeling, including constitutional AI methods that attempt to ground preferences in explicit principles. But these developments also highlight a deeper challenge: the difficulty of specifying what we actually want from AI systems in a way that can be systematically measured and optimized.
The Jailbreaking Arms Race
Perhaps no development has highlighted the limitations of traditional evaluation methods more clearly than the emergence of jailbreaking - techniques for bypassing AI safety guardrails to elicit harmful outputs. Jailbreaking represents a fundamental challenge to the assumption that AI safety can be evaluated through static tests.
Traditional safety evaluations typically involve testing whether a model will refuse to answer harmful questions or generate dangerous content. A model that consistently refuses to provide instructions for making explosives or generating hate speech might be considered safe according to these evaluations. But jailbreaking techniques have shown that these refusals can often be bypassed through creative prompting strategies.
The jailbreaking phenomenon has led to an arms race between safety researchers developing new guardrails and adversarial researchers finding ways to bypass them. This dynamic has important implications for how we think about AI evaluation. Static safety tests may provide a false sense of security if they don't account for the possibility that determined users will find ways to circumvent safety measures.
Recent developments in jailbreaking evaluation have attempted to address this challenge through more sophisticated testing frameworks. StrongREJECT provides a benchmark for evaluating jailbreak resistance, while GuidedBench offers structured approaches to adversarial testing. These tools represent important progress, but they also highlight the fundamental difficulty of evaluating safety in systems that can be manipulated through natural language interaction.
The jailbreaking arms race also reveals something important about the nature of AI safety evaluation. Unlike traditional benchmarks where higher scores are generally better, safety evaluation involves a more complex dynamic where the evaluation process itself can reveal new vulnerabilities. Each new jailbreaking technique not only demonstrates a current vulnerability but also suggests new attack vectors that must be defended against.
The Regulatory Reality Check
One of the most significant misunderstandings about AI evaluation concerns its role in regulation and governance. While early discussions of AI regulation often focused on specific performance thresholds - models that achieve certain benchmark scores might trigger regulatory requirements, the reality of emerging AI governance frameworks is more nuanced.
The European Union's AI Act, often cited as the most comprehensive AI regulation to date, does reference specific performance metrics in some contexts. But the bulk of the regulation focuses on process requirements, transparency obligations, and risk management frameworks rather than specific benchmark thresholds. Similarly, emerging regulatory approaches in the United States and other jurisdictions emphasize governance processes and risk assessment procedures rather than simple performance cutoffs.
This shift toward process-focused regulation reflects a growing recognition among policymakers that AI safety cannot be reduced to simple metrics. A model that performs well on safety benchmarks might still pose significant risks if it's deployed inappropriately or without adequate oversight. Conversely, a model with lower benchmark scores might be perfectly safe if it's used in appropriate contexts with proper safeguards.
The emergence of AI Safety Institutes around the world represents another important development in the regulatory landscape. These institutions are developing new evaluation methodologies that go beyond traditional benchmarks to assess real-world risks and impacts. Their work suggests a future where AI evaluation becomes more sophisticated, more context-dependent, and more closely tied to actual deployment scenarios.
However, the relationship between evaluation and regulation remains complex and evolving. Policymakers often lack the technical expertise to fully understand the limitations of different evaluation methods, while researchers may not fully appreciate the practical constraints that policymakers face. This gap between technical and policy communities represents one of the most significant challenges in developing effective AI governance frameworks.
The Meta-Evaluation Challenge
As the number and variety of AI evaluations has exploded, a new challenge has emerged: how do we evaluate the evaluations themselves? A recent analysis of approximately 100 AI safety evaluations published in 2024 revealed both the dynamism and the limitations of the current evaluation landscape.
The analysis found that while the volume of new evaluations is growing rapidly, the types of evaluations being developed remain relatively static. More than 80% of evaluations still focus on text-based models, despite the growing importance of multimodal systems. Most evaluations continue to assess model outputs directly rather than examining how models interact with humans or impact real-world outcomes.
This pattern suggests that the evaluation field may be stuck in its own version of benchmark overfitting - developing increasingly sophisticated versions of familiar evaluation types rather than addressing fundamental gaps in how we assess AI systems. The focus on text-based evaluations, for example, means that we may be poorly prepared to assess the risks and capabilities of emerging audio and video generation systems.
The meta-evaluation challenge also highlights the resource constraints that limit evaluation development. Many researchers who want to develop new evaluation methods lack access to the computational resources, model access, or funding needed to conduct comprehensive assessments. This creates a situation where evaluation development is concentrated among a small number of well-resourced institutions, potentially limiting the diversity of perspectives and approaches.
The Coordination Problem
Perhaps the most fundamental challenge in AI evaluation is not technical but social: how do we coordinate around better evaluation methods when different stakeholders have different incentives and priorities? Companies may prefer evaluations that make their models look good. Researchers may prefer evaluations that are easy to publish. Policymakers may prefer evaluations that are easy to understand and implement.
These different incentives create a coordination problem that goes beyond simply developing better evaluation methods. Even if researchers develop more sophisticated, more accurate ways to assess AI capabilities and risks, adoption of these methods requires overcoming entrenched interests and institutional inertia.
The history of benchmark adoption in AI provides both encouraging and discouraging examples. ImageNet's rapid adoption in the computer vision community shows that the field can coordinate around new evaluation standards when they provide clear value. But the persistence of flawed benchmarks despite well-documented limitations shows how difficult it can be to dislodge established evaluation methods.
Recent developments suggest some reasons for optimism. The emergence of responsible scaling policies at major AI companies indicates a growing recognition that evaluation must go beyond simple capability metrics. The development of international coordination mechanisms through AI Safety Institutes suggests that policymakers are beginning to take evaluation seriously as a governance tool.
But significant challenges remain. The rapid pace of AI development means that evaluation methods can become obsolete quickly. The increasing sophistication of AI systems makes evaluation more complex and resource-intensive. And the growing stakes of AI deployment mean that evaluation failures can have serious real-world consequences.
Beyond the Numbers Game
The future of AI evaluation will likely involve moving beyond the simple metrics that have dominated the field toward more nuanced, context-dependent assessment methods. This shift will require acknowledging that evaluation is not just a technical challenge but a fundamentally human one that involves values, trade-offs, and judgment calls that cannot be reduced to numbers.
Some promising directions are already emerging. Dynamic benchmarks that evolve over time may help address the problem of benchmark overfitting. Evaluation methods that focus on specific deployment contexts rather than general capabilities may provide more actionable insights. Approaches that combine multiple evaluation methods may capture different aspects of AI behavior that single metrics miss.
But perhaps the most important development will be a growing recognition that evaluation is not separate from AI development but integral to it. The measurements we choose don't just assess AI systems, they shape them. The benchmarks we optimize for don't just measure progress - they define it.
This recognition suggests that the future of AI evaluation will require not just better technical methods but better processes for deciding what to measure and why. It will require bringing together diverse perspectives from technical researchers, policymakers, ethicists, and affected communities. And it will require acknowledging that the question of how to evaluate AI is ultimately a question about what kind of future we want to build.
The numbers on the leaderboards will continue to tell a story, but it may not be the story we think we're reading. Understanding the difference and acting on that understanding may be one of the most important challenges of our time.