Study identifies weaknesses in how AI systems are evaluated

**Press Release:**

A recent study led by the Oxford Internet Institute (OII) at the University of Oxford, with a team of 42 researchers from prestigious global institutions such as EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University, has uncovered deficiencies in the evaluation process of AI systems.

The extensive analysis of AI benchmarks, which are standardized assessments used for comparing and ranking AI models, revealed a lack of scientific rigor in many of these tests. The study emphasizes the necessity for clearer definitions and more robust scientific standards to ensure accurate evaluations of large language models (LLMs).

The research paper titled “Measuring What Matters: Construct Validity in Large Language Model Benchmarks” has been accepted for publication in the upcoming NeurIPS conference proceedings. It scrutinizes 445 AI benchmarks and highlights issues such as ambiguous definitions and inadequate analytical approaches that hinder drawing dependable conclusions regarding AI advancements, capabilities, and safety.

Lead author Andrew Bean emphasized the critical role benchmarks play in shaping the development, deployment, and regulation of AI systems. Without well-defined parameters and reliable measurement techniques, assessing genuine progress in AI becomes challenging.

The study underscores that if benchmarks lack scientific validity, they could mislead developers and regulators about the true capabilities and safety levels of AI systems. To address these issues, the researchers offer eight recommendations based on established methodologies from fields like psychometrics and medicine to enhance the credibility of AI benchmarks.

Furthermore, they introduce a Construct Validity Checklist as a practical tool for researchers, developers, and regulators to assess the design integrity of AI benchmarks before relying on their outcomes.

For further details and inquiries:

Contact:
Anthea Milnes
Head of Communications
Sara Spinks / Veena McCoole
Media and Communications Manager
Phone: +44 (0)1865 280527
Mobile: +44 (0)7551 345493
Email: press@oii.ox.ac.uk

About:
The Oxford Internet Institute (OII) is a pioneer in investigating the societal impacts of emerging technologies over 25 years. Through interdisciplinary research and education initiatives, OII examines challenges and opportunities posed by transformative innovations such as artificial intelligence, machine learning, digital platforms, and autonomous agents.

Oxford University continues to lead globally in research excellence and innovation for the tenth consecutive year according to the Times Higher Education World University Rankings in 2025. The institution’s success is rooted in its groundbreaking research endeavors and distinctive educational offerings that attract top talents from around the world.

Ai Mainstream

Ai Mainstream

Ai Mainstream

Study identifies weaknesses in how AI systems are evaluated

Join Our Newsletter

Follow us for more