A new analysis by the Oxford Internet Institute says many of the 445 benchmarks commonly used to rate artificial-intelligence systems oversell performance and lack scientific rigor. The paper faults top tests for vague definitions, data reuse and weak statistical methods, arguing that claims such as “Ph.D.-level intelligence” should be treated skeptically. The authors spotlight “construct validity” gaps—where benchmarks may not measure what they purport to—and cite GSM8K as a case where right answers don’t necessarily indicate reasoning ability. The team offers eight recommendations and a checklist to improve transparency and comparability. Industry researchers at METR and Anthropic have urged similar statistical safeguards, while OpenAI and the Center for AI Safety are pushing new, job-focused evaluations to better reflect real-world capability.
Related articles:
– Beyond the Imitation Game Benchmark (BIG-bench)
– Measuring Massive Multitask Language Understanding (MMLU)
– Training Verifiers to Solve Math Word Problems (introducing GSM8K)
– Dynabench: Rethinking Benchmarking in NLP
– OpenAI Evals































