At a packed Royal Society event in London marking 75 years since Alan Turing proposed his “imitation game,” researchers argued that modern chatbots’ ability to pass the Turing test is no proof of understanding—and that the test should be retired. Speakers including neuroscientist Gary Marcus and University of Sussex’s Anil Seth urged the field to shift from chasing ill-defined artificial general intelligence toward targeted evaluations of safety and practical capabilities. While large language models now mimic human conversation convincingly, their brittleness outside familiar prompts underscores the gap between imitation and cognition. New benchmarks such as ARC-AGI-2 aim to measure adaptable reasoning, but there’s little consensus on how to define or certify “general” intelligence. The emerging consensus: prioritize rigorous, transparent assessments of real-world reliability and risk over headline-grabbing parlor games.
Related articles:
On the Measure of Intelligence (ARC) by François Chollet
Beyond the Imitation Game: Measuring and Extrapolating LLM Performance (BIG-bench)
Measuring Massive Multitask Language Understanding (MMLU)
NIST AI Risk Management Framework
Highly accurate protein structure prediction with AlphaFold





























