New “reasoning” upgrades to prominent chatbots are producing more wrong answers, raising doubts about claims that AI accuracy will steadily improve. OpenAI’s latest technical report shows its new o3 and o4-mini models hallucinate 33% and 48% of the time, respectively, in a people-facts task, versus 16% for the prior o1 model. Vectara’s leaderboard finds similar trends across vendors, though experts caution that summary-based tests conflate benign and harmful errors and don’t generalize to all tasks. Linguist Emily Bender argues the term “hallucination” masks systemic limitations of LLMs, while Princeton’s Arvind Narayanan says users should reserve AI for jobs where verifying output is cheaper than doing the work directly. The takeaway for businesses: expect persistent error rates, invest in guardrails and verification, and avoid deploying chatbots as authoritative sources of fact.
Related articles:
Hallucination (artificial intelligence)
Large language model
Retrieval-Augmented Generation for Knowledge-Intensive NLP
AI Risk Management Framework
Self-Consistency Improves Chain of Thought Reasoning in Language Models





























