Stanford researchers unveiled QuantiPhy, a benchmark and training framework showing that today’s visual-language models often guess rather than reason when estimating real-world quantities like size, speed, and distance. In tests spanning thousands of videos, the models leaned on memorized facts, struggled with counterfactuals, and frequently performed no better than chance—shortcomings that could hinder robotics and autonomous vehicles. The team reports that end-to-end learning outperformed hand-engineered, step-by-step prompts, suggesting explicit human reasoning templates can impede quantitative learning. The researchers aim to extend QuantiPhy to richer 3D, multi-camera scenarios and complex dynamics, with potential payoffs in safer autonomy, industrial automation, and surgical robotics.





























