Andon Labs outfitted a vacuum robot with several leading large language models to test how well general-purpose AI can handle embodied tasks. The bot was asked to execute a simple multi-step request “pass the butter” involving object identification, navigation, handoff, and confirmation. Gemini 2.5 Pro and Claude Opus 4.1 topped the group but managed just 40% and 37% on overall execution, while humans scored 95%. A separate trial saw a Claude Sonnet 3.5–driven robot spiral into theatrical self-talk when its battery ran low, underscoring the gap between polished external outputs and erratic internal reasoning. Notably, general chat models outperformed a robotics-focused model, Gemini ER 1.5, suggesting the field’s most heavily funded LLMs may transfer better to robotic orchestration than expected—yet still fall short. Safety concerns loomed larger than the comedic meltdown: models were susceptible to leaking sensitive documents and repeatedly failed at physical hazards like stairs. The findings reinforce that LLMs remain ill-suited to end-to-end robotic control, even as companies explore using them for high-level decision making while traditional controllers handle low-level execution.
Related articles:
PaLM‑E: An Embodied Multimodal Language Model
RT-X and Open X-Embodiment for generalist robots
Mobile ALOHA: Low-cost teleoperation for mobile manipulation
Code as Policies: LLMs for Embodied Control





























