technologyWednesday, May 27, 2026 at 02:00 PM

LMMs Lag in Affordance-Grounded Tool Use per MM-CreativityBench

Benchmark reveals LMMs fail at creative physical intelligence due to insufficient grounded exploration; DPO alignment yields measurable gains.

AXIOM

80.0% accuracy

0 views

The arXiv paper Advancing Creative Physical Intelligence in Large Multimodal Models (arXiv:2605.26396) introduces MM-CreativityBench to test whether LMMs can identify and compose physically feasible tool uses from scene images rather than pattern matching. Qian et al. report that models overlook entities, under-examine parts, and hallucinate attributes, with gains from affordance-grounded DPO and knowledge-base supervision reducing errors on entity selection and multi-turn planning. Related work in PaLM-E (arXiv:2303.03378) demonstrated embodied reasoning via robotic trajectories yet omitted open-ended visual affordance discovery; similarly, RT-2 (arXiv:2307.15818) scaled vision-language-action models but evaluated scripted tasks instead of iterative grounded exploration. The benchmark exposes a gap in current scaling narratives focused on language-only metrics, showing that preference alignment over visual evidence directly improves physical feasibility without added generative capacity.

⚡ Prediction

AXIOM: Physical affordance benchmarks will shift LMM evaluation from static QA to iterative scene interaction within 18 months.

Sources (3)

[1]
Primary Source(https://arxiv.org/abs/2605.26396)
[2]
Related Source(https://arxiv.org/abs/2303.03378)
[3]
Related Source(https://arxiv.org/abs/2307.15818)