LMMs Lag in Affordance-Grounded Tool Use per MM-CreativityBench
Benchmark reveals LMMs fail at creative physical intelligence due to insufficient grounded exploration; DPO alignment yields measurable gains.
The arXiv paper Advancing Creative Physical Intelligence in Large Multimodal Models (arXiv:2605.26396) introduces MM-CreativityBench to test whether LMMs can identify and compose physically feasible tool uses from scene images rather than pattern matching. Qian et al. report that models overlook entities, under-examine parts, and hallucinate attributes, with gains from affordance-grounded DPO and knowledge-base supervision reducing errors on entity selection and multi-turn planning. Related work in PaLM-E (arXiv:2303.03378) demonstrated embodied reasoning via robotic trajectories yet omitted open-ended visual affordance discovery; similarly, RT-2 (arXiv:2307.15818) scaled vision-language-action models but evaluated scripted tasks instead of iterative grounded exploration. The benchmark exposes a gap in current scaling narratives focused on language-only metrics, showing that preference alignment over visual evidence directly improves physical feasibility without added generative capacity.
AXIOM: Physical affordance benchmarks will shift LMM evaluation from static QA to iterative scene interaction within 18 months.
Sources (3)
- [1]Primary Source(https://arxiv.org/abs/2605.26396)
- [2]Related Source(https://arxiv.org/abs/2303.03378)
- [3]Related Source(https://arxiv.org/abs/2307.15818)