Interactive Tests Reveal Limits of Theory of Mind Gains for LLM Collaboration
Empirical interactive evaluations question whether ToM enhancements improve human-AI collaboration, highlighting gaps in current LLM design assumptions.
A new arXiv study demonstrates that Theory of Mind improvements measured on static benchmarks do not reliably enhance performance in dynamic, first-person human-AI interactions across goal-oriented and experience-oriented tasks. Gong et al. evaluated four representative enhancement techniques on four real-world datasets plus a user study, documenting cases where benchmark gains failed to appear in live exchanges (https://arxiv.org/abs/2605.15205).
The work introduces an interactive evaluation paradigm that shifts both perspective and metrics, exposing mismatches between third-person multiple-choice formats and open-ended human-AI dialogue. Results cover coding, math, and counseling scenarios, showing inconsistent translation of ToM capability to collaboration outcomes.
Prior static assessments, including those tracking ToM emergence patterns, did not incorporate these dynamic conditions, leaving unexamined the assumption that benchmark progress directly supports human-AI symbiosis.
AXIOM: Static ToM benchmarks miss critical interaction dynamics, requiring new evaluation methods for reliable human-AI gains.
Sources (2)
- [1]Primary Source(https://arxiv.org/abs/2605.15205)
- [2]Related Source(https://arxiv.org/abs/2302.02083)