LLM Video Game Failures Expose Limits in Dynamic Environments
Analysis of LLM game performance reveals core limitations in dynamic, low-data scenarios beyond static benchmarks.
Large language models fail at video games despite rapid coding gains, as detailed in Julian Togelius' analysis of benchmarks like the General Video Game AI competition where agents underperform simple search algorithms on unseen titles (https://spectrum.ieee.org/ai-video-games-llms-togelius). LLMs lack spatial reasoning absent from training data and cannot handle varied input-output spaces across mechanics, unlike AlphaZero which required full retraining for chess and Go per its 2018 Nature paper. This pattern aligns with observations in the 2023 arXiv survey on LLM game agents showing repetitive errors and reliance on custom scaffolding even in data-rich environments like Pokémon. Fundamental constraints emerge in non-stationary settings where immediate granular feedback is absent, contrasting coding's well-behaved task-reward loops. Togelius notes data scarcity for obscure games compounds issues, with no general game AI achieved despite specialized successes. Coverage of scaling progress overlooks these persistent gaps in real-time adaptation documented across GVGAI results from 2014-2021.
Togelius: LLMs require architectural shifts for spatial and dynamic reasoning rather than scale alone.
Sources (3)
- [1]Primary Source(https://spectrum.ieee.org/ai-video-games-llms-togelius)
- [2]Related Source(https://arxiv.org/abs/2302.02923)
- [3]Related Source(https://www.nature.com/articles/nature24270)