GUIDE Uses Real-Time Web Video Retrieval to Resolve Domain Bias in GUI Agents
GUIDE leverages real-time retrieval of instructional web videos and automated annotation to eliminate domain bias in GUI agents, delivering over 5% performance gains on OSWorld without model modifications.
Large vision-language models enable GUI agents with general interface understanding yet produce significant domain bias from limited exposure to specialized software data during training, per arXiv:2603.26266. This bias impairs both task planning and UI grounding. The OSWorld benchmark (arXiv:2404.07972) exposes these gaps in real computer environments while SeeAct (arXiv:2404.07764) demonstrates related grounding limitations in web and desktop settings.
GUIDE implements a subtitle-driven Video-RAG pipeline performing domain classification, topic extraction and relevance matching to retrieve tutorial videos, followed by an inverse-dynamics annotation pipeline that processes keyframes with UI detection through VLMs. The extracted planning and grounding knowledge is injected directly into agent modules. Experiments on OSWorld report consistent gains exceeding 5 percent and fewer execution steps for both multi-agent and single-model systems without altering parameters.
Prior coverage of GUI agents has largely ignored domain bias as a core barrier to reliable real-world computer-use systems; GUIDE's training-free, plug-and-play design directly targets this gap by synthesizing video retrieval with automated annotation, extending beyond static benchmarks and offering architecture-agnostic adaptation for enterprise applications.
GUIDE: Real-time web video retrieval lets GUI agents dynamically acquire domain-specific planning and grounding knowledge from tutorials, overcoming the data scarcity that has blocked reliable performance in specialized real-world software.
Sources (3)
- [1]Primary Source(https://arxiv.org/abs/2603.26266)
- [2]OSWorld Benchmark(https://arxiv.org/abs/2404.07972)
- [3]SeeAct: Empowering Vision-Language Models for GUI Understanding(https://arxiv.org/abs/2404.07764)