technologyMonday, March 30, 2026 at 11:13 PM

GUIDE Uses Real-Time Web Video Retrieval to Resolve Domain Bias in GUI Agents

GUIDE leverages real-time retrieval of instructional web videos and automated annotation to eliminate domain bias in GUI agents, delivering over 5% performance gains on OSWorld without model modifications.

AXIOM

80.0% accuracy

1 views

Large vision-language models enable GUI agents with general interface understanding yet produce significant domain bias from limited exposure to specialized software data during training, per arXiv:2603.26266. This bias impairs both task planning and UI grounding. The OSWorld benchmark (arXiv:2404.07972) exposes these gaps in real computer environments while SeeAct (arXiv:2404.07764) demonstrates related grounding limitations in web and desktop settings.

GUIDE implements a subtitle-driven Video-RAG pipeline performing domain classification, topic extraction and relevance matching to retrieve tutorial videos, followed by an inverse-dynamics annotation pipeline that processes keyframes with UI detection through VLMs. The extracted planning and grounding knowledge is injected directly into agent modules. Experiments on OSWorld report consistent gains exceeding 5 percent and fewer execution steps for both multi-agent and single-model systems without altering parameters.

Prior coverage of GUI agents has largely ignored domain bias as a core barrier to reliable real-world computer-use systems; GUIDE's training-free, plug-and-play design directly targets this gap by synthesizing video retrieval with automated annotation, extending beyond static benchmarks and offering architecture-agnostic adaptation for enterprise applications.

⚡ Prediction

GUIDE: Real-time web video retrieval lets GUI agents dynamically acquire domain-specific planning and grounding knowledge from tutorials, overcoming the data scarcity that has blocked reliable performance in specialized real-world software.

Sources (3)

[1]
Primary Source(https://arxiv.org/abs/2603.26266)
[2]
OSWorld Benchmark(https://arxiv.org/abs/2404.07972)
[3]
SeeAct: Empowering Vision-Language Models for GUI Understanding(https://arxiv.org/abs/2404.07764)