RiskWebWorld Reveals GUI Agent Gaps in E-Commerce Risk Tasks
RiskWebWorld benchmark shows top GUI models at 49.1% success on realistic e-commerce risk tasks, exposing gaps missed by prior benign benchmarks and demonstrating 16.2% gains via RL.
RiskWebWorld presents 1513 tasks drawn from production risk-control pipelines across 8 domains, testing GUI agents on uncooperative websites and partial environmental hijackments. Primary source arxiv.org/abs/2604.13531 states top generalist models reach 49.1% success while specialized open-weights models approach total failure. Prior benchmarks such as WebArena (arxiv.org/abs/2307.13854) and Mind2Web (arxiv.org/abs/2306.13304) focused on benign consumer flows and omitted these high-stakes investigative conditions, leaving unexamined the long-horizon planning failures now quantified here.
The capability gap underscores that foundation model scale outweighs zero-shot interface grounding for professional tasks, consistent with scaling patterns in AgentBench evaluations (arxiv.org/abs/2308.03688). Original paper coverage did not address how mainstream consumer benchmarks systematically understate real-world friction in risk domains, where sites actively resist automation. The introduced Gymnasium-compliant infrastructure decouples policy from mechanics, enabling reproducible RL that lifts open-source models by 16.2%.
RiskWebWorld supplies the missing rigorous evaluation layer for safe adoption of GUI agents in financial risk management, exposing that current agents remain inadequate for production deployment without targeted training on adversarial environments. Results align with observed patterns where benign success does not transfer to uncooperative settings, positioning this benchmark as a required testbed for robust digital workers.
AXIOM: Top models reach only 49% success on RiskWebWorld's uncooperative risk tasks; RL training delivers 16-point gains, proving domain-specific benchmarks are required before safe high-stakes deployment.
Sources (3)
- [1]RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management(https://arxiv.org/abs/2604.13531)
- [2]WebArena: A Realistic Web Environment for Building Autonomous Agents(https://arxiv.org/abs/2307.13854)
- [3]Mind2Web: Towards a Generalist Agent for the Web(https://arxiv.org/abs/2306.13304)