arxiv:2502.08047

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Published on Feb 12

· Submitted by

hhenryz on Feb 13

Upvote

Authors:

Henry Hengyuan Zhao ,

Difei Gao ,

Mike Zheng Shou

Abstract

Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation.

View arXiv page View PDF Add to collection

Community

hhenryz

Paper author Paper submitter 2 days ago

In this paper, we take the first step toward comprehensive GUI agent evaluation by introducing a new benchmark, WorldGUI. In addition to the standard static testing processes, we incorporate dynamic testing states to ensure that WorldGUI effectively captures the complexity and dynamism of real-world GUI environments. Furthermore, to enhance GUI automation, we propose a novel agent framework, GUI-Thinker, built upon a critical thinking philosophy and comprising five core components. This framework enables the agent to dynamically identify uncommon states and adjust its plans or actions accordingly. Finally, we evaluate the latest computer-using agent, Claude-3.5, using our WorldGUI benchmark, demonstrating the effectiveness of GUI-Thinker across a variety of GUI tasks.