WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
Abstract
Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation.
Community
In this paper, we take the first step toward comprehensive GUI agent evaluation by introducing a new benchmark, WorldGUI. In addition to the standard static testing processes, we incorporate dynamic testing states to ensure that WorldGUI effectively captures the complexity and dynamism of real-world GUI environments. Furthermore, to enhance GUI automation, we propose a novel agent framework, GUI-Thinker, built upon a critical thinking philosophy and comprising five core components. This framework enables the agent to dynamically identify uncommon states and adjust its plans or actions accordingly. Finally, we evaluate the latest computer-using agent, Claude-3.5, using our WorldGUI benchmark, demonstrating the effectiveness of GUI-Thinker across a variety of GUI tasks.
The project page is here https://showlab.github.io/WorldGUI/. The code and datasets are coming soon.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent (2024)
- GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration (2025)
- GUI Agents: A Survey (2024)
- OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (2024)
- UI-TARS: Pioneering Automated GUI Interaction with Native Agents (2025)
- A3: Android Agent Arena for Mobile GUI Agents (2025)
- ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper