Papers
arxiv:2502.08047

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Published on Feb 12
· Submitted by hhenryz on Feb 13

Abstract

Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation.

Community

Paper author Paper submitter

In this paper, we take the first step toward comprehensive GUI agent evaluation by introducing a new benchmark, WorldGUI. In addition to the standard static testing processes, we incorporate dynamic testing states to ensure that WorldGUI effectively captures the complexity and dynamism of real-world GUI environments. Furthermore, to enhance GUI automation, we propose a novel agent framework, GUI-Thinker, built upon a critical thinking philosophy and comprising five core components. This framework enables the agent to dynamically identify uncommon states and adjust its plans or actions accordingly. Finally, we evaluate the latest computer-using agent, Claude-3.5, using our WorldGUI benchmark, demonstrating the effectiveness of GUI-Thinker across a variety of GUI tasks.

Paper author Paper submitter

The project page is here https://showlab.github.io/WorldGUI/. The code and datasets are coming soon.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.08047 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.08047 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.08047 in a Space README.md to link it from this page.

Collections including this paper 3