OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Overview

os-genesis

We introduce OS-Genesis, an interaction-driven pipeline that synthesizes high-quality and diverse GUI agent trajectory data without human supervision. By leveraging reverse task synthesis, OS-Genesis enables effective training of GUI agents to achieve superior performance on dynamic benchmarks such as AndroidWorld and WebArena.

Quick Start

OS-Genesis-7B-AC is a mobile action model finetuned from Qwen2-VL-7B-Instruct.

OS-Genesis AC Family Models

In the following table, we provide an overview of the OS-Genesis AC Family Models used for evaluating the AndroidControl Benchmark.

Model Name Base Model Training Data HF Link
OS-Genesis-4B-AC InternVL2-4B OS-Genesis-ac-training-data 🤗 link
OS-Genesis-7B-AC Qwen2-VL-7B-Instruct OS-Genesis-ac-training-data 🤗 link
OS-Genesis-8B-AC InternVL2-8B OS-Genesis-ac-training-data 🤗 link

Inference Example

First, ensure that the necessary dependencies are installed:

pip install transformers
pip install qwen-vl-utils

For evaluating the AndroidControl Benchmark, please refer to the evaluation code.

Inference code example:

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# Default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "OS-Copilot/OS-Genesis-7B-AC", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("OS-Copilot/OS-Atlas-Base-7B")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "./web_6f93090a-81f6-489e-bb35-1a2838b18c01.png",
            },
            {"type": "text", "text": "You are a GUI task expert, I will provide you with a high-level instruction, an action history, a screenshot with its corresponding accessibility tree.\n High-level instruction: {high_level_instruction}\n Action history: {action_history}\n Accessibility tree: {a11y_tree}\n  Please generate the low-level thought and action for the next step."},
        ],
    }
]


# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
)
print(output_text)
# <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|>

Citation

If you find this repository helpful, feel free to cite our paper:

@article{sun2024genesis,
  title={OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis},
  author={Sun, Qiushi and Cheng, Kanzhi and Ding, Zichen and Jin, Chuanyang and Wang, Yian and Xu, Fangzhi and Wu, Zhenyu and Jia, Chengyou and Chen, Liheng and Liu, Zhoumianze and others},
  journal={arXiv preprint arXiv:2412.19723},
  year={2024}
}
Downloads last month
88
Safetensors
Model size
8.29B params
Tensor type
BF16
·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for OS-Copilot/OS-Genesis-7B-AC

Base model

Qwen/Qwen2-VL-7B
Finetuned
(135)
this model
Quantizations
1 model

Collection including OS-Copilot/OS-Genesis-7B-AC