Add link to paper
Browse filesThis PR ensures the model can be found at https://huggingface.co./papers/2501.12326.
README.md
CHANGED
@@ -9,7 +9,6 @@ tags:
|
|
9 |
library_name: transformers
|
10 |
---
|
11 |
|
12 |
-
|
13 |
# UI-TARS-7B-SFT
|
14 |
[UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT) |
|
15 |
[UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT) |
|
@@ -19,7 +18,9 @@ library_name: transformers
|
|
19 |
## Introduction
|
20 |
|
21 |
UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.
|
22 |
-
|
|
|
|
|
23 |
<p align="center">
|
24 |
<img src="https://github.com/bytedance/UI-TARS/blob/main/figures/UI-TARS-vs-Previous-SOTA.png?raw=true" width="90%"/>
|
25 |
<p>
|
@@ -27,10 +28,6 @@ UI-TARS is a next-generation native GUI agent model designed to interact seamles
|
|
27 |
<img src="https://github.com/bytedance/UI-TARS/blob/main/figures/UI-TARS.png?raw=true" width="90%"/>
|
28 |
<p>
|
29 |
|
30 |
-
<!-- ![Local Image](figures/UI-TARS-vs-Previous-SOTA.png) -->
|
31 |
-
|
32 |
-
This repository contains the model for the paper [UI-TARS: Pioneering Automated GUI Interaction with Native Agents](https://huggingface.co/papers/2501.12326).
|
33 |
-
|
34 |
Code: https://github.com/bytedance/UI-TARS
|
35 |
|
36 |
## Performance
|
@@ -69,35 +66,6 @@ Code: https://github.com/bytedance/UI-TARS
|
|
69 |
| **UI-TARS-72B** | **63.0** | **17.3** | **40.8** | **57.1** | **15.4** | **39.6** | 18.8 | **12.5**| 17.2 | **64.6** | 20.9 | 45.7 | **63.3** | **26.4** | **54.8** | **42.1**| 15.7 | **30.1**| **50.9**| **17.5**| **38.1** |
|
70 |
|
71 |
|
72 |
-
- **ScreenSpot**
|
73 |
-
|
74 |
-
| Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
|
75 |
-
|--------|-------------|-------------|-------------|-------------|-------------|---------|---------|
|
76 |
-
| **Agent Framework** | | | | | | | |
|
77 |
-
| GPT-4 (SeeClick) | 76.6 | 55.5 | 68.0 | 28.6 | 40.9 | 23.3 | **48.8** |
|
78 |
-
| GPT-4 (OmniParser) | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | **73.0** |
|
79 |
-
| GPT-4 (UGround-7B) | 90.1 | 70.3 | 87.1 | 55.7 | 85.7 | 64.6 | **75.6** |
|
80 |
-
| GPT-4o (SeeClick) | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | **52.3** |
|
81 |
-
| GPT-4o (UGround-7B) | 93.4 | 76.9 | 92.8 | 67.9 | 88.7 | 68.9 | **81.4** |
|
82 |
-
| **Agent Model** | | | | | | | |
|
83 |
-
| GPT-4 | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | **16.2** |
|
84 |
-
| GPT-4o | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | **18.3** |
|
85 |
-
| CogAgent | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | **47.4** |
|
86 |
-
| SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | **53.4** |
|
87 |
-
| Qwen2-VL | 75.5 | 60.7 | 76.3 | 54.3 | 35.2 | 25.7 | **55.3** |
|
88 |
-
| UGround-7B | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | **73.3** |
|
89 |
-
| Aguvis-G-7B | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | **81.8** |
|
90 |
-
| OS-Atlas-7B | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | **82.5** |
|
91 |
-
| Claude Computer Use | - | - | - | - | - | - | **83.0** |
|
92 |
-
| Gemini 2.0 (Project Mariner) | - | - | - | - | - | - | **84.0** |
|
93 |
-
| Aguvis-7B | **95.6** | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | **84.4** |
|
94 |
-
| Aguvis-72B | 94.5 | **85.2** | 95.4 | 77.9 | **91.3** | **85.9** | **89.2** |
|
95 |
-
| **Our Model** | | | | | | | |
|
96 |
-
| **UI-TARS-2B** | 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 | **82.3** |
|
97 |
-
| **UI-TARS-7B** | 94.5 | **85.2** | **95.9** | 85.7 | 90.0 | 83.5 | **89.5** |
|
98 |
-
| **UI-TARS-72B** | 94.9 | 82.5 | 89.7 | **88.6** | 88.7 | 85.0 | **88.4** |
|
99 |
-
|
100 |
-
|
101 |
- **ScreenSpot v2**
|
102 |
|
103 |
| Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
|
@@ -116,49 +84,6 @@ Code: https://github.com/bytedance/UI-TARS
|
|
116 |
| **UI-TARS-72B** | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
|
117 |
|
118 |
|
119 |
-
**Offline Agent Capability Evaluation**
|
120 |
-
- **Multimodal Mind2Web**
|
121 |
-
|
122 |
-
| Method | Cross-Task Ele.Acc | Cross-Task Op.F1 | Cross-Task Step SR | Cross-Website Ele.Acc | Cross-Website Op.F1 | Cross-Website Step SR | Cross-Domain Ele.Acc | Cross-Domain Op.F1 | Cross-Domain Step SR |
|
123 |
-
|--------|----------------------|-------------------|--------------------|----------------------|--------------------|-------------------|--------------------|-------------------|-------------------|
|
124 |
-
| **Agent Framework** | | | | | | | | | |
|
125 |
-
| GPT-4o (SeeClick) | 32.1 | - | - | 33.1 | - | - | 33.5 | - | - |
|
126 |
-
| GPT-4o (UGround) | 47.7 | - | - | 46.0 | - | - | 46.6 | - | - |
|
127 |
-
| GPT-4o (Aria-UI) | 57.6 | - | - | 57.7 | - | - | 61.4 | - | - |
|
128 |
-
| GPT-4V (OmniParser) | 42.4 | 87.6 | 39.4 | 41.0 | 84.8 | 36.5 | 45.5 | 85.7 | 42.0 |
|
129 |
-
| **Agent Model** | | | | | | | | | |
|
130 |
-
| GPT-4o | 5.7 | 77.2 | 4.3 | 5.7 | 79.0 | 3.9 | 5.5 | 86.4 | 4.5 |
|
131 |
-
| GPT-4 (SOM) | 29.6 | - | 20.3 | 20.1 | - | 13.9 | 27.0 | - | 23.7 |
|
132 |
-
| GPT-3.5 (Text-only) | 19.4 | 59.2 | 16.8 | 14.9 | 56.5 | 14.1 | 25.2 | 57.9 | 24.1 |
|
133 |
-
| GPT-4 (Text-only) | 40.8 | 63.1 | 32.3 | 30.2 | 61.0 | 27.0 | 35.4 | 61.9 | 29.7 |
|
134 |
-
| Claude | 62.7 | 84.7 | 53.5 | 59.5 | 79.6 | 47.7 | 64.5 | 85.4 | 56.4 |
|
135 |
-
| Aguvis-7B | 64.2 | 89.8 | 60.4 | 60.7 | 88.1 | 54.6 | 60.4 | 89.2 | 56.6 |
|
136 |
-
| CogAgent | - | - | 62.3 | - | - | 54.0 | - | - | 59.4 |
|
137 |
-
| Aguvis-72B | 69.5 | 90.8 | 64.0 | 62.6 | 88.6 | 56.5 | 63.5 | 88.5 | 58.2 |
|
138 |
-
| **Our Model** | | | | | | | | | |
|
139 |
-
| **UI-TARS-2B** | 62.3 | 90.0 | 56.3 | 58.5 | 87.2 | 50.8 | 58.8 | 89.6 | 52.3 |
|
140 |
-
| **UI-TARS-7B** | 73.1 | 92.2 | 67.1 | 68.2 | 90.9 | 61.7 | 66.6 | 90.9 | 60.5 |
|
141 |
-
| **UI-TARS-72B** | **74.7** | **92.5** | **68.6** | **72.4** | **91.2** | **63.5** | **68.9** | **91.8** | **62.1** |
|
142 |
-
|
143 |
-
|
144 |
-
- **Android Control and GUI Odyssey**
|
145 |
-
|
146 |
-
| Agent Models | AndroidControl-Low Type | AndroidControl-Low Grounding | AndroidControl-Low SR | AndroidControl-High Type | AndroidControl-High Grounding | AndroidControl-High SR | GUIOdyssey Type | GUIOdyssey Grounding | GUIOdyssey SR |
|
147 |
-
|---------------------|----------------------|----------------------|----------------|----------------------|----------------------|----------------|----------------|----------------|----------------|
|
148 |
-
| Claude | 74.3 | 0.0 | 19.4 | 63.7 | 0.0 | 12.5 | 60.9 | 0.0 | 3.1 |
|
149 |
-
| GPT-4o | 74.3 | 0.0 | 19.4 | 66.3 | 0.0 | 20.8 | 34.3 | 0.0 | 3.3 |
|
150 |
-
| SeeClick | 93.0 | 73.4 | 75.0 | 82.9 | 62.9 | 59.1 | 71.0 | 52.4 | 53.9 |
|
151 |
-
| InternVL-2-4B | 90.9 | 84.1 | 80.1 | 84.1 | 72.7 | 66.7 | 82.1 | 55.5 | 51.5 |
|
152 |
-
| Qwen2-VL-7B | 91.9 | 86.5 | 82.6 | 83.8 | 77.7 | 69.7 | 83.5 | 65.9 | 60.2 |
|
153 |
-
| Aria-UI | -- | 87.7 | 67.3 | -- | 43.2 | 10.2 | -- | 86.8 | 36.5 |
|
154 |
-
| OS-Atlas-4B | 91.9 | 83.8 | 80.6 | 84.7 | 73.8 | 67.5 | 83.5 | 61.4 | 56.4 |
|
155 |
-
| OS-Atlas-7B | 93.6 | 88.0 | 85.2 | 85.2 | 78.5 | 71.2 | 84.5 | 67.8 | 62.0 |
|
156 |
-
| Aguvis-7B | -- | -- | 80.5 | -- | -- | 61.5 | -- | -- | -- |
|
157 |
-
| Aguvis-72B | -- | -- | 84.4 | -- | -- | 66.4 | -- | -- | -- |
|
158 |
-
| **UI-TARS-2B** | **98.1** | 87.3 | 89.3 | 81.2 | 78.4 | 68.9 | 93.9 | 86.8 | 83.4 |
|
159 |
-
| **UI-TARS-7B** | 98.0 | 89.3 | 90.8 | 83.7 | 80.5 | 72.5 | 94.6 | 90.1 | 87.0 |
|
160 |
-
| **UI-TARS-72B** | **98.1** | **89.9** | **91.3** | **85.2** | **81.5** | **74.7** | **95.4** | **91.4** | **88.6** |
|
161 |
-
|
162 |
**Online Agent Capability Evaluation**
|
163 |
|
164 |
| Method | OSWorld (Online) | AndroidWorld (Online) |
|
|
|
9 |
library_name: transformers
|
10 |
---
|
11 |
|
|
|
12 |
# UI-TARS-7B-SFT
|
13 |
[UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT) |
|
14 |
[UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT) |
|
|
|
18 |
## Introduction
|
19 |
|
20 |
UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.
|
21 |
+
|
22 |
+
This repository contains the model for the paper [UI-TARS: Pioneering Automated GUI Interaction with Native Agents](https://huggingface.co/papers/2501.12326).
|
23 |
+
|
24 |
<p align="center">
|
25 |
<img src="https://github.com/bytedance/UI-TARS/blob/main/figures/UI-TARS-vs-Previous-SOTA.png?raw=true" width="90%"/>
|
26 |
<p>
|
|
|
28 |
<img src="https://github.com/bytedance/UI-TARS/blob/main/figures/UI-TARS.png?raw=true" width="90%"/>
|
29 |
<p>
|
30 |
|
|
|
|
|
|
|
|
|
31 |
Code: https://github.com/bytedance/UI-TARS
|
32 |
|
33 |
## Performance
|
|
|
66 |
| **UI-TARS-72B** | **63.0** | **17.3** | **40.8** | **57.1** | **15.4** | **39.6** | 18.8 | **12.5**| 17.2 | **64.6** | 20.9 | 45.7 | **63.3** | **26.4** | **54.8** | **42.1**| 15.7 | **30.1**| **50.9**| **17.5**| **38.1** |
|
67 |
|
68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
69 |
- **ScreenSpot v2**
|
70 |
|
71 |
| Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
|
|
|
84 |
| **UI-TARS-72B** | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
|
85 |
|
86 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
**Online Agent Capability Evaluation**
|
88 |
|
89 |
| Method | OSWorld (Online) | AndroidWorld (Online) |
|