Add link to paper

#3
by nielsr HF staff - opened
Files changed (1) hide show
  1. README.md +267 -73
README.md CHANGED
@@ -9,7 +9,6 @@ tags:
9
  library_name: transformers
10
  ---
11
 
12
-
13
  # UI-TARS-72B-DPO
14
  [UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)  | 
15
  [UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)  | 
@@ -69,35 +68,6 @@ Code: https://github.com/bytedance/UI-TARS
69
  | **UI-TARS-72B** | **63.0** | **17.3** | **40.8** | **57.1** | **15.4** | **39.6** | 18.8 | **12.5**| 17.2 | **64.6** | 20.9 | 45.7 | **63.3** | **26.4** | **54.8** | **42.1**| 15.7 | **30.1**| **50.9**| **17.5**| **38.1** |
70
 
71
 
72
- - **ScreenSpot**
73
-
74
- | Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
75
- |--------|-------------|-------------|-------------|-------------|-------------|---------|---------|
76
- | **Agent Framework** | | | | | | | |
77
- | GPT-4 (SeeClick) | 76.6 | 55.5 | 68.0 | 28.6 | 40.9 | 23.3 | **48.8** |
78
- | GPT-4 (OmniParser) | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | **73.0** |
79
- | GPT-4 (UGround-7B) | 90.1 | 70.3 | 87.1 | 55.7 | 85.7 | 64.6 | **75.6** |
80
- | GPT-4o (SeeClick) | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | **52.3** |
81
- | GPT-4o (UGround-7B) | 93.4 | 76.9 | 92.8 | 67.9 | 88.7 | 68.9 | **81.4** |
82
- | **Agent Model** | | | | | | | |
83
- | GPT-4 | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | **16.2** |
84
- | GPT-4o | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | **18.3** |
85
- | CogAgent | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | **47.4** |
86
- | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | **53.4** |
87
- | Qwen2-VL | 75.5 | 60.7 | 76.3 | 54.3 | 35.2 | 25.7 | **55.3** |
88
- | UGround-7B | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | **73.3** |
89
- | Aguvis-G-7B | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | **81.8** |
90
- | OS-Atlas-7B | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | **82.5** |
91
- | Claude Computer Use | - | - | - | - | - | - | **83.0** |
92
- | Gemini 2.0 (Project Mariner) | - | - | - | - | - | - | **84.0** |
93
- | Aguvis-7B | **95.6** | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | **84.4** |
94
- | Aguvis-72B | 94.5 | **85.2** | 95.4 | 77.9 | **91.3** | **85.9** | **89.2** |
95
- | **Our Model** | | | | | | | |
96
- | **UI-TARS-2B** | 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 | **82.3** |
97
- | **UI-TARS-7B** | 94.5 | **85.2** | **95.9** | 85.7 | 90.0 | 83.5 | **89.5** |
98
- | **UI-TARS-72B** | 94.9 | 82.5 | 89.7 | **88.6** | 88.7 | 85.0 | **88.4** |
99
-
100
-
101
  - **ScreenSpot v2**
102
 
103
  | Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
@@ -116,49 +86,6 @@ Code: https://github.com/bytedance/UI-TARS
116
  | **UI-TARS-72B** | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
117
 
118
 
119
- **Offline Agent Capability Evaluation**
120
- - **Multimodal Mind2Web**
121
-
122
- | Method | Cross-Task Ele.Acc | Cross-Task Op.F1 | Cross-Task Step SR | Cross-Website Ele.Acc | Cross-Website Op.F1 | Cross-Website Step SR | Cross-Domain Ele.Acc | Cross-Domain Op.F1 | Cross-Domain Step SR |
123
- |--------|----------------------|-------------------|--------------------|----------------------|--------------------|-------------------|--------------------|-------------------|-------------------|
124
- | **Agent Framework** | | | | | | | | | |
125
- | GPT-4o (SeeClick) | 32.1 | - | - | 33.1 | - | - | 33.5 | - | - |
126
- | GPT-4o (UGround) | 47.7 | - | - | 46.0 | - | - | 46.6 | - | - |
127
- | GPT-4o (Aria-UI) | 57.6 | - | - | 57.7 | - | - | 61.4 | - | - |
128
- | GPT-4V (OmniParser) | 42.4 | 87.6 | 39.4 | 41.0 | 84.8 | 36.5 | 45.5 | 85.7 | 42.0 |
129
- | **Agent Model** | | | | | | | | | |
130
- | GPT-4o | 5.7 | 77.2 | 4.3 | 5.7 | 79.0 | 3.9 | 5.5 | 86.4 | 4.5 |
131
- | GPT-4 (SOM) | 29.6 | - | 20.3 | 20.1 | - | 13.9 | 27.0 | - | 23.7 |
132
- | GPT-3.5 (Text-only) | 19.4 | 59.2 | 16.8 | 14.9 | 56.5 | 14.1 | 25.2 | 57.9 | 24.1 |
133
- | GPT-4 (Text-only) | 40.8 | 63.1 | 32.3 | 30.2 | 61.0 | 27.0 | 35.4 | 61.9 | 29.7 |
134
- | Claude | 62.7 | 84.7 | 53.5 | 59.5 | 79.6 | 47.7 | 64.5 | 85.4 | 56.4 |
135
- | Aguvis-7B | 64.2 | 89.8 | 60.4 | 60.7 | 88.1 | 54.6 | 60.4 | 89.2 | 56.6 |
136
- | CogAgent | - | - | 62.3 | - | - | 54.0 | - | - | 59.4 |
137
- | Aguvis-72B | 69.5 | 90.8 | 64.0 | 62.6 | 88.6 | 56.5 | 63.5 | 88.5 | 58.2 |
138
- | **Our Model** | | | | | | | | | |
139
- | **UI-TARS-2B** | 62.3 | 90.0 | 56.3 | 58.5 | 87.2 | 50.8 | 58.8 | 89.6 | 52.3 |
140
- | **UI-TARS-7B** | 73.1 | 92.2 | 67.1 | 68.2 | 90.9 | 61.7 | 66.6 | 90.9 | 60.5 |
141
- | **UI-TARS-72B** | **74.7** | **92.5** | **68.6** | **72.4** | **91.2** | **63.5** | **68.9** | **91.8** | **62.1** |
142
-
143
-
144
- - **Android Control and GUI Odyssey**
145
-
146
- | Agent Models | AndroidControl-Low Type | AndroidControl-Low Grounding | AndroidControl-Low SR | AndroidControl-High Type | AndroidControl-High Grounding | AndroidControl-High SR | GUIOdyssey Type | GUIOdyssey Grounding | GUIOdyssey SR |
147
- |---------------------|----------------------|----------------------|----------------|----------------------|----------------------|----------------|----------------|----------------|----------------|
148
- | Claude | 74.3 | 0.0 | 19.4 | 63.7 | 0.0 | 12.5 | 60.9 | 0.0 | 3.1 |
149
- | GPT-4o | 74.3 | 0.0 | 19.4 | 66.3 | 0.0 | 20.8 | 34.3 | 0.0 | 3.3 |
150
- | SeeClick | 93.0 | 73.4 | 75.0 | 82.9 | 62.9 | 59.1 | 71.0 | 52.4 | 53.9 |
151
- | InternVL-2-4B | 90.9 | 84.1 | 80.1 | 84.1 | 72.7 | 66.7 | 82.1 | 55.5 | 51.5 |
152
- | Qwen2-VL-7B | 91.9 | 86.5 | 82.6 | 83.8 | 77.7 | 69.7 | 83.5 | 65.9 | 60.2 |
153
- | Aria-UI | -- | 87.7 | 67.3 | -- | 43.2 | 10.2 | -- | 86.8 | 36.5 |
154
- | OS-Atlas-4B | 91.9 | 83.8 | 80.6 | 84.7 | 73.8 | 67.5 | 83.5 | 61.4 | 56.4 |
155
- | OS-Atlas-7B | 93.6 | 88.0 | 85.2 | 85.2 | 78.5 | 71.2 | 84.5 | 67.8 | 62.0 |
156
- | Aguvis-7B | -- | -- | 80.5 | -- | -- | 61.5 | -- | -- | -- |
157
- | Aguvis-72B | -- | -- | 84.4 | -- | -- | 66.4 | -- | -- | -- |
158
- | **UI-TARS-2B** | **98.1** | 87.3 | 89.3 | 81.2 | 78.4 | 68.9 | 93.9 | 86.8 | 83.4 |
159
- | **UI-TARS-7B** | 98.0 | 89.3 | 90.8 | 83.7 | 80.5 | 72.5 | 94.6 | 90.1 | 87.0 |
160
- | **UI-TARS-72B** | **98.1** | **89.9** | **91.3** | **85.2** | **81.5** | **74.7** | **95.4** | **91.4** | **88.6** |
161
-
162
  **Online Agent Capability Evaluation**
163
 
164
  | Method | OSWorld (Online) | AndroidWorld (Online) |
@@ -182,6 +109,273 @@ Code: https://github.com/bytedance/UI-TARS
182
  | **UI-TARS-72B-DPO** | **22.7** (15 steps) | - |
183
  | **UI-TARS-72B-DPO** | **24.6** (50 steps) | - |
184
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
 
186
  ## Citation
187
  If you find our paper and model useful in your research, feel free to give us a cite.
 
9
  library_name: transformers
10
  ---
11
 
 
12
  # UI-TARS-72B-DPO
13
  [UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)  | 
14
  [UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)  | 
 
68
  | **UI-TARS-72B** | **63.0** | **17.3** | **40.8** | **57.1** | **15.4** | **39.6** | 18.8 | **12.5**| 17.2 | **64.6** | 20.9 | 45.7 | **63.3** | **26.4** | **54.8** | **42.1**| 15.7 | **30.1**| **50.9**| **17.5**| **38.1** |
69
 
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  - **ScreenSpot v2**
72
 
73
  | Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
 
86
  | **UI-TARS-72B** | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
87
 
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  **Online Agent Capability Evaluation**
90
 
91
  | Method | OSWorld (Online) | AndroidWorld (Online) |
 
109
  | **UI-TARS-72B-DPO** | **22.7** (15 steps) | - |
110
  | **UI-TARS-72B-DPO** | **24.6** (50 steps) | - |
111
 
112
+ ## Deployment
113
+
114
+ ### Cloud Deployment
115
+ We recommend using HuggingFace Inference Endpoints for fast deployment.
116
+ We provide two docs for users to refer:
117
+
118
+ English version: [GUI Model Deployment Guide](https://juniper-switch-f10.notion.site/GUI-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
119
+
120
+ 中文版: [GUI模型部署教程](https://bytedance.sg.larkoffice.com/docx/TCcudYwyIox5vyxiSDLlgIsTgWf#U94rdCxzBoJMLex38NPlHL21gNb)
121
+
122
+ ### Local Deployment [Transformers]
123
+ We follow the same way as Qwen2-VL, check this [tutorial](https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file#using---transformers-to-chat) for more details.
124
+
125
+ ### Local Deployment [vLLM]
126
+ We recommend using vLLM for fast deployment and inference. You need to use `vllm>=0.6.1`.
127
+ ```bash
128
+ pip install -U transformers
129
+ VLLM_VERSION=0.6.6
130
+ CUDA_VERSION=cu124
131
+ pip install vllm==${VLLM_VERSION} --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION}
132
+
133
+ ```
134
+ #### Download the Model
135
+ We provide three model sizes on Hugging Face: **2B**, **7B**, and **72B**. To achieve the best performance, we recommend using the **7B-DPO** or **72B-DPO** model (depends on your GPU configuration):
136
+
137
+ - [2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)
138
+ - [7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)
139
+ - [7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)
140
+ - [72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)
141
+ - [72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)
142
+
143
+
144
+ #### Start an OpenAI API Service
145
+ Run the command below to start an OpenAI-compatible API service:
146
+
147
+ ```bash
148
+ python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars --model <path to your model>
149
+ ```
150
+
151
+ Then you can use the chat API as below with the gui prompt (choose from mobile or computer) and base64-encoded local images (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details), you can also use it in [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop):
152
+ ```python
153
+ import base64
154
+ from openai import OpenAI
155
+
156
+
157
+ instruction = "search for today's weather"
158
+ screenshot_path = "screenshot.png"
159
+ client = OpenAI(
160
+ base_url="http://127.0.0.1:8000/v1",
161
+ api_key="empty",
162
+ )
163
+
164
+ ## Below is the prompt for mobile
165
+ prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
166
+
167
+ ## Output Format
168
+ ```\nThought: ...
169
+ Action: ...\n```
170
+
171
+ ## Action Space
172
+
173
+ click(start_box='<|box_start|>(x1,y1)<|box_end|>')
174
+ left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
175
+ right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
176
+ drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
177
+ hotkey(key='')
178
+ type(content='') #If you want to submit your input, use \"\
179
+ \" at the end of `content`.
180
+ scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
181
+ wait() #Sleep for 5s and take a screenshot to check for any changes.
182
+ finished()
183
+ call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.
184
+
185
+
186
+ ## Note
187
+ - Use Chinese in `Thought` part.
188
+ - Summarize your next action (with its target element) in one sentence in `Thought` part.
189
+
190
+ ## User Instruction
191
+ """
192
+
193
+ with open(screenshot_path, "rb") as image_file:
194
+ encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
195
+ response = client.chat.completions.create(
196
+ model="ui-tars",
197
+ messages=[
198
+ {
199
+ "role": "user",
200
+ "content": [
201
+ {"type": "text", "text": prompt + instruction},
202
+ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
203
+ ],
204
+ },
205
+ ],
206
+ frequency_penalty=1,
207
+ max_tokens=128,
208
+ )
209
+ print(response.choices[0].message.content)
210
+ ```
211
+
212
+ For single step grounding task or inference on grounding dataset such as Seeclick, kindly refer to the following script:
213
+ ```python
214
+ import base64
215
+ from openai import OpenAI
216
+
217
+
218
+ instruction = "search for today's weather"
219
+ screenshot_path = "screenshot.png"
220
+ client = OpenAI(
221
+ base_url="http://127.0.0.1:8000/v1",
222
+ api_key="empty",
223
+ )
224
+
225
+ ## Below is the prompt for mobile
226
+ prompt = r"""Output only the coordinate of one point in your response. What element matches the following task: """
227
+
228
+ with open(screenshot_path, "rb") as image_file:
229
+ encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
230
+ response = client.chat.completions.create(
231
+ model="ui-tars",
232
+ messages=[
233
+ {
234
+ "role": "user",
235
+ "content": [
236
+ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
237
+ {"type": "text", "text": prompt + instruction}
238
+ ],
239
+ },
240
+ ],
241
+ frequency_penalty=1,
242
+ max_tokens=128,
243
+ )
244
+ print(response.choices[0].message.content)
245
+ ```
246
+
247
+ ### Prompt Templates
248
+ We provide two prompt templates currently for stable running and performance, one for mobile scene and one for personal computer scene.
249
+ - Prompt template for mobile:
250
+ ```python
251
+ ## Below is the prompt for mobile
252
+ prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
253
+
254
+ ## Output Format
255
+ ```\nThought: ...
256
+ Action: ...\n```
257
+
258
+ ## Action Space
259
+ click(start_box='<|box_start|>(x1,y1)<|box_end|>')
260
+ long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
261
+ type(content='')
262
+ scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
263
+ press_home()
264
+ press_back()
265
+ finished(content='') # Submit the task regardless of whether it succeeds or fails.
266
+
267
+ ## Note
268
+ - Use English in `Thought` part.
269
+
270
+ - Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.
271
+
272
+ ## User Instruction
273
+ """
274
+ ```
275
+
276
+ - Prompt template for computer:
277
+ ```python
278
+ ## Below is the prompt for computer
279
+ prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
280
+
281
+ ## Output Format
282
+ ```\nThought: ...
283
+ Action: ...\n```
284
+
285
+ ## Action Space
286
+
287
+ click(start_box='<|box_start|>(x1,y1)<|box_end|>')
288
+ left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
289
+ right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
290
+ drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
291
+ hotkey(key='')
292
+ type(content='') #If you want to submit your input, use \"\
293
+ \" at the end of `content`.
294
+ scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
295
+ wait() #Sleep for 5s and take a screenshot to check for any changes.
296
+ finished()
297
+ call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.
298
+
299
+
300
+ ## Note
301
+ - Use Chinese in `Thought` part.
302
+ - Summarize your next action (with its target element) in one sentence in `Thought` part.
303
+
304
+ ## User Instruction
305
+ """
306
+ ```
307
+
308
+ ### Local Deployment [Ollama]
309
+ <!-- Ollama can deploy the model via gguf format. Bugs exist for safetensors. -->Ollama will be coming soon. Please be patient and wait~ 😊
310
+ <!-- #### Get the model in GGUF format
311
+ We provide 2B and 7B model in [GGUF](https://huggingface.co/docs/hub/en/gguf) format:
312
+
313
+ 2B: https://huggingface.co/bytedance-research/UI-TARS-2B-gguf
314
+
315
+ 7B: https://huggingface.co/bytedance-research/UI-TARS-7B-gguf
316
+
317
+ Users can convert the model into GGUF format by using the script from [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py):
318
+
319
+ ```bash
320
+ python3 convert_hf_to_gguf.py <path to your model>
321
+ ```
322
+
323
+ The GGUF file will be generated under the path provided.
324
+
325
+ #### Deploy GGUF model
326
+ We deploy the model by following Ollama [tutorial](https://github.com/ollama/ollama?tab=readme-ov-file#customize-a-model).
327
+
328
+ ```bash
329
+ # Create Modelfile, Windows users can just create a file named Modelfile
330
+ echo "FROM ./path/to/model.gguf" > Modelfile
331
+
332
+ # Create model in Ollama
333
+ ollama create ui-tars -f Modelfile
334
+
335
+ # Run the model
336
+ ollama run ui-tars
337
+
338
+ ```
339
+
340
+ Test script is same as vLLM except two changes:
341
+
342
+ ```python
343
+ ...
344
+ client = OpenAI(
345
+ base_url="http://127.0.0.1:11434/v1/",
346
+ ...
347
+ )
348
+ ...
349
+ response = client.chat.completions.create(
350
+ model="ui-tars" # the name we create via Ollama cli
351
+ ...
352
+ )
353
+
354
+ ``` -->
355
+
356
+ ### Explanation of Inference Results
357
+
358
+ #### Coordinate Mapping
359
+ The model generates a 2D coordinate output that represents relative positions. To convert these values to image-relative coordinates, divide each component by 1000 to obtain values in the range [0,1]. The absolute coordinates required by the Action can be calculated by:
360
+ - X absolute = X relative × image width
361
+ - Y absolute = Y relative × image height
362
+
363
+ For example, given a screen size: 1920 × 1080, and the model generates a coordinate output of (235, 512). The X absolute is `round(1920*235/1000)=451`. The Y absolute is `round(1080*512/1000)=553`. The absolute coordinate is (451, 553)
364
+
365
+ ## Use in desktop and web automation
366
+
367
+ To experience ui-tars agent in desktop, you may refer to [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop). We recommend using the **7B/72B DPO model** on desktop.
368
+
369
+ [Midscene.js](https://github.com/web-infra-dev/Midscene) is an open-source web automation SDK that has supported UI-TARS model. Developers can use javascript and natural language to control the browser. See [this guide](https://midscenejs.com/choose-a-model) for more details about setting up the model.
370
+
371
+ ## License
372
+
373
+ UI-TARS is licensed under the Apache License 2.0.
374
+
375
+ ## Acknowledgements
376
+ This project builds upon and extends the capabilities of Qwen-2-VL, a powerful vision-language model, which serves as the foundational architecture for UI-TARS. We would like to acknowledge the contributions of the developers and researchers behind Qwen-2-VL for their groundbreaking work in the field of multimodal AI and for providing a robust base for further advancements.
377
+
378
+ Additionally, we thank the broader open-source community for their datasets, tools, and insights that have facilitated the development of UI-TARS. These collaborative efforts continue to push the boundaries of what GUI automation and AI-driven agents can achieve.
379
 
380
  ## Citation
381
  If you find our paper and model useful in your research, feel free to give us a cite.