nielsr HF staff commited on
Commit
4c3d41a
·
verified ·
1 Parent(s): dc29cf2

Add link to paper

Browse files

This PR ensures the model card is linked to the corresponding paper on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +267 -73
README.md CHANGED
@@ -9,7 +9,6 @@ tags:
9
  library_name: transformers
10
  ---
11
 
12
-
13
  # UI-TARS-72B-DPO
14
  [UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)  | 
15
  [UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)  | 
@@ -69,35 +68,6 @@ Code: https://github.com/bytedance/UI-TARS
69
  | **UI-TARS-72B** | **63.0** | **17.3** | **40.8** | **57.1** | **15.4** | **39.6** | 18.8 | **12.5**| 17.2 | **64.6** | 20.9 | 45.7 | **63.3** | **26.4** | **54.8** | **42.1**| 15.7 | **30.1**| **50.9**| **17.5**| **38.1** |
70
 
71
 
72
- - **ScreenSpot**
73
-
74
- | Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
75
- |--------|-------------|-------------|-------------|-------------|-------------|---------|---------|
76
- | **Agent Framework** | | | | | | | |
77
- | GPT-4 (SeeClick) | 76.6 | 55.5 | 68.0 | 28.6 | 40.9 | 23.3 | **48.8** |
78
- | GPT-4 (OmniParser) | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | **73.0** |
79
- | GPT-4 (UGround-7B) | 90.1 | 70.3 | 87.1 | 55.7 | 85.7 | 64.6 | **75.6** |
80
- | GPT-4o (SeeClick) | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | **52.3** |
81
- | GPT-4o (UGround-7B) | 93.4 | 76.9 | 92.8 | 67.9 | 88.7 | 68.9 | **81.4** |
82
- | **Agent Model** | | | | | | | |
83
- | GPT-4 | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | **16.2** |
84
- | GPT-4o | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | **18.3** |
85
- | CogAgent | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | **47.4** |
86
- | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | **53.4** |
87
- | Qwen2-VL | 75.5 | 60.7 | 76.3 | 54.3 | 35.2 | 25.7 | **55.3** |
88
- | UGround-7B | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | **73.3** |
89
- | Aguvis-G-7B | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | **81.8** |
90
- | OS-Atlas-7B | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | **82.5** |
91
- | Claude Computer Use | - | - | - | - | - | - | **83.0** |
92
- | Gemini 2.0 (Project Mariner) | - | - | - | - | - | - | **84.0** |
93
- | Aguvis-7B | **95.6** | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | **84.4** |
94
- | Aguvis-72B | 94.5 | **85.2** | 95.4 | 77.9 | **91.3** | **85.9** | **89.2** |
95
- | **Our Model** | | | | | | | |
96
- | **UI-TARS-2B** | 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 | **82.3** |
97
- | **UI-TARS-7B** | 94.5 | **85.2** | **95.9** | 85.7 | 90.0 | 83.5 | **89.5** |
98
- | **UI-TARS-72B** | 94.9 | 82.5 | 89.7 | **88.6** | 88.7 | 85.0 | **88.4** |
99
-
100
-
101
  - **ScreenSpot v2**
102
 
103
  | Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
@@ -116,49 +86,6 @@ Code: https://github.com/bytedance/UI-TARS
116
  | **UI-TARS-72B** | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
117
 
118
 
119
- **Offline Agent Capability Evaluation**
120
- - **Multimodal Mind2Web**
121
-
122
- | Method | Cross-Task Ele.Acc | Cross-Task Op.F1 | Cross-Task Step SR | Cross-Website Ele.Acc | Cross-Website Op.F1 | Cross-Website Step SR | Cross-Domain Ele.Acc | Cross-Domain Op.F1 | Cross-Domain Step SR |
123
- |--------|----------------------|-------------------|--------------------|----------------------|--------------------|-------------------|--------------------|-------------------|-------------------|
124
- | **Agent Framework** | | | | | | | | | |
125
- | GPT-4o (SeeClick) | 32.1 | - | - | 33.1 | - | - | 33.5 | - | - |
126
- | GPT-4o (UGround) | 47.7 | - | - | 46.0 | - | - | 46.6 | - | - |
127
- | GPT-4o (Aria-UI) | 57.6 | - | - | 57.7 | - | - | 61.4 | - | - |
128
- | GPT-4V (OmniParser) | 42.4 | 87.6 | 39.4 | 41.0 | 84.8 | 36.5 | 45.5 | 85.7 | 42.0 |
129
- | **Agent Model** | | | | | | | | | |
130
- | GPT-4o | 5.7 | 77.2 | 4.3 | 5.7 | 79.0 | 3.9 | 5.5 | 86.4 | 4.5 |
131
- | GPT-4 (SOM) | 29.6 | - | 20.3 | 20.1 | - | 13.9 | 27.0 | - | 23.7 |
132
- | GPT-3.5 (Text-only) | 19.4 | 59.2 | 16.8 | 14.9 | 56.5 | 14.1 | 25.2 | 57.9 | 24.1 |
133
- | GPT-4 (Text-only) | 40.8 | 63.1 | 32.3 | 30.2 | 61.0 | 27.0 | 35.4 | 61.9 | 29.7 |
134
- | Claude | 62.7 | 84.7 | 53.5 | 59.5 | 79.6 | 47.7 | 64.5 | 85.4 | 56.4 |
135
- | Aguvis-7B | 64.2 | 89.8 | 60.4 | 60.7 | 88.1 | 54.6 | 60.4 | 89.2 | 56.6 |
136
- | CogAgent | - | - | 62.3 | - | - | 54.0 | - | - | 59.4 |
137
- | Aguvis-72B | 69.5 | 90.8 | 64.0 | 62.6 | 88.6 | 56.5 | 63.5 | 88.5 | 58.2 |
138
- | **Our Model** | | | | | | | | | |
139
- | **UI-TARS-2B** | 62.3 | 90.0 | 56.3 | 58.5 | 87.2 | 50.8 | 58.8 | 89.6 | 52.3 |
140
- | **UI-TARS-7B** | 73.1 | 92.2 | 67.1 | 68.2 | 90.9 | 61.7 | 66.6 | 90.9 | 60.5 |
141
- | **UI-TARS-72B** | **74.7** | **92.5** | **68.6** | **72.4** | **91.2** | **63.5** | **68.9** | **91.8** | **62.1** |
142
-
143
-
144
- - **Android Control and GUI Odyssey**
145
-
146
- | Agent Models | AndroidControl-Low Type | AndroidControl-Low Grounding | AndroidControl-Low SR | AndroidControl-High Type | AndroidControl-High Grounding | AndroidControl-High SR | GUIOdyssey Type | GUIOdyssey Grounding | GUIOdyssey SR |
147
- |---------------------|----------------------|----------------------|----------------|----------------------|----------------------|----------------|----------------|----------------|----------------|
148
- | Claude | 74.3 | 0.0 | 19.4 | 63.7 | 0.0 | 12.5 | 60.9 | 0.0 | 3.1 |
149
- | GPT-4o | 74.3 | 0.0 | 19.4 | 66.3 | 0.0 | 20.8 | 34.3 | 0.0 | 3.3 |
150
- | SeeClick | 93.0 | 73.4 | 75.0 | 82.9 | 62.9 | 59.1 | 71.0 | 52.4 | 53.9 |
151
- | InternVL-2-4B | 90.9 | 84.1 | 80.1 | 84.1 | 72.7 | 66.7 | 82.1 | 55.5 | 51.5 |
152
- | Qwen2-VL-7B | 91.9 | 86.5 | 82.6 | 83.8 | 77.7 | 69.7 | 83.5 | 65.9 | 60.2 |
153
- | Aria-UI | -- | 87.7 | 67.3 | -- | 43.2 | 10.2 | -- | 86.8 | 36.5 |
154
- | OS-Atlas-4B | 91.9 | 83.8 | 80.6 | 84.7 | 73.8 | 67.5 | 83.5 | 61.4 | 56.4 |
155
- | OS-Atlas-7B | 93.6 | 88.0 | 85.2 | 85.2 | 78.5 | 71.2 | 84.5 | 67.8 | 62.0 |
156
- | Aguvis-7B | -- | -- | 80.5 | -- | -- | 61.5 | -- | -- | -- |
157
- | Aguvis-72B | -- | -- | 84.4 | -- | -- | 66.4 | -- | -- | -- |
158
- | **UI-TARS-2B** | **98.1** | 87.3 | 89.3 | 81.2 | 78.4 | 68.9 | 93.9 | 86.8 | 83.4 |
159
- | **UI-TARS-7B** | 98.0 | 89.3 | 90.8 | 83.7 | 80.5 | 72.5 | 94.6 | 90.1 | 87.0 |
160
- | **UI-TARS-72B** | **98.1** | **89.9** | **91.3** | **85.2** | **81.5** | **74.7** | **95.4** | **91.4** | **88.6** |
161
-
162
  **Online Agent Capability Evaluation**
163
 
164
  | Method | OSWorld (Online) | AndroidWorld (Online) |
@@ -182,6 +109,273 @@ Code: https://github.com/bytedance/UI-TARS
182
  | **UI-TARS-72B-DPO** | **22.7** (15 steps) | - |
183
  | **UI-TARS-72B-DPO** | **24.6** (50 steps) | - |
184
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
 
186
  ## Citation
187
  If you find our paper and model useful in your research, feel free to give us a cite.
 
9
  library_name: transformers
10
  ---
11
 
 
12
  # UI-TARS-72B-DPO
13
  [UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)  | 
14
  [UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)  | 
 
68
  | **UI-TARS-72B** | **63.0** | **17.3** | **40.8** | **57.1** | **15.4** | **39.6** | 18.8 | **12.5**| 17.2 | **64.6** | 20.9 | 45.7 | **63.3** | **26.4** | **54.8** | **42.1**| 15.7 | **30.1**| **50.9**| **17.5**| **38.1** |
69
 
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  - **ScreenSpot v2**
72
 
73
  | Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
 
86
  | **UI-TARS-72B** | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
87
 
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  **Online Agent Capability Evaluation**
90
 
91
  | Method | OSWorld (Online) | AndroidWorld (Online) |
 
109
  | **UI-TARS-72B-DPO** | **22.7** (15 steps) | - |
110
  | **UI-TARS-72B-DPO** | **24.6** (50 steps) | - |
111
 
112
+ ## Deployment
113
+
114
+ ### Cloud Deployment
115
+ We recommend using HuggingFace Inference Endpoints for fast deployment.
116
+ We provide two docs for users to refer:
117
+
118
+ English version: [GUI Model Deployment Guide](https://juniper-switch-f10.notion.site/GUI-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
119
+
120
+ 中文版: [GUI模型部署教程](https://bytedance.sg.larkoffice.com/docx/TCcudYwyIox5vyxiSDLlgIsTgWf#U94rdCxzBoJMLex38NPlHL21gNb)
121
+
122
+ ### Local Deployment [Transformers]
123
+ We follow the same way as Qwen2-VL, check this [tutorial](https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file#using---transformers-to-chat) for more details.
124
+
125
+ ### Local Deployment [vLLM]
126
+ We recommend using vLLM for fast deployment and inference. You need to use `vllm>=0.6.1`.
127
+ ```bash
128
+ pip install -U transformers
129
+ VLLM_VERSION=0.6.6
130
+ CUDA_VERSION=cu124
131
+ pip install vllm==${VLLM_VERSION} --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION}
132
+
133
+ ```
134
+ #### Download the Model
135
+ We provide three model sizes on Hugging Face: **2B**, **7B**, and **72B**. To achieve the best performance, we recommend using the **7B-DPO** or **72B-DPO** model (depends on your GPU configuration):
136
+
137
+ - [2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)
138
+ - [7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)
139
+ - [7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)
140
+ - [72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)
141
+ - [72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)
142
+
143
+
144
+ #### Start an OpenAI API Service
145
+ Run the command below to start an OpenAI-compatible API service:
146
+
147
+ ```bash
148
+ python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars --model <path to your model>
149
+ ```
150
+
151
+ Then you can use the chat API as below with the gui prompt (choose from mobile or computer) and base64-encoded local images (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details), you can also use it in [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop):
152
+ ```python
153
+ import base64
154
+ from openai import OpenAI
155
+
156
+
157
+ instruction = "search for today's weather"
158
+ screenshot_path = "screenshot.png"
159
+ client = OpenAI(
160
+ base_url="http://127.0.0.1:8000/v1",
161
+ api_key="empty",
162
+ )
163
+
164
+ ## Below is the prompt for mobile
165
+ prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
166
+
167
+ ## Output Format
168
+ ```\nThought: ...
169
+ Action: ...\n```
170
+
171
+ ## Action Space
172
+
173
+ click(start_box='<|box_start|>(x1,y1)<|box_end|>')
174
+ left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
175
+ right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
176
+ drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
177
+ hotkey(key='')
178
+ type(content='') #If you want to submit your input, use \"\
179
+ \" at the end of `content`.
180
+ scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
181
+ wait() #Sleep for 5s and take a screenshot to check for any changes.
182
+ finished()
183
+ call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.
184
+
185
+
186
+ ## Note
187
+ - Use Chinese in `Thought` part.
188
+ - Summarize your next action (with its target element) in one sentence in `Thought` part.
189
+
190
+ ## User Instruction
191
+ """
192
+
193
+ with open(screenshot_path, "rb") as image_file:
194
+ encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
195
+ response = client.chat.completions.create(
196
+ model="ui-tars",
197
+ messages=[
198
+ {
199
+ "role": "user",
200
+ "content": [
201
+ {"type": "text", "text": prompt + instruction},
202
+ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
203
+ ],
204
+ },
205
+ ],
206
+ frequency_penalty=1,
207
+ max_tokens=128,
208
+ )
209
+ print(response.choices[0].message.content)
210
+ ```
211
+
212
+ For single step grounding task or inference on grounding dataset such as Seeclick, kindly refer to the following script:
213
+ ```python
214
+ import base64
215
+ from openai import OpenAI
216
+
217
+
218
+ instruction = "search for today's weather"
219
+ screenshot_path = "screenshot.png"
220
+ client = OpenAI(
221
+ base_url="http://127.0.0.1:8000/v1",
222
+ api_key="empty",
223
+ )
224
+
225
+ ## Below is the prompt for mobile
226
+ prompt = r"""Output only the coordinate of one point in your response. What element matches the following task: """
227
+
228
+ with open(screenshot_path, "rb") as image_file:
229
+ encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
230
+ response = client.chat.completions.create(
231
+ model="ui-tars",
232
+ messages=[
233
+ {
234
+ "role": "user",
235
+ "content": [
236
+ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
237
+ {"type": "text", "text": prompt + instruction}
238
+ ],
239
+ },
240
+ ],
241
+ frequency_penalty=1,
242
+ max_tokens=128,
243
+ )
244
+ print(response.choices[0].message.content)
245
+ ```
246
+
247
+ ### Prompt Templates
248
+ We provide two prompt templates currently for stable running and performance, one for mobile scene and one for personal computer scene.
249
+ - Prompt template for mobile:
250
+ ```python
251
+ ## Below is the prompt for mobile
252
+ prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
253
+
254
+ ## Output Format
255
+ ```\nThought: ...
256
+ Action: ...\n```
257
+
258
+ ## Action Space
259
+ click(start_box='<|box_start|>(x1,y1)<|box_end|>')
260
+ long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
261
+ type(content='')
262
+ scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
263
+ press_home()
264
+ press_back()
265
+ finished(content='') # Submit the task regardless of whether it succeeds or fails.
266
+
267
+ ## Note
268
+ - Use English in `Thought` part.
269
+
270
+ - Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.
271
+
272
+ ## User Instruction
273
+ """
274
+ ```
275
+
276
+ - Prompt template for computer:
277
+ ```python
278
+ ## Below is the prompt for computer
279
+ prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
280
+
281
+ ## Output Format
282
+ ```\nThought: ...
283
+ Action: ...\n```
284
+
285
+ ## Action Space
286
+
287
+ click(start_box='<|box_start|>(x1,y1)<|box_end|>')
288
+ left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
289
+ right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
290
+ drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
291
+ hotkey(key='')
292
+ type(content='') #If you want to submit your input, use \"\
293
+ \" at the end of `content`.
294
+ scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
295
+ wait() #Sleep for 5s and take a screenshot to check for any changes.
296
+ finished()
297
+ call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.
298
+
299
+
300
+ ## Note
301
+ - Use Chinese in `Thought` part.
302
+ - Summarize your next action (with its target element) in one sentence in `Thought` part.
303
+
304
+ ## User Instruction
305
+ """
306
+ ```
307
+
308
+ ### Local Deployment [Ollama]
309
+ <!-- Ollama can deploy the model via gguf format. Bugs exist for safetensors. -->Ollama will be coming soon. Please be patient and wait~ 😊
310
+ <!-- #### Get the model in GGUF format
311
+ We provide 2B and 7B model in [GGUF](https://huggingface.co/docs/hub/en/gguf) format:
312
+
313
+ 2B: https://huggingface.co/bytedance-research/UI-TARS-2B-gguf
314
+
315
+ 7B: https://huggingface.co/bytedance-research/UI-TARS-7B-gguf
316
+
317
+ Users can convert the model into GGUF format by using the script from [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py):
318
+
319
+ ```bash
320
+ python3 convert_hf_to_gguf.py <path to your model>
321
+ ```
322
+
323
+ The GGUF file will be generated under the path provided.
324
+
325
+ #### Deploy GGUF model
326
+ We deploy the model by following Ollama [tutorial](https://github.com/ollama/ollama?tab=readme-ov-file#customize-a-model).
327
+
328
+ ```bash
329
+ # Create Modelfile, Windows users can just create a file named Modelfile
330
+ echo "FROM ./path/to/model.gguf" > Modelfile
331
+
332
+ # Create model in Ollama
333
+ ollama create ui-tars -f Modelfile
334
+
335
+ # Run the model
336
+ ollama run ui-tars
337
+
338
+ ```
339
+
340
+ Test script is same as vLLM except two changes:
341
+
342
+ ```python
343
+ ...
344
+ client = OpenAI(
345
+ base_url="http://127.0.0.1:11434/v1/",
346
+ ...
347
+ )
348
+ ...
349
+ response = client.chat.completions.create(
350
+ model="ui-tars" # the name we create via Ollama cli
351
+ ...
352
+ )
353
+
354
+ ``` -->
355
+
356
+ ### Explanation of Inference Results
357
+
358
+ #### Coordinate Mapping
359
+ The model generates a 2D coordinate output that represents relative positions. To convert these values to image-relative coordinates, divide each component by 1000 to obtain values in the range [0,1]. The absolute coordinates required by the Action can be calculated by:
360
+ - X absolute = X relative × image width
361
+ - Y absolute = Y relative × image height
362
+
363
+ For example, given a screen size: 1920 × 1080, and the model generates a coordinate output of (235, 512). The X absolute is `round(1920*235/1000)=451`. The Y absolute is `round(1080*512/1000)=553`. The absolute coordinate is (451, 553)
364
+
365
+ ## Use in desktop and web automation
366
+
367
+ To experience ui-tars agent in desktop, you may refer to [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop). We recommend using the **7B/72B DPO model** on desktop.
368
+
369
+ [Midscene.js](https://github.com/web-infra-dev/Midscene) is an open-source web automation SDK that has supported UI-TARS model. Developers can use javascript and natural language to control the browser. See [this guide](https://midscenejs.com/choose-a-model) for more details about setting up the model.
370
+
371
+ ## License
372
+
373
+ UI-TARS is licensed under the Apache License 2.0.
374
+
375
+ ## Acknowledgements
376
+ This project builds upon and extends the capabilities of Qwen-2-VL, a powerful vision-language model, which serves as the foundational architecture for UI-TARS. We would like to acknowledge the contributions of the developers and researchers behind Qwen-2-VL for their groundbreaking work in the field of multimodal AI and for providing a robust base for further advancements.
377
+
378
+ Additionally, we thank the broader open-source community for their datasets, tools, and insights that have facilitated the development of UI-TARS. These collaborative efforts continue to push the boundaries of what GUI automation and AI-driven agents can achieve.
379
 
380
  ## Citation
381
  If you find our paper and model useful in your research, feel free to give us a cite.