bytedance-research
/

UI-TARS-72B-DPO

@@ -9,7 +9,6 @@ tags:
 library_name: transformers
 ---
 # UI-TARS-72B-DPO
 [UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT) &nbsp;|&nbsp;
 [UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT) &nbsp;|&nbsp;
@@ -69,35 +68,6 @@ Code: https://github.com/bytedance/UI-TARS
 | **UI-TARS-72B**       | **63.0** | **17.3** | **40.8** | **57.1**     | **15.4**     | **39.6**     | 18.8    | **12.5**| 17.2    | **64.6**      | 20.9          | 45.7          | **63.3**   | **26.4**   | **54.8**   | **42.1**| 15.7    | **30.1**| **50.9**| **17.5**| **38.1**  |
-- **ScreenSpot**
-| Method |  Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
-|--------|-------------|-------------|-------------|-------------|-------------|---------|---------|
-| **Agent Framework**  | | | | | | | |
-| GPT-4 (SeeClick) |  76.6 | 55.5 | 68.0 | 28.6 | 40.9 | 23.3 | **48.8** |
-| GPT-4 (OmniParser)  | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | **73.0** |
-| GPT-4 (UGround-7B)  | 90.1 | 70.3 | 87.1 | 55.7 | 85.7 | 64.6 | **75.6** |
-| GPT-4o (SeeClick)  | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | **52.3** |
-| GPT-4o (UGround-7B)  | 93.4 | 76.9 | 92.8 | 67.9 | 88.7 | 68.9 | **81.4** |
-| **Agent Model**   | | | | | | | |
-| GPT-4  | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | **16.2** |
-| GPT-4o  | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | **18.3** |
-| CogAgent  | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | **47.4** |
-| SeeClick  | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | **53.4** |
-| Qwen2-VL  | 75.5 | 60.7 | 76.3 | 54.3 | 35.2 | 25.7 | **55.3** |
-| UGround-7B  | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | **73.3** |
-| Aguvis-G-7B  | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | **81.8** |
-| OS-Atlas-7B | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | **82.5** |
-| Claude Computer Use  | - | - | - | - | - | - | **83.0** |
-| Gemini 2.0 (Project Mariner)  | - | - | - | - | - | - | **84.0** |
-| Aguvis-7B  | **95.6** | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | **84.4** |
-| Aguvis-72B  | 94.5 | **85.2** | 95.4 | 77.9 | **91.3** | **85.9** | **89.2** |
-| **Our Model**   | | | | | | | |
-| **UI-TARS-2B**  | 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 | **82.3** |
-| **UI-TARS-7B**  | 94.5 | **85.2** | **95.9** | 85.7 | 90.0 | 83.5 | **89.5** |
-| **UI-TARS-72B**  | 94.9 | 82.5 | 89.7 | **88.6** | 88.7 | 85.0 | **88.4** |
 - **ScreenSpot v2**
 | Method |  Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
@@ -116,49 +86,6 @@ Code: https://github.com/bytedance/UI-TARS
 | **UI-TARS-72B**  | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
-**Offline Agent Capability Evaluation**
-- **Multimodal Mind2Web**
-| Method |  Cross-Task Ele.Acc | Cross-Task Op.F1 | Cross-Task Step SR | Cross-Website Ele.Acc | Cross-Website Op.F1 | Cross-Website Step SR | Cross-Domain Ele.Acc | Cross-Domain Op.F1 | Cross-Domain Step SR |
-|--------|----------------------|-------------------|--------------------|----------------------|--------------------|-------------------|--------------------|-------------------|-------------------|
-| **Agent Framework**  | | | | | | | | | |
-| GPT-4o (SeeClick)  | 32.1 | - | - | 33.1 | - | - | 33.5 | - | - |
-| GPT-4o (UGround)  | 47.7 | - | - | 46.0 | - | - | 46.6 | - | - |
-| GPT-4o (Aria-UI)  | 57.6 | - | - | 57.7 | - | - | 61.4 | - | - |
-| GPT-4V (OmniParser)  | 42.4 | 87.6 | 39.4 | 41.0 | 84.8 | 36.5 | 45.5 | 85.7 | 42.0 |
-| **Agent Model** |  | | | | | | | | |
-| GPT-4o  | 5.7 | 77.2 | 4.3 | 5.7 | 79.0 | 3.9 | 5.5 | 86.4 | 4.5 |
-| GPT-4 (SOM)  | 29.6 | - | 20.3 | 20.1 | - | 13.9 | 27.0 | - | 23.7 |
-| GPT-3.5 (Text-only)  | 19.4 | 59.2 | 16.8 | 14.9 | 56.5 | 14.1 | 25.2 | 57.9 | 24.1 |
-| GPT-4 (Text-only)  | 40.8 | 63.1 | 32.3 | 30.2 | 61.0 | 27.0 | 35.4 | 61.9 | 29.7 |
-| Claude  | 62.7 | 84.7 | 53.5 | 59.5 | 79.6 | 47.7 | 64.5 | 85.4 | 56.4 |
-| Aguvis-7B  | 64.2 | 89.8 | 60.4 | 60.7 | 88.1 | 54.6 | 60.4 | 89.2 | 56.6 |
-| CogAgent  | - | - | 62.3 | - | - | 54.0 | - | - | 59.4 |
-| Aguvis-72B  | 69.5 | 90.8 | 64.0 | 62.6 | 88.6 | 56.5 | 63.5 | 88.5 | 58.2 |
-| **Our Model**  | | | | | | | | | |
-| **UI-TARS-2B**  | 62.3 | 90.0 | 56.3 | 58.5 | 87.2 | 50.8 | 58.8 | 89.6 | 52.3 |
-| **UI-TARS-7B**  | 73.1 | 92.2 | 67.1 | 68.2 | 90.9 | 61.7 | 66.6 | 90.9 | 60.5 |
-| **UI-TARS-72B**  | **74.7** | **92.5** | **68.6** | **72.4** | **91.2** | **63.5** | **68.9** | **91.8** | **62.1** |
-- **Android Control and GUI Odyssey**
-| Agent Models        | AndroidControl-Low Type | AndroidControl-Low Grounding | AndroidControl-Low SR | AndroidControl-High Type | AndroidControl-High Grounding | AndroidControl-High SR | GUIOdyssey Type | GUIOdyssey Grounding | GUIOdyssey SR |
-|---------------------|----------------------|----------------------|----------------|----------------------|----------------------|----------------|----------------|----------------|----------------|
-| Claude             | 74.3                 | 0.0                  | 19.4           | 63.7                 | 0.0                  | 12.5           | 60.9           | 0.0            | 3.1            |
-| GPT-4o             | 74.3                 | 0.0                  | 19.4           | 66.3                 | 0.0                  | 20.8           | 34.3           | 0.0            | 3.3            |
-| SeeClick           | 93.0                 | 73.4                 | 75.0           | 82.9                 | 62.9                 | 59.1           | 71.0           | 52.4           | 53.9           |
-| InternVL-2-4B      | 90.9                 | 84.1                 | 80.1           | 84.1                 | 72.7                 | 66.7           | 82.1           | 55.5           | 51.5           |
-| Qwen2-VL-7B       | 91.9                 | 86.5                 | 82.6           | 83.8                 | 77.7                 | 69.7           | 83.5           | 65.9           | 60.2           |
-| Aria-UI           | --                   | 87.7                 | 67.3           | --                   | 43.2                 | 10.2           | --             | 86.8           | 36.5           |
-| OS-Atlas-4B       | 91.9                 | 83.8                 | 80.6           | 84.7                 | 73.8                 | 67.5           | 83.5           | 61.4           | 56.4           |
-| OS-Atlas-7B       | 93.6                 | 88.0                 | 85.2           | 85.2                 | 78.5                 | 71.2           | 84.5           | 67.8           | 62.0           |
-| Aguvis-7B         | --                   | --                   | 80.5           | --                   | --                   | 61.5           | --             | --             | --             |
-| Aguvis-72B        | --                   | --                   | 84.4           | --                   | --                   | 66.4           | --             | --             | --             |
-| **UI-TARS-2B**   | **98.1**             | 87.3                 | 89.3           | 81.2                 | 78.4                 | 68.9           | 93.9           | 86.8           | 83.4           |
-| **UI-TARS-7B**   | 98.0                 | 89.3                 | 90.8           | 83.7                 | 80.5                 | 72.5           | 94.6           | 90.1           | 87.0           |
-| **UI-TARS-72B**  | **98.1**             | **89.9**             | **91.3**       | **85.2**             | **81.5**             | **74.7**       | **95.4**       | **91.4**       | **88.6**       |
 **Online Agent Capability Evaluation**
 | Method |  OSWorld (Online) | AndroidWorld (Online) |
@@ -182,6 +109,273 @@ Code: https://github.com/bytedance/UI-TARS
 | **UI-TARS-72B-DPO**  | **22.7** (15 steps) | - |
 | **UI-TARS-72B-DPO**  | **24.6** (50 steps) | - |
 ## Citation
 If you find our paper and model useful in your research, feel free to give us a cite.

 library_name: transformers
 ---
 # UI-TARS-72B-DPO
 [UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT) &nbsp;|&nbsp;
 [UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT) &nbsp;|&nbsp;
 | **UI-TARS-72B**       | **63.0** | **17.3** | **40.8** | **57.1**     | **15.4**     | **39.6**     | 18.8    | **12.5**| 17.2    | **64.6**      | 20.9          | 45.7          | **63.3**   | **26.4**   | **54.8**   | **42.1**| 15.7    | **30.1**| **50.9**| **17.5**| **38.1**  |
 - **ScreenSpot v2**
 | Method |  Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
 | **UI-TARS-72B**  | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
 **Online Agent Capability Evaluation**
 | Method |  OSWorld (Online) | AndroidWorld (Online) |
 | **UI-TARS-72B-DPO**  | **22.7** (15 steps) | - |
 | **UI-TARS-72B-DPO**  | **24.6** (50 steps) | - |
+## Deployment
+### Cloud Deployment
+We recommend using HuggingFace Inference Endpoints for fast deployment.
+We provide two docs for users to refer:
+English version: [GUI Model Deployment Guide](https://juniper-switch-f10.notion.site/GUI-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
+中文版: [GUI模型部署教程](https://bytedance.sg.larkoffice.com/docx/TCcudYwyIox5vyxiSDLlgIsTgWf#U94rdCxzBoJMLex38NPlHL21gNb)
+### Local Deployment [Transformers]
+We follow the same way as Qwen2-VL, check this [tutorial](https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file#using---transformers-to-chat) for more details.
+### Local Deployment [vLLM]
+We recommend using vLLM for fast deployment and inference. You need to use `vllm>=0.6.1`.
+```bash
+pip install -U transformers
+VLLM_VERSION=0.6.6
+CUDA_VERSION=cu124
+pip install vllm==${VLLM_VERSION} --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION}
+```
+#### Download the Model
+We provide three model sizes on Hugging Face: **2B**, **7B**, and **72B**. To achieve the best performance, we recommend using the **7B-DPO** or **72B-DPO** model (depends on your GPU configuration):
+- [2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)
+- [7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)
+- [7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)
+- [72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)
+- [72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)
+#### Start an OpenAI API Service
+Run the command below to start an OpenAI-compatible API service:
+```bash
+python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars --model <path to your model>
+```
+Then you can use the chat API as below with the gui prompt (choose from mobile or computer) and base64-encoded local images (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details), you can also use it in [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop):
+```python
+import base64
+from openai import OpenAI
+instruction = "search for today's weather"
+screenshot_path = "screenshot.png"
+client = OpenAI(
+    base_url="http://127.0.0.1:8000/v1",
+    api_key="empty",
+)
+## Below is the prompt for mobile
+prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
+## Output Format
+```\nThought: ...
+Action: ...\n```
+## Action Space
+click(start_box='<|box_start|>(x1,y1)<|box_end|>')
+left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
+right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
+drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
+hotkey(key='')
+type(content='') #If you want to submit your input, use \"\
+\" at the end of `content`.
+scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
+wait() #Sleep for 5s and take a screenshot to check for any changes.
+finished()
+call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.
+## Note
+- Use Chinese in `Thought` part.
+- Summarize your next action (with its target element) in one sentence in `Thought` part.
+## User Instruction
+"""
+with open(screenshot_path, "rb") as image_file:
+    encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
+response = client.chat.completions.create(
+    model="ui-tars",
+    messages=[
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": prompt + instruction},
+                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
+            ],
+        },
+    ],
+    frequency_penalty=1,
+    max_tokens=128,
+)
+print(response.choices[0].message.content)
+```
+For single step grounding task or inference on grounding dataset such as Seeclick, kindly refer to the following script:
+```python
+import base64
+from openai import OpenAI
+instruction = "search for today's weather"
+screenshot_path = "screenshot.png"
+client = OpenAI(
+    base_url="http://127.0.0.1:8000/v1",
+    api_key="empty",
+)
+## Below is the prompt for mobile
+prompt = r"""Output only the coordinate of one point in your response. What element matches the following task: """
+with open(screenshot_path, "rb") as image_file:
+    encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
+response = client.chat.completions.create(
+    model="ui-tars",
+    messages=[
+        {
+            "role": "user",
+            "content": [
+                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
+                {"type": "text", "text": prompt + instruction}
+            ],
+        },
+    ],
+    frequency_penalty=1,
+    max_tokens=128,
+)
+print(response.choices[0].message.content)
+```
+### Prompt Templates
+We provide two prompt templates currently for stable running and performance, one for mobile scene and one for personal computer scene.
+- Prompt template for mobile:
+```python
+## Below is the prompt for mobile
+prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
+## Output Format
+```\nThought: ...
+Action: ...\n```
+## Action Space
+click(start_box='<|box_start|>(x1,y1)<|box_end|>')
+long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
+type(content='')
+scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
+press_home()
+press_back()
+finished(content='') # Submit the task regardless of whether it succeeds or fails.
+## Note
+- Use English in `Thought` part.
+- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.
+## User Instruction
+"""
+```
+- Prompt template for computer:
+```python
+## Below is the prompt for computer
+prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
+## Output Format
+```\nThought: ...
+Action: ...\n```
+## Action Space
+click(start_box='<|box_start|>(x1,y1)<|box_end|>')
+left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
+right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
+drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
+hotkey(key='')
+type(content='') #If you want to submit your input, use \"\
+\" at the end of `content`.
+scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
+wait() #Sleep for 5s and take a screenshot to check for any changes.
+finished()
+call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.
+## Note
+- Use Chinese in `Thought` part.
+- Summarize your next action (with its target element) in one sentence in `Thought` part.
+## User Instruction
+"""
+```
+### Local Deployment [Ollama]
+<!-- Ollama can deploy the model via gguf format. Bugs exist for safetensors. -->Ollama will be coming soon. Please be patient and wait~ 😊
+<!-- #### Get the model in GGUF format
+We provide 2B and 7B model in [GGUF](https://huggingface.co/docs/hub/en/gguf) format:
+2B: https://huggingface.co/bytedance-research/UI-TARS-2B-gguf
+7B: https://huggingface.co/bytedance-research/UI-TARS-7B-gguf
+Users can convert the model into GGUF format by using the script from [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py):
+```bash
+python3 convert_hf_to_gguf.py <path to your model>
+```
+The GGUF file will be generated under the path provided.
+#### Deploy GGUF model
+We deploy the model by following Ollama [tutorial](https://github.com/ollama/ollama?tab=readme-ov-file#customize-a-model).
+```bash
+# Create Modelfile, Windows users can just create a file named Modelfile
+echo "FROM ./path/to/model.gguf" > Modelfile
+# Create model in Ollama
+ollama create ui-tars -f Modelfile
+# Run the model
+ollama run ui-tars
+```
+Test script is same as vLLM except two changes:
+```python
+...
+client = OpenAI(
+    base_url="http://127.0.0.1:11434/v1/",
+    ...
+)
+...
+response = client.chat.completions.create(
+    model="ui-tars" # the name we create via Ollama cli
+    ...
+)
+``` -->
+### Explanation of Inference Results
+#### Coordinate Mapping
+The model generates a 2D coordinate output that represents relative positions. To convert these values to image-relative coordinates, divide each component by 1000 to obtain values in the range [0,1]. The absolute coordinates required by the Action can be calculated by:
+- X absolute = X relative × image width
+- Y absolute = Y relative × image height
+For example, given a screen size: 1920 × 1080, and the model generates a coordinate output of (235, 512). The X absolute is `round(1920*235/1000)=451`. The Y absolute is `round(1080*512/1000)=553`. The absolute coordinate is (451, 553)
+## Use in desktop and web automation
+To experience ui-tars agent in desktop, you may refer to [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop). We recommend using the **7B/72B DPO model** on desktop.
+[Midscene.js](https://github.com/web-infra-dev/Midscene) is an open-source web automation SDK that has supported UI-TARS model. Developers can use javascript and natural language to control the browser. See [this guide](https://midscenejs.com/choose-a-model) for more details about setting up the model.
+## License
+UI-TARS is licensed under the Apache License 2.0.
+## Acknowledgements
+This project builds upon and extends the capabilities of Qwen-2-VL, a powerful vision-language model, which serves as the foundational architecture for UI-TARS. We would like to acknowledge the contributions of the developers and researchers behind Qwen-2-VL for their groundbreaking work in the field of multimodal AI and for providing a robust base for further advancements.
+Additionally, we thank the broader open-source community for their datasets, tools, and insights that have facilitated the development of UI-TARS. These collaborative efforts continue to push the boundaries of what GUI automation and AI-driven agents can achieve.
 ## Citation
 If you find our paper and model useful in your research, feel free to give us a cite.