Fix typos and inference script

#3
by alvarobartt HF staff - opened
Files changed (1) hide show
  1. README.md +14 -8
README.md CHANGED
@@ -102,7 +102,7 @@ Magma is a multimodal agentic AI model that can generate text based on the input
102
 
103
  ### Highlights
104
  * **Digital and Physical Worlds:** Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
105
- * **Versatile Capabilities:** Magma as a single model not only posseesses generic image and videos understanding ability, but alse generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
106
  * **State-of-the-art Performance:** Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
107
  * **Scalable Pretraining Strategy:** Magma is designed to be **learned scalably from unlabeled videos** in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!
108
 
@@ -125,15 +125,21 @@ The model is developed by Microsoft and is funded by Microsoft Research. The mod
125
 
126
  <!-- {{ get_started_code | default("[More Information Needed]", true)}} -->
127
 
128
- Use the code below to get started with the model.
 
 
 
 
 
 
129
 
130
  ```python
131
  import torch
132
  from PIL import Image
 
133
  import requests
134
 
135
- from transformers import AutoModelForCausalLM
136
- from transformers import AutoProcessor
137
 
138
  # Load the model and processor
139
  model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
@@ -141,11 +147,12 @@ processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_cod
141
  model.to("cuda")
142
 
143
  # Inference
144
- url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
145
- image = Image.open(requests.get(url, stream=True).raw)
 
146
 
147
  convs = [
148
- {"role": "system", "content": "You are agent that can see, talk and act."},
149
  {"role": "user", "content": "<image_start><image><image_end>\nWhat is in this image?"},
150
  ]
151
  prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
@@ -159,7 +166,6 @@ with torch.inference_mode():
159
 
160
  generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
161
  response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
162
-
163
  print(response)
164
  ```
165
 
 
102
 
103
  ### Highlights
104
  * **Digital and Physical Worlds:** Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
105
+ * **Versatile Capabilities:** Magma as a single model not only possesses generic image and videos understanding ability, but also generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
106
  * **State-of-the-art Performance:** Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
107
  * **Scalable Pretraining Strategy:** Magma is designed to be **learned scalably from unlabeled videos** in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!
108
 
 
125
 
126
  <!-- {{ get_started_code | default("[More Information Needed]", true)}} -->
127
 
128
+ To get started with the model, you first need to make sure that `transformers` and `torch` are installed, as well as installing the following dependencies:
129
+
130
+ ```bash
131
+ pip install torchvision Pillow open_clip_torch
132
+ ```
133
+
134
+ Then you can run the following code:
135
 
136
  ```python
137
  import torch
138
  from PIL import Image
139
+ from io import BytesIO
140
  import requests
141
 
142
+ from transformers import AutoModelForCausalLM, AutoProcessor
 
143
 
144
  # Load the model and processor
145
  model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
 
147
  model.to("cuda")
148
 
149
  # Inference
150
+ url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
151
+ image = Image.open(BytesIO(requests.get(url, stream=True).content))
152
+ image = image.convert("rgb")
153
 
154
  convs = [
155
+ {"role": "system", "content": "You are agent that can see, talk and act."},
156
  {"role": "user", "content": "<image_start><image><image_end>\nWhat is in this image?"},
157
  ]
158
  prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
 
166
 
167
  generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
168
  response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
 
169
  print(response)
170
  ```
171