Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
# Llama-3.2-11B-Vision-Instruct
|
3 |
+
|
4 |
+
This is a model based on the Llama-3.2-11B-Vision-Instruct model by Meta. It is finetuned for multimodal generation.
|
5 |
+
|
6 |
+
## Model Description
|
7 |
+
|
8 |
+
This model is a vision-language model capable of generating text from a given image and text prompt. It's based on the Llama 3.2 architecture and has been instruction-tuned for improved performance on a variety of tasks, including:
|
9 |
+
|
10 |
+
* **Image captioning:** Generating descriptive captions for images.
|
11 |
+
* **Visual question answering:** Answering questions about the content of images.
|
12 |
+
* **Image-based dialogue:** Engaging in conversations based on visual input.
|
13 |
+
|
14 |
+
## Intended Uses & Limitations
|
15 |
+
|
16 |
+
This model is intended for research purposes and should be used responsibly. It may generate incorrect or misleading information, and should not be used for making critical decisions.
|
17 |
+
|
18 |
+
**Limitations:**
|
19 |
+
|
20 |
+
* The model may not always accurately interpret the content of images.
|
21 |
+
* It may be biased towards certain types of images or concepts.
|
22 |
+
* It may generate inappropriate or offensive content.
|
23 |
+
|
24 |
+
## How to Use
|
25 |
+
|
26 |
+
Here's an example of how to use this model in Python with the `transformers` library:
|
27 |
+
|
28 |
+
```python
|
29 |
+
import gradio as gr
|
30 |
+
from transformers import AutoProcessor, MllamaForConditionalGeneration
|
31 |
+
|
32 |
+
# Use GPU if available, otherwise CPU
|
33 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
34 |
+
|
35 |
+
# Load the model and processor
|
36 |
+
model_name = "ruslanmv/Llama-3.2-11B-Vision-Instruct"
|
37 |
+
processor = AutoProcessor.from_pretrained(model_name)
|
38 |
+
model = MllamaForConditionalGeneration.from_pretrained(
|
39 |
+
model_name,
|
40 |
+
torch_dtype=torch.bfloat16,
|
41 |
+
device_map="auto",
|
42 |
+
)
|
43 |
+
|
44 |
+
# Function to generate model response
|
45 |
+
def predict(message, image):
|
46 |
+
messages = [{"role": "user", "content": [
|
47 |
+
{"type": "image"},
|
48 |
+
{"type": "text", "text": message}
|
49 |
+
]}]
|
50 |
+
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
|
51 |
+
inputs = processor(image, input_text, return_tensors="pt").to(device)
|
52 |
+
response = model.generate(**inputs, max_new_tokens=100)
|
53 |
+
return processor.decode(response[0], skip_special_tokens=True)
|
54 |
+
|
55 |
+
# Gradio interface
|
56 |
+
with gr.Blocks() as demo:
|
57 |
+
gr.Markdown("# Simple Multimodal Chatbot")
|
58 |
+
with gr.Row():
|
59 |
+
with gr.Column(): # Message input on the left
|
60 |
+
text_input = gr.Textbox(label="Message")
|
61 |
+
submit_button = gr.Button("Send")
|
62 |
+
with gr.Column(): # Image input on the right
|
63 |
+
image_input = gr.Image(type="pil", label="Upload an Image")
|
64 |
+
chatbot = gr.Chatbot() # Chatbot output at the bottom
|
65 |
+
|
66 |
+
def respond(message, image, history):
|
67 |
+
history = history + [(message, "")]
|
68 |
+
response = predict(message, image)
|
69 |
+
history[-1] = (message, response)
|
70 |
+
return history
|
71 |
+
|
72 |
+
submit_button.click(
|
73 |
+
fn=respond,
|
74 |
+
inputs=[text_input, image_input, chatbot],
|
75 |
+
outputs=chatbot
|
76 |
+
)
|
77 |
+
|
78 |
+
demo.launch()
|
79 |
+
```
|
80 |
+
|
81 |
+
This code provides a simple Gradio interface for interacting with the model. You can upload an image and type a message, and the model will generate a response based on both inputs.
|
82 |
+
|
83 |
+
## More Information
|
84 |
+
|
85 |
+
For more details and examples, please visit [ruslanmv.com](https://ruslanmv.com).
|
86 |
+
|
87 |
+
## License
|
88 |
+
|
89 |
+
This model is licensed under the [Llama 3.2 Community License Agreement](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).
|