ruslanmv commited on
Commit
e2fa1a5
·
verified ·
1 Parent(s): 9d2abe2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -0
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Llama-3.2-11B-Vision-Instruct
3
+
4
+ This is a model based on the Llama-3.2-11B-Vision-Instruct model by Meta. It is finetuned for multimodal generation.
5
+
6
+ ## Model Description
7
+
8
+ This model is a vision-language model capable of generating text from a given image and text prompt. It's based on the Llama 3.2 architecture and has been instruction-tuned for improved performance on a variety of tasks, including:
9
+
10
+ * **Image captioning:** Generating descriptive captions for images.
11
+ * **Visual question answering:** Answering questions about the content of images.
12
+ * **Image-based dialogue:** Engaging in conversations based on visual input.
13
+
14
+ ## Intended Uses & Limitations
15
+
16
+ This model is intended for research purposes and should be used responsibly. It may generate incorrect or misleading information, and should not be used for making critical decisions.
17
+
18
+ **Limitations:**
19
+
20
+ * The model may not always accurately interpret the content of images.
21
+ * It may be biased towards certain types of images or concepts.
22
+ * It may generate inappropriate or offensive content.
23
+
24
+ ## How to Use
25
+
26
+ Here's an example of how to use this model in Python with the `transformers` library:
27
+
28
+ ```python
29
+ import gradio as gr
30
+ from transformers import AutoProcessor, MllamaForConditionalGeneration
31
+
32
+ # Use GPU if available, otherwise CPU
33
+ device = "cuda" if torch.cuda.is_available() else "cpu"
34
+
35
+ # Load the model and processor
36
+ model_name = "ruslanmv/Llama-3.2-11B-Vision-Instruct"
37
+ processor = AutoProcessor.from_pretrained(model_name)
38
+ model = MllamaForConditionalGeneration.from_pretrained(
39
+ model_name,
40
+ torch_dtype=torch.bfloat16,
41
+ device_map="auto",
42
+ )
43
+
44
+ # Function to generate model response
45
+ def predict(message, image):
46
+ messages = [{"role": "user", "content": [
47
+ {"type": "image"},
48
+ {"type": "text", "text": message}
49
+ ]}]
50
+ input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
51
+ inputs = processor(image, input_text, return_tensors="pt").to(device)
52
+ response = model.generate(**inputs, max_new_tokens=100)
53
+ return processor.decode(response[0], skip_special_tokens=True)
54
+
55
+ # Gradio interface
56
+ with gr.Blocks() as demo:
57
+ gr.Markdown("# Simple Multimodal Chatbot")
58
+ with gr.Row():
59
+ with gr.Column(): # Message input on the left
60
+ text_input = gr.Textbox(label="Message")
61
+ submit_button = gr.Button("Send")
62
+ with gr.Column(): # Image input on the right
63
+ image_input = gr.Image(type="pil", label="Upload an Image")
64
+ chatbot = gr.Chatbot() # Chatbot output at the bottom
65
+
66
+ def respond(message, image, history):
67
+ history = history + [(message, "")]
68
+ response = predict(message, image)
69
+ history[-1] = (message, response)
70
+ return history
71
+
72
+ submit_button.click(
73
+ fn=respond,
74
+ inputs=[text_input, image_input, chatbot],
75
+ outputs=chatbot
76
+ )
77
+
78
+ demo.launch()
79
+ ```
80
+
81
+ This code provides a simple Gradio interface for interacting with the model. You can upload an image and type a message, and the model will generate a response based on both inputs.
82
+
83
+ ## More Information
84
+
85
+ For more details and examples, please visit [ruslanmv.com](https://ruslanmv.com).
86
+
87
+ ## License
88
+
89
+ This model is licensed under the [Llama 3.2 Community License Agreement](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).