ruslanmv
/

Llama-3.2-11B-Vision-Instruct

Safetensors

English

mllama

Model card Files Files and versions Community

ruslanmv commited on Oct 2, 2024

Commit

e2fa1a5

verified ·

1 Parent(s): 9d2abe2

Create README.md

Browse files

Files changed (1) hide show

README.md +89 -0

README.md ADDED Viewed

	@@ -0,0 +1,89 @@

+# Llama-3.2-11B-Vision-Instruct
+This is a  model based on the Llama-3.2-11B-Vision-Instruct model by Meta. It is finetuned for multimodal generation.
+## Model Description
+This model is a vision-language model capable of generating text from a given image and text prompt. It's based on the Llama 3.2 architecture and has been instruction-tuned for improved performance on a variety of tasks, including:
+* **Image captioning:** Generating descriptive captions for images.
+* **Visual question answering:** Answering questions about the content of images.
+* **Image-based dialogue:** Engaging in conversations based on visual input.
+## Intended Uses & Limitations
+This model is intended for research purposes and should be used responsibly. It may generate incorrect or misleading information, and should not be used for making critical decisions.
+**Limitations:**
+* The model may not always accurately interpret the content of images.
+* It may be biased towards certain types of images or concepts.
+* It may generate inappropriate or offensive content.
+## How to Use
+Here's an example of how to use this model in Python with the `transformers` library:
+```python
+import gradio as gr
+from transformers import AutoProcessor, MllamaForConditionalGeneration
+# Use GPU if available, otherwise CPU
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Load the model and processor
+model_name = "ruslanmv/Llama-3.2-11B-Vision-Instruct"
+processor = AutoProcessor.from_pretrained(model_name)
+model = MllamaForConditionalGeneration.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+# Function to generate model response
+def predict(message, image):
+    messages = [{"role": "user", "content": [
+        {"type": "image"},
+        {"type": "text", "text": message}
+    ]}]
+    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
+    inputs = processor(image, input_text, return_tensors="pt").to(device)
+    response = model.generate(**inputs, max_new_tokens=100)
+    return processor.decode(response[0], skip_special_tokens=True)
+# Gradio interface
+with gr.Blocks() as demo:
+    gr.Markdown("# Simple Multimodal Chatbot")
+    with gr.Row():
+        with gr.Column():  # Message input on the left
+            text_input = gr.Textbox(label="Message")
+            submit_button = gr.Button("Send")
+        with gr.Column():  # Image input on the right
+            image_input = gr.Image(type="pil", label="Upload an Image")
+    chatbot = gr.Chatbot()  # Chatbot output at the bottom
+    def respond(message, image, history):
+        history = history + [(message, "")]
+        response = predict(message, image)
+        history[-1] = (message, response)
+        return history
+    submit_button.click(
+        fn=respond,
+        inputs=[text_input, image_input, chatbot],
+        outputs=chatbot
+    )
+demo.launch()
+```
+This code provides a simple Gradio interface for interacting with the model. You can upload an image and type a message, and the model will generate a response based on both inputs.
+## More Information
+For more details and examples, please visit [ruslanmv.com](https://ruslanmv.com).
+## License
+This model is licensed under the [Llama 3.2 Community License Agreement](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).