README.md · royleibov/Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed at 04b747f904322a3847d8dd6cf37f6fa078c13a74

metadata

language:
  - en
  - de
  - fr
  - it
  - pt
  - hi
  - es
  - th
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - facebook
  - meta
  - pytorch
  - llama
  - llama-3

This repository is a pre-release checkpoint for Llama 3.2 11B Vision Instruct.

It contains two versions of the model, for use with transformers and with the original llama3 codebase (under the original directory).

Inference with transformers

Please, install the in-progress development wheel from https://huggingface.co./nltpt/transformers/tree/main.

This is an example inference snippet (API subject to change):

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "nltpt/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe image in two sentences"}
        ]
    }
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)

url = "https://llava-vl.github.io/static/images/view.jpg"
raw_image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=text, images=raw_image, return_tensors="pt").to(model.device)
output = model.generate(**inputs, do_sample=False, max_new_tokens=25)
print(processor.decode(output[0]))

Output:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

<|image|>Describe image in two sentences<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The image depicts a serene lake scene, featuring a long wooden dock extending into the calm water, with a dense forest of trees

Running the original checkpoints

The package installed will provide three binaries:

example_chat_completion
example_text_completion
multimodal_example_chat_completion

You can invoke them via torchrun by doing the following:

CHECKPOINT_DIR=~/.llama/checkpoints/Llama-3.2-11B-Vision-Instruct/

torchrun `which multimodal_example_chat_completion` "$CHECKPOINT_DIR"

You can study the code for the script by doing something like:

PACKAGE_DIR=$(pip show -f llama-models | grep Location | awk '{ print $2 }')

echo "Scripts are in the directory: $PACKAGE_DIR/llama-models/scripts/"