fbaldassarri/meta-llama_Llama-3.2-11B-Vision-Instruct-OpenVino

Model Information

Converted version of meta-llama/Llama-3.2-11B-Vision-Instruct to OpenVINO Intermediate Representation (IR) for CPU devices inference.

Model consists of 2 parts:

Image Encoder, as openvino_vision_encoder.bin, for encoding input images into LLM cross attention states space;
Language Model, as openvino_language_model.bin, for generation answer based on cross attention states provided by Image Encoder and input tokens.

Then, for reducing memory consumption, weights compression optimization has applied using Neural Network Compression Framework (NNCF) that provides 4-bit/8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs.

Note: Compressed model can be found in as llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.bin/.xml

4 bits (INT4)
group size = 64
Asymmetrical Quantization
method AWQ

Finally, an INT8 quantized version of the Imange Enconder only can be find as openvino_vision_encoder_int8.bin/.xml.

Replication Recipe

Step 1 Install Requirements

I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment.

pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.13.0" --extra-index-url https://download.pytorch.org/whl/cpu

pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu

pip install -Uq --pre "openvino>2024.4.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Step 2 Convert the model in OpenVINO Intermediate Representation (IR)

from pathlib import Path
from ov_mllama_helper import convert_mllama
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model_dir = Path(model_id.split("/")[-1]) / "OpenVino"
convert_mllama(model_id, model_dir)

Step 3 INT4 Compression

from ov_mllama_compression import compress
from ov_mllama_compression import compression_widgets_helper
compression_scenario, compress_args = compression_widgets_helper()
compression_scenario
compression_kwargs = {key: value.value for key, value in compress_args.items()}
language_model_path = compress(model_dir, **compression_kwargs)

Step 4 INT8 Image Enconder Optimization

from ov_mllama_compression import vision_encoder_selection_widget
vision_encoder_options = vision_encoder_selection_widget(device.value)
vision_encoder_options
from transformers import AutoProcessor
import nncf
import openvino as ov
import gc
from data_preprocessing import prepare_dataset_vision
processor = AutoProcessor.from_pretrained(model_dir)
core = ov.Core()
fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml"
int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml")
int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml")
calibration_data = prepare_dataset_vision(processor, 100)
ov_model = core.read_model(fp_vision_encoder_path)
calibration_dataset = nncf.Dataset(calibration_data)
quantized_model = nncf.quantize(
   model=ov_model,
   calibration_dataset=calibration_dataset,
   model_type=nncf.ModelType.TRANSFORMER,
   advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6),
 )
ov.save_model(quantized_model, int8_vision_encoder_path)
del quantized_model
del ov_model
del calibration_dataset
del calibration_data
gc.collect()
vision_encoder_path = int8_vision_encoder_path

License

Llama 3.2 Community License

Disclaimer

This quantized model comes with no warrenty. It has been developed only for research purposes.

fbaldassarri
/

meta-llama_Llama-3.2-11B-Vision-Instruct-OpenVino