|
DEPLOY_TEXT = f""" |
|
|
|
Having table full of powerful models is nice and call but at the end of the day, you have to be able to use |
|
them for something. Below you will find sample code to help you load models and perform inference. |
|
|
|
|
|
## Inference with Gaudi 2 |
|
Habana's SDK, Intel Gaudi Software, supports PyTorch and DeepSpeed for accelerating LLM training and inference. |
|
The Intel Gaudi Software graph compiler will optimize the execution of the operations accumulated in the graph |
|
(e.g. operator fusion, data layout management, parallelization, pipelining and memory management, |
|
and graph-level optimizations). |
|
|
|
Optimum Habana provides covenient functionality for various tasks, below you'll find the command line |
|
snippet that you would run to perform inference on Gaudi with meta-llama/Llama-2-7b-hf. |
|
|
|
The "run_generation.py" script below can be found [here](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation) |
|
|
|
```bash |
|
python run_generation.py \ |
|
--model_name_or_path meta-llama/Llama-2-7b-hf \ |
|
--use_hpu_graphs \ |
|
--use_kv_cache \ |
|
--max_new_tokens 100 \ |
|
--do_sample \ |
|
--batch_size 2 \ |
|
--prompt "Hello world" "How are you?" |
|
|
|
``` |
|
|
|
# Inference Intel Extension for Transformers |
|
Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM |
|
everywhere with the optimal performance of Transformer-based models on various Intel platforms, |
|
including Intel Gaudi2, Intel CPU, and Intel GPU. |
|
|
|
### INT4 Inference (CPU) |
|
```python |
|
from transformers import AutoTokenizer |
|
from intel_extension_for_transformers.transformers import AutoModelForCausalLM |
|
model_name = "Intel/neural-chat-7b-v3-1" |
|
prompt = "When winter becomes spring, the flowers..." |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
inputs = tokenizer(prompt, return_tensors="pt").input_ids |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True) |
|
outputs = model.generate(inputs) |
|
|
|
``` |
|
### INT4 Inference (GPU) |
|
```python |
|
import intel_extension_for_pytorch as ipex |
|
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM |
|
from transformers import AutoTokenizer |
|
|
|
device_map = "xpu" |
|
model_name ="Qwen/Qwen-7B" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
prompt = "When winter becomes spring, the flowers..." |
|
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map) |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, |
|
device_map=device_map, load_in_4bit=True) |
|
|
|
model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, woq=True, device=device_map) |
|
|
|
output = model.generate(inputs) |
|
``` |
|
|
|
# Intel Extension for PyTorch |
|
Intel® Extension for PyTorch extends PyTorch with up-to-date features optimizations for an |
|
extra performance boost on Intel hardware. Optimizations take advantage of Intel® Advanced |
|
Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® |
|
Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions |
|
(XMX) AI engines on Intel discrete GPUs. Moreover, Intel® Extension for PyTorch* provides easy |
|
GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device. |
|
|
|
There are a few flavors of PyTorch that can be leveraged for inference. For detailed documentation, |
|
the visit https://intel.github.io/intel-extension-for-pytorch/#introduction |
|
|
|
### IPEX with Optimum Intel (no quantization) |
|
Requires installing/updating optimum `pip install --upgrade-strategy eager optimum[ipex] |
|
` |
|
```python |
|
from optimum.intel import IPEXModelForCausalLM |
|
from transformers import AutoTokenizer, pipeline |
|
|
|
model = IPEXModelForCausalLM.from_pretrained(model_id) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) |
|
results = pipe("A fisherman at sea...") |
|
``` |
|
|
|
### IPEX with Stock PyTorch with Mixed Precision |
|
```python |
|
import torch |
|
import intel_extension_for_pytorch as ipex |
|
import transformers |
|
|
|
model= transformers.AutoModelForCausalLM(model_name_or_path).eval() |
|
|
|
dtype = torch.float # or torch.bfloat16 |
|
model = ipex.llm.optimize(model, dtype=dtype) |
|
|
|
# generation inference loop |
|
with torch.inference_mode(): |
|
model.generate() |
|
``` |
|
|
|
# OpenVINO Toolkit |
|
|
|
```python |
|
from optimum.intel import OVModelForCausalLM |
|
from transformers import AutoTokenizer, pipeline |
|
|
|
model_id = "helenai/gpt2-ov" |
|
model = OVModelForCausalLM.from_pretrained(model_id) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) |
|
|
|
pipe("In the spring, beautiful flowers bloom...") |
|
|
|
``` |
|
|
|
|
|
""" |