|
--- |
|
license: llama3.1 |
|
--- |
|
|
|
## Introduction |
|
This is vllm-compatible fp8 ptq model based on [Meta-Llama-3.1-405B-Instruct](https://huggingface.co./meta-llama/Meta-Llama-3.1-405B-Instruct). |
|
For detailed quantization scheme, refer to the official documentation of [AMD Quark 0.2.0 quantizer](https://quark.docs.amd.com/latest/index.html). |
|
|
|
## Quickstart |
|
|
|
To run this fp8 model on vLLM framework, |
|
|
|
### Modle Preparation |
|
1. build the rocm-vllm docker image by using this [dockerfile](https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm) and launch a vllm docker container. |
|
|
|
```sh |
|
docker build -f Dockerfile.rocm -t vllm_test . |
|
docker run --rm -it --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G vllm_test:latest |
|
``` |
|
|
|
2. clone the baseline [Meta-Llama-3.1-405B-Instruct](https://huggingface.co./meta-llama/Meta-Llama-3.1-405B-Instruct). |
|
3. clone this [fp8 model](https://huggingface.co./amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm) and inside the [fp8 model](https://huggingface.co./amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm) folder run this to merge the splitted llama-*.safetensors into a single llama.safetensors. |
|
|
|
```sh |
|
python merge.py |
|
``` |
|
|
|
4. once the merged llama.safetensors is created, move this file and llama.json to the saved directory of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co./meta-llama/Meta-Llama-3.1-405B-Instruct) by this command. Model snapshot commit# 069992c75aed59df00ec06c17177e76c63296a26 can be different. |
|
```sh |
|
cp llama.json ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/. |
|
cp llama.safetensors ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/. |
|
``` |
|
|
|
### Running fp8 model |
|
|
|
```sh |
|
# 8 GPUs |
|
torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py |
|
``` |
|
|
|
```python |
|
# run_vllm_fp8.py |
|
from vllm import LLM, SamplingParams |
|
prompt = "Write me an essay about bear and knight" |
|
|
|
model_name="models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/" |
|
tp=8 # 8 GPUs |
|
|
|
model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="float16", quantization="fp8", quantized_weights_path="/llama.safetensors") |
|
sampling_params = SamplingParams( |
|
top_k=1.0, |
|
ignore_eos=True, |
|
max_tokens=200, |
|
) |
|
result = model.generate(prompt, sampling_params=sampling_params) |
|
print(result) |
|
``` |
|
### Running fp16 model (For comparison) |
|
|
|
```sh |
|
# 8 GPUs |
|
torchrun --standalone --nproc_per_node=8 run_vllm_fp16.py |
|
``` |
|
|
|
```python |
|
# run_vllm_fp16.py |
|
from vllm import LLM, SamplingParams |
|
prompt = "Write me an essay about bear and knight" |
|
|
|
model_name="models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/" |
|
tp=8 # 8 GPUs |
|
model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="bfloat16") |
|
sampling_params = SamplingParams( |
|
top_k=1.0, |
|
ignore_eos=True, |
|
max_tokens=200, |
|
) |
|
result = model.generate(prompt, sampling_params=sampling_params) |
|
print(result) |
|
``` |
|
## fp8 gemm_tuning |
|
Will update soon. |
|
|
|
#### License |
|
Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved. |
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
you may not use this file except in compliance with the License. |
|
You may obtain a copy of the License at |
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
Unless required by applicable law or agreed to in writing, software |
|
distributed under the License is distributed on an "AS IS" BASIS, |
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
See the License for the specific language governing permissions and |
|
limitations under the License. |