|
--- |
|
base_model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0 |
|
inference: false |
|
language: |
|
- ko |
|
library_name: transformers |
|
license: apache-2.0 |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# EEVE-Korean-Instruct-10.8B-v1.0-AWQ |
|
- Model creator: [Yanolja](https://huggingface.co./yanolja) |
|
- Original model: [yanolja/EEVE-Korean-Instruct-10.8B-v1.0](https://huggingface.co./yanolja/EEVE-Korean-Instruct-10.8B-v1.0) |
|
|
|
<!-- description start --> |
|
## Description |
|
|
|
This repo contains AWQ model files for [yanolja/EEVE-Korean-Instruct-10.8B-v1.0](https://huggingface.co./yanolja/EEVE-Korean-Instruct-10.8B-v1.0). |
|
|
|
|
|
### About AWQ |
|
|
|
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings. |
|
|
|
It is supported by: |
|
|
|
- [Text Generation Webui](https://github.com/oobabooga/text-generation-webui) - using Loader: AutoAWQ |
|
- [vLLM](https://github.com/vllm-project/vllm) - Llama and Mistral models only |
|
- [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) |
|
- [Transformers](https://huggingface.co./docs/transformers) version 4.35.0 and later, from any code or client that supports Transformers |
|
- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - for use from Python code |
|
|
|
<!-- description end --> |
|
|
|
<!-- README_AWQ.md-use-from-vllm start --> |
|
## Using OpenAI Chat API with vLLM |
|
|
|
Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/). |
|
|
|
- Please ensure you are using vLLM version 0.2 or later. |
|
- When using vLLM as a server, pass the `--quantization awq` parameter. |
|
|
|
#### Start the OpenAI-Compatible Server: |
|
- vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API |
|
|
|
```shell |
|
python3 -m vllm.entrypoints.openai.api_server --model Copycats/EEVE-Korean-Instruct-10.8B-v1.0-AWQ --quantization awq --dtype half |
|
``` |
|
- --model: huggingface model path |
|
- --quantization: βawqβ |
|
- --dtype: βhalfβ for FP16. Recommended for AWQ quantization. |
|
|
|
#### Querying the model using OpenAI Chat API: |
|
- You can use the create chat completion endpoint to communicate with the model in a chat-like interface: |
|
|
|
```shell |
|
curl http://localhost:8000/v1/chat/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"model": "Copycats/EEVE-Korean-Instruct-10.8B-v1.0-AWQ", |
|
"messages": [ |
|
{"role": "system", "content": "λΉμ μ μ¬μ©μμ μ§λ¬Έμ μΉμ νκ² λ΅λ³νλ μ΄μμ€ν΄νΈμ
λλ€."}, |
|
{"role": "user", "content": "κ΄μ€λ μ¬νΌμ λλ¬Όμ΄ λλ©΄ μ΄λ»κ² νλμ?"} |
|
] |
|
}' |
|
``` |
|
|
|
#### Python Client Example: |
|
- Using the openai python package, you can also communicate with the model in a chat-like manner: |
|
|
|
```python |
|
from openai import OpenAI |
|
# Set OpenAI's API key and API base to use vLLM's API server. |
|
openai_api_key = "EMPTY" |
|
openai_api_base = "http://localhost:8000/v1" |
|
|
|
client = OpenAI( |
|
api_key=openai_api_key, |
|
base_url=openai_api_base, |
|
) |
|
|
|
chat_response = client.chat.completions.create( |
|
model="Copycats/EEVE-Korean-Instruct-10.8B-v1.0-AWQ", |
|
messages=[ |
|
{"role": "system", "content": "λΉμ μ μ¬μ©μμ μ§λ¬Έμ μΉμ νκ² λ΅λ³νλ μ΄μμ€ν΄νΈμ
λλ€."}, |
|
{"role": "user", "content": "κ΄μ€λ μ¬νΌμ λλ¬Όμ΄ λλ©΄ μ΄λ»κ² νλμ?"}, |
|
] |
|
) |
|
print("Chat response:", chat_response) |
|
``` |
|
<!-- README_AWQ.md-use-from-vllm start --> |