Phi-3-mini-128k-instruct-int4

Description

Phi-3-mini-128k-instruct-int4 is an int4 model with group_size 128 of the microsoft/Phi-3-mini-128k-instruct.

The above model was quantized using AutoRound(Advanced Weight-Only Quantization Algorithm for LLMs) released by intel.

you can find out more in detail through the the GitHub Repository.

Training details

Cloning a repository(AutoRound)

git clone https://github.com/intel/auto-round

Enter into the examples/language-modeling folder

cd auto-round/examples/language-modeling
pip install -r requirements.txt

Install FlashAttention-2

pip install flash_attn==2.5.8

Here's an simplified code for quantization. In order to save memory in quantization, we set the batch size to 1.

python main.py \
  --model_name "microsoft/Phi-3-mini-128k-instruct" \
  --bits 4 \
  --group_size 128 \
  --train_bs 1 \
  --gradient_accumulate_steps 8 \
  --deployment_device 'gpu' \
  --output_dir "./save_ckpt" 

Model inference

Install the necessary packages

pip install auto_gptq
pip install optimum
pip install -U accelerate bitsandbytes datasets peft transformers

Example codes

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "ssuncheol/Phi-3-mini-128k-instruct-int4", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
tokenizer = AutoTokenizer.from_pretrained("ssuncheol/Phi-3-mini-128k-instruct-int4")

messages = [
    {"role": "system", "content": "You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

License

The model is licensed under the MIT license.

Downloads last month
118
Safetensors
Model size
653M params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.