metadata
language:
- en
- zh
license: apache-2.0
tags:
- text-generation-inference
- transformers
- Chinese
- unsloth
- llama
- trl
base_model: waylandzhang/Llama-3-8b-Chinese-Novel-4bit-lesson-v0.1
Uploaded model
- Developed by: waylandzhang
- License: apache-2.0
- Finetuned from model : unsloth/llama-3-8b-bnb-4bit
Teaching purpose model。 这个model只是配合我视频教学目的 :D
QLoRA (4bit)
Params to replicate training
Peft Config
r=8,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
random_state=3407,
use_rslora=False, # Rank stabilized LoRA
loftq_config=None, # LoftQ
Training args
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=4, # set to 4 to avoid issues with GPTQ Quantization
warmup_steps=5,
max_steps=300, # Fine-tune iterations
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
evaluation_strategy="steps",
prediction_loss_only=True,
eval_accumulation_steps=1,
eval_steps=10,
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="cosine", # instead of "linear"
seed=1337,
output_dir="wayland-files/models",
report_to="wandb", # Log report to W&B
Interernce Code
from unsloth import FastLanguageModel
import os
import torch
max_seq_length = 4096 # 2048
dtype = None
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="waylandzhang/Llama-3-8b-Chinese-Novel-4bit-lesson-v0.1",
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
device_map="cuda",
attn_implementation="flash_attention_2"
)
FastLanguageModel.for_inference(model) # 使用unsloth的推理模式可以加快2倍速度
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
inputs = tokenizer(
[
alpaca_prompt.format(
"给你一段话,帮我继续写下去。", # 任务指令
"小明在西安城墙上", # 用户指令
"", # output - 留空以自动生成 / 不留空以填充
)
], return_tensors="pt").to("cuda")
# Opt 1: 文本生成输出
# outputs = model.generate(**inputs, max_new_tokens=500, use_cache=True)
# print(tokenizer.batch_decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))
# Opt 2: 消息流式输出
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=500)
This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.