metadata

language:
  - en
  - zh
license: apache-2.0
tags:
  - text-generation-inference
  - transformers
  - Chinese
  - unsloth
  - llama
  - trl
base_model: waylandzhang/Llama-3-8b-Chinese-Novel-4bit-lesson-v0.1

Uploaded model

Developed by: waylandzhang
License: apache-2.0
Finetuned from model : unsloth/llama-3-8b-bnb-4bit

Teaching purpose model。这个model只是配合我视频教学目的：D

QLoRA (4bit)

Params to replicate training

Peft Config

    r=8,  
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  
    bias="none",  
    random_state=3407,
    use_rslora=False,  # Rank stabilized LoRA
    loftq_config=None,  # LoftQ

Training args

    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,  # set to 4 to avoid issues with GPTQ Quantization
    warmup_steps=5,
    max_steps=300,  # Fine-tune iterations
    learning_rate=2e-4, 
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    evaluation_strategy="steps",
    prediction_loss_only=True,
    eval_accumulation_steps=1,
    eval_steps=10,
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="cosine",  # instead of "linear"
    seed=1337,
    output_dir="wayland-files/models",
    report_to="wandb",  # Log report to W&B

Interernce Code

from unsloth import FastLanguageModel
import os
import torch

max_seq_length = 4096  # 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="waylandzhang/Llama-3-8b-Chinese-Novel-4bit-lesson-v0.1",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)

FastLanguageModel.for_inference(model)  # 使用unsloth的推理模式可以加快2倍速度

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

inputs = tokenizer(
    [
        alpaca_prompt.format(
            "给你一段话，帮我继续写下去。",  # 任务指令
            "小明在西安城墙上",  # 用户指令
            "",  # output - 留空以自动生成 / 不留空以填充
        )
    ], return_tensors="pt").to("cuda")

# Opt 1: 文本生成输出
# outputs = model.generate(**inputs, max_new_tokens=500, use_cache=True)
# print(tokenizer.batch_decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

# Opt 2: 消息流式输出
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=500)

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.