--- language: - en - zh license: apache-2.0 tags: - text-generation-inference - transformers - Chinese - unsloth - llama - trl base_model: waylandzhang/Llama-3-8b-Chinese-Novel-4bit-lesson-v0.1 --- # Uploaded model - **Developed by:** waylandzhang - **License:** apache-2.0 - **Finetuned from model :** unsloth/llama-3-8b-bnb-4bit Teaching purpose model。 这个model只是配合我视频教学目的 :D **QLoRA (4bit)** Params to replicate training Peft Config ``` r=8, target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], lora_alpha=16, lora_dropout=0, bias="none", random_state=3407, use_rslora=False, # Rank stabilized LoRA loftq_config=None, # LoftQ ``` Training args ``` per_device_train_batch_size=2, per_device_eval_batch_size=2, gradient_accumulation_steps=4, # set to 4 to avoid issues with GPTQ Quantization warmup_steps=5, max_steps=300, # Fine-tune iterations learning_rate=2e-4, fp16=not torch.cuda.is_bf16_supported(), bf16=torch.cuda.is_bf16_supported(), evaluation_strategy="steps", prediction_loss_only=True, eval_accumulation_steps=1, eval_steps=10, logging_steps=1, optim="adamw_8bit", weight_decay=0.01, lr_scheduler_type="cosine", # instead of "linear" seed=1337, output_dir="wayland-files/models", report_to="wandb", # Log report to W&B ``` **Interernce Code** ```python from unsloth import FastLanguageModel import os import torch max_seq_length = 4096 # 2048 dtype = None load_in_4bit = True model, tokenizer = FastLanguageModel.from_pretrained( model_name="waylandzhang/Llama-3-8b-Chinese-Novel-4bit-lesson-v0.1", max_seq_length=max_seq_length, dtype=dtype, load_in_4bit=load_in_4bit, device_map="cuda", attn_implementation="flash_attention_2" ) FastLanguageModel.for_inference(model) # 使用unsloth的推理模式可以加快2倍速度 alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Input: {} ### Response: {}""" inputs = tokenizer( [ alpaca_prompt.format( "给你一段话,帮我继续写下去。", # 任务指令 "小明在西安城墙上", # 用户指令 "", # output - 留空以自动生成 / 不留空以填充 ) ], return_tensors="pt").to("cuda") # Opt 1: 文本生成输出 # outputs = model.generate(**inputs, max_new_tokens=500, use_cache=True) # print(tokenizer.batch_decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)) # Opt 2: 消息流式输出 from transformers import TextStreamer text_streamer = TextStreamer(tokenizer, skip_prompt=True) _ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=500) ``` This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. [](https://github.com/unslothai/unsloth)