See axolotl config
axolotl version: 0.8.0.dev0
base_model: FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
hub_model_id: downquark/v12_qwen_datav5_lora
hub_strategy: "checkpoint"
push_dataset_to_hub:
hf_use_auth_token: true
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: true
load_in_8bit: false
load_in_4bit: true
strict: false
datasets:
- path: downquark/dataset_llm_finetune
type: input_output
revision: dataset_v5_2025_2_23_qwen
train_on_split: train
# A list of one or more datasets to eval the model with.
# You can use either test_datasets, or val_set_size, but not both.
test_datasets:
- path: /workspace/test.jsonl
ds_type: json
type: input_output
split: train
data_files:
- /workspace/test.jsonl
shuffle_merged_datasets: true
dataset_exact_deduplication: true
dataset_prepared_path: /workspace/data/last_run_prepared
val_set_size: 0.0
output_dir: /workspace/data/out
sequence_len: 2048 # magnum-v4: 32768
sample_packing: true
pad_to_sequence_len: true
adapter: qlora
lora_model_dir:
lora_r: 64
lora_alpha: 128
lora_dropout: 0.1
lora_target_linear: true
lora_fan_in_fan_out:
peft_use_rslora: true
# unsloth_lora_mlp: true
# unsloth_lora_qkv: true
# unsloth_lora_o: true
# unsloth_cross_entropy_loss: true
# unsloth_rms_norm: true
# unsloth_rope: true
wandb_project: llm_finetune
wandb_entity:
wandb_watch:
wandb_name: v12_qwen_datav5_lora
wandb_log_model:
# LIMO: https://github.com/GAIR-NLP/LIMO/blob/main/train/examples/train_limo.yaml
#
# Critique fine-tuning: ... we select the best-performing checkpoint after training on the entire
# dataset for 1 epoch. We maintain consistent hyperparameters across all experiments with a learning rate of 5e-6,
# a cosine decay learning schedule with a warm-up ratio of 0.1, and a global batch size of 512.
#
# memory requirement table: https://www.reddit.com/r/LocalLLaMA/comments/18o5u0k/helpful_vram_requirement_table_for_qlora_lora_and/
# https://unsloth.ai/blog/mistral-benchmark
# mistral 7B lora 19.3GB (r16 a16, batch_size 4: 16GB, proj layers, seq 2048)
#
gradient_accumulation_steps: 4 # LIMO: 1, magnum-v4: 2
micro_batch_size: 1 # LIMO: 1, Critique Fine-Tuning: 512, magnum-v4: 1
num_epochs: 2 # LIMO: 15, Critique Fine-Tuning: 1, magnum-v4: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine # LIMO, Critique Fine-Tuning, magnum-v4: cosine
learning_rate: 5.0e-6 # LIMO and Critique Fine-Tuning: 5.0e-6, magnum-v4: 1.0e-5
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 40
evals_per_epoch: 6
eval_batch_size: 1
eval_sample_packing: false
eval_max_new_tokens: 2048
saves_per_epoch: 3
debug:
deepspeed: deepspeed_configs/zero1.json
# deepspeed: ./deepspeed_configs/zero3_bf16.json
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
pad_token: <pad>
v12_qwen_datav5_lora
This model is a fine-tuned version of FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview on the downquark/dataset_llm_finetune dataset. It achieves the following results on the evaluation set:
- Loss: 0.9681
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-06
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- gradient_accumulation_steps: 4
- total_train_batch_size: 4
- optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 40
- num_epochs: 2.0
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
2.0202 | 0.0014 | 1 | 2.0575 |
1.0964 | 0.1666 | 117 | 1.1130 |
1.0717 | 0.3332 | 234 | 1.0510 |
0.9258 | 0.4998 | 351 | 1.0206 |
1.0243 | 0.6664 | 468 | 1.0057 |
1.0016 | 0.8330 | 585 | 0.9957 |
1.0392 | 0.9996 | 702 | 0.9826 |
0.9533 | 1.1652 | 819 | 0.9817 |
0.9055 | 1.3318 | 936 | 0.9747 |
0.9562 | 1.4984 | 1053 | 0.9713 |
0.8825 | 1.6650 | 1170 | 0.9691 |
0.8486 | 1.8316 | 1287 | 0.9682 |
0.9038 | 1.9982 | 1404 | 0.9681 |
Framework versions
- PEFT 0.14.0
- Transformers 4.49.0
- Pytorch 2.5.1+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0
- Downloads last month
- 7
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no pipeline_tag.