See axolotl config
axolotl version: 0.6.0
unsloth_lora_mlp: true
unsloth_lora_qkv: true
unsloth_lora_o: true
# This is the huggingface model that contains *.pt, *.safetensors, or *.bin files
# This can also be a relative path to a model on disk
base_model: meta-llama/Llama-3.1-8B-Instruct

# Corresponding tokenizer for the model AutoTokenizer is a good choice
tokenizer_type: AutoTokenizer

# How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc. 0 for no eval.
val_set_size: 0.10


# Whether you are training a 4-bit GPTQ quantized model
# gptq: false

# This will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer
load_in_8bit: false
# Use bitsandbytes 4 bit
load_in_4bit: true

# Limit the memory for all available GPUs to this amount (if an integer, expressed in gigabytes); default: unset
gpu_memory_limit: 24
# Do the LoRA/PEFT loading on CPU -- this is required if the base model is so large it takes up most or all of the available GPU VRAM, e.g. during a model and LoRA merge
lora_on_cpu: true

# A list of one or more datasets to finetune the model with
datasets:
  - path: ./ts-8k.jsonl
    type: chat_template
    chat_template: tokenizer_default
    field_messages: messages
    message_field_role: role
    message_field_content: content
    roles_to_train: [ "assistant" ]
    


# If false, the datasets will not be shuffled and will keep their original order in `datasets`.
# The same applies to the `test_datasets` option and the `pretraining_dataset` option. Default is true.
shuffle_merged_datasets: true


# The name of the chat template to use for training, following values are supported:
# - tokenizer_default: Uses the chat template that is available in the tokenizer_config.json. If the chat template is not available in the tokenizer, it will raise an error. This is the default value.
# - alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates are available in the axolotl codebase at src/axolotl/utils/chat_templates.py
# - tokenizer_default_fallback_*: where * is the name of the chat template to fallback to. E.g. tokenizer_default_fallback_chatml. This is useful when the chat template is not available in the tokenizer.
# - jinja: Uses a custom jinja template for the chat template. The custom jinja template should be provided in the chat_template_jinja field.
# The selected chat template will be saved to the tokenizer_config.json for easier inferencing
# Note: It is recommended to set train_on_inputs to true when using a chat template that is different from the model's default chat template.
chat_template: tokenizer_default

# Axolotl attempts to save the dataset as an arrow after packing the data together so
# subsequent training attempts load faster, relative path
dataset_prepared_path: data/last_run_prepared
# push checkpoints to hub
#hub_model_id: # private repo path to push finetuned model
# how to push checkpoints to hub
# https://huggingface.co./docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
#hub_strategy:
# Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private datasets
# Required to be true when used in combination with `push_dataset_to_hub`
#hf_use_auth_token: # boolean

# Num shards for whole dataset
#dataset_shard_num:
# Index of shard to use for whole dataset
#dataset_shard_idx:

# The maximum length of an input to train with, this should typically be less than 2048
# as most models have a token/context limit of 2048
sequence_len: 1024
# Pad inputs so each step uses constant sized buffers
# This will reduce memory fragmentation and may prevent OOMs, by re-using memory more efficiently
pad_to_sequence_len: true
# Use efficient multi-packing with block diagonal attention and per sequence position_ids. Recommend set to 'true'
sample_packing: true
# Set to 'false' if getting errors during eval with sample_packing on.
eval_sample_packing: false
# You can set these packing optimizations AFTER starting a training at least once.
# The trainer will provide recommended values for these values.
# sample_packing_eff_est:
# total_num_tokens:
# Increasing the following values helps with packing, but usually only slightly (<%1.)
# The number of samples packed at a time.
# sample_packing_group_size: 100000
# The number of samples which can be packed into one sequence. Increase if using a large sequence_len with many short samples.
# sample_packing_bin_size: 200
# whether to concatenate samples during pretraining
# pretraining_sample_concatenation:

# Use batch flattening for speedups when not using sample_packing
# batch_flattening:

# Passed through to transformers when loading the model when launched without accelerate
# Use `sequential` when training w/ model parallelism to limit memory
# device_map:
# Defines the max memory usage per gpu on the system. Passed through to transformers when loading the model.
# max_memory:

# If you want to use 'lora' or 'qlora' or leave blank to train all parameters in original model
adapter: qlora
# If you already have a lora model trained that you want to load, put that here.
# This means after training, if you want to test the model, you should set this to the value of `output_dir`.
# Note that if you merge an adapter to the base model, a new subdirectory `merged` will be created under the `output_dir`.
# lora_model_dir:

# LoRA hyperparameters
# For more details about the following options, see:
# https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj
lora_target_linear: # If true, will target all linear modules
peft_layers_to_transform: # The layer indices to transform, otherwise, apply to all layers

# If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.
# For LLaMA and Mistral, you need to save `embed_tokens` and `lm_head`. It may vary for other models.
# `embed_tokens` converts tokens to embeddings, and `lm_head` converts embeddings to token probabilities.
# https://github.com/huggingface/peft/issues/334#issuecomment-1561727994
#lora_modules_to_save:
#  - embed_tokens
#  - lm_head

#lora_fan_in_fan_out: false

# LoRA+ hyperparameters
# For more details about the following options, see:
# https://arxiv.org/abs/2402.12354  and `src/axolotl/core/train_builder.py`
#loraplus_lr_ratio: # loraplus learning rate ratio lr_B / lr_A. Recommended value is 2^4.
#loraplus_lr_embedding: #  loraplus learning rate for lora embedding layers. Default value is 1e-6.

#peft:
  # Configuration options for loftq initialization for LoRA
  # https://huggingface.co./docs/peft/developer_guides/quantization#loftq-initialization
#  loftq_config:
#    loftq_bits:  4 # typically 4 bits

# ReLoRA configuration
# Must use either 'lora' or 'qlora' adapter, and does not support fsdp or deepspeed
#relora_steps: # Number of steps per ReLoRA restart
#relora_warmup_steps: # Number of per-restart warmup steps
#relora_anneal_steps: # Number of anneal steps for each relora cycle
#relora_prune_ratio: # threshold for optimizer magnitude when pruning
#relora_cpu_offload: # True to perform lora weight merges on cpu during restarts, for modest gpu memory savings

# wandb configuration if you're using it
# Make sure your `WANDB_API_KEY` environment variable is set (recommended) or you login to wandb with `wandb login`.
# wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
wandb_project: # Your wandb project name
wandb_entity: # A wandb Team name if using a Team
wandb_watch:
wandb_name: vast-finetune-r1 # Set the name of your wandb run
wandb_run_id: # Set the ID of your wandb run
wandb_log_model: checkpoint # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training

wandb_entity: blueanode
wandb_project: fabricator

# mlflow configuration if you're using it
#mlflow_tracking_uri: # URI to mlflow
#mlflow_experiment_name: # Your experiment name
#mlflow_run_name: # Your run name
#hf_mlflow_log_artifacts:  # set to true to copy each saved checkpoint on each save to mlflow artifact registry



# Where to save the full-finetuned model to
output_dir: ./vast-finetune-r1

# Whether to use torch.compile and which backend to use
# setting to `auto` will enable torch compile when torch>=2.5.1
torch_compile:  # Optional[Union[Literal["auto"], bool]]
torch_compile_backend:  # Optional[str]

# Training hyperparameters

# If greater than 1, backpropagation will be skipped and the gradients will be accumulated for the given number of steps.
gradient_accumulation_steps: 1
# The number of samples to include in each batch. This is the number of samples sent to each GPU.
# Batch size per gpu = micro_batch_size * gradient_accumulation_steps
micro_batch_size: 2
eval_batch_size:
num_epochs: 8
warmup_steps: 100  # cannot use with warmup_ratio
learning_rate: 0.00003
lr_quadratic_warmup:
logging_steps:
eval_steps: # Leave empty to eval at each epoch, integers for every N steps. decimal for fraction of total steps
evals_per_epoch: 4 # number of times per epoch to run evals, mutually exclusive with eval_steps
save_strategy: # Set to `"no"` to skip checkpoint saves
save_steps: # Leave empty to save at each epoch
# saves_per_epoch: # number of times per epoch to save a checkpoint, mutually exclusive with save_steps
save_total_limit: 2 # Checkpoints saved at a time
# Maximum number of iterations to train for. It precedes num_epochs which means that
# if both are set, num_epochs will not be guaranteed.
# e.g., when 1 epoch is 1000 steps => `num_epochs: 2` and `max_steps: 100` will train for 100 steps
# max_steps:

eval_table_size: 8 # Approximate number of predictions sent to wandb depending on batch size. Enabled above 0. Default is 0
eval_max_new_tokens: 256 # Total number of tokens generated for predictions sent to wandb. Default is 128
#eval_causal_lm_metrics: # HF evaluate metrics used during evaluation. Default is ["sacrebleu", "comet", "ter", "chrf", "perplexity"]

profiler_steps: # enable the pytorch profiler to capture the first N steps of training to the output_dir.
                # see https://pytorch.org/blog/understanding-gpu-memory-1/ for more information
                # snapshots can be visualized @ https://pytorch.org/memory_viz

#loss_watchdog_threshold: # High loss value, indicating the learning has broken down (a good estimate is ~2 times the loss at the start of training)
#loss_watchdog_patience: # Number of high-loss steps in a row before the trainer aborts (default: 3)

# Save model as safetensors (require safetensors package)
# save_safetensors:

# Whether to mask out or include the human's prompt from the training labels
train_on_inputs: false
#train_on_inputs: false
#group_by_length: false
bf16: auto
fp16:
tf32: false
# Group similarly sized data to minimize padding.
# May be slower to start, as it must download and sort the entire dataset.
# Note that training loss may have an oscillating pattern with this enabled.
group_by_length: false

# Whether to use gradient checkpointing https://huggingface.co./docs/transformers/v4.18.0/en/performance#gradient-checkpointing
gradient_checkpointing: false
# additional kwargs to pass to the trainer for gradient checkpointing
# gradient_checkpointing_kwargs:
#   use_reentrant: true

# Stop training after this many evaluation losses have increased in a row
# https://huggingface.co./transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
# early_stopping_patience: 3

# Specify a scheduler and kwargs to use with the optimizer
#lr_scheduler: # 'one_cycle' | 'log_sweep' | empty for cosine
lr_scheduler_kwargs:
cosine_min_lr_ratio: # decay lr to some percentage of the peak lr, e.g. cosine_min_lr_ratio=0.1 for 10% of peak lr
cosine_constant_lr_ratio: # freeze lr at some percentage of the step, e.g. cosine_constant_lr_ratio=0.8 means start cosine_min_lr at 80% of training step (https://arxiv.org/pdf/2308.04014.pdf)

# For one_cycle optim
lr_div_factor: # Learning rate div factor

# Specify optimizer
# Valid values are driven by the Transformers OptimizerNames class, see:
# https://github.com/huggingface/transformers/blob/95b374952dc27d8511541d6f5a4e22c9ec11fb24/src/transformers/training_args.py#L134
#
# Note that not all optimizers may be available in your environment, ex: 'adamw_anyprecision' is part of
# torchdistx, 'adamw_bnb_8bit' is part of bnb.optim.Adam8bit, etc. When in doubt, it is recommended to start with the optimizer used
# in the examples/ for your model and fine-tuning use case.
#
# Valid values for 'optimizer' include:
# - adamw_hf
# - adamw_torch
# - adamw_torch_fused
# - adamw_torch_xla
# - adamw_apex_fused
# - adopt_adamw (an EXPERIMENTAL optimizer, only for torch version >= 2.5.1)
# - adafactor
# - adamw_anyprecision
# - sgd
# - adagrad
# - adamw_bnb_8bit
# - lion_8bit
# - lion_32bit
# - paged_adamw_32bit
# - paged_adamw_8bit
# - paged_lion_32bit
# - paged_lion_8bit
# - galore_adamw
# - galore_adamw_8bit
# - galore_adafactor
# - galore_adamw_layerwise
# - galore_adamw_8bit_layerwise
# - galore_adafactor_layerwise
optimizer: paged_adamw_32bit
lr_scheduler: cosine
# Dictionary of arguments to pass to the optimizer
optim_args:
# For Galore Optimizers the following optim_args are available
# rank:  # type: int
# update_proj_gap  # type: int
# scale  # type: float
# proj_type:  # type: str, default = std

# The target modules to optimize, i.e. the module names that you would like to train, right now this is used only for GaLore algorithm
optim_target_modules:
# - self_attn  # for llama
# - mlp

# Specify weight decay
weight_decay:
# adamw hyperparams
adam_beta1:
adam_beta2:
adam_epsilon:
# Gradient clipping max norm
max_grad_norm:

# Augmentation techniques
# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
# currently only supported on Llama and Mistral
neftune_noise_alpha:

# Whether to bettertransformers
flash_optimum:
# Whether to use xformers attention patch https://github.com/facebookresearch/xformers:
xformers_attention:
# Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention:
flash_attention:
flash_attn_cross_entropy:  # Whether to use flash-attention cross entropy implementation - advanced use only
flash_attn_rms_norm:  # Whether to use flash-attention rms norm implementation - advanced use only
flash_attn_fuse_qkv: # Whether to fuse QKV into a single operation
flash_attn_fuse_mlp: # Whether to fuse part of the MLP into a single operation
# Whether to use scaled-dot-product attention
# https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
sdp_attention:
# Shifted-sparse attention (only llama) - https://arxiv.org/pdf/2309.12307.pdf
s2_attention:
# Resume from a specific checkpoint dir
resume_from_checkpoint:
# If resume_from_checkpoint isn't set and you simply want it to start where it left off.
# Be careful with this being turned on between different models.
auto_resume_from_checkpoints: true

# Don't mess with this, it's here for accelerate and torchrun
local_rank:

# Add or change special tokens.
# If you add tokens here, you don't need to add them to the `tokens` list.
special_tokens:
  # bos_token: "<s>"
  # eos_token: "</s>"
  # unk_token: "<unk>"
  pad_token: "<|end_of_text|>"

# Add extra tokens.
tokens:

# FSDP
fsdp:
fsdp_config:

# Deepspeed config path. e.g., deepspeed_configs/zero3.json
deepspeed:

# Advanced DDP Arguments
ddp_timeout:
ddp_bucket_cap_mb:
ddp_broadcast_buffers:

# Path to torch distx for optim 'adamw_anyprecision'
torchdistx_path:

# Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
pretraining_dataset:

# Debug mode
debug:

# Seed
seed:

# Allow overwrite yml config using from cli
strict:
vast-finetune-r1

This model is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct on the ./ts-8k.jsonl dataset. It achieves the following results on the evaluation set:
Loss: 1.1386
Model description

More information needed
Intended uses & limitations

More information needed
Training and evaluation data

More information needed
Training procedure

Training hyperparameters

The following hyperparameters were used during training:
learning_rate: 3e-05
train_batch_size: 2
eval_batch_size: 2
seed: 42
optimizer: Use OptimizerNames.PAGED_ADAMW with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
num_epochs: 8
Training results

Training Loss	Epoch	Step	Validation Loss
No log	0.0006	1	1.4578
0.8797	0.2505	414	1.0716
1.1073	0.5009	828	1.0543
0.9352	0.7514	1242	1.0344
1.0419	2.0024	1656	1.0315
0.9242	2.5030	2070	1.0270
0.8121	3.0024	2484	1.0251
0.7811	3.5030	2898	1.0463
0.8205	4.0048	3312	1.0431
0.7505	4.5054	3726	1.0653
0.6997	5.0085	4140	1.0701
0.78	5.5091	4554	1.0947
0.6445	6.0085	4968	1.1057
0.6848	6.5091	5382	1.1273
0.6173	7.0109	5796	1.1262
0.6861	7.5115	6210	1.1386
Framework versions

PEFT 0.14.0
Transformers 4.47.1
Pytorch 2.3.1+cu121
Datasets 3.2.0
Tokenizers 0.21.0
redcathode
/

fabricator

vast-finetune-r1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for redcathode/fabricator

Dataset used to train redcathode/fabricator

Evaluation results