See axolotl config

axolotl version: 0.5.2

base_model: mistralai/Mistral-7B-v0.1
model_type: AutoModelForCausalLM
tokenizer_config: Open-Orca/Mistral-7B-OpenOrca
tokenizer_type: AutoTokenizer
tokenizer_use_fast: false
resize_token_embeddings_to_32x: false

flash_attention: true
xformers_attention:

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: chatml
datasets:
  - path: skymizer/open-orca-conversations
    type: chat_template
    field_messages: messages
    train_on_split: train

test_datasets:
  - path: skymizer/open-orca-conversations
    type: chat_template
    field_messages: messages
    split: test

hf_use_auth_token: true
dataset_prepared_path: /mnt/home/model-team/dataset/pretokenized/mistral-open-orca
output_dir: /mnt/home/model-team/models/mistral-7B-v0.1-open-orca-q-sparse

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

eval_sample_packing: false
# eval_causal_lm_metrics: ["perplexity"]

wandb_project: "axolotl_q_sparse_sft"
wandb_entity:
wandb_watch:
wandb_name: "mistral-7B-v0.1-open-orca-q-sparse"
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 8
eval_batch_size: 
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.000005
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 0.000001
max_grad_norm: 1.0

train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false

hub_model_id: "skymizer/mistral-7B-v0.1-open-orca-q-sparse"

save_strategy: "steps"
save_steps: 500

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1

warmup_ratio: 0.03
eval_steps: 500
eval_table_size:
eval_max_new_tokens: 2048
debug:
deepspeed: /root/train/axolotl/deepspeed_configs/zero3_bf16.json
fsdp:
fsdp_config:
seed: 42

mistral-7B-v0.1-open-orca-q-sparse

topk_sparsity all = 0.5 with relu2

This model is a fine-tuned version of mistralai/Mistral-7B-v0.1 on the None dataset. It achieves the following results on the evaluation set:

Loss: 1.8081

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 128
total_eval_batch_size: 64
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 182
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss
11.1181	0.0002	1	11.1302
3.3048	0.0824	500	3.2701
2.9251	0.1648	1000	2.8377
2.6088	0.2472	1500	2.5340
2.3853	0.3296	2000	2.3155
2.2076	0.4120	2500	2.1627
2.0993	0.4944	3000	2.0499
2.0122	0.5768	3500	1.9705
1.9029	0.6592	4000	1.9069
1.8822	0.7416	4500	1.8588
1.941	0.8240	5000	1.8296
1.9377	0.9064	5500	1.8117
1.8411	0.9888	6000	1.8081

Framework versions

Transformers 4.46.3
Pytorch 2.5.1+cu124
Datasets 3.1.0
Tokenizers 0.20.3

skymizer
/

mistral-7B-v0.1-open-orca-q-sparse

mistral-7B-v0.1-open-orca-q-sparse

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for skymizer/mistral-7B-v0.1-open-orca-q-sparse

Evaluation results