CalamitousFelicitousness commited on
Commit
b1da1cb
1 Parent(s): e96b607

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +160 -0
README.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ datasets:
4
+ - Mielikki/Erebus-87k
5
+ - allura-org/r_shortstories_24k
6
+ base_model:
7
+ - UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
8
+ pipeline_tag: text-generation
9
+ library_name: transformers
10
+ ---
11
+
12
+ <img src="image_27.png" alt="A beautiful witch writing a book with a quill">
13
+ <sub>Image by CalamitousFelicitouness</sub>
14
+
15
+ ---
16
+
17
+ # This repo contains the copy of the original quantized to EXL2. Original: [allura-org/G2-9B-Sugarquill-v0](https://huggingface.co/allura-org/G2-9B-Sugarquill-v0)
18
+
19
+ # Gemma-2-9B Sugarquill v0
20
+
21
+ An experimental continued pretrain of Gemma-2-9B-It-SPPO-Iter3 on assorted short story data from the web.
22
+ I was trying to diversify Gemma's prose, without completely destroying it's smarts. I think I half-succeeded? This model could have used another epoch of training, but even this is already more creative and descriptive than it's base model, w/o becoming too silly. Doesn't seem to have degraded much in terms of core abilities as well.
23
+ Should be usable both for RP and raw completion storywriting.
24
+ I originally planned to use this in a merge, but I feel like this model is interesting enough to be released on it's own as well.
25
+
26
+ Model was trained by Auri.
27
+
28
+ Dedicated to Cahvay, who wanted a Gemma finetune from me for months by now, and to La Rata, who loves storywriter models.
29
+
30
+ GGUFs by Prodeus: https://huggingface.co/allura-org/G2-9B-Sugarquill-v0-GGUF
31
+
32
+ **Training notes**
33
+
34
+ This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. It was trained on 8xH100 SXM node for 30 minutes with rsLoRA.
35
+ I got complete nonsense reported to my wandb during this run, and logging stopped altogether after step 13 for some reason. Seems to be directly related to Gemma, as my training setup worked flawlessly for Qwen.
36
+ Thanks to Kearm for helping with setting up LF on that node and to Featherless for providing it for EVA-Qwen2.5 (and this model, unknowingly lol) training.
37
+
38
+ **Format**
39
+
40
+ Model responds to Gemma instruct formatting, exactly like it's base model.
41
+
42
+ ```
43
+ <bos><start_of_turn>user
44
+ {user message}<end_of_turn>
45
+ <start_of_turn>model
46
+ {response}<end_of_turn><eos>
47
+ ```
48
+
49
+ **Training config**
50
+ <details><summary>See LLaMA-Factory config</summary>
51
+
52
+ ```yaml
53
+ ### Model
54
+ model_name_or_path: UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
55
+ #ref_model: # Reference model for RL (optional, for everything besides SimPO, which doesn't take it at all)
56
+ #ref_model_quantization_bit: 8 # 8 or 4
57
+
58
+ ### Method
59
+ stage: pt # pt, sft, rm, ppo, kto, dpo (includes orpo and simpo)
60
+ do_train: true
61
+ finetuning_type: lora # full, freeze or lora
62
+ lora_target: all
63
+ #pref_beta: 0.1
64
+ #pref_loss: simpo # sigmoid (dpo), orpo, simpo, ipo, hinge
65
+
66
+ ### Reward model
67
+ #reward_model: RLHFlow/ArmoRM-Llama3-8B-v0.1 # or sfairXC/FsfairX-Gemma2-RM-v0.1 or nvidia/Llama-3.1-Nemotron-70B-Reward-HF
68
+ #reward_model_type: full # full, lora, api
69
+ #reward_model_adapters: # Path to RM LoRA adapter(s) if using a LoRA RM
70
+ #reward_model_quantization_bit: 8 # 4 or 8
71
+
72
+ ### Freeze
73
+ #freeze_trainable_layers: # The number of trainable layers for freeze (partial-parameter) fine-tuning. Positive number means n last layers to train, negative - n first layers to train
74
+ #freeze_trainable_modules: # Name(s) of trainable modules for freeze (partial-parameter) fine-tuning. Use commas to separate
75
+ #freeze_extra_modules: # Name(s) of modules apart from hidden layers to be set as trainable. Use commas to separate
76
+
77
+ ### LoRA
78
+ #loraplus_lr_ratio: 8.0
79
+ #loraplus_lr_embedding:
80
+ use_dora: false
81
+ use_rslora: true
82
+ lora_rank: 64 # 64 is optimal for most trains on instruct, if training on base - use rslora or dora
83
+ lora_alpha: 32
84
+ lora_dropout: 0.05
85
+ #pissa_init: true
86
+ #pissa_iter: 16
87
+ #pissa_convert: true
88
+
89
+ ### QLoRA
90
+ quantization_bit: 8 # 2,3,4,5,6,8 in HQQ, 4 or 8 in bnb
91
+ quantization_method: hqq # bitsandbytes or hqq
92
+
93
+ ### DeepSpeed
94
+ deepspeed: examples/deepspeed/ds_z2_config.json # ds_z3_config.json or ds_z2_config.json which is required for HQQ on multigpu
95
+
96
+ ### Dataset
97
+ dataset: sugarquill-10k # define in data/dataset_info.json
98
+ cutoff_len: 8192
99
+ max_samples: 10000
100
+ overwrite_cache: true
101
+ preprocessing_num_workers: 16
102
+ #template: chatml
103
+
104
+ ### Output
105
+ output_dir: saves/gemma/lora/sugarquill-1
106
+ logging_steps: 3
107
+ save_steps: 50
108
+ plot_loss: true
109
+ compute_accuracy: true
110
+ overwrite_output_dir: true
111
+
112
+ ### Train
113
+ per_device_train_batch_size: 1 # Effective b/s == per-device b/s * grad accum steps * number of GPUs
114
+ gradient_accumulation_steps: 8
115
+ learning_rate: 3.0e-5
116
+ optim: paged_adamw_8bit # paged_adamw_8bit or adamw_torch usually
117
+ num_train_epochs: 2.0
118
+ lr_scheduler_type: cosine # cosine, constant or linear
119
+ warmup_ratio: 0.05
120
+ bf16: true
121
+ ddp_timeout: 180000000
122
+ packing: true
123
+ max_grad_norm: 1.0
124
+
125
+ ### Opts
126
+ flash_attn: fa2 # auto, disabled, sdpa, fa2 | Gemma will fallback to eager
127
+ enable_liger_kernel: true # Pretty much must have if it works
128
+ #use_unsloth: true # May not work with multigpu idk
129
+ #use_adam_mini: true # Comment optim if using this
130
+
131
+ ### Eval
132
+ val_size: 0.1
133
+ per_device_eval_batch_size: 1
134
+ eval_strategy: steps
135
+ eval_steps: 0.05
136
+
137
+ ### Misc
138
+ include_num_input_tokens_seen: true
139
+ ddp_find_unused_parameters: false # Stupid thing tries to start distributed training otherwise
140
+ upcast_layernorm: true
141
+
142
+ ### Inference for PPO
143
+ #max_new_tokens: 512
144
+ #temperature: 0.8
145
+ #top_k: 0
146
+ #top_p: 0.8
147
+
148
+ ### Tracking
149
+ report_to: wandb # or tensorboard or mlflow | LOGIN BEFORE STARTING TRAIN OR ELSE IT WILL CRASH
150
+ run_name: G2-9B-Sugarquill-1
151
+
152
+ ### Merge Adapter
153
+ #export_dir: models/G2-9B-Sugarquill
154
+ #export_size: 4
155
+ #export_device: gpu
156
+ #export_legacy_format: false
157
+
158
+ ```
159
+
160
+ </details>