kyryl-georgian commited on
Commit
360860b
1 Parent(s): b4ab127

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ base_model: google/flan-t5-base
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+
201
+
202
+ ### Framework versions
203
+
204
+ - PEFT 0.7.1
adapter_config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "google/flan-t5-base",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layers_pattern": null,
10
+ "layers_to_transform": null,
11
+ "loftq_config": {},
12
+ "lora_alpha": 32,
13
+ "lora_dropout": 0.2,
14
+ "megatron_config": null,
15
+ "megatron_core": "megatron.core",
16
+ "modules_to_save": null,
17
+ "peft_type": "LORA",
18
+ "r": 16,
19
+ "rank_pattern": {},
20
+ "revision": null,
21
+ "target_modules": [
22
+ "v",
23
+ "q"
24
+ ],
25
+ "task_type": "SEQ_2_SEQ_LM"
26
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b3296fcff514fee3ff0d0fd95872e9945028e0a5171922fe4e120f7bc1da6fb
3
+ size 7098016
all_results.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 10.0,
3
+ "eval_loss": 0.05540305748581886,
4
+ "eval_runtime": 27.2595,
5
+ "eval_samples_per_second": 288.266,
6
+ "eval_steps_per_second": 18.049,
7
+ "train_loss": 0.09247852528257068,
8
+ "train_runtime": 8790.0948,
9
+ "train_samples_per_second": 80.453,
10
+ "train_steps_per_second": 5.028
11
+ }
eval_results.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 10.0,
3
+ "eval_loss": 0.05540305748581886,
4
+ "eval_runtime": 27.2595,
5
+ "eval_samples_per_second": 288.266,
6
+ "eval_steps_per_second": 18.049
7
+ }
train_results.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 10.0,
3
+ "train_loss": 0.09247852528257068,
4
+ "train_runtime": 8790.0948,
5
+ "train_samples_per_second": 80.453,
6
+ "train_steps_per_second": 5.028
7
+ }
trainer_state.json ADDED
@@ -0,0 +1,1350 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 10.0,
5
+ "eval_steps": 500,
6
+ "global_step": 44200,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.11,
13
+ "grad_norm": 0.4132317900657654,
14
+ "learning_rate": 0.0009886877828054299,
15
+ "loss": 0.285,
16
+ "step": 500
17
+ },
18
+ {
19
+ "epoch": 0.11,
20
+ "eval_loss": 0.12178385257720947,
21
+ "eval_runtime": 27.3061,
22
+ "eval_samples_per_second": 287.774,
23
+ "eval_steps_per_second": 18.018,
24
+ "step": 500
25
+ },
26
+ {
27
+ "epoch": 0.23,
28
+ "grad_norm": 0.4537642300128937,
29
+ "learning_rate": 0.0009773755656108597,
30
+ "loss": 0.1782,
31
+ "step": 1000
32
+ },
33
+ {
34
+ "epoch": 0.23,
35
+ "eval_loss": 0.10424336045980453,
36
+ "eval_runtime": 27.2202,
37
+ "eval_samples_per_second": 288.683,
38
+ "eval_steps_per_second": 18.075,
39
+ "step": 1000
40
+ },
41
+ {
42
+ "epoch": 0.34,
43
+ "grad_norm": 0.4927336871623993,
44
+ "learning_rate": 0.0009660633484162896,
45
+ "loss": 0.1623,
46
+ "step": 1500
47
+ },
48
+ {
49
+ "epoch": 0.34,
50
+ "eval_loss": 0.1000167727470398,
51
+ "eval_runtime": 27.2235,
52
+ "eval_samples_per_second": 288.648,
53
+ "eval_steps_per_second": 18.073,
54
+ "step": 1500
55
+ },
56
+ {
57
+ "epoch": 0.45,
58
+ "grad_norm": 0.3784632682800293,
59
+ "learning_rate": 0.0009547511312217196,
60
+ "loss": 0.1487,
61
+ "step": 2000
62
+ },
63
+ {
64
+ "epoch": 0.45,
65
+ "eval_loss": 0.10203839093446732,
66
+ "eval_runtime": 27.2422,
67
+ "eval_samples_per_second": 288.45,
68
+ "eval_steps_per_second": 18.06,
69
+ "step": 2000
70
+ },
71
+ {
72
+ "epoch": 0.57,
73
+ "grad_norm": 0.4409666955471039,
74
+ "learning_rate": 0.0009434389140271493,
75
+ "loss": 0.1419,
76
+ "step": 2500
77
+ },
78
+ {
79
+ "epoch": 0.57,
80
+ "eval_loss": 0.0881214290857315,
81
+ "eval_runtime": 27.2438,
82
+ "eval_samples_per_second": 288.432,
83
+ "eval_steps_per_second": 18.059,
84
+ "step": 2500
85
+ },
86
+ {
87
+ "epoch": 0.68,
88
+ "grad_norm": 0.33156296610832214,
89
+ "learning_rate": 0.0009321266968325792,
90
+ "loss": 0.1371,
91
+ "step": 3000
92
+ },
93
+ {
94
+ "epoch": 0.68,
95
+ "eval_loss": 0.08762603253126144,
96
+ "eval_runtime": 27.2331,
97
+ "eval_samples_per_second": 288.546,
98
+ "eval_steps_per_second": 18.066,
99
+ "step": 3000
100
+ },
101
+ {
102
+ "epoch": 0.79,
103
+ "grad_norm": 0.26063305139541626,
104
+ "learning_rate": 0.000920814479638009,
105
+ "loss": 0.1366,
106
+ "step": 3500
107
+ },
108
+ {
109
+ "epoch": 0.79,
110
+ "eval_loss": 0.08680489659309387,
111
+ "eval_runtime": 27.2296,
112
+ "eval_samples_per_second": 288.583,
113
+ "eval_steps_per_second": 18.069,
114
+ "step": 3500
115
+ },
116
+ {
117
+ "epoch": 0.9,
118
+ "grad_norm": 0.6152302622795105,
119
+ "learning_rate": 0.0009095022624434389,
120
+ "loss": 0.1288,
121
+ "step": 4000
122
+ },
123
+ {
124
+ "epoch": 0.9,
125
+ "eval_loss": 0.08397215604782104,
126
+ "eval_runtime": 27.259,
127
+ "eval_samples_per_second": 288.272,
128
+ "eval_steps_per_second": 18.049,
129
+ "step": 4000
130
+ },
131
+ {
132
+ "epoch": 1.02,
133
+ "grad_norm": 0.20703616738319397,
134
+ "learning_rate": 0.0008981900452488689,
135
+ "loss": 0.1329,
136
+ "step": 4500
137
+ },
138
+ {
139
+ "epoch": 1.02,
140
+ "eval_loss": 0.0818256288766861,
141
+ "eval_runtime": 27.2728,
142
+ "eval_samples_per_second": 288.126,
143
+ "eval_steps_per_second": 18.04,
144
+ "step": 4500
145
+ },
146
+ {
147
+ "epoch": 1.13,
148
+ "grad_norm": 0.47000011801719666,
149
+ "learning_rate": 0.0008868778280542986,
150
+ "loss": 0.1221,
151
+ "step": 5000
152
+ },
153
+ {
154
+ "epoch": 1.13,
155
+ "eval_loss": 0.08076580613851547,
156
+ "eval_runtime": 27.2281,
157
+ "eval_samples_per_second": 288.599,
158
+ "eval_steps_per_second": 18.07,
159
+ "step": 5000
160
+ },
161
+ {
162
+ "epoch": 1.24,
163
+ "grad_norm": 0.3874566853046417,
164
+ "learning_rate": 0.0008755656108597285,
165
+ "loss": 0.1202,
166
+ "step": 5500
167
+ },
168
+ {
169
+ "epoch": 1.24,
170
+ "eval_loss": 0.08285848051309586,
171
+ "eval_runtime": 27.2548,
172
+ "eval_samples_per_second": 288.316,
173
+ "eval_steps_per_second": 18.052,
174
+ "step": 5500
175
+ },
176
+ {
177
+ "epoch": 1.36,
178
+ "grad_norm": 0.35874509811401367,
179
+ "learning_rate": 0.0008642533936651585,
180
+ "loss": 0.1186,
181
+ "step": 6000
182
+ },
183
+ {
184
+ "epoch": 1.36,
185
+ "eval_loss": 0.081941157579422,
186
+ "eval_runtime": 27.2605,
187
+ "eval_samples_per_second": 288.256,
188
+ "eval_steps_per_second": 18.048,
189
+ "step": 6000
190
+ },
191
+ {
192
+ "epoch": 1.47,
193
+ "grad_norm": 0.34497329592704773,
194
+ "learning_rate": 0.0008529411764705882,
195
+ "loss": 0.1163,
196
+ "step": 6500
197
+ },
198
+ {
199
+ "epoch": 1.47,
200
+ "eval_loss": 0.07750312983989716,
201
+ "eval_runtime": 27.2737,
202
+ "eval_samples_per_second": 288.116,
203
+ "eval_steps_per_second": 18.039,
204
+ "step": 6500
205
+ },
206
+ {
207
+ "epoch": 1.58,
208
+ "grad_norm": 0.34537777304649353,
209
+ "learning_rate": 0.0008416289592760181,
210
+ "loss": 0.1213,
211
+ "step": 7000
212
+ },
213
+ {
214
+ "epoch": 1.58,
215
+ "eval_loss": 0.0756232738494873,
216
+ "eval_runtime": 27.2421,
217
+ "eval_samples_per_second": 288.451,
218
+ "eval_steps_per_second": 18.06,
219
+ "step": 7000
220
+ },
221
+ {
222
+ "epoch": 1.7,
223
+ "grad_norm": 0.31827136874198914,
224
+ "learning_rate": 0.000830316742081448,
225
+ "loss": 0.1169,
226
+ "step": 7500
227
+ },
228
+ {
229
+ "epoch": 1.7,
230
+ "eval_loss": 0.07395777106285095,
231
+ "eval_runtime": 27.2385,
232
+ "eval_samples_per_second": 288.489,
233
+ "eval_steps_per_second": 18.063,
234
+ "step": 7500
235
+ },
236
+ {
237
+ "epoch": 1.81,
238
+ "grad_norm": 0.43441176414489746,
239
+ "learning_rate": 0.0008190045248868778,
240
+ "loss": 0.1151,
241
+ "step": 8000
242
+ },
243
+ {
244
+ "epoch": 1.81,
245
+ "eval_loss": 0.08032752573490143,
246
+ "eval_runtime": 27.2392,
247
+ "eval_samples_per_second": 288.481,
248
+ "eval_steps_per_second": 18.062,
249
+ "step": 8000
250
+ },
251
+ {
252
+ "epoch": 1.92,
253
+ "grad_norm": 0.403935968875885,
254
+ "learning_rate": 0.0008076923076923078,
255
+ "loss": 0.1162,
256
+ "step": 8500
257
+ },
258
+ {
259
+ "epoch": 1.92,
260
+ "eval_loss": 0.07349220663309097,
261
+ "eval_runtime": 27.2543,
262
+ "eval_samples_per_second": 288.322,
263
+ "eval_steps_per_second": 18.052,
264
+ "step": 8500
265
+ },
266
+ {
267
+ "epoch": 2.04,
268
+ "grad_norm": 0.2286953330039978,
269
+ "learning_rate": 0.0007963800904977375,
270
+ "loss": 0.1157,
271
+ "step": 9000
272
+ },
273
+ {
274
+ "epoch": 2.04,
275
+ "eval_loss": 0.07655400782823563,
276
+ "eval_runtime": 27.2562,
277
+ "eval_samples_per_second": 288.301,
278
+ "eval_steps_per_second": 18.051,
279
+ "step": 9000
280
+ },
281
+ {
282
+ "epoch": 2.15,
283
+ "grad_norm": 0.2294893115758896,
284
+ "learning_rate": 0.0007850678733031674,
285
+ "loss": 0.1121,
286
+ "step": 9500
287
+ },
288
+ {
289
+ "epoch": 2.15,
290
+ "eval_loss": 0.07088885456323624,
291
+ "eval_runtime": 27.2467,
292
+ "eval_samples_per_second": 288.402,
293
+ "eval_steps_per_second": 18.057,
294
+ "step": 9500
295
+ },
296
+ {
297
+ "epoch": 2.26,
298
+ "grad_norm": 0.44981372356414795,
299
+ "learning_rate": 0.0007737556561085974,
300
+ "loss": 0.1073,
301
+ "step": 10000
302
+ },
303
+ {
304
+ "epoch": 2.26,
305
+ "eval_loss": 0.07331141084432602,
306
+ "eval_runtime": 27.2427,
307
+ "eval_samples_per_second": 288.445,
308
+ "eval_steps_per_second": 18.06,
309
+ "step": 10000
310
+ },
311
+ {
312
+ "epoch": 2.38,
313
+ "grad_norm": 0.4742676019668579,
314
+ "learning_rate": 0.0007624434389140271,
315
+ "loss": 0.1063,
316
+ "step": 10500
317
+ },
318
+ {
319
+ "epoch": 2.38,
320
+ "eval_loss": 0.07561534643173218,
321
+ "eval_runtime": 27.2233,
322
+ "eval_samples_per_second": 288.65,
323
+ "eval_steps_per_second": 18.073,
324
+ "step": 10500
325
+ },
326
+ {
327
+ "epoch": 2.49,
328
+ "grad_norm": 0.4676545262336731,
329
+ "learning_rate": 0.0007511312217194571,
330
+ "loss": 0.1109,
331
+ "step": 11000
332
+ },
333
+ {
334
+ "epoch": 2.49,
335
+ "eval_loss": 0.07197986543178558,
336
+ "eval_runtime": 27.2211,
337
+ "eval_samples_per_second": 288.673,
338
+ "eval_steps_per_second": 18.074,
339
+ "step": 11000
340
+ },
341
+ {
342
+ "epoch": 2.6,
343
+ "grad_norm": 0.5688673257827759,
344
+ "learning_rate": 0.0007398190045248869,
345
+ "loss": 0.1072,
346
+ "step": 11500
347
+ },
348
+ {
349
+ "epoch": 2.6,
350
+ "eval_loss": 0.07261210680007935,
351
+ "eval_runtime": 27.2328,
352
+ "eval_samples_per_second": 288.549,
353
+ "eval_steps_per_second": 18.066,
354
+ "step": 11500
355
+ },
356
+ {
357
+ "epoch": 2.71,
358
+ "grad_norm": 0.24911655485630035,
359
+ "learning_rate": 0.0007285067873303167,
360
+ "loss": 0.1055,
361
+ "step": 12000
362
+ },
363
+ {
364
+ "epoch": 2.71,
365
+ "eval_loss": 0.06898481398820877,
366
+ "eval_runtime": 27.2378,
367
+ "eval_samples_per_second": 288.496,
368
+ "eval_steps_per_second": 18.063,
369
+ "step": 12000
370
+ },
371
+ {
372
+ "epoch": 2.83,
373
+ "grad_norm": 0.4301845133304596,
374
+ "learning_rate": 0.0007171945701357467,
375
+ "loss": 0.1004,
376
+ "step": 12500
377
+ },
378
+ {
379
+ "epoch": 2.83,
380
+ "eval_loss": 0.06929654628038406,
381
+ "eval_runtime": 27.2358,
382
+ "eval_samples_per_second": 288.517,
383
+ "eval_steps_per_second": 18.064,
384
+ "step": 12500
385
+ },
386
+ {
387
+ "epoch": 2.94,
388
+ "grad_norm": 0.4303476810455322,
389
+ "learning_rate": 0.0007058823529411765,
390
+ "loss": 0.0995,
391
+ "step": 13000
392
+ },
393
+ {
394
+ "epoch": 2.94,
395
+ "eval_loss": 0.06872580200433731,
396
+ "eval_runtime": 27.2296,
397
+ "eval_samples_per_second": 288.583,
398
+ "eval_steps_per_second": 18.069,
399
+ "step": 13000
400
+ },
401
+ {
402
+ "epoch": 3.05,
403
+ "grad_norm": 0.3978405296802521,
404
+ "learning_rate": 0.0006945701357466064,
405
+ "loss": 0.0999,
406
+ "step": 13500
407
+ },
408
+ {
409
+ "epoch": 3.05,
410
+ "eval_loss": 0.06932587921619415,
411
+ "eval_runtime": 27.2271,
412
+ "eval_samples_per_second": 288.609,
413
+ "eval_steps_per_second": 18.07,
414
+ "step": 13500
415
+ },
416
+ {
417
+ "epoch": 3.17,
418
+ "grad_norm": 0.26857316493988037,
419
+ "learning_rate": 0.0006832579185520362,
420
+ "loss": 0.0959,
421
+ "step": 14000
422
+ },
423
+ {
424
+ "epoch": 3.17,
425
+ "eval_loss": 0.07186341285705566,
426
+ "eval_runtime": 27.231,
427
+ "eval_samples_per_second": 288.569,
428
+ "eval_steps_per_second": 18.068,
429
+ "step": 14000
430
+ },
431
+ {
432
+ "epoch": 3.28,
433
+ "grad_norm": 0.4276795983314514,
434
+ "learning_rate": 0.0006719457013574661,
435
+ "loss": 0.0982,
436
+ "step": 14500
437
+ },
438
+ {
439
+ "epoch": 3.28,
440
+ "eval_loss": 0.07152236998081207,
441
+ "eval_runtime": 27.2215,
442
+ "eval_samples_per_second": 288.669,
443
+ "eval_steps_per_second": 18.074,
444
+ "step": 14500
445
+ },
446
+ {
447
+ "epoch": 3.39,
448
+ "grad_norm": 0.41015538573265076,
449
+ "learning_rate": 0.000660633484162896,
450
+ "loss": 0.0969,
451
+ "step": 15000
452
+ },
453
+ {
454
+ "epoch": 3.39,
455
+ "eval_loss": 0.07108399271965027,
456
+ "eval_runtime": 27.2286,
457
+ "eval_samples_per_second": 288.594,
458
+ "eval_steps_per_second": 18.069,
459
+ "step": 15000
460
+ },
461
+ {
462
+ "epoch": 3.51,
463
+ "grad_norm": 0.180690735578537,
464
+ "learning_rate": 0.0006493212669683258,
465
+ "loss": 0.0995,
466
+ "step": 15500
467
+ },
468
+ {
469
+ "epoch": 3.51,
470
+ "eval_loss": 0.06466764211654663,
471
+ "eval_runtime": 27.2483,
472
+ "eval_samples_per_second": 288.385,
473
+ "eval_steps_per_second": 18.056,
474
+ "step": 15500
475
+ },
476
+ {
477
+ "epoch": 3.62,
478
+ "grad_norm": 0.2916184067726135,
479
+ "learning_rate": 0.0006380090497737556,
480
+ "loss": 0.0962,
481
+ "step": 16000
482
+ },
483
+ {
484
+ "epoch": 3.62,
485
+ "eval_loss": 0.06967472285032272,
486
+ "eval_runtime": 27.2534,
487
+ "eval_samples_per_second": 288.331,
488
+ "eval_steps_per_second": 18.053,
489
+ "step": 16000
490
+ },
491
+ {
492
+ "epoch": 3.73,
493
+ "grad_norm": 0.444690465927124,
494
+ "learning_rate": 0.0006266968325791855,
495
+ "loss": 0.0959,
496
+ "step": 16500
497
+ },
498
+ {
499
+ "epoch": 3.73,
500
+ "eval_loss": 0.06753501296043396,
501
+ "eval_runtime": 27.2523,
502
+ "eval_samples_per_second": 288.343,
503
+ "eval_steps_per_second": 18.054,
504
+ "step": 16500
505
+ },
506
+ {
507
+ "epoch": 3.85,
508
+ "grad_norm": 0.3559369146823883,
509
+ "learning_rate": 0.0006153846153846154,
510
+ "loss": 0.0949,
511
+ "step": 17000
512
+ },
513
+ {
514
+ "epoch": 3.85,
515
+ "eval_loss": 0.06987947970628738,
516
+ "eval_runtime": 27.2447,
517
+ "eval_samples_per_second": 288.423,
518
+ "eval_steps_per_second": 18.059,
519
+ "step": 17000
520
+ },
521
+ {
522
+ "epoch": 3.96,
523
+ "grad_norm": 0.3376706838607788,
524
+ "learning_rate": 0.0006040723981900453,
525
+ "loss": 0.096,
526
+ "step": 17500
527
+ },
528
+ {
529
+ "epoch": 3.96,
530
+ "eval_loss": 0.06431511789560318,
531
+ "eval_runtime": 27.2316,
532
+ "eval_samples_per_second": 288.562,
533
+ "eval_steps_per_second": 18.067,
534
+ "step": 17500
535
+ },
536
+ {
537
+ "epoch": 4.07,
538
+ "grad_norm": 0.4778081476688385,
539
+ "learning_rate": 0.0005927601809954751,
540
+ "loss": 0.0916,
541
+ "step": 18000
542
+ },
543
+ {
544
+ "epoch": 4.07,
545
+ "eval_loss": 0.06719387322664261,
546
+ "eval_runtime": 27.1935,
547
+ "eval_samples_per_second": 288.967,
548
+ "eval_steps_per_second": 18.093,
549
+ "step": 18000
550
+ },
551
+ {
552
+ "epoch": 4.19,
553
+ "grad_norm": 0.6138429641723633,
554
+ "learning_rate": 0.000581447963800905,
555
+ "loss": 0.0887,
556
+ "step": 18500
557
+ },
558
+ {
559
+ "epoch": 4.19,
560
+ "eval_loss": 0.06378566473722458,
561
+ "eval_runtime": 27.2223,
562
+ "eval_samples_per_second": 288.661,
563
+ "eval_steps_per_second": 18.073,
564
+ "step": 18500
565
+ },
566
+ {
567
+ "epoch": 4.3,
568
+ "grad_norm": 0.48502928018569946,
569
+ "learning_rate": 0.0005701357466063349,
570
+ "loss": 0.0902,
571
+ "step": 19000
572
+ },
573
+ {
574
+ "epoch": 4.3,
575
+ "eval_loss": 0.06466159969568253,
576
+ "eval_runtime": 27.224,
577
+ "eval_samples_per_second": 288.642,
578
+ "eval_steps_per_second": 18.072,
579
+ "step": 19000
580
+ },
581
+ {
582
+ "epoch": 4.41,
583
+ "grad_norm": 0.28751033544540405,
584
+ "learning_rate": 0.0005588235294117647,
585
+ "loss": 0.089,
586
+ "step": 19500
587
+ },
588
+ {
589
+ "epoch": 4.41,
590
+ "eval_loss": 0.06292453408241272,
591
+ "eval_runtime": 27.2238,
592
+ "eval_samples_per_second": 288.644,
593
+ "eval_steps_per_second": 18.072,
594
+ "step": 19500
595
+ },
596
+ {
597
+ "epoch": 4.52,
598
+ "grad_norm": 0.2429145723581314,
599
+ "learning_rate": 0.0005475113122171947,
600
+ "loss": 0.0881,
601
+ "step": 20000
602
+ },
603
+ {
604
+ "epoch": 4.52,
605
+ "eval_loss": 0.0646950751543045,
606
+ "eval_runtime": 27.2322,
607
+ "eval_samples_per_second": 288.555,
608
+ "eval_steps_per_second": 18.067,
609
+ "step": 20000
610
+ },
611
+ {
612
+ "epoch": 4.64,
613
+ "grad_norm": 0.13486433029174805,
614
+ "learning_rate": 0.0005361990950226244,
615
+ "loss": 0.0875,
616
+ "step": 20500
617
+ },
618
+ {
619
+ "epoch": 4.64,
620
+ "eval_loss": 0.06334567815065384,
621
+ "eval_runtime": 27.2229,
622
+ "eval_samples_per_second": 288.654,
623
+ "eval_steps_per_second": 18.073,
624
+ "step": 20500
625
+ },
626
+ {
627
+ "epoch": 4.75,
628
+ "grad_norm": 0.2922358512878418,
629
+ "learning_rate": 0.0005248868778280543,
630
+ "loss": 0.0894,
631
+ "step": 21000
632
+ },
633
+ {
634
+ "epoch": 4.75,
635
+ "eval_loss": 0.06537148356437683,
636
+ "eval_runtime": 27.2312,
637
+ "eval_samples_per_second": 288.566,
638
+ "eval_steps_per_second": 18.068,
639
+ "step": 21000
640
+ },
641
+ {
642
+ "epoch": 4.86,
643
+ "grad_norm": 0.22684411704540253,
644
+ "learning_rate": 0.0005135746606334842,
645
+ "loss": 0.0901,
646
+ "step": 21500
647
+ },
648
+ {
649
+ "epoch": 4.86,
650
+ "eval_loss": 0.06314302235841751,
651
+ "eval_runtime": 27.237,
652
+ "eval_samples_per_second": 288.504,
653
+ "eval_steps_per_second": 18.064,
654
+ "step": 21500
655
+ },
656
+ {
657
+ "epoch": 4.98,
658
+ "grad_norm": 0.641290545463562,
659
+ "learning_rate": 0.000502262443438914,
660
+ "loss": 0.0898,
661
+ "step": 22000
662
+ },
663
+ {
664
+ "epoch": 4.98,
665
+ "eval_loss": 0.06266883760690689,
666
+ "eval_runtime": 27.2238,
667
+ "eval_samples_per_second": 288.645,
668
+ "eval_steps_per_second": 18.072,
669
+ "step": 22000
670
+ },
671
+ {
672
+ "epoch": 5.09,
673
+ "grad_norm": 0.31225764751434326,
674
+ "learning_rate": 0.0004909502262443439,
675
+ "loss": 0.0813,
676
+ "step": 22500
677
+ },
678
+ {
679
+ "epoch": 5.09,
680
+ "eval_loss": 0.06273192167282104,
681
+ "eval_runtime": 27.2277,
682
+ "eval_samples_per_second": 288.603,
683
+ "eval_steps_per_second": 18.07,
684
+ "step": 22500
685
+ },
686
+ {
687
+ "epoch": 5.2,
688
+ "grad_norm": 0.44664525985717773,
689
+ "learning_rate": 0.0004796380090497738,
690
+ "loss": 0.083,
691
+ "step": 23000
692
+ },
693
+ {
694
+ "epoch": 5.2,
695
+ "eval_loss": 0.06290117651224136,
696
+ "eval_runtime": 27.2049,
697
+ "eval_samples_per_second": 288.845,
698
+ "eval_steps_per_second": 18.085,
699
+ "step": 23000
700
+ },
701
+ {
702
+ "epoch": 5.32,
703
+ "grad_norm": 0.1560264378786087,
704
+ "learning_rate": 0.00046832579185520365,
705
+ "loss": 0.0833,
706
+ "step": 23500
707
+ },
708
+ {
709
+ "epoch": 5.32,
710
+ "eval_loss": 0.06229640915989876,
711
+ "eval_runtime": 27.2246,
712
+ "eval_samples_per_second": 288.636,
713
+ "eval_steps_per_second": 18.072,
714
+ "step": 23500
715
+ },
716
+ {
717
+ "epoch": 5.43,
718
+ "grad_norm": 0.11389543116092682,
719
+ "learning_rate": 0.00045701357466063346,
720
+ "loss": 0.083,
721
+ "step": 24000
722
+ },
723
+ {
724
+ "epoch": 5.43,
725
+ "eval_loss": 0.06498704105615616,
726
+ "eval_runtime": 27.2302,
727
+ "eval_samples_per_second": 288.576,
728
+ "eval_steps_per_second": 18.068,
729
+ "step": 24000
730
+ },
731
+ {
732
+ "epoch": 5.54,
733
+ "grad_norm": 0.6757131814956665,
734
+ "learning_rate": 0.0004457013574660634,
735
+ "loss": 0.0825,
736
+ "step": 24500
737
+ },
738
+ {
739
+ "epoch": 5.54,
740
+ "eval_loss": 0.06173526123166084,
741
+ "eval_runtime": 27.2094,
742
+ "eval_samples_per_second": 288.798,
743
+ "eval_steps_per_second": 18.082,
744
+ "step": 24500
745
+ },
746
+ {
747
+ "epoch": 5.66,
748
+ "grad_norm": 0.2726614475250244,
749
+ "learning_rate": 0.00043438914027149324,
750
+ "loss": 0.0829,
751
+ "step": 25000
752
+ },
753
+ {
754
+ "epoch": 5.66,
755
+ "eval_loss": 0.060302384197711945,
756
+ "eval_runtime": 27.2049,
757
+ "eval_samples_per_second": 288.845,
758
+ "eval_steps_per_second": 18.085,
759
+ "step": 25000
760
+ },
761
+ {
762
+ "epoch": 5.77,
763
+ "grad_norm": 0.8743285536766052,
764
+ "learning_rate": 0.0004230769230769231,
765
+ "loss": 0.0818,
766
+ "step": 25500
767
+ },
768
+ {
769
+ "epoch": 5.77,
770
+ "eval_loss": 0.062085919082164764,
771
+ "eval_runtime": 27.2011,
772
+ "eval_samples_per_second": 288.885,
773
+ "eval_steps_per_second": 18.087,
774
+ "step": 25500
775
+ },
776
+ {
777
+ "epoch": 5.88,
778
+ "grad_norm": 0.2872491478919983,
779
+ "learning_rate": 0.0004117647058823529,
780
+ "loss": 0.0807,
781
+ "step": 26000
782
+ },
783
+ {
784
+ "epoch": 5.88,
785
+ "eval_loss": 0.059214599430561066,
786
+ "eval_runtime": 27.2158,
787
+ "eval_samples_per_second": 288.73,
788
+ "eval_steps_per_second": 18.078,
789
+ "step": 26000
790
+ },
791
+ {
792
+ "epoch": 6.0,
793
+ "grad_norm": 0.5603688955307007,
794
+ "learning_rate": 0.0004004524886877828,
795
+ "loss": 0.082,
796
+ "step": 26500
797
+ },
798
+ {
799
+ "epoch": 6.0,
800
+ "eval_loss": 0.05830477178096771,
801
+ "eval_runtime": 27.2102,
802
+ "eval_samples_per_second": 288.788,
803
+ "eval_steps_per_second": 18.081,
804
+ "step": 26500
805
+ },
806
+ {
807
+ "epoch": 6.11,
808
+ "grad_norm": 0.4404628574848175,
809
+ "learning_rate": 0.0003891402714932127,
810
+ "loss": 0.0763,
811
+ "step": 27000
812
+ },
813
+ {
814
+ "epoch": 6.11,
815
+ "eval_loss": 0.05895010381937027,
816
+ "eval_runtime": 27.2169,
817
+ "eval_samples_per_second": 288.718,
818
+ "eval_steps_per_second": 18.077,
819
+ "step": 27000
820
+ },
821
+ {
822
+ "epoch": 6.22,
823
+ "grad_norm": 0.27021318674087524,
824
+ "learning_rate": 0.00037782805429864254,
825
+ "loss": 0.0781,
826
+ "step": 27500
827
+ },
828
+ {
829
+ "epoch": 6.22,
830
+ "eval_loss": 0.06117743253707886,
831
+ "eval_runtime": 27.2077,
832
+ "eval_samples_per_second": 288.815,
833
+ "eval_steps_per_second": 18.083,
834
+ "step": 27500
835
+ },
836
+ {
837
+ "epoch": 6.33,
838
+ "grad_norm": 0.5952714681625366,
839
+ "learning_rate": 0.0003665158371040724,
840
+ "loss": 0.077,
841
+ "step": 28000
842
+ },
843
+ {
844
+ "epoch": 6.33,
845
+ "eval_loss": 0.06172608584165573,
846
+ "eval_runtime": 27.2143,
847
+ "eval_samples_per_second": 288.745,
848
+ "eval_steps_per_second": 18.079,
849
+ "step": 28000
850
+ },
851
+ {
852
+ "epoch": 6.45,
853
+ "grad_norm": 0.11397124826908112,
854
+ "learning_rate": 0.00035520361990950226,
855
+ "loss": 0.0763,
856
+ "step": 28500
857
+ },
858
+ {
859
+ "epoch": 6.45,
860
+ "eval_loss": 0.06007913500070572,
861
+ "eval_runtime": 27.1971,
862
+ "eval_samples_per_second": 288.928,
863
+ "eval_steps_per_second": 18.09,
864
+ "step": 28500
865
+ },
866
+ {
867
+ "epoch": 6.56,
868
+ "grad_norm": 0.18584699928760529,
869
+ "learning_rate": 0.0003438914027149321,
870
+ "loss": 0.0741,
871
+ "step": 29000
872
+ },
873
+ {
874
+ "epoch": 6.56,
875
+ "eval_loss": 0.05769050493836403,
876
+ "eval_runtime": 27.1856,
877
+ "eval_samples_per_second": 289.05,
878
+ "eval_steps_per_second": 18.098,
879
+ "step": 29000
880
+ },
881
+ {
882
+ "epoch": 6.67,
883
+ "grad_norm": 0.26046234369277954,
884
+ "learning_rate": 0.000332579185520362,
885
+ "loss": 0.0746,
886
+ "step": 29500
887
+ },
888
+ {
889
+ "epoch": 6.67,
890
+ "eval_loss": 0.05827530845999718,
891
+ "eval_runtime": 27.1863,
892
+ "eval_samples_per_second": 289.043,
893
+ "eval_steps_per_second": 18.097,
894
+ "step": 29500
895
+ },
896
+ {
897
+ "epoch": 6.79,
898
+ "grad_norm": 0.12222661823034286,
899
+ "learning_rate": 0.0003212669683257919,
900
+ "loss": 0.0735,
901
+ "step": 30000
902
+ },
903
+ {
904
+ "epoch": 6.79,
905
+ "eval_loss": 0.05913107842206955,
906
+ "eval_runtime": 27.1918,
907
+ "eval_samples_per_second": 288.984,
908
+ "eval_steps_per_second": 18.094,
909
+ "step": 30000
910
+ },
911
+ {
912
+ "epoch": 6.9,
913
+ "grad_norm": 0.28610703349113464,
914
+ "learning_rate": 0.0003099547511312217,
915
+ "loss": 0.0726,
916
+ "step": 30500
917
+ },
918
+ {
919
+ "epoch": 6.9,
920
+ "eval_loss": 0.05818793550133705,
921
+ "eval_runtime": 27.2089,
922
+ "eval_samples_per_second": 288.803,
923
+ "eval_steps_per_second": 18.082,
924
+ "step": 30500
925
+ },
926
+ {
927
+ "epoch": 7.01,
928
+ "grad_norm": 0.3682945966720581,
929
+ "learning_rate": 0.00029864253393665157,
930
+ "loss": 0.0741,
931
+ "step": 31000
932
+ },
933
+ {
934
+ "epoch": 7.01,
935
+ "eval_loss": 0.05868174880743027,
936
+ "eval_runtime": 27.1916,
937
+ "eval_samples_per_second": 288.986,
938
+ "eval_steps_per_second": 18.094,
939
+ "step": 31000
940
+ },
941
+ {
942
+ "epoch": 7.13,
943
+ "grad_norm": 0.16477471590042114,
944
+ "learning_rate": 0.00028733031674208143,
945
+ "loss": 0.0715,
946
+ "step": 31500
947
+ },
948
+ {
949
+ "epoch": 7.13,
950
+ "eval_loss": 0.05955735221505165,
951
+ "eval_runtime": 27.1945,
952
+ "eval_samples_per_second": 288.955,
953
+ "eval_steps_per_second": 18.092,
954
+ "step": 31500
955
+ },
956
+ {
957
+ "epoch": 7.24,
958
+ "grad_norm": 0.24769556522369385,
959
+ "learning_rate": 0.00027601809954751135,
960
+ "loss": 0.07,
961
+ "step": 32000
962
+ },
963
+ {
964
+ "epoch": 7.24,
965
+ "eval_loss": 0.057150740176439285,
966
+ "eval_runtime": 27.1825,
967
+ "eval_samples_per_second": 289.083,
968
+ "eval_steps_per_second": 18.1,
969
+ "step": 32000
970
+ },
971
+ {
972
+ "epoch": 7.35,
973
+ "grad_norm": 0.3199273347854614,
974
+ "learning_rate": 0.0002647058823529412,
975
+ "loss": 0.0686,
976
+ "step": 32500
977
+ },
978
+ {
979
+ "epoch": 7.35,
980
+ "eval_loss": 0.05786846950650215,
981
+ "eval_runtime": 27.2001,
982
+ "eval_samples_per_second": 288.896,
983
+ "eval_steps_per_second": 18.088,
984
+ "step": 32500
985
+ },
986
+ {
987
+ "epoch": 7.47,
988
+ "grad_norm": 0.3163066804409027,
989
+ "learning_rate": 0.000253393665158371,
990
+ "loss": 0.0703,
991
+ "step": 33000
992
+ },
993
+ {
994
+ "epoch": 7.47,
995
+ "eval_loss": 0.05759541690349579,
996
+ "eval_runtime": 27.1994,
997
+ "eval_samples_per_second": 288.904,
998
+ "eval_steps_per_second": 18.089,
999
+ "step": 33000
1000
+ },
1001
+ {
1002
+ "epoch": 7.58,
1003
+ "grad_norm": 0.4390794336795807,
1004
+ "learning_rate": 0.0002420814479638009,
1005
+ "loss": 0.0694,
1006
+ "step": 33500
1007
+ },
1008
+ {
1009
+ "epoch": 7.58,
1010
+ "eval_loss": 0.06044788658618927,
1011
+ "eval_runtime": 27.2196,
1012
+ "eval_samples_per_second": 288.689,
1013
+ "eval_steps_per_second": 18.075,
1014
+ "step": 33500
1015
+ },
1016
+ {
1017
+ "epoch": 7.69,
1018
+ "grad_norm": 0.19777078926563263,
1019
+ "learning_rate": 0.0002307692307692308,
1020
+ "loss": 0.0683,
1021
+ "step": 34000
1022
+ },
1023
+ {
1024
+ "epoch": 7.69,
1025
+ "eval_loss": 0.05697755515575409,
1026
+ "eval_runtime": 27.2282,
1027
+ "eval_samples_per_second": 288.598,
1028
+ "eval_steps_per_second": 18.069,
1029
+ "step": 34000
1030
+ },
1031
+ {
1032
+ "epoch": 7.81,
1033
+ "grad_norm": 0.418797105550766,
1034
+ "learning_rate": 0.00021945701357466062,
1035
+ "loss": 0.0712,
1036
+ "step": 34500
1037
+ },
1038
+ {
1039
+ "epoch": 7.81,
1040
+ "eval_loss": 0.05598929896950722,
1041
+ "eval_runtime": 27.2203,
1042
+ "eval_samples_per_second": 288.682,
1043
+ "eval_steps_per_second": 18.075,
1044
+ "step": 34500
1045
+ },
1046
+ {
1047
+ "epoch": 7.92,
1048
+ "grad_norm": 0.4459814727306366,
1049
+ "learning_rate": 0.0002081447963800905,
1050
+ "loss": 0.0672,
1051
+ "step": 35000
1052
+ },
1053
+ {
1054
+ "epoch": 7.92,
1055
+ "eval_loss": 0.05849257484078407,
1056
+ "eval_runtime": 27.2068,
1057
+ "eval_samples_per_second": 288.825,
1058
+ "eval_steps_per_second": 18.084,
1059
+ "step": 35000
1060
+ },
1061
+ {
1062
+ "epoch": 8.03,
1063
+ "grad_norm": 0.2313721477985382,
1064
+ "learning_rate": 0.00019683257918552037,
1065
+ "loss": 0.0675,
1066
+ "step": 35500
1067
+ },
1068
+ {
1069
+ "epoch": 8.03,
1070
+ "eval_loss": 0.05674152076244354,
1071
+ "eval_runtime": 27.2055,
1072
+ "eval_samples_per_second": 288.839,
1073
+ "eval_steps_per_second": 18.085,
1074
+ "step": 35500
1075
+ },
1076
+ {
1077
+ "epoch": 8.14,
1078
+ "grad_norm": 0.2439548671245575,
1079
+ "learning_rate": 0.00018552036199095024,
1080
+ "loss": 0.0651,
1081
+ "step": 36000
1082
+ },
1083
+ {
1084
+ "epoch": 8.14,
1085
+ "eval_loss": 0.05658886954188347,
1086
+ "eval_runtime": 27.2233,
1087
+ "eval_samples_per_second": 288.65,
1088
+ "eval_steps_per_second": 18.073,
1089
+ "step": 36000
1090
+ },
1091
+ {
1092
+ "epoch": 8.26,
1093
+ "grad_norm": 0.3285837471485138,
1094
+ "learning_rate": 0.0001742081447963801,
1095
+ "loss": 0.0648,
1096
+ "step": 36500
1097
+ },
1098
+ {
1099
+ "epoch": 8.26,
1100
+ "eval_loss": 0.05789176747202873,
1101
+ "eval_runtime": 27.2295,
1102
+ "eval_samples_per_second": 288.584,
1103
+ "eval_steps_per_second": 18.069,
1104
+ "step": 36500
1105
+ },
1106
+ {
1107
+ "epoch": 8.37,
1108
+ "grad_norm": 0.3167458772659302,
1109
+ "learning_rate": 0.00016289592760180996,
1110
+ "loss": 0.067,
1111
+ "step": 37000
1112
+ },
1113
+ {
1114
+ "epoch": 8.37,
1115
+ "eval_loss": 0.05568605288863182,
1116
+ "eval_runtime": 27.2118,
1117
+ "eval_samples_per_second": 288.772,
1118
+ "eval_steps_per_second": 18.08,
1119
+ "step": 37000
1120
+ },
1121
+ {
1122
+ "epoch": 8.48,
1123
+ "grad_norm": 0.1530727595090866,
1124
+ "learning_rate": 0.00015158371040723982,
1125
+ "loss": 0.0651,
1126
+ "step": 37500
1127
+ },
1128
+ {
1129
+ "epoch": 8.48,
1130
+ "eval_loss": 0.057902004569768906,
1131
+ "eval_runtime": 27.219,
1132
+ "eval_samples_per_second": 288.695,
1133
+ "eval_steps_per_second": 18.076,
1134
+ "step": 37500
1135
+ },
1136
+ {
1137
+ "epoch": 8.6,
1138
+ "grad_norm": 0.21044595539569855,
1139
+ "learning_rate": 0.00014027149321266968,
1140
+ "loss": 0.0666,
1141
+ "step": 38000
1142
+ },
1143
+ {
1144
+ "epoch": 8.6,
1145
+ "eval_loss": 0.05458011105656624,
1146
+ "eval_runtime": 27.2446,
1147
+ "eval_samples_per_second": 288.424,
1148
+ "eval_steps_per_second": 18.059,
1149
+ "step": 38000
1150
+ },
1151
+ {
1152
+ "epoch": 8.71,
1153
+ "grad_norm": 0.23161017894744873,
1154
+ "learning_rate": 0.00012895927601809957,
1155
+ "loss": 0.0635,
1156
+ "step": 38500
1157
+ },
1158
+ {
1159
+ "epoch": 8.71,
1160
+ "eval_loss": 0.056671272963285446,
1161
+ "eval_runtime": 27.2141,
1162
+ "eval_samples_per_second": 288.748,
1163
+ "eval_steps_per_second": 18.079,
1164
+ "step": 38500
1165
+ },
1166
+ {
1167
+ "epoch": 8.82,
1168
+ "grad_norm": 0.14228539168834686,
1169
+ "learning_rate": 0.00011764705882352942,
1170
+ "loss": 0.0622,
1171
+ "step": 39000
1172
+ },
1173
+ {
1174
+ "epoch": 8.82,
1175
+ "eval_loss": 0.05409713461995125,
1176
+ "eval_runtime": 27.229,
1177
+ "eval_samples_per_second": 288.59,
1178
+ "eval_steps_per_second": 18.069,
1179
+ "step": 39000
1180
+ },
1181
+ {
1182
+ "epoch": 8.94,
1183
+ "grad_norm": 0.19111554324626923,
1184
+ "learning_rate": 0.00010633484162895928,
1185
+ "loss": 0.0645,
1186
+ "step": 39500
1187
+ },
1188
+ {
1189
+ "epoch": 8.94,
1190
+ "eval_loss": 0.05430610105395317,
1191
+ "eval_runtime": 27.2287,
1192
+ "eval_samples_per_second": 288.592,
1193
+ "eval_steps_per_second": 18.069,
1194
+ "step": 39500
1195
+ },
1196
+ {
1197
+ "epoch": 9.05,
1198
+ "grad_norm": 0.1508806049823761,
1199
+ "learning_rate": 9.502262443438914e-05,
1200
+ "loss": 0.0631,
1201
+ "step": 40000
1202
+ },
1203
+ {
1204
+ "epoch": 9.05,
1205
+ "eval_loss": 0.05481436848640442,
1206
+ "eval_runtime": 27.2111,
1207
+ "eval_samples_per_second": 288.78,
1208
+ "eval_steps_per_second": 18.081,
1209
+ "step": 40000
1210
+ },
1211
+ {
1212
+ "epoch": 9.16,
1213
+ "grad_norm": 0.26917019486427307,
1214
+ "learning_rate": 8.3710407239819e-05,
1215
+ "loss": 0.063,
1216
+ "step": 40500
1217
+ },
1218
+ {
1219
+ "epoch": 9.16,
1220
+ "eval_loss": 0.056788042187690735,
1221
+ "eval_runtime": 27.2329,
1222
+ "eval_samples_per_second": 288.548,
1223
+ "eval_steps_per_second": 18.066,
1224
+ "step": 40500
1225
+ },
1226
+ {
1227
+ "epoch": 9.28,
1228
+ "grad_norm": 0.26919251680374146,
1229
+ "learning_rate": 7.239819004524887e-05,
1230
+ "loss": 0.0614,
1231
+ "step": 41000
1232
+ },
1233
+ {
1234
+ "epoch": 9.28,
1235
+ "eval_loss": 0.056851934641599655,
1236
+ "eval_runtime": 27.2442,
1237
+ "eval_samples_per_second": 288.428,
1238
+ "eval_steps_per_second": 18.059,
1239
+ "step": 41000
1240
+ },
1241
+ {
1242
+ "epoch": 9.39,
1243
+ "grad_norm": 0.222616046667099,
1244
+ "learning_rate": 6.108597285067873e-05,
1245
+ "loss": 0.0588,
1246
+ "step": 41500
1247
+ },
1248
+ {
1249
+ "epoch": 9.39,
1250
+ "eval_loss": 0.05487231910228729,
1251
+ "eval_runtime": 27.23,
1252
+ "eval_samples_per_second": 288.579,
1253
+ "eval_steps_per_second": 18.068,
1254
+ "step": 41500
1255
+ },
1256
+ {
1257
+ "epoch": 9.5,
1258
+ "grad_norm": 0.2073131799697876,
1259
+ "learning_rate": 4.9773755656108595e-05,
1260
+ "loss": 0.0616,
1261
+ "step": 42000
1262
+ },
1263
+ {
1264
+ "epoch": 9.5,
1265
+ "eval_loss": 0.05528046563267708,
1266
+ "eval_runtime": 27.2327,
1267
+ "eval_samples_per_second": 288.55,
1268
+ "eval_steps_per_second": 18.067,
1269
+ "step": 42000
1270
+ },
1271
+ {
1272
+ "epoch": 9.62,
1273
+ "grad_norm": 0.19287574291229248,
1274
+ "learning_rate": 3.846153846153846e-05,
1275
+ "loss": 0.0609,
1276
+ "step": 42500
1277
+ },
1278
+ {
1279
+ "epoch": 9.62,
1280
+ "eval_loss": 0.055462516844272614,
1281
+ "eval_runtime": 27.2342,
1282
+ "eval_samples_per_second": 288.535,
1283
+ "eval_steps_per_second": 18.066,
1284
+ "step": 42500
1285
+ },
1286
+ {
1287
+ "epoch": 9.73,
1288
+ "grad_norm": 0.11690975725650787,
1289
+ "learning_rate": 2.7149321266968327e-05,
1290
+ "loss": 0.0612,
1291
+ "step": 43000
1292
+ },
1293
+ {
1294
+ "epoch": 9.73,
1295
+ "eval_loss": 0.055802907794713974,
1296
+ "eval_runtime": 27.263,
1297
+ "eval_samples_per_second": 288.23,
1298
+ "eval_steps_per_second": 18.046,
1299
+ "step": 43000
1300
+ },
1301
+ {
1302
+ "epoch": 9.84,
1303
+ "grad_norm": 0.19802606105804443,
1304
+ "learning_rate": 1.583710407239819e-05,
1305
+ "loss": 0.0588,
1306
+ "step": 43500
1307
+ },
1308
+ {
1309
+ "epoch": 9.84,
1310
+ "eval_loss": 0.05586336553096771,
1311
+ "eval_runtime": 27.2516,
1312
+ "eval_samples_per_second": 288.35,
1313
+ "eval_steps_per_second": 18.054,
1314
+ "step": 43500
1315
+ },
1316
+ {
1317
+ "epoch": 9.95,
1318
+ "grad_norm": 0.29080289602279663,
1319
+ "learning_rate": 4.5248868778280546e-06,
1320
+ "loss": 0.0622,
1321
+ "step": 44000
1322
+ },
1323
+ {
1324
+ "epoch": 9.95,
1325
+ "eval_loss": 0.05555348098278046,
1326
+ "eval_runtime": 27.2559,
1327
+ "eval_samples_per_second": 288.305,
1328
+ "eval_steps_per_second": 18.051,
1329
+ "step": 44000
1330
+ },
1331
+ {
1332
+ "epoch": 10.0,
1333
+ "step": 44200,
1334
+ "total_flos": 8.389179359649792e+16,
1335
+ "train_loss": 0.09247852528257068,
1336
+ "train_runtime": 8790.0948,
1337
+ "train_samples_per_second": 80.453,
1338
+ "train_steps_per_second": 5.028
1339
+ }
1340
+ ],
1341
+ "logging_steps": 500,
1342
+ "max_steps": 44200,
1343
+ "num_input_tokens_seen": 0,
1344
+ "num_train_epochs": 10,
1345
+ "save_steps": 500,
1346
+ "total_flos": 8.389179359649792e+16,
1347
+ "train_batch_size": 16,
1348
+ "trial_name": null,
1349
+ "trial_params": null
1350
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:234fbc7bf0d159d3d95c13453a8fd74105a86470e2dc26447e416696e864f884
3
+ size 5048