bigscience-bot commited on
Commit
d0b4f1a
1 Parent(s): adb6c42
Files changed (1) hide show
  1. logs/main_log.txt +458 -0
logs/main_log.txt CHANGED
@@ -42181,3 +42181,461 @@ time (ms)
42181
  time (ms)
42182
  [2021-09-24 23:07:37] PULSE: tr8-104B is waiting for the previous job to finish before scheduling a new one using the dependency mechanism (1165978_[1-10%1] on 'gpu_p13' partition)
42183
  [2021-09-24 23:07:37] PULSE: tr8-104B is running for 17:15:26 since 2021-09-24T05:52:11 (1162855_1 on 'gpu_p13' partition (r6i4n[5,7],r6i5n[2,7-8],r6i6n[0,2,6],r7i2n[4-5],r7i6n[2-4],r7i7n[7-8],r8i0n[2-3,5-8],r8i1n[0,2-4],r8i2n8,r8i3n[0-2],r8i5n[3-4],r8i7n[3-8],r9i0n[0-2],r9i1n[0-3],r9i2n[3-5,8],r9i3n[0-1,7-8],r9i4n[0-2],r9i5n[3-8],r9i6n[0,7-8])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42181
  time (ms)
42182
  [2021-09-24 23:07:37] PULSE: tr8-104B is waiting for the previous job to finish before scheduling a new one using the dependency mechanism (1165978_[1-10%1] on 'gpu_p13' partition)
42183
  [2021-09-24 23:07:37] PULSE: tr8-104B is running for 17:15:26 since 2021-09-24T05:52:11 (1162855_1 on 'gpu_p13' partition (r6i4n[5,7],r6i5n[2,7-8],r6i6n[0,2,6],r7i2n[4-5],r7i6n[2-4],r7i7n[7-8],r8i0n[2-3,5-8],r8i1n[0,2-4],r8i2n8,r8i3n[0-2],r8i5n[3-4],r8i7n[3-8],r9i0n[0-2],r9i1n[0-3],r9i2n[3-5,8],r9i3n[0-1,7-8],r9i4n[0-2],r9i5n[3-8],r9i6n[0,7-8])
42184
+ iteration 5253/ 159576 | consumed samples: 134016 | elapsed time per iteration (ms): 15553.2 | learning rate: 3.709E-05 | global batch size: 48 | lm loss: 6.395989E+00 | loss scale: 4096.0 | grad norm: 75934.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42185
+ time (ms)
42186
+ iteration 5254/ 159576 | consumed samples: 134064 | elapsed time per iteration (ms): 15521.6 | learning rate: 3.710E-05 | global batch size: 48 | lm loss: 6.388237E+00 | loss scale: 4096.0 | grad norm: 85225.047 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42187
+ time (ms)
42188
+ iteration 5255/ 159576 | consumed samples: 134112 | elapsed time per iteration (ms): 15886.3 | learning rate: 3.711E-05 | global batch size: 48 | lm loss: 6.348703E+00 | loss scale: 4096.0 | grad norm: 72802.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42189
+ time (ms)
42190
+ iteration 5256/ 159576 | consumed samples: 134160 | elapsed time per iteration (ms): 15520.3 | learning rate: 3.713E-05 | global batch size: 48 | lm loss: 6.321572E+00 | loss scale: 4096.0 | grad norm: 73245.874 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42191
+ time (ms)
42192
+ iteration 5257/ 159576 | consumed samples: 134208 | elapsed time per iteration (ms): 15443.7 | learning rate: 3.714E-05 | global batch size: 48 | lm loss: 6.335665E+00 | loss scale: 4096.0 | grad norm: 58798.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42193
+ time (ms)
42194
+ iteration 5258/ 159576 | consumed samples: 134256 | elapsed time per iteration (ms): 15427.0 | learning rate: 3.715E-05 | global batch size: 48 | lm loss: 6.319070E+00 | loss scale: 4096.0 | grad norm: 66591.391 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42195
+ time (ms)
42196
+ iteration 5259/ 159576 | consumed samples: 134304 | elapsed time per iteration (ms): 15760.6 | learning rate: 3.717E-05 | global batch size: 48 | lm loss: 6.229961E+00 | loss scale: 4096.0 | grad norm: 78411.623 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42197
+ time (ms)
42198
+ iteration 5260/ 159576 | consumed samples: 134352 | elapsed time per iteration (ms): 15544.0 | learning rate: 3.718E-05 | global batch size: 48 | lm loss: 6.379896E+00 | loss scale: 4096.0 | grad norm: 82294.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42199
+ time (ms)
42200
+ iteration 5261/ 159576 | consumed samples: 134400 | elapsed time per iteration (ms): 15397.8 | learning rate: 3.719E-05 | global batch size: 48 | lm loss: 6.233184E+00 | loss scale: 4096.0 | grad norm: 65525.586 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42201
+ time (ms)
42202
+ iteration 5262/ 159576 | consumed samples: 134448 | elapsed time per iteration (ms): 15498.3 | learning rate: 3.721E-05 | global batch size: 48 | lm loss: 6.326461E+00 | loss scale: 4096.0 | grad norm: 101232.286 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42203
+ time (ms)
42204
+ iteration 5263/ 159576 | consumed samples: 134496 | elapsed time per iteration (ms): 15834.8 | learning rate: 3.722E-05 | global batch size: 48 | lm loss: 6.351873E+00 | loss scale: 4096.0 | grad norm: 82652.498 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42205
+ time (ms)
42206
+ iteration 5264/ 159576 | consumed samples: 134544 | elapsed time per iteration (ms): 15450.4 | learning rate: 3.723E-05 | global batch size: 48 | lm loss: 6.411518E+00 | loss scale: 4096.0 | grad norm: 79704.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42207
+ time (ms)
42208
+ iteration 5265/ 159576 | consumed samples: 134592 | elapsed time per iteration (ms): 15408.5 | learning rate: 3.725E-05 | global batch size: 48 | lm loss: 6.324855E+00 | loss scale: 4096.0 | grad norm: 96783.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42209
+ time (ms)
42210
+ iteration 5266/ 159576 | consumed samples: 134640 | elapsed time per iteration (ms): 15369.4 | learning rate: 3.726E-05 | global batch size: 48 | lm loss: 6.351592E+00 | loss scale: 4096.0 | grad norm: 96231.447 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42211
+ time (ms)
42212
+ iteration 5267/ 159576 | consumed samples: 134688 | elapsed time per iteration (ms): 15643.8 | learning rate: 3.727E-05 | global batch size: 48 | lm loss: 6.439371E+00 | loss scale: 4096.0 | grad norm: 86165.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42213
+ time (ms)
42214
+ iteration 5268/ 159576 | consumed samples: 134736 | elapsed time per iteration (ms): 15428.0 | learning rate: 3.729E-05 | global batch size: 48 | lm loss: 6.282881E+00 | loss scale: 4096.0 | grad norm: 95370.085 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42215
+ time (ms)
42216
+ iteration 5269/ 159576 | consumed samples: 134784 | elapsed time per iteration (ms): 15422.7 | learning rate: 3.730E-05 | global batch size: 48 | lm loss: 6.489480E+00 | loss scale: 4096.0 | grad norm: 77407.640 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42217
+ time (ms)
42218
+ iteration 5270/ 159576 | consumed samples: 134832 | elapsed time per iteration (ms): 15384.0 | learning rate: 3.731E-05 | global batch size: 48 | lm loss: 6.382200E+00 | loss scale: 4096.0 | grad norm: 66716.315 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42219
+ time (ms)
42220
+ iteration 5271/ 159576 | consumed samples: 134880 | elapsed time per iteration (ms): 15581.8 | learning rate: 3.733E-05 | global batch size: 48 | lm loss: 6.409722E+00 | loss scale: 4096.0 | grad norm: 68218.526 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42221
+ time (ms)
42222
+ iteration 5272/ 159576 | consumed samples: 134928 | elapsed time per iteration (ms): 15395.7 | learning rate: 3.734E-05 | global batch size: 48 | lm loss: 6.493249E+00 | loss scale: 4096.0 | grad norm: 71580.496 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42223
+ time (ms)
42224
+ iteration 5273/ 159576 | consumed samples: 134976 | elapsed time per iteration (ms): 15402.4 | learning rate: 3.735E-05 | global batch size: 48 | lm loss: 6.376624E+00 | loss scale: 4096.0 | grad norm: 85075.910 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42225
+ time (ms)
42226
+ iteration 5274/ 159576 | consumed samples: 135024 | elapsed time per iteration (ms): 15424.2 | learning rate: 3.737E-05 | global batch size: 48 | lm loss: 6.441435E+00 | loss scale: 4096.0 | grad norm: 75286.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42227
+ time (ms)
42228
+ iteration 5275/ 159576 | consumed samples: 135072 | elapsed time per iteration (ms): 15616.5 | learning rate: 3.738E-05 | global batch size: 48 | lm loss: 6.428281E+00 | loss scale: 4096.0 | grad norm: 71317.497 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42229
+ time (ms)
42230
+ iteration 5276/ 159576 | consumed samples: 135120 | elapsed time per iteration (ms): 15383.8 | learning rate: 3.739E-05 | global batch size: 48 | lm loss: 6.324539E+00 | loss scale: 4096.0 | grad norm: 70509.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42231
+ time (ms)
42232
+ iteration 5277/ 159576 | consumed samples: 135168 | elapsed time per iteration (ms): 15404.4 | learning rate: 3.741E-05 | global batch size: 48 | lm loss: 6.396560E+00 | loss scale: 4096.0 | grad norm: 68223.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42233
+ time (ms)
42234
+ iteration 5278/ 159576 | consumed samples: 135216 | elapsed time per iteration (ms): 15464.0 | learning rate: 3.742E-05 | global batch size: 48 | lm loss: 6.403405E+00 | loss scale: 4096.0 | grad norm: 74828.040 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42235
+ time (ms)
42236
+ iteration 5279/ 159576 | consumed samples: 135264 | elapsed time per iteration (ms): 15572.0 | learning rate: 3.743E-05 | global batch size: 48 | lm loss: 6.340907E+00 | loss scale: 4096.0 | grad norm: 103719.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42237
+ time (ms)
42238
+ iteration 5280/ 159576 | consumed samples: 135312 | elapsed time per iteration (ms): 15390.1 | learning rate: 3.745E-05 | global batch size: 48 | lm loss: 6.465801E+00 | loss scale: 4096.0 | grad norm: 71954.053 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42239
+ time (ms)
42240
+ iteration 5281/ 159576 | consumed samples: 135360 | elapsed time per iteration (ms): 15379.3 | learning rate: 3.746E-05 | global batch size: 48 | lm loss: 6.481463E+00 | loss scale: 4096.0 | grad norm: 64156.580 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42241
+ time (ms)
42242
+ iteration 5282/ 159576 | consumed samples: 135408 | elapsed time per iteration (ms): 15880.0 | learning rate: 3.747E-05 | global batch size: 48 | lm loss: 6.324627E+00 | loss scale: 4096.0 | grad norm: 77974.806 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42243
+ time (ms)
42244
+ iteration 5283/ 159576 | consumed samples: 135456 | elapsed time per iteration (ms): 15461.2 | learning rate: 3.749E-05 | global batch size: 48 | lm loss: 6.278036E+00 | loss scale: 4096.0 | grad norm: 78417.449 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42245
+ time (ms)
42246
+ iteration 5284/ 159576 | consumed samples: 135504 | elapsed time per iteration (ms): 15434.3 | learning rate: 3.750E-05 | global batch size: 48 | lm loss: 6.470399E+00 | loss scale: 4096.0 | grad norm: 70677.576 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42247
+ time (ms)
42248
+ iteration 5285/ 159576 | consumed samples: 135552 | elapsed time per iteration (ms): 15453.3 | learning rate: 3.751E-05 | global batch size: 48 | lm loss: 6.465354E+00 | loss scale: 4096.0 | grad norm: 72699.042 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42249
+ time (ms)
42250
+ iteration 5286/ 159576 | consumed samples: 135600 | elapsed time per iteration (ms): 15799.4 | learning rate: 3.753E-05 | global batch size: 48 | lm loss: 6.366466E+00 | loss scale: 4096.0 | grad norm: 87890.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42251
+ time (ms)
42252
+ iteration 5287/ 159576 | consumed samples: 135648 | elapsed time per iteration (ms): 15462.6 | learning rate: 3.754E-05 | global batch size: 48 | lm loss: 6.450302E+00 | loss scale: 4096.0 | grad norm: 65500.276 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42253
+ time (ms)
42254
+ iteration 5288/ 159576 | consumed samples: 135696 | elapsed time per iteration (ms): 15449.3 | learning rate: 3.755E-05 | global batch size: 48 | lm loss: 6.211058E+00 | loss scale: 4096.0 | grad norm: 91309.432 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42255
+ time (ms)
42256
+ iteration 5289/ 159576 | consumed samples: 135744 | elapsed time per iteration (ms): 15440.0 | learning rate: 3.757E-05 | global batch size: 48 | lm loss: 6.439297E+00 | loss scale: 4096.0 | grad norm: 78139.415 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42257
+ time (ms)
42258
+ iteration 5290/ 159576 | consumed samples: 135792 | elapsed time per iteration (ms): 15759.6 | learning rate: 3.758E-05 | global batch size: 48 | lm loss: 6.295393E+00 | loss scale: 4096.0 | grad norm: 67343.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42259
+ time (ms)
42260
+ iteration 5291/ 159576 | consumed samples: 135840 | elapsed time per iteration (ms): 15513.6 | learning rate: 3.759E-05 | global batch size: 48 | lm loss: 6.403075E+00 | loss scale: 4096.0 | grad norm: 88227.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42261
+ time (ms)
42262
+ iteration 5292/ 159576 | consumed samples: 135888 | elapsed time per iteration (ms): 15421.3 | learning rate: 3.761E-05 | global batch size: 48 | lm loss: 6.414333E+00 | loss scale: 4096.0 | grad norm: 78788.254 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42263
+ time (ms)
42264
+ iteration 5293/ 159576 | consumed samples: 135936 | elapsed time per iteration (ms): 15345.3 | learning rate: 3.762E-05 | global batch size: 48 | lm loss: 6.292488E+00 | loss scale: 4096.0 | grad norm: 59708.880 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42265
+ time (ms)
42266
+ iteration 5294/ 159576 | consumed samples: 135984 | elapsed time per iteration (ms): 16027.7 | learning rate: 3.763E-05 | global batch size: 48 | lm loss: 6.385753E+00 | loss scale: 4096.0 | grad norm: 102775.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42267
+ time (ms)
42268
+ iteration 5295/ 159576 | consumed samples: 136032 | elapsed time per iteration (ms): 15461.5 | learning rate: 3.765E-05 | global batch size: 48 | lm loss: 6.324437E+00 | loss scale: 4096.0 | grad norm: 71697.534 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42269
+ time (ms)
42270
+ iteration 5296/ 159576 | consumed samples: 136080 | elapsed time per iteration (ms): 15433.9 | learning rate: 3.766E-05 | global batch size: 48 | lm loss: 6.384956E+00 | loss scale: 4096.0 | grad norm: 102953.672 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42271
+ time (ms)
42272
+ iteration 5297/ 159576 | consumed samples: 136128 | elapsed time per iteration (ms): 15429.7 | learning rate: 3.767E-05 | global batch size: 48 | lm loss: 6.436825E+00 | loss scale: 4096.0 | grad norm: 75031.086 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42273
+ time (ms)
42274
+ iteration 5298/ 159576 | consumed samples: 136176 | elapsed time per iteration (ms): 15818.4 | learning rate: 3.769E-05 | global batch size: 48 | lm loss: 6.482272E+00 | loss scale: 4096.0 | grad norm: 65276.986 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42275
+ time (ms)
42276
+ iteration 5299/ 159576 | consumed samples: 136224 | elapsed time per iteration (ms): 15441.5 | learning rate: 3.770E-05 | global batch size: 48 | lm loss: 6.589076E+00 | loss scale: 4096.0 | grad norm: 121561.959 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42277
+ time (ms)
42278
+ iteration 5300/ 159576 | consumed samples: 136272 | elapsed time per iteration (ms): 15422.2 | learning rate: 3.771E-05 | global batch size: 48 | lm loss: 6.405668E+00 | loss scale: 4096.0 | grad norm: 62093.972 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42279
+ time (ms)
42280
+ iteration 5301/ 159576 | consumed samples: 136320 | elapsed time per iteration (ms): 15355.0 | learning rate: 3.773E-05 | global batch size: 48 | lm loss: 6.390646E+00 | loss scale: 4096.0 | grad norm: 56038.998 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42281
+ time (ms)
42282
+ iteration 5302/ 159576 | consumed samples: 136368 | elapsed time per iteration (ms): 15565.3 | learning rate: 3.774E-05 | global batch size: 48 | lm loss: 6.410752E+00 | loss scale: 4096.0 | grad norm: 64581.105 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42283
+ time (ms)
42284
+ iteration 5303/ 159576 | consumed samples: 136416 | elapsed time per iteration (ms): 15422.3 | learning rate: 3.775E-05 | global batch size: 48 | lm loss: 6.448494E+00 | loss scale: 4096.0 | grad norm: 77740.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42285
+ time (ms)
42286
+ iteration 5304/ 159576 | consumed samples: 136464 | elapsed time per iteration (ms): 15454.6 | learning rate: 3.777E-05 | global batch size: 48 | lm loss: 6.436998E+00 | loss scale: 4096.0 | grad norm: 86587.477 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42287
+ time (ms)
42288
+ iteration 5305/ 159576 | consumed samples: 136512 | elapsed time per iteration (ms): 15410.7 | learning rate: 3.778E-05 | global batch size: 48 | lm loss: 6.360906E+00 | loss scale: 4096.0 | grad norm: 102483.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42289
+ time (ms)
42290
+ iteration 5306/ 159576 | consumed samples: 136560 | elapsed time per iteration (ms): 15590.5 | learning rate: 3.779E-05 | global batch size: 48 | lm loss: 6.449046E+00 | loss scale: 4096.0 | grad norm: 63898.529 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42291
+ time (ms)
42292
+ iteration 5307/ 159576 | consumed samples: 136608 | elapsed time per iteration (ms): 15506.8 | learning rate: 3.781E-05 | global batch size: 48 | lm loss: 6.467348E+00 | loss scale: 4096.0 | grad norm: 66863.281 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42293
+ time (ms)
42294
+ iteration 5308/ 159576 | consumed samples: 136656 | elapsed time per iteration (ms): 15351.0 | learning rate: 3.782E-05 | global batch size: 48 | lm loss: 6.301440E+00 | loss scale: 4096.0 | grad norm: 66038.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42295
+ time (ms)
42296
+ iteration 5309/ 159576 | consumed samples: 136704 | elapsed time per iteration (ms): 15547.1 | learning rate: 3.783E-05 | global batch size: 48 | lm loss: 6.314401E+00 | loss scale: 4096.0 | grad norm: 100622.046 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42297
+ time (ms)
42298
+ iteration 5310/ 159576 | consumed samples: 136752 | elapsed time per iteration (ms): 15714.1 | learning rate: 3.785E-05 | global batch size: 48 | lm loss: 6.474138E+00 | loss scale: 4096.0 | grad norm: 100713.919 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42299
+ time (ms)
42300
+ iteration 5311/ 159576 | consumed samples: 136800 | elapsed time per iteration (ms): 15441.4 | learning rate: 3.786E-05 | global batch size: 48 | lm loss: 6.429978E+00 | loss scale: 4096.0 | grad norm: 73118.420 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42301
+ time (ms)
42302
+ iteration 5312/ 159576 | consumed samples: 136848 | elapsed time per iteration (ms): 15448.2 | learning rate: 3.787E-05 | global batch size: 48 | lm loss: 6.322928E+00 | loss scale: 4096.0 | grad norm: 79244.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42303
+ time (ms)
42304
+ iteration 5313/ 159576 | consumed samples: 136896 | elapsed time per iteration (ms): 15801.3 | learning rate: 3.789E-05 | global batch size: 48 | lm loss: 6.536728E+00 | loss scale: 4096.0 | grad norm: 80004.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42305
+ time (ms)
42306
+ iteration 5314/ 159576 | consumed samples: 136944 | elapsed time per iteration (ms): 15420.7 | learning rate: 3.790E-05 | global batch size: 48 | lm loss: 6.358313E+00 | loss scale: 4096.0 | grad norm: 73656.992 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42307
+ time (ms)
42308
+ iteration 5315/ 159576 | consumed samples: 136992 | elapsed time per iteration (ms): 15430.5 | learning rate: 3.791E-05 | global batch size: 48 | lm loss: 6.285139E+00 | loss scale: 4096.0 | grad norm: 72555.490 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42309
+ time (ms)
42310
+ iteration 5316/ 159576 | consumed samples: 137040 | elapsed time per iteration (ms): 15418.3 | learning rate: 3.793E-05 | global batch size: 48 | lm loss: 6.355993E+00 | loss scale: 4096.0 | grad norm: 89604.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42311
+ time (ms)
42312
+ iteration 5317/ 159576 | consumed samples: 137088 | elapsed time per iteration (ms): 15767.6 | learning rate: 3.794E-05 | global batch size: 48 | lm loss: 6.370296E+00 | loss scale: 4096.0 | grad norm: 68760.061 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42313
+ time (ms)
42314
+ iteration 5318/ 159576 | consumed samples: 137136 | elapsed time per iteration (ms): 15469.0 | learning rate: 3.795E-05 | global batch size: 48 | lm loss: 6.401207E+00 | loss scale: 4096.0 | grad norm: 64825.425 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42315
+ time (ms)
42316
+ iteration 5319/ 159576 | consumed samples: 137184 | elapsed time per iteration (ms): 15469.4 | learning rate: 3.797E-05 | global batch size: 48 | lm loss: 6.433188E+00 | loss scale: 4096.0 | grad norm: 75954.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42317
+ time (ms)
42318
+ iteration 5320/ 159576 | consumed samples: 137232 | elapsed time per iteration (ms): 15484.0 | learning rate: 3.798E-05 | global batch size: 48 | lm loss: 6.422481E+00 | loss scale: 4096.0 | grad norm: 85143.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42319
+ time (ms)
42320
+ iteration 5321/ 159576 | consumed samples: 137280 | elapsed time per iteration (ms): 15773.2 | learning rate: 3.799E-05 | global batch size: 48 | lm loss: 6.394318E+00 | loss scale: 4096.0 | grad norm: 81431.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42321
+ time (ms)
42322
+ iteration 5322/ 159576 | consumed samples: 137328 | elapsed time per iteration (ms): 15339.5 | learning rate: 3.801E-05 | global batch size: 48 | lm loss: 6.498918E+00 | loss scale: 4096.0 | grad norm: 76418.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42323
+ time (ms)
42324
+ iteration 5323/ 159576 | consumed samples: 137376 | elapsed time per iteration (ms): 15420.7 | learning rate: 3.802E-05 | global batch size: 48 | lm loss: 6.518599E+00 | loss scale: 4096.0 | grad norm: 71705.255 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42325
+ time (ms)
42326
+ iteration 5324/ 159576 | consumed samples: 137424 | elapsed time per iteration (ms): 15420.3 | learning rate: 3.803E-05 | global batch size: 48 | lm loss: 6.429631E+00 | loss scale: 4096.0 | grad norm: 57358.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42327
+ time (ms)
42328
+ iteration 5325/ 159576 | consumed samples: 137472 | elapsed time per iteration (ms): 15903.1 | learning rate: 3.805E-05 | global batch size: 48 | lm loss: 6.407781E+00 | loss scale: 4096.0 | grad norm: 91506.505 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42329
+ time (ms)
42330
+ iteration 5326/ 159576 | consumed samples: 137520 | elapsed time per iteration (ms): 15425.4 | learning rate: 3.806E-05 | global batch size: 48 | lm loss: 6.399868E+00 | loss scale: 4096.0 | grad norm: 68843.352 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42331
+ time (ms)
42332
+ iteration 5327/ 159576 | consumed samples: 137568 | elapsed time per iteration (ms): 15444.3 | learning rate: 3.807E-05 | global batch size: 48 | lm loss: 6.412372E+00 | loss scale: 4096.0 | grad norm: 67149.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42333
+ time (ms)
42334
+ iteration 5328/ 159576 | consumed samples: 137616 | elapsed time per iteration (ms): 15406.6 | learning rate: 3.809E-05 | global batch size: 48 | lm loss: 6.430699E+00 | loss scale: 4096.0 | grad norm: 102742.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42335
+ time (ms)
42336
+ iteration 5329/ 159576 | consumed samples: 137664 | elapsed time per iteration (ms): 15722.7 | learning rate: 3.810E-05 | global batch size: 48 | lm loss: 6.415520E+00 | loss scale: 4096.0 | grad norm: 73301.472 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42337
+ time (ms)
42338
+ iteration 5330/ 159576 | consumed samples: 137712 | elapsed time per iteration (ms): 15405.0 | learning rate: 3.811E-05 | global batch size: 48 | lm loss: 6.359590E+00 | loss scale: 4096.0 | grad norm: 70222.523 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42339
+ time (ms)
42340
+ iteration 5331/ 159576 | consumed samples: 137760 | elapsed time per iteration (ms): 15374.6 | learning rate: 3.813E-05 | global batch size: 48 | lm loss: 6.443409E+00 | loss scale: 4096.0 | grad norm: 79619.657 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42341
+ time (ms)
42342
+ iteration 5332/ 159576 | consumed samples: 137808 | elapsed time per iteration (ms): 15404.3 | learning rate: 3.814E-05 | global batch size: 48 | lm loss: 6.412749E+00 | loss scale: 4096.0 | grad norm: 110889.514 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42343
+ time (ms)
42344
+ iteration 5333/ 159576 | consumed samples: 137856 | elapsed time per iteration (ms): 15590.4 | learning rate: 3.815E-05 | global batch size: 48 | lm loss: 6.492513E+00 | loss scale: 4096.0 | grad norm: 80255.448 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42345
+ time (ms)
42346
+ iteration 5334/ 159576 | consumed samples: 137904 | elapsed time per iteration (ms): 15436.5 | learning rate: 3.817E-05 | global batch size: 48 | lm loss: 6.400149E+00 | loss scale: 4096.0 | grad norm: 69554.344 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42347
+ time (ms)
42348
+ iteration 5335/ 159576 | consumed samples: 137952 | elapsed time per iteration (ms): 15422.0 | learning rate: 3.818E-05 | global batch size: 48 | lm loss: 6.473186E+00 | loss scale: 4096.0 | grad norm: 96185.543 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42349
+ time (ms)
42350
+ iteration 5336/ 159576 | consumed samples: 138000 | elapsed time per iteration (ms): 15442.7 | learning rate: 3.819E-05 | global batch size: 48 | lm loss: 6.552884E+00 | loss scale: 4096.0 | grad norm: 73254.921 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42351
+ time (ms)
42352
+ iteration 5337/ 159576 | consumed samples: 138048 | elapsed time per iteration (ms): 15634.6 | learning rate: 3.821E-05 | global batch size: 48 | lm loss: 6.365612E+00 | loss scale: 4096.0 | grad norm: 57539.381 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42353
+ time (ms)
42354
+ iteration 5338/ 159576 | consumed samples: 138096 | elapsed time per iteration (ms): 15386.8 | learning rate: 3.822E-05 | global batch size: 48 | lm loss: 6.445109E+00 | loss scale: 4096.0 | grad norm: 67382.289 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42355
+ time (ms)
42356
+ iteration 5339/ 159576 | consumed samples: 138144 | elapsed time per iteration (ms): 15470.1 | learning rate: 3.823E-05 | global batch size: 48 | lm loss: 6.353713E+00 | loss scale: 4096.0 | grad norm: 110272.660 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42357
+ time (ms)
42358
+ iteration 5340/ 159576 | consumed samples: 138192 | elapsed time per iteration (ms): 15791.0 | learning rate: 3.825E-05 | global batch size: 48 | lm loss: 6.413539E+00 | loss scale: 4096.0 | grad norm: 72349.998 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42359
+ time (ms)
42360
+ iteration 5341/ 159576 | consumed samples: 138240 | elapsed time per iteration (ms): 15411.4 | learning rate: 3.826E-05 | global batch size: 48 | lm loss: 6.347322E+00 | loss scale: 4096.0 | grad norm: 61859.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42361
+ time (ms)
42362
+ iteration 5342/ 159576 | consumed samples: 138288 | elapsed time per iteration (ms): 15471.9 | learning rate: 3.827E-05 | global batch size: 48 | lm loss: 6.298682E+00 | loss scale: 4096.0 | grad norm: 78125.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42363
+ time (ms)
42364
+ iteration 5343/ 159576 | consumed samples: 138336 | elapsed time per iteration (ms): 15450.5 | learning rate: 3.829E-05 | global batch size: 48 | lm loss: 6.346509E+00 | loss scale: 4096.0 | grad norm: 76921.340 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42365
+ time (ms)
42366
+ iteration 5344/ 159576 | consumed samples: 138384 | elapsed time per iteration (ms): 15797.4 | learning rate: 3.830E-05 | global batch size: 48 | lm loss: 6.464560E+00 | loss scale: 4096.0 | grad norm: 73833.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42367
+ time (ms)
42368
+ iteration 5345/ 159576 | consumed samples: 138432 | elapsed time per iteration (ms): 15447.3 | learning rate: 3.831E-05 | global batch size: 48 | lm loss: 6.491942E+00 | loss scale: 4096.0 | grad norm: 58609.094 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42369
+ time (ms)
42370
+ iteration 5346/ 159576 | consumed samples: 138480 | elapsed time per iteration (ms): 15470.6 | learning rate: 3.833E-05 | global batch size: 48 | lm loss: 6.408776E+00 | loss scale: 4096.0 | grad norm: 61084.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42371
+ time (ms)
42372
+ iteration 5347/ 159576 | consumed samples: 138528 | elapsed time per iteration (ms): 15595.7 | learning rate: 3.834E-05 | global batch size: 48 | lm loss: 6.317072E+00 | loss scale: 4096.0 | grad norm: 79107.564 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42373
+ time (ms)
42374
+ iteration 5348/ 159576 | consumed samples: 138576 | elapsed time per iteration (ms): 15857.5 | learning rate: 3.835E-05 | global batch size: 48 | lm loss: 6.342214E+00 | loss scale: 4096.0 | grad norm: 82396.508 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42375
+ time (ms)
42376
+ iteration 5349/ 159576 | consumed samples: 138624 | elapsed time per iteration (ms): 15501.3 | learning rate: 3.837E-05 | global batch size: 48 | lm loss: 6.416060E+00 | loss scale: 4096.0 | grad norm: 58909.391 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42377
+ time (ms)
42378
+ iteration 5350/ 159576 | consumed samples: 138672 | elapsed time per iteration (ms): 15334.9 | learning rate: 3.838E-05 | global batch size: 48 | lm loss: 6.348287E+00 | loss scale: 4096.0 | grad norm: 54069.980 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42379
+ time (ms)
42380
+ iteration 5351/ 159576 | consumed samples: 138720 | elapsed time per iteration (ms): 15454.2 | learning rate: 3.839E-05 | global batch size: 48 | lm loss: 6.456007E+00 | loss scale: 4096.0 | grad norm: 61307.306 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42381
+ time (ms)
42382
+ iteration 5352/ 159576 | consumed samples: 138768 | elapsed time per iteration (ms): 15972.1 | learning rate: 3.841E-05 | global batch size: 48 | lm loss: 6.276731E+00 | loss scale: 4096.0 | grad norm: 62789.049 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42383
+ time (ms)
42384
+ iteration 5353/ 159576 | consumed samples: 138816 | elapsed time per iteration (ms): 15447.0 | learning rate: 3.842E-05 | global batch size: 48 | lm loss: 6.443192E+00 | loss scale: 4096.0 | grad norm: 75454.112 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42385
+ time (ms)
42386
+ iteration 5354/ 159576 | consumed samples: 138864 | elapsed time per iteration (ms): 15426.1 | learning rate: 3.843E-05 | global batch size: 48 | lm loss: 6.301665E+00 | loss scale: 4096.0 | grad norm: 66381.021 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42387
+ time (ms)
42388
+ iteration 5355/ 159576 | consumed samples: 138912 | elapsed time per iteration (ms): 15465.4 | learning rate: 3.845E-05 | global batch size: 48 | lm loss: 6.453572E+00 | loss scale: 4096.0 | grad norm: 63236.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42389
+ time (ms)
42390
+ iteration 5356/ 159576 | consumed samples: 138960 | elapsed time per iteration (ms): 15595.7 | learning rate: 3.846E-05 | global batch size: 48 | lm loss: 6.391494E+00 | loss scale: 4096.0 | grad norm: 78457.049 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42391
+ time (ms)
42392
+ iteration 5357/ 159576 | consumed samples: 139008 | elapsed time per iteration (ms): 15508.4 | learning rate: 3.847E-05 | global batch size: 48 | lm loss: 6.379974E+00 | loss scale: 4096.0 | grad norm: 85282.485 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42393
+ time (ms)
42394
+ iteration 5358/ 159576 | consumed samples: 139056 | elapsed time per iteration (ms): 15495.7 | learning rate: 3.849E-05 | global batch size: 48 | lm loss: 6.517261E+00 | loss scale: 4096.0 | grad norm: 75329.391 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42395
+ time (ms)
42396
+ iteration 5359/ 159576 | consumed samples: 139104 | elapsed time per iteration (ms): 15455.1 | learning rate: 3.850E-05 | global batch size: 48 | lm loss: 6.311386E+00 | loss scale: 4096.0 | grad norm: 74599.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42397
+ time (ms)
42398
+ iteration 5360/ 159576 | consumed samples: 139152 | elapsed time per iteration (ms): 15693.4 | learning rate: 3.851E-05 | global batch size: 48 | lm loss: 6.481428E+00 | loss scale: 4096.0 | grad norm: 77215.648 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42399
+ time (ms)
42400
+ iteration 5361/ 159576 | consumed samples: 139200 | elapsed time per iteration (ms): 15475.6 | learning rate: 3.853E-05 | global batch size: 48 | lm loss: 6.331719E+00 | loss scale: 4096.0 | grad norm: 60279.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42401
+ time (ms)
42402
+ iteration 5362/ 159576 | consumed samples: 139248 | elapsed time per iteration (ms): 15551.6 | learning rate: 3.854E-05 | global batch size: 48 | lm loss: 6.506707E+00 | loss scale: 4096.0 | grad norm: 57442.387 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42403
+ time (ms)
42404
+ iteration 5363/ 159576 | consumed samples: 139296 | elapsed time per iteration (ms): 15525.0 | learning rate: 3.855E-05 | global batch size: 48 | lm loss: 6.283090E+00 | loss scale: 4096.0 | grad norm: 69167.961 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42405
+ time (ms)
42406
+ iteration 5364/ 159576 | consumed samples: 139344 | elapsed time per iteration (ms): 15703.9 | learning rate: 3.857E-05 | global batch size: 48 | lm loss: 6.344968E+00 | loss scale: 4096.0 | grad norm: 66351.451 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42407
+ time (ms)
42408
+ iteration 5365/ 159576 | consumed samples: 139392 | elapsed time per iteration (ms): 15511.9 | learning rate: 3.858E-05 | global batch size: 48 | lm loss: 6.402239E+00 | loss scale: 4096.0 | grad norm: 69893.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42409
+ time (ms)
42410
+ iteration 5366/ 159576 | consumed samples: 139440 | elapsed time per iteration (ms): 15507.6 | learning rate: 3.859E-05 | global batch size: 48 | lm loss: 6.510591E+00 | loss scale: 4096.0 | grad norm: 73294.922 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42411
+ time (ms)
42412
+ iteration 5367/ 159576 | consumed samples: 139488 | elapsed time per iteration (ms): 15841.0 | learning rate: 3.861E-05 | global batch size: 48 | lm loss: 6.292207E+00 | loss scale: 4096.0 | grad norm: 69220.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42413
+ time (ms)
42414
+ iteration 5368/ 159576 | consumed samples: 139536 | elapsed time per iteration (ms): 15748.2 | learning rate: 3.862E-05 | global batch size: 48 | lm loss: 6.492587E+00 | loss scale: 4096.0 | grad norm: 78294.485 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42415
+ time (ms)
42416
+ iteration 5369/ 159576 | consumed samples: 139584 | elapsed time per iteration (ms): 15492.3 | learning rate: 3.863E-05 | global batch size: 48 | lm loss: 6.493845E+00 | loss scale: 4096.0 | grad norm: 94517.294 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42417
+ time (ms)
42418
+ iteration 5370/ 159576 | consumed samples: 139632 | elapsed time per iteration (ms): 15493.8 | learning rate: 3.864E-05 | global batch size: 48 | lm loss: 6.430061E+00 | loss scale: 4096.0 | grad norm: 77523.471 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42419
+ time (ms)
42420
+ iteration 5371/ 159576 | consumed samples: 139680 | elapsed time per iteration (ms): 15870.2 | learning rate: 3.866E-05 | global batch size: 48 | lm loss: 6.411311E+00 | loss scale: 4096.0 | grad norm: 69582.630 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42421
+ time (ms)
42422
+ iteration 5372/ 159576 | consumed samples: 139728 | elapsed time per iteration (ms): 15517.9 | learning rate: 3.867E-05 | global batch size: 48 | lm loss: 6.515477E+00 | loss scale: 4096.0 | grad norm: 75626.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42423
+ time (ms)
42424
+ iteration 5373/ 159576 | consumed samples: 139776 | elapsed time per iteration (ms): 15491.8 | learning rate: 3.868E-05 | global batch size: 48 | lm loss: 6.453342E+00 | loss scale: 4096.0 | grad norm: 69940.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42425
+ time (ms)
42426
+ iteration 5374/ 159576 | consumed samples: 139824 | elapsed time per iteration (ms): 15511.6 | learning rate: 3.870E-05 | global batch size: 48 | lm loss: 6.378087E+00 | loss scale: 4096.0 | grad norm: 70420.660 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42427
+ time (ms)
42428
+ iteration 5375/ 159576 | consumed samples: 139872 | elapsed time per iteration (ms): 15836.7 | learning rate: 3.871E-05 | global batch size: 48 | lm loss: 6.371119E+00 | loss scale: 4096.0 | grad norm: 56046.647 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42429
+ time (ms)
42430
+ iteration 5376/ 159576 | consumed samples: 139920 | elapsed time per iteration (ms): 15468.7 | learning rate: 3.872E-05 | global batch size: 48 | lm loss: 6.480386E+00 | loss scale: 4096.0 | grad norm: 67254.408 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42431
+ time (ms)
42432
+ iteration 5377/ 159576 | consumed samples: 139968 | elapsed time per iteration (ms): 15505.8 | learning rate: 3.874E-05 | global batch size: 48 | lm loss: 6.445705E+00 | loss scale: 4096.0 | grad norm: 58120.342 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42433
+ time (ms)
42434
+ iteration 5378/ 159576 | consumed samples: 140016 | elapsed time per iteration (ms): 15512.2 | learning rate: 3.875E-05 | global batch size: 48 | lm loss: 6.383876E+00 | loss scale: 4096.0 | grad norm: 63811.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42435
+ time (ms)
42436
+ iteration 5379/ 159576 | consumed samples: 140064 | elapsed time per iteration (ms): 15885.3 | learning rate: 3.876E-05 | global batch size: 48 | lm loss: 6.430426E+00 | loss scale: 4096.0 | grad norm: 71627.105 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42437
+ time (ms)
42438
+ iteration 5380/ 159576 | consumed samples: 140112 | elapsed time per iteration (ms): 15514.4 | learning rate: 3.878E-05 | global batch size: 48 | lm loss: 6.352599E+00 | loss scale: 4096.0 | grad norm: 55768.573 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42439
+ time (ms)
42440
+ iteration 5381/ 159576 | consumed samples: 140160 | elapsed time per iteration (ms): 15536.5 | learning rate: 3.879E-05 | global batch size: 48 | lm loss: 6.462265E+00 | loss scale: 4096.0 | grad norm: 76307.339 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42441
+ time (ms)
42442
+ iteration 5382/ 159576 | consumed samples: 140208 | elapsed time per iteration (ms): 15499.8 | learning rate: 3.880E-05 | global batch size: 48 | lm loss: 6.439154E+00 | loss scale: 4096.0 | grad norm: 97619.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42443
+ time (ms)
42444
+ iteration 5383/ 159576 | consumed samples: 140256 | elapsed time per iteration (ms): 15693.9 | learning rate: 3.882E-05 | global batch size: 48 | lm loss: 6.327425E+00 | loss scale: 4096.0 | grad norm: 69803.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42445
+ time (ms)
42446
+ iteration 5384/ 159576 | consumed samples: 140304 | elapsed time per iteration (ms): 15550.5 | learning rate: 3.883E-05 | global batch size: 48 | lm loss: 6.391693E+00 | loss scale: 4096.0 | grad norm: 66211.348 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42447
+ time (ms)
42448
+ iteration 5385/ 159576 | consumed samples: 140352 | elapsed time per iteration (ms): 15520.0 | learning rate: 3.884E-05 | global batch size: 48 | lm loss: 6.323473E+00 | loss scale: 4096.0 | grad norm: 68034.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42449
+ time (ms)
42450
+ iteration 5386/ 159576 | consumed samples: 140400 | elapsed time per iteration (ms): 15545.0 | learning rate: 3.886E-05 | global batch size: 48 | lm loss: 6.299393E+00 | loss scale: 4096.0 | grad norm: 85492.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42451
+ time (ms)
42452
+ iteration 5387/ 159576 | consumed samples: 140448 | elapsed time per iteration (ms): 15684.9 | learning rate: 3.887E-05 | global batch size: 48 | lm loss: 6.374225E+00 | loss scale: 4096.0 | grad norm: 72949.757 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42453
+ time (ms)
42454
+ iteration 5388/ 159576 | consumed samples: 140496 | elapsed time per iteration (ms): 15553.2 | learning rate: 3.888E-05 | global batch size: 48 | lm loss: 6.446224E+00 | loss scale: 4096.0 | grad norm: 83315.401 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42455
+ time (ms)
42456
+ iteration 5389/ 159576 | consumed samples: 140544 | elapsed time per iteration (ms): 15520.1 | learning rate: 3.890E-05 | global batch size: 48 | lm loss: 6.336344E+00 | loss scale: 4096.0 | grad norm: 60566.619 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42457
+ time (ms)
42458
+ iteration 5390/ 159576 | consumed samples: 140592 | elapsed time per iteration (ms): 15438.2 | learning rate: 3.891E-05 | global batch size: 48 | lm loss: 6.437949E+00 | loss scale: 4096.0 | grad norm: 93800.672 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42459
+ time (ms)
42460
+ iteration 5391/ 159576 | consumed samples: 140640 | elapsed time per iteration (ms): 15842.4 | learning rate: 3.892E-05 | global batch size: 48 | lm loss: 6.445059E+00 | loss scale: 4096.0 | grad norm: 67207.362 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42461
+ time (ms)
42462
+ iteration 5392/ 159576 | consumed samples: 140688 | elapsed time per iteration (ms): 15543.4 | learning rate: 3.894E-05 | global batch size: 48 | lm loss: 6.340952E+00 | loss scale: 4096.0 | grad norm: 92289.634 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42463
+ time (ms)
42464
+ iteration 5393/ 159576 | consumed samples: 140736 | elapsed time per iteration (ms): 15518.9 | learning rate: 3.895E-05 | global batch size: 48 | lm loss: 6.416577E+00 | loss scale: 4096.0 | grad norm: 84099.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42465
+ time (ms)
42466
+ iteration 5394/ 159576 | consumed samples: 140784 | elapsed time per iteration (ms): 15997.3 | learning rate: 3.896E-05 | global batch size: 48 | lm loss: 6.439622E+00 | loss scale: 4096.0 | grad norm: 54809.573 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42467
+ time (ms)
42468
+ iteration 5395/ 159576 | consumed samples: 140832 | elapsed time per iteration (ms): 15450.3 | learning rate: 3.898E-05 | global batch size: 48 | lm loss: 6.441430E+00 | loss scale: 4096.0 | grad norm: 63144.662 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42469
+ time (ms)
42470
+ iteration 5396/ 159576 | consumed samples: 140880 | elapsed time per iteration (ms): 15568.2 | learning rate: 3.899E-05 | global batch size: 48 | lm loss: 6.424047E+00 | loss scale: 4096.0 | grad norm: 106261.057 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42471
+ time (ms)
42472
+ iteration 5397/ 159576 | consumed samples: 140928 | elapsed time per iteration (ms): 15464.4 | learning rate: 3.900E-05 | global batch size: 48 | lm loss: 6.325677E+00 | loss scale: 4096.0 | grad norm: 64383.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42473
+ time (ms)
42474
+ iteration 5398/ 159576 | consumed samples: 140976 | elapsed time per iteration (ms): 15883.9 | learning rate: 3.902E-05 | global batch size: 48 | lm loss: 6.582463E+00 | loss scale: 4096.0 | grad norm: 66662.490 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42475
+ time (ms)
42476
+ iteration 5399/ 159576 | consumed samples: 141024 | elapsed time per iteration (ms): 15497.5 | learning rate: 3.903E-05 | global batch size: 48 | lm loss: 6.498641E+00 | loss scale: 4096.0 | grad norm: 59391.511 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42477
+ time (ms)
42478
+ iteration 5400/ 159576 | consumed samples: 141072 | elapsed time per iteration (ms): 15569.9 | learning rate: 3.904E-05 | global batch size: 48 | lm loss: 6.283938E+00 | loss scale: 4096.0 | grad norm: 64487.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42479
+ time (ms)
42480
+ iteration 5401/ 159576 | consumed samples: 141120 | elapsed time per iteration (ms): 15526.8 | learning rate: 3.906E-05 | global batch size: 48 | lm loss: 6.336715E+00 | loss scale: 4096.0 | grad norm: 57781.336 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42481
+ time (ms)
42482
+ iteration 5402/ 159576 | consumed samples: 141168 | elapsed time per iteration (ms): 15981.6 | learning rate: 3.907E-05 | global batch size: 48 | lm loss: 6.293415E+00 | loss scale: 4096.0 | grad norm: 92738.567 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42483
+ time (ms)
42484
+ iteration 5403/ 159576 | consumed samples: 141216 | elapsed time per iteration (ms): 15632.0 | learning rate: 3.908E-05 | global batch size: 48 | lm loss: 6.294649E+00 | loss scale: 4096.0 | grad norm: 62910.047 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42485
+ time (ms)
42486
+ iteration 5404/ 159576 | consumed samples: 141264 | elapsed time per iteration (ms): 15497.6 | learning rate: 3.910E-05 | global batch size: 48 | lm loss: 6.331801E+00 | loss scale: 4096.0 | grad norm: 64648.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42487
+ time (ms)
42488
+ iteration 5405/ 159576 | consumed samples: 141312 | elapsed time per iteration (ms): 15498.1 | learning rate: 3.911E-05 | global batch size: 48 | lm loss: 6.406822E+00 | loss scale: 4096.0 | grad norm: 71416.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42489
+ time (ms)
42490
+ iteration 5406/ 159576 | consumed samples: 141360 | elapsed time per iteration (ms): 15867.4 | learning rate: 3.912E-05 | global batch size: 48 | lm loss: 6.404875E+00 | loss scale: 4096.0 | grad norm: 56955.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42491
+ time (ms)
42492
+ iteration 5407/ 159576 | consumed samples: 141408 | elapsed time per iteration (ms): 15506.2 | learning rate: 3.914E-05 | global batch size: 48 | lm loss: 6.428100E+00 | loss scale: 4096.0 | grad norm: 65410.012 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42493
+ time (ms)
42494
+ iteration 5408/ 159576 | consumed samples: 141456 | elapsed time per iteration (ms): 15573.9 | learning rate: 3.915E-05 | global batch size: 48 | lm loss: 6.352518E+00 | loss scale: 4096.0 | grad norm: 57463.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42495
+ time (ms)
42496
+ iteration 5409/ 159576 | consumed samples: 141504 | elapsed time per iteration (ms): 15570.8 | learning rate: 3.916E-05 | global batch size: 48 | lm loss: 6.276915E+00 | loss scale: 4096.0 | grad norm: 56808.465 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42497
+ time (ms)
42498
+ iteration 5410/ 159576 | consumed samples: 141552 | elapsed time per iteration (ms): 15647.9 | learning rate: 3.918E-05 | global batch size: 48 | lm loss: 6.388402E+00 | loss scale: 4096.0 | grad norm: 55831.269 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42499
+ time (ms)
42500
+ iteration 5411/ 159576 | consumed samples: 141600 | elapsed time per iteration (ms): 15527.8 | learning rate: 3.919E-05 | global batch size: 48 | lm loss: 6.359324E+00 | loss scale: 4096.0 | grad norm: 58176.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42501
+ time (ms)
42502
+ iteration 5412/ 159576 | consumed samples: 141648 | elapsed time per iteration (ms): 15485.9 | learning rate: 3.920E-05 | global batch size: 48 | lm loss: 6.410316E+00 | loss scale: 4096.0 | grad norm: 58797.382 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42503
+ time (ms)
42504
+ iteration 5413/ 159576 | consumed samples: 141696 | elapsed time per iteration (ms): 15570.6 | learning rate: 3.922E-05 | global batch size: 48 | lm loss: 6.487602E+00 | loss scale: 4096.0 | grad norm: 54779.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42505
+ time (ms)
42506
+ iteration 5414/ 159576 | consumed samples: 141744 | elapsed time per iteration (ms): 15692.4 | learning rate: 3.923E-05 | global batch size: 48 | lm loss: 6.538764E+00 | loss scale: 4096.0 | grad norm: 56952.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42507
+ time (ms)
42508
+ iteration 5415/ 159576 | consumed samples: 141808 | elapsed time per iteration (ms): 16423.4 | learning rate: 3.925E-05 | global batch size: 64 | lm loss: 6.468464E+00 | loss scale: 4096.0 | grad norm: 47962.953 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42509
+ time (ms)
42510
+ iteration 5416/ 159576 | consumed samples: 141872 | elapsed time per iteration (ms): 16486.4 | learning rate: 3.927E-05 | global batch size: 64 | lm loss: 6.358836E+00 | loss scale: 4096.0 | grad norm: 79746.041 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42511
+ time (ms)
42512
+ iteration 5417/ 159576 | consumed samples: 141936 | elapsed time per iteration (ms): 16837.9 | learning rate: 3.928E-05 | global batch size: 64 | lm loss: 6.458796E+00 | loss scale: 4096.0 | grad norm: 72485.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42513
+ time (ms)
42514
+ iteration 5418/ 159576 | consumed samples: 142000 | elapsed time per iteration (ms): 16282.1 | learning rate: 3.930E-05 | global batch size: 64 | lm loss: 6.325031E+00 | loss scale: 4096.0 | grad norm: 50657.294 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42515
+ time (ms)
42516
+ iteration 5419/ 159576 | consumed samples: 142064 | elapsed time per iteration (ms): 16473.5 | learning rate: 3.932E-05 | global batch size: 64 | lm loss: 6.393603E+00 | loss scale: 4096.0 | grad norm: 53317.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42517
+ time (ms)
42518
+ iteration 5420/ 159576 | consumed samples: 142128 | elapsed time per iteration (ms): 16358.3 | learning rate: 3.934E-05 | global batch size: 64 | lm loss: 6.505975E+00 | loss scale: 4096.0 | grad norm: 76759.970 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42519
+ time (ms)
42520
+ iteration 5421/ 159576 | consumed samples: 142192 | elapsed time per iteration (ms): 16646.9 | learning rate: 3.936E-05 | global batch size: 64 | lm loss: 6.377459E+00 | loss scale: 4096.0 | grad norm: 61658.865 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42521
+ time (ms)
42522
+ iteration 5422/ 159576 | consumed samples: 142256 | elapsed time per iteration (ms): 16480.4 | learning rate: 3.937E-05 | global batch size: 64 | lm loss: 6.350579E+00 | loss scale: 4096.0 | grad norm: 61672.596 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42523
+ time (ms)
42524
+ iteration 5423/ 159576 | consumed samples: 142320 | elapsed time per iteration (ms): 16500.8 | learning rate: 3.939E-05 | global batch size: 64 | lm loss: 6.359305E+00 | loss scale: 4096.0 | grad norm: 71934.386 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42525
+ time (ms)
42526
+ iteration 5424/ 159576 | consumed samples: 142384 | elapsed time per iteration (ms): 16400.7 | learning rate: 3.941E-05 | global batch size: 64 | lm loss: 6.515474E+00 | loss scale: 4096.0 | grad norm: 62262.598 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42527
+ time (ms)
42528
+ iteration 5425/ 159576 | consumed samples: 142448 | elapsed time per iteration (ms): 16686.7 | learning rate: 3.943E-05 | global batch size: 64 | lm loss: 6.377324E+00 | loss scale: 4096.0 | grad norm: 66128.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42529
+ time (ms)
42530
+ iteration 5426/ 159576 | consumed samples: 142512 | elapsed time per iteration (ms): 16346.9 | learning rate: 3.944E-05 | global batch size: 64 | lm loss: 6.394655E+00 | loss scale: 4096.0 | grad norm: 64276.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42531
+ time (ms)
42532
+ iteration 5427/ 159576 | consumed samples: 142576 | elapsed time per iteration (ms): 16454.0 | learning rate: 3.946E-05 | global batch size: 64 | lm loss: 6.417256E+00 | loss scale: 4096.0 | grad norm: 55916.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42533
+ time (ms)
42534
+ iteration 5428/ 159576 | consumed samples: 142640 | elapsed time per iteration (ms): 16713.8 | learning rate: 3.948E-05 | global batch size: 64 | lm loss: 6.314127E+00 | loss scale: 4096.0 | grad norm: 65443.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42535
+ time (ms)
42536
+ iteration 5429/ 159576 | consumed samples: 142704 | elapsed time per iteration (ms): 16492.7 | learning rate: 3.950E-05 | global batch size: 64 | lm loss: 6.349669E+00 | loss scale: 4096.0 | grad norm: 64819.083 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42537
+ time (ms)
42538
+ iteration 5430/ 159576 | consumed samples: 142768 | elapsed time per iteration (ms): 16430.1 | learning rate: 3.951E-05 | global batch size: 64 | lm loss: 6.406096E+00 | loss scale: 4096.0 | grad norm: 72027.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42539
+ time (ms)
42540
+ iteration 5431/ 159576 | consumed samples: 142832 | elapsed time per iteration (ms): 16452.9 | learning rate: 3.953E-05 | global batch size: 64 | lm loss: 6.422045E+00 | loss scale: 4096.0 | grad norm: 59470.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42541
+ time (ms)
42542
+ iteration 5432/ 159576 | consumed samples: 142896 | elapsed time per iteration (ms): 16574.0 | learning rate: 3.955E-05 | global batch size: 64 | lm loss: 6.384964E+00 | loss scale: 4096.0 | grad norm: 59229.555 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42543
+ time (ms)
42544
+ iteration 5433/ 159576 | consumed samples: 142960 | elapsed time per iteration (ms): 16448.4 | learning rate: 3.957E-05 | global batch size: 64 | lm loss: 6.388242E+00 | loss scale: 4096.0 | grad norm: 51139.017 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42545
+ time (ms)
42546
+ iteration 5434/ 159576 | consumed samples: 143024 | elapsed time per iteration (ms): 16378.2 | learning rate: 3.959E-05 | global batch size: 64 | lm loss: 6.422913E+00 | loss scale: 4096.0 | grad norm: 55548.958 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42547
+ time (ms)
42548
+ iteration 5435/ 159576 | consumed samples: 143088 | elapsed time per iteration (ms): 16838.8 | learning rate: 3.960E-05 | global batch size: 64 | lm loss: 6.399693E+00 | loss scale: 4096.0 | grad norm: 87728.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42549
+ time (ms)
42550
+ iteration 5436/ 159576 | consumed samples: 143152 | elapsed time per iteration (ms): 16458.9 | learning rate: 3.962E-05 | global batch size: 64 | lm loss: 6.291359E+00 | loss scale: 4096.0 | grad norm: 65955.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42551
+ time (ms)
42552
+ iteration 5437/ 159576 | consumed samples: 143216 | elapsed time per iteration (ms): 16425.2 | learning rate: 3.964E-05 | global batch size: 64 | lm loss: 6.367932E+00 | loss scale: 4096.0 | grad norm: 63150.328 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42553
+ time (ms)
42554
+ iteration 5438/ 159576 | consumed samples: 143280 | elapsed time per iteration (ms): 16418.8 | learning rate: 3.966E-05 | global batch size: 64 | lm loss: 6.365756E+00 | loss scale: 4096.0 | grad norm: 57427.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42555
+ time (ms)
42556
+ iteration 5439/ 159576 | consumed samples: 143344 | elapsed time per iteration (ms): 16802.3 | learning rate: 3.967E-05 | global batch size: 64 | lm loss: 6.415596E+00 | loss scale: 4096.0 | grad norm: 61605.287 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42557
+ time (ms)
42558
+ iteration 5440/ 159576 | consumed samples: 143408 | elapsed time per iteration (ms): 16516.9 | learning rate: 3.969E-05 | global batch size: 64 | lm loss: 6.414165E+00 | loss scale: 4096.0 | grad norm: 64434.632 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42559
+ time (ms)
42560
+ iteration 5441/ 159576 | consumed samples: 143472 | elapsed time per iteration (ms): 16398.0 | learning rate: 3.971E-05 | global batch size: 64 | lm loss: 6.425170E+00 | loss scale: 4096.0 | grad norm: 63830.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42561
+ time (ms)
42562
+ iteration 5442/ 159576 | consumed samples: 143536 | elapsed time per iteration (ms): 16330.0 | learning rate: 3.973E-05 | global batch size: 64 | lm loss: 6.420317E+00 | loss scale: 4096.0 | grad norm: 80818.483 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42563
+ time (ms)
42564
+ iteration 5443/ 159576 | consumed samples: 143600 | elapsed time per iteration (ms): 16646.2 | learning rate: 3.975E-05 | global batch size: 64 | lm loss: 6.404300E+00 | loss scale: 4096.0 | grad norm: 66058.957 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42565
+ time (ms)
42566
+ iteration 5444/ 159576 | consumed samples: 143664 | elapsed time per iteration (ms): 16389.9 | learning rate: 3.976E-05 | global batch size: 64 | lm loss: 6.307170E+00 | loss scale: 4096.0 | grad norm: 64553.082 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42567
+ time (ms)
42568
+ iteration 5445/ 159576 | consumed samples: 143728 | elapsed time per iteration (ms): 16425.8 | learning rate: 3.978E-05 | global batch size: 64 | lm loss: 6.474117E+00 | loss scale: 4096.0 | grad norm: 54414.389 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42569
+ time (ms)
42570
+ iteration 5446/ 159576 | consumed samples: 143792 | elapsed time per iteration (ms): 16855.6 | learning rate: 3.980E-05 | global batch size: 64 | lm loss: 6.329272E+00 | loss scale: 4096.0 | grad norm: 67896.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42571
+ time (ms)
42572
+ iteration 5447/ 159576 | consumed samples: 143856 | elapsed time per iteration (ms): 16363.1 | learning rate: 3.982E-05 | global batch size: 64 | lm loss: 6.485427E+00 | loss scale: 4096.0 | grad norm: 55200.098 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42573
+ time (ms)
42574
+ iteration 5448/ 159576 | consumed samples: 143920 | elapsed time per iteration (ms): 16446.4 | learning rate: 3.983E-05 | global batch size: 64 | lm loss: 6.474103E+00 | loss scale: 4096.0 | grad norm: 58759.422 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42575
+ time (ms)
42576
+ iteration 5449/ 159576 | consumed samples: 143984 | elapsed time per iteration (ms): 16365.5 | learning rate: 3.985E-05 | global batch size: 64 | lm loss: 6.386650E+00 | loss scale: 4096.0 | grad norm: 69075.558 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42577
+ time (ms)
42578
+ iteration 5450/ 159576 | consumed samples: 144048 | elapsed time per iteration (ms): 16855.4 | learning rate: 3.987E-05 | global batch size: 64 | lm loss: 6.407839E+00 | loss scale: 4096.0 | grad norm: 76751.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42579
+ time (ms)
42580
+ iteration 5451/ 159576 | consumed samples: 144112 | elapsed time per iteration (ms): 16481.2 | learning rate: 3.989E-05 | global batch size: 64 | lm loss: 6.437217E+00 | loss scale: 4096.0 | grad norm: 60762.834 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42581
+ time (ms)
42582
+ iteration 5452/ 159576 | consumed samples: 144176 | elapsed time per iteration (ms): 16387.3 | learning rate: 3.991E-05 | global batch size: 64 | lm loss: 6.391966E+00 | loss scale: 4096.0 | grad norm: 57835.999 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42583
+ time (ms)
42584
+ iteration 5453/ 159576 | consumed samples: 144240 | elapsed time per iteration (ms): 16456.9 | learning rate: 3.992E-05 | global batch size: 64 | lm loss: 6.407461E+00 | loss scale: 4096.0 | grad norm: 56276.948 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42585
+ time (ms)
42586
+ iteration 5454/ 159576 | consumed samples: 144304 | elapsed time per iteration (ms): 16533.3 | learning rate: 3.994E-05 | global batch size: 64 | lm loss: 6.319425E+00 | loss scale: 4096.0 | grad norm: 66856.562 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42587
+ time (ms)
42588
+ iteration 5455/ 159576 | consumed samples: 144368 | elapsed time per iteration (ms): 16417.1 | learning rate: 3.996E-05 | global batch size: 64 | lm loss: 6.377168E+00 | loss scale: 4096.0 | grad norm: 53863.935 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42589
+ time (ms)
42590
+ iteration 5456/ 159576 | consumed samples: 144432 | elapsed time per iteration (ms): 16422.1 | learning rate: 3.998E-05 | global batch size: 64 | lm loss: 6.368913E+00 | loss scale: 4096.0 | grad norm: 63261.354 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42591
+ time (ms)
42592
+ iteration 5457/ 159576 | consumed samples: 144496 | elapsed time per iteration (ms): 16738.2 | learning rate: 3.999E-05 | global batch size: 64 | lm loss: 6.264383E+00 | loss scale: 4096.0 | grad norm: 64656.043 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42593
+ time (ms)
42594
+ iteration 5458/ 159576 | consumed samples: 144560 | elapsed time per iteration (ms): 16315.9 | learning rate: 4.001E-05 | global batch size: 64 | lm loss: 6.410008E+00 | loss scale: 4096.0 | grad norm: 82472.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42595
+ time (ms)
42596
+ iteration 5459/ 159576 | consumed samples: 144624 | elapsed time per iteration (ms): 16385.7 | learning rate: 4.003E-05 | global batch size: 64 | lm loss: 6.419100E+00 | loss scale: 4096.0 | grad norm: 81581.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42597
+ time (ms)
42598
+ iteration 5460/ 159576 | consumed samples: 144688 | elapsed time per iteration (ms): 16422.6 | learning rate: 4.005E-05 | global batch size: 64 | lm loss: 6.374327E+00 | loss scale: 4096.0 | grad norm: 77883.993 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42599
+ time (ms)
42600
+ iteration 5461/ 159576 | consumed samples: 144752 | elapsed time per iteration (ms): 16514.0 | learning rate: 4.007E-05 | global batch size: 64 | lm loss: 6.323710E+00 | loss scale: 4096.0 | grad norm: 59535.385 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42601
+ time (ms)
42602
+ iteration 5462/ 159576 | consumed samples: 144816 | elapsed time per iteration (ms): 16520.4 | learning rate: 4.008E-05 | global batch size: 64 | lm loss: 6.325150E+00 | loss scale: 4096.0 | grad norm: 54807.099 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42603
+ time (ms)
42604
+ iteration 5463/ 159576 | consumed samples: 144880 | elapsed time per iteration (ms): 16362.9 | learning rate: 4.010E-05 | global batch size: 64 | lm loss: 6.461391E+00 | loss scale: 4096.0 | grad norm: 74839.084 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42605
+ time (ms)
42606
+ iteration 5464/ 159576 | consumed samples: 144944 | elapsed time per iteration (ms): 16408.3 | learning rate: 4.012E-05 | global batch size: 64 | lm loss: 6.392217E+00 | loss scale: 4096.0 | grad norm: 61727.667 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42607
+ time (ms)
42608
+ iteration 5465/ 159576 | consumed samples: 145008 | elapsed time per iteration (ms): 16556.8 | learning rate: 4.014E-05 | global batch size: 64 | lm loss: 6.349445E+00 | loss scale: 4096.0 | grad norm: 90938.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42609
+ time (ms)
42610
+ iteration 5466/ 159576 | consumed samples: 145072 | elapsed time per iteration (ms): 16389.1 | learning rate: 4.015E-05 | global batch size: 64 | lm loss: 6.314983E+00 | loss scale: 4096.0 | grad norm: 62408.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42611
+ time (ms)
42612
+ iteration 5467/ 159576 | consumed samples: 145136 | elapsed time per iteration (ms): 16364.1 | learning rate: 4.017E-05 | global batch size: 64 | lm loss: 6.412921E+00 | loss scale: 4096.0 | grad norm: 82535.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42613
+ time (ms)
42614
+ iteration 5468/ 159576 | consumed samples: 145200 | elapsed time per iteration (ms): 16712.9 | learning rate: 4.019E-05 | global batch size: 64 | lm loss: 6.508467E+00 | loss scale: 4096.0 | grad norm: 53388.956 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42615
+ time (ms)
42616
+ iteration 5469/ 159576 | consumed samples: 145264 | elapsed time per iteration (ms): 16357.7 | learning rate: 4.021E-05 | global batch size: 64 | lm loss: 6.367021E+00 | loss scale: 4096.0 | grad norm: 88053.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42617
+ time (ms)
42618
+ iteration 5470/ 159576 | consumed samples: 145328 | elapsed time per iteration (ms): 16424.7 | learning rate: 4.022E-05 | global batch size: 64 | lm loss: 6.396588E+00 | loss scale: 4096.0 | grad norm: 83281.076 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42619
+ time (ms)
42620
+ iteration 5471/ 159576 | consumed samples: 145392 | elapsed time per iteration (ms): 16363.6 | learning rate: 4.024E-05 | global batch size: 64 | lm loss: 6.387273E+00 | loss scale: 4096.0 | grad norm: 56875.433 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42621
+ time (ms)
42622
+ iteration 5472/ 159576 | consumed samples: 145456 | elapsed time per iteration (ms): 16523.2 | learning rate: 4.026E-05 | global batch size: 64 | lm loss: 6.456463E+00 | loss scale: 4096.0 | grad norm: 60270.862 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42623
+ time (ms)
42624
+ iteration 5473/ 159576 | consumed samples: 145520 | elapsed time per iteration (ms): 16398.7 | learning rate: 4.028E-05 | global batch size: 64 | lm loss: 6.460003E+00 | loss scale: 4096.0 | grad norm: 61151.257 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42625
+ time (ms)
42626
+ iteration 5474/ 159576 | consumed samples: 145584 | elapsed time per iteration (ms): 16345.5 | learning rate: 4.030E-05 | global batch size: 64 | lm loss: 6.443559E+00 | loss scale: 4096.0 | grad norm: 83130.420 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42627
+ time (ms)
42628
+ iteration 5475/ 159576 | consumed samples: 145648 | elapsed time per iteration (ms): 16591.9 | learning rate: 4.031E-05 | global batch size: 64 | lm loss: 6.454519E+00 | loss scale: 4096.0 | grad norm: 69198.394 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42629
+ time (ms)
42630
+ iteration 5476/ 159576 | consumed samples: 145712 | elapsed time per iteration (ms): 16643.0 | learning rate: 4.033E-05 | global batch size: 64 | lm loss: 6.424469E+00 | loss scale: 4096.0 | grad norm: 57626.523 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42631
+ time (ms)
42632
+ iteration 5477/ 159576 | consumed samples: 145776 | elapsed time per iteration (ms): 16362.1 | learning rate: 4.035E-05 | global batch size: 64 | lm loss: 6.285227E+00 | loss scale: 4096.0 | grad norm: 87864.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42633
+ time (ms)
42634
+ iteration 5478/ 159576 | consumed samples: 145840 | elapsed time per iteration (ms): 16435.9 | learning rate: 4.037E-05 | global batch size: 64 | lm loss: 6.372074E+00 | loss scale: 4096.0 | grad norm: 67542.980 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42635
+ time (ms)
42636
+ iteration 5479/ 159576 | consumed samples: 145904 | elapsed time per iteration (ms): 16597.3 | learning rate: 4.038E-05 | global batch size: 64 | lm loss: 6.438199E+00 | loss scale: 4096.0 | grad norm: 74119.106 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42637
+ time (ms)
42638
+ iteration 5480/ 159576 | consumed samples: 145968 | elapsed time per iteration (ms): 16483.8 | learning rate: 4.040E-05 | global batch size: 64 | lm loss: 6.487626E+00 | loss scale: 4096.0 | grad norm: 68136.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
42639
+ time (ms)
42640
+ [2021-09-25 00:07:47] PULSE: tr8-104B is waiting for the previous job to finish before scheduling a new one using the dependency mechanism (1165978_[1-10%1] on 'gpu_p13' partition)
42641
+ [2021-09-25 00:07:47] PULSE: tr8-104B is running for 18:15:36 since 2021-09-24T05:52:11 (1162855_1 on 'gpu_p13' partition (r6i4n[5,7],r6i5n[2,7-8],r6i6n[0,2,6],r7i2n[4-5],r7i6n[2-4],r7i7n[7-8],r8i0n[2-3,5-8],r8i1n[0,2-4],r8i2n8,r8i3n[0-2],r8i5n[3-4],r8i7n[3-8],r9i0n[0-2],r9i1n[0-3],r9i2n[3-5,8],r9i3n[0-1,7-8],r9i4n[0-2],r9i5n[3-8],r9i6n[0,7-8])