bigscience-bot commited on
Commit
a670b7a
·
1 Parent(s): 7f31139
Files changed (1) hide show
  1. logs/main_log.txt +497 -0
logs/main_log.txt CHANGED
@@ -39307,3 +39307,500 @@ time (ms)
39307
  [2021-09-24 17:07:03] PULSE: tr8-104B is running for 11:14:52 since 2021-09-24T05:52:11 (1162855_1 on 'gpu_p13' partition (r6i4n[5,7],r6i5n[2,7-8],r6i6n[0,2,6],r7i2n[4-5],r7i6n[2-4],r7i7n[7-8],r8i0n[2-3,5-8],r8i1n[0,2-4],r8i2n8,r8i3n[0-2],r8i5n[3-4],r8i7n[3-8],r9i0n[0-2],r9i1n[0-3],r9i2n[3-5,8],r9i3n[0-1,7-8],r9i4n[0-2],r9i5n[3-8],r9i6n[0,7-8])
39308
  iteration 3827/ 159576 | consumed samples: 75216 | elapsed time per iteration (ms): 14531.9 | learning rate: 2.083E-05 | global batch size: 32 | lm loss: 6.427704E+00 | loss scale: 16384.0 | grad norm: 68943.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39309
  time (ms)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39307
  [2021-09-24 17:07:03] PULSE: tr8-104B is running for 11:14:52 since 2021-09-24T05:52:11 (1162855_1 on 'gpu_p13' partition (r6i4n[5,7],r6i5n[2,7-8],r6i6n[0,2,6],r7i2n[4-5],r7i6n[2-4],r7i7n[7-8],r8i0n[2-3,5-8],r8i1n[0,2-4],r8i2n8,r8i3n[0-2],r8i5n[3-4],r8i7n[3-8],r9i0n[0-2],r9i1n[0-3],r9i2n[3-5,8],r9i3n[0-1,7-8],r9i4n[0-2],r9i5n[3-8],r9i6n[0,7-8])
39308
  iteration 3827/ 159576 | consumed samples: 75216 | elapsed time per iteration (ms): 14531.9 | learning rate: 2.083E-05 | global batch size: 32 | lm loss: 6.427704E+00 | loss scale: 16384.0 | grad norm: 68943.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39309
  time (ms)
39310
+ iteration 3828/ 159576 | consumed samples: 75248 | elapsed time per iteration (ms): 14988.1 | learning rate: 2.084E-05 | global batch size: 32 | lm loss: 6.347779E+00 | loss scale: 16384.0 | grad norm: 64095.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39311
+ time (ms)
39312
+ iteration 3829/ 159576 | consumed samples: 75280 | elapsed time per iteration (ms): 14665.9 | learning rate: 2.084E-05 | global batch size: 32 | lm loss: 6.411919E+00 | loss scale: 16384.0 | grad norm: 82008.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39313
+ time (ms)
39314
+ iteration 3830/ 159576 | consumed samples: 75312 | elapsed time per iteration (ms): 14539.9 | learning rate: 2.085E-05 | global batch size: 32 | lm loss: 6.458866E+00 | loss scale: 16384.0 | grad norm: 67971.949 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39315
+ time (ms)
39316
+ iteration 3831/ 159576 | consumed samples: 75344 | elapsed time per iteration (ms): 14600.2 | learning rate: 2.086E-05 | global batch size: 32 | lm loss: 6.450158E+00 | loss scale: 16384.0 | grad norm: 59376.432 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39317
+ time (ms)
39318
+ iteration 3832/ 159576 | consumed samples: 75376 | elapsed time per iteration (ms): 14931.8 | learning rate: 2.087E-05 | global batch size: 32 | lm loss: 6.537256E+00 | loss scale: 16384.0 | grad norm: 77538.560 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39319
+ time (ms)
39320
+ iteration 3833/ 159576 | consumed samples: 75408 | elapsed time per iteration (ms): 14592.6 | learning rate: 2.088E-05 | global batch size: 32 | lm loss: 6.392985E+00 | loss scale: 16384.0 | grad norm: 84275.600 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39321
+ time (ms)
39322
+ iteration 3834/ 159576 | consumed samples: 75440 | elapsed time per iteration (ms): 14616.6 | learning rate: 2.089E-05 | global batch size: 32 | lm loss: 6.512251E+00 | loss scale: 16384.0 | grad norm: 80167.095 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39323
+ time (ms)
39324
+ iteration 3835/ 159576 | consumed samples: 75472 | elapsed time per iteration (ms): 14584.0 | learning rate: 2.090E-05 | global batch size: 32 | lm loss: 6.467295E+00 | loss scale: 16384.0 | grad norm: 85124.328 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39325
+ time (ms)
39326
+ iteration 3836/ 159576 | consumed samples: 75504 | elapsed time per iteration (ms): 14844.3 | learning rate: 2.091E-05 | global batch size: 32 | lm loss: 6.514040E+00 | loss scale: 16384.0 | grad norm: 71539.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39327
+ time (ms)
39328
+ iteration 3837/ 159576 | consumed samples: 75536 | elapsed time per iteration (ms): 14618.8 | learning rate: 2.092E-05 | global batch size: 32 | lm loss: 6.519591E+00 | loss scale: 16384.0 | grad norm: 89173.398 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39329
+ time (ms)
39330
+ iteration 3838/ 159576 | consumed samples: 75568 | elapsed time per iteration (ms): 14566.0 | learning rate: 2.092E-05 | global batch size: 32 | lm loss: 6.447284E+00 | loss scale: 16384.0 | grad norm: 86030.395 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39331
+ time (ms)
39332
+ iteration 3839/ 159576 | consumed samples: 75600 | elapsed time per iteration (ms): 14636.3 | learning rate: 2.093E-05 | global batch size: 32 | lm loss: 6.369718E+00 | loss scale: 16384.0 | grad norm: 66275.400 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39333
+ time (ms)
39334
+ iteration 3840/ 159576 | consumed samples: 75632 | elapsed time per iteration (ms): 14897.9 | learning rate: 2.094E-05 | global batch size: 32 | lm loss: 6.467171E+00 | loss scale: 16384.0 | grad norm: 82043.402 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39335
+ time (ms)
39336
+ iteration 3841/ 159576 | consumed samples: 75664 | elapsed time per iteration (ms): 14554.8 | learning rate: 2.095E-05 | global batch size: 32 | lm loss: 6.458669E+00 | loss scale: 16384.0 | grad norm: 73761.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39337
+ time (ms)
39338
+ iteration 3842/ 159576 | consumed samples: 75696 | elapsed time per iteration (ms): 14564.2 | learning rate: 2.096E-05 | global batch size: 32 | lm loss: 6.516797E+00 | loss scale: 16384.0 | grad norm: 83647.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39339
+ time (ms)
39340
+ iteration 3843/ 159576 | consumed samples: 75728 | elapsed time per iteration (ms): 14464.9 | learning rate: 2.097E-05 | global batch size: 32 | lm loss: 6.381551E+00 | loss scale: 16384.0 | grad norm: 58297.000 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39341
+ time (ms)
39342
+ iteration 3844/ 159576 | consumed samples: 75760 | elapsed time per iteration (ms): 14942.4 | learning rate: 2.098E-05 | global batch size: 32 | lm loss: 6.471825E+00 | loss scale: 16384.0 | grad norm: 82881.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39343
+ time (ms)
39344
+ iteration 3845/ 159576 | consumed samples: 75792 | elapsed time per iteration (ms): 14531.3 | learning rate: 2.099E-05 | global batch size: 32 | lm loss: 6.528457E+00 | loss scale: 16384.0 | grad norm: 67296.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39345
+ time (ms)
39346
+ iteration 3846/ 159576 | consumed samples: 75824 | elapsed time per iteration (ms): 14601.9 | learning rate: 2.100E-05 | global batch size: 32 | lm loss: 6.408827E+00 | loss scale: 16384.0 | grad norm: 67512.624 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39347
+ time (ms)
39348
+ iteration 3847/ 159576 | consumed samples: 75856 | elapsed time per iteration (ms): 14580.2 | learning rate: 2.100E-05 | global batch size: 32 | lm loss: 6.440091E+00 | loss scale: 16384.0 | grad norm: 78400.656 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39349
+ time (ms)
39350
+ iteration 3848/ 159576 | consumed samples: 75888 | elapsed time per iteration (ms): 14911.9 | learning rate: 2.101E-05 | global batch size: 32 | lm loss: 6.374573E+00 | loss scale: 16384.0 | grad norm: 85886.969 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39351
+ time (ms)
39352
+ iteration 3849/ 159576 | consumed samples: 75920 | elapsed time per iteration (ms): 14768.3 | learning rate: 2.102E-05 | global batch size: 32 | lm loss: 6.529835E+00 | loss scale: 16384.0 | grad norm: 71394.057 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39353
+ time (ms)
39354
+ iteration 3850/ 159576 | consumed samples: 75952 | elapsed time per iteration (ms): 14553.3 | learning rate: 2.103E-05 | global batch size: 32 | lm loss: 6.455585E+00 | loss scale: 16384.0 | grad norm: 67772.089 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39355
+ time (ms)
39356
+ iteration 3851/ 159576 | consumed samples: 75984 | elapsed time per iteration (ms): 14574.9 | learning rate: 2.104E-05 | global batch size: 32 | lm loss: 6.428284E+00 | loss scale: 16384.0 | grad norm: 110864.098 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39357
+ time (ms)
39358
+ iteration 3852/ 159576 | consumed samples: 76016 | elapsed time per iteration (ms): 14592.6 | learning rate: 2.105E-05 | global batch size: 32 | lm loss: 6.457644E+00 | loss scale: 16384.0 | grad norm: 73499.592 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39359
+ time (ms)
39360
+ iteration 3853/ 159576 | consumed samples: 76048 | elapsed time per iteration (ms): 14780.7 | learning rate: 2.106E-05 | global batch size: 32 | lm loss: 6.459057E+00 | loss scale: 16384.0 | grad norm: 71503.908 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39361
+ time (ms)
39362
+ iteration 3854/ 159576 | consumed samples: 76080 | elapsed time per iteration (ms): 14631.9 | learning rate: 2.107E-05 | global batch size: 32 | lm loss: 6.522111E+00 | loss scale: 16384.0 | grad norm: 73205.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39363
+ time (ms)
39364
+ iteration 3855/ 159576 | consumed samples: 76112 | elapsed time per iteration (ms): 14685.7 | learning rate: 2.108E-05 | global batch size: 32 | lm loss: 6.444643E+00 | loss scale: 16384.0 | grad norm: 70169.559 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39365
+ time (ms)
39366
+ iteration 3856/ 159576 | consumed samples: 76144 | elapsed time per iteration (ms): 14534.2 | learning rate: 2.108E-05 | global batch size: 32 | lm loss: 6.392300E+00 | loss scale: 16384.0 | grad norm: 81224.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39367
+ time (ms)
39368
+ iteration 3857/ 159576 | consumed samples: 76176 | elapsed time per iteration (ms): 14734.9 | learning rate: 2.109E-05 | global batch size: 32 | lm loss: 6.474737E+00 | loss scale: 16384.0 | grad norm: 76429.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39369
+ time (ms)
39370
+ iteration 3858/ 159576 | consumed samples: 76208 | elapsed time per iteration (ms): 14589.1 | learning rate: 2.110E-05 | global batch size: 32 | lm loss: 6.481500E+00 | loss scale: 16384.0 | grad norm: 76288.617 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39371
+ time (ms)
39372
+ iteration 3859/ 159576 | consumed samples: 76240 | elapsed time per iteration (ms): 14536.6 | learning rate: 2.111E-05 | global batch size: 32 | lm loss: 6.504058E+00 | loss scale: 16384.0 | grad norm: 75104.955 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39373
+ time (ms)
39374
+ iteration 3860/ 159576 | consumed samples: 76272 | elapsed time per iteration (ms): 14557.4 | learning rate: 2.112E-05 | global batch size: 32 | lm loss: 6.616935E+00 | loss scale: 16384.0 | grad norm: 73471.312 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39375
+ time (ms)
39376
+ iteration 3861/ 159576 | consumed samples: 76304 | elapsed time per iteration (ms): 14996.3 | learning rate: 2.113E-05 | global batch size: 32 | lm loss: 6.437632E+00 | loss scale: 16384.0 | grad norm: 100626.814 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39377
+ time (ms)
39378
+ iteration 3862/ 159576 | consumed samples: 76336 | elapsed time per iteration (ms): 14610.8 | learning rate: 2.114E-05 | global batch size: 32 | lm loss: 6.358921E+00 | loss scale: 16384.0 | grad norm: 84367.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39379
+ time (ms)
39380
+ iteration 3863/ 159576 | consumed samples: 76368 | elapsed time per iteration (ms): 14574.0 | learning rate: 2.115E-05 | global batch size: 32 | lm loss: 6.489450E+00 | loss scale: 16384.0 | grad norm: 111308.083 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39381
+ time (ms)
39382
+ iteration 3864/ 159576 | consumed samples: 76400 | elapsed time per iteration (ms): 14585.8 | learning rate: 2.116E-05 | global batch size: 32 | lm loss: 6.579299E+00 | loss scale: 16384.0 | grad norm: 71685.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39383
+ time (ms)
39384
+ iteration 3865/ 159576 | consumed samples: 76432 | elapsed time per iteration (ms): 14801.5 | learning rate: 2.116E-05 | global batch size: 32 | lm loss: 6.356242E+00 | loss scale: 16384.0 | grad norm: 68636.493 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39385
+ time (ms)
39386
+ iteration 3866/ 159576 | consumed samples: 76464 | elapsed time per iteration (ms): 14581.8 | learning rate: 2.117E-05 | global batch size: 32 | lm loss: 6.583051E+00 | loss scale: 16384.0 | grad norm: 83498.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39387
+ time (ms)
39388
+ iteration 3867/ 159576 | consumed samples: 76496 | elapsed time per iteration (ms): 14548.1 | learning rate: 2.118E-05 | global batch size: 32 | lm loss: 6.414474E+00 | loss scale: 16384.0 | grad norm: 70120.527 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39389
+ time (ms)
39390
+ iteration 3868/ 159576 | consumed samples: 76528 | elapsed time per iteration (ms): 14581.2 | learning rate: 2.119E-05 | global batch size: 32 | lm loss: 6.383676E+00 | loss scale: 16384.0 | grad norm: 65625.290 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39391
+ time (ms)
39392
+ iteration 3869/ 159576 | consumed samples: 76560 | elapsed time per iteration (ms): 14975.0 | learning rate: 2.120E-05 | global batch size: 32 | lm loss: 6.553302E+00 | loss scale: 16384.0 | grad norm: 78443.319 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39393
+ time (ms)
39394
+ iteration 3870/ 159576 | consumed samples: 76592 | elapsed time per iteration (ms): 14654.1 | learning rate: 2.121E-05 | global batch size: 32 | lm loss: 6.525763E+00 | loss scale: 16384.0 | grad norm: 74575.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39395
+ time (ms)
39396
+ iteration 3871/ 159576 | consumed samples: 76624 | elapsed time per iteration (ms): 14658.5 | learning rate: 2.122E-05 | global batch size: 32 | lm loss: 6.416959E+00 | loss scale: 16384.0 | grad norm: 61001.593 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39397
+ time (ms)
39398
+ iteration 3872/ 159576 | consumed samples: 76656 | elapsed time per iteration (ms): 14544.3 | learning rate: 2.123E-05 | global batch size: 32 | lm loss: 6.516649E+00 | loss scale: 16384.0 | grad norm: 76582.538 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39399
+ time (ms)
39400
+ iteration 3873/ 159576 | consumed samples: 76688 | elapsed time per iteration (ms): 14961.2 | learning rate: 2.124E-05 | global batch size: 32 | lm loss: 6.532383E+00 | loss scale: 16384.0 | grad norm: 98540.585 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39401
+ time (ms)
39402
+ iteration 3874/ 159576 | consumed samples: 76720 | elapsed time per iteration (ms): 14595.7 | learning rate: 2.124E-05 | global batch size: 32 | lm loss: 6.589262E+00 | loss scale: 16384.0 | grad norm: 90020.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39403
+ time (ms)
39404
+ iteration 3875/ 159576 | consumed samples: 76752 | elapsed time per iteration (ms): 14549.8 | learning rate: 2.125E-05 | global batch size: 32 | lm loss: 6.475612E+00 | loss scale: 16384.0 | grad norm: 71253.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39405
+ time (ms)
39406
+ iteration 3876/ 159576 | consumed samples: 76784 | elapsed time per iteration (ms): 14539.7 | learning rate: 2.126E-05 | global batch size: 32 | lm loss: 6.477540E+00 | loss scale: 16384.0 | grad norm: 113904.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39407
+ time (ms)
39408
+ iteration 3877/ 159576 | consumed samples: 76816 | elapsed time per iteration (ms): 14922.4 | learning rate: 2.127E-05 | global batch size: 32 | lm loss: 6.475825E+00 | loss scale: 16384.0 | grad norm: 59736.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39409
+ time (ms)
39410
+ iteration 3878/ 159576 | consumed samples: 76848 | elapsed time per iteration (ms): 14676.0 | learning rate: 2.128E-05 | global batch size: 32 | lm loss: 6.477038E+00 | loss scale: 16384.0 | grad norm: 73926.427 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39411
+ time (ms)
39412
+ iteration 3879/ 159576 | consumed samples: 76880 | elapsed time per iteration (ms): 14505.4 | learning rate: 2.129E-05 | global batch size: 32 | lm loss: 6.577363E+00 | loss scale: 16384.0 | grad norm: 65273.771 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39413
+ time (ms)
39414
+ iteration 3880/ 159576 | consumed samples: 76912 | elapsed time per iteration (ms): 14525.2 | learning rate: 2.130E-05 | global batch size: 32 | lm loss: 6.431276E+00 | loss scale: 16384.0 | grad norm: 62353.041 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39415
+ time (ms)
39416
+ iteration 3881/ 159576 | consumed samples: 76944 | elapsed time per iteration (ms): 14918.9 | learning rate: 2.131E-05 | global batch size: 32 | lm loss: 6.471975E+00 | loss scale: 16384.0 | grad norm: 80402.399 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39417
+ time (ms)
39418
+ iteration 3882/ 159576 | consumed samples: 76976 | elapsed time per iteration (ms): 14543.5 | learning rate: 2.132E-05 | global batch size: 32 | lm loss: 6.481179E+00 | loss scale: 16384.0 | grad norm: 59241.446 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39419
+ time (ms)
39420
+ iteration 3883/ 159576 | consumed samples: 77008 | elapsed time per iteration (ms): 14519.1 | learning rate: 2.132E-05 | global batch size: 32 | lm loss: 6.356431E+00 | loss scale: 16384.0 | grad norm: 66124.949 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39421
+ time (ms)
39422
+ iteration 3884/ 159576 | consumed samples: 77040 | elapsed time per iteration (ms): 14635.6 | learning rate: 2.133E-05 | global batch size: 32 | lm loss: 7.171796E+00 | loss scale: 16384.0 | grad norm: 628102.297 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39423
+ time (ms)
39424
+ iteration 3885/ 159576 | consumed samples: 77072 | elapsed time per iteration (ms): 14877.6 | learning rate: 2.134E-05 | global batch size: 32 | lm loss: 7.122965E+00 | loss scale: 16384.0 | grad norm: 105361.079 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39425
+ time (ms)
39426
+ iteration 3886/ 159576 | consumed samples: 77104 | elapsed time per iteration (ms): 14581.7 | learning rate: 2.135E-05 | global batch size: 32 | lm loss: 6.781033E+00 | loss scale: 16384.0 | grad norm: 90805.956 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39427
+ time (ms)
39428
+ iteration 3887/ 159576 | consumed samples: 77136 | elapsed time per iteration (ms): 14580.5 | learning rate: 2.136E-05 | global batch size: 32 | lm loss: 6.824611E+00 | loss scale: 16384.0 | grad norm: 128888.283 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39429
+ time (ms)
39430
+ iteration 3888/ 159576 | consumed samples: 77168 | elapsed time per iteration (ms): 14468.4 | learning rate: 2.137E-05 | global batch size: 32 | lm loss: 6.773994E+00 | loss scale: 16384.0 | grad norm: 67441.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39431
+ time (ms)
39432
+ iteration 3889/ 159576 | consumed samples: 77200 | elapsed time per iteration (ms): 14934.3 | learning rate: 2.138E-05 | global batch size: 32 | lm loss: 6.845183E+00 | loss scale: 16384.0 | grad norm: 171660.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39433
+ time (ms)
39434
+ iteration 3890/ 159576 | consumed samples: 77232 | elapsed time per iteration (ms): 14531.8 | learning rate: 2.139E-05 | global batch size: 32 | lm loss: 6.803124E+00 | loss scale: 16384.0 | grad norm: 100767.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39435
+ time (ms)
39436
+ iteration 3891/ 159576 | consumed samples: 77264 | elapsed time per iteration (ms): 14568.7 | learning rate: 2.139E-05 | global batch size: 32 | lm loss: 6.825951E+00 | loss scale: 16384.0 | grad norm: 84326.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39437
+ time (ms)
39438
+ iteration 3892/ 159576 | consumed samples: 77296 | elapsed time per iteration (ms): 14543.8 | learning rate: 2.140E-05 | global batch size: 32 | lm loss: 6.734772E+00 | loss scale: 16384.0 | grad norm: 87236.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39439
+ time (ms)
39440
+ iteration 3893/ 159576 | consumed samples: 77328 | elapsed time per iteration (ms): 14607.7 | learning rate: 2.141E-05 | global batch size: 32 | lm loss: 6.789660E+00 | loss scale: 16384.0 | grad norm: 88054.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39441
+ time (ms)
39442
+ iteration 3894/ 159576 | consumed samples: 77360 | elapsed time per iteration (ms): 14920.9 | learning rate: 2.142E-05 | global batch size: 32 | lm loss: 6.710454E+00 | loss scale: 16384.0 | grad norm: 182978.046 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39443
+ time (ms)
39444
+ iteration 3895/ 159576 | consumed samples: 77392 | elapsed time per iteration (ms): 14510.2 | learning rate: 2.143E-05 | global batch size: 32 | lm loss: 6.691602E+00 | loss scale: 16384.0 | grad norm: 119037.944 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39445
+ time (ms)
39446
+ iteration 3896/ 159576 | consumed samples: 77424 | elapsed time per iteration (ms): 14496.2 | learning rate: 2.144E-05 | global batch size: 32 | lm loss: 6.739342E+00 | loss scale: 16384.0 | grad norm: 97461.502 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39447
+ time (ms)
39448
+ iteration 3897/ 159576 | consumed samples: 77456 | elapsed time per iteration (ms): 14526.7 | learning rate: 2.145E-05 | global batch size: 32 | lm loss: 6.818674E+00 | loss scale: 16384.0 | grad norm: 86334.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39449
+ time (ms)
39450
+ iteration 3898/ 159576 | consumed samples: 77488 | elapsed time per iteration (ms): 14792.9 | learning rate: 2.146E-05 | global batch size: 32 | lm loss: 6.717194E+00 | loss scale: 16384.0 | grad norm: 113951.645 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39451
+ time (ms)
39452
+ iteration 3899/ 159576 | consumed samples: 77520 | elapsed time per iteration (ms): 14491.5 | learning rate: 2.147E-05 | global batch size: 32 | lm loss: 6.714782E+00 | loss scale: 16384.0 | grad norm: 99766.959 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39453
+ time (ms)
39454
+ iteration 3900/ 159576 | consumed samples: 77552 | elapsed time per iteration (ms): 14584.1 | learning rate: 2.147E-05 | global batch size: 32 | lm loss: 6.659179E+00 | loss scale: 16384.0 | grad norm: 89663.421 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39455
+ time (ms)
39456
+ iteration 3901/ 159576 | consumed samples: 77584 | elapsed time per iteration (ms): 14629.2 | learning rate: 2.148E-05 | global batch size: 32 | lm loss: 6.615579E+00 | loss scale: 16384.0 | grad norm: 68957.535 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39457
+ time (ms)
39458
+ iteration 3902/ 159576 | consumed samples: 77616 | elapsed time per iteration (ms): 14617.9 | learning rate: 2.149E-05 | global batch size: 32 | lm loss: 6.606854E+00 | loss scale: 16384.0 | grad norm: 99968.600 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39459
+ time (ms)
39460
+ iteration 3903/ 159576 | consumed samples: 77648 | elapsed time per iteration (ms): 14554.1 | learning rate: 2.150E-05 | global batch size: 32 | lm loss: 6.537298E+00 | loss scale: 16384.0 | grad norm: 67921.849 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39461
+ time (ms)
39462
+ iteration 3904/ 159576 | consumed samples: 77680 | elapsed time per iteration (ms): 14545.4 | learning rate: 2.151E-05 | global batch size: 32 | lm loss: 6.606940E+00 | loss scale: 16384.0 | grad norm: 145573.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39463
+ time (ms)
39464
+ iteration 3905/ 159576 | consumed samples: 77712 | elapsed time per iteration (ms): 14521.9 | learning rate: 2.152E-05 | global batch size: 32 | lm loss: 6.625298E+00 | loss scale: 16384.0 | grad norm: 96778.059 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39465
+ time (ms)
39466
+ iteration 3906/ 159576 | consumed samples: 77744 | elapsed time per iteration (ms): 14699.2 | learning rate: 2.153E-05 | global batch size: 32 | lm loss: 6.624491E+00 | loss scale: 16384.0 | grad norm: 92738.461 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39467
+ time (ms)
39468
+ iteration 3907/ 159576 | consumed samples: 77776 | elapsed time per iteration (ms): 14558.6 | learning rate: 2.154E-05 | global batch size: 32 | lm loss: 6.825802E+00 | loss scale: 16384.0 | grad norm: 119492.559 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39469
+ time (ms)
39470
+ iteration 3908/ 159576 | consumed samples: 77808 | elapsed time per iteration (ms): 14547.7 | learning rate: 2.155E-05 | global batch size: 32 | lm loss: 6.591653E+00 | loss scale: 16384.0 | grad norm: 78761.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39471
+ time (ms)
39472
+ iteration 3909/ 159576 | consumed samples: 77840 | elapsed time per iteration (ms): 14554.0 | learning rate: 2.155E-05 | global batch size: 32 | lm loss: 6.567001E+00 | loss scale: 16384.0 | grad norm: 147075.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39473
+ time (ms)
39474
+ iteration 3910/ 159576 | consumed samples: 77872 | elapsed time per iteration (ms): 15013.4 | learning rate: 2.156E-05 | global batch size: 32 | lm loss: 6.787440E+00 | loss scale: 16384.0 | grad norm: 142314.988 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39475
+ time (ms)
39476
+ iteration 3911/ 159576 | consumed samples: 77904 | elapsed time per iteration (ms): 14566.2 | learning rate: 2.157E-05 | global batch size: 32 | lm loss: 6.525432E+00 | loss scale: 16384.0 | grad norm: 87369.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39477
+ time (ms)
39478
+ iteration 3912/ 159576 | consumed samples: 77936 | elapsed time per iteration (ms): 14516.0 | learning rate: 2.158E-05 | global batch size: 32 | lm loss: 6.615817E+00 | loss scale: 16384.0 | grad norm: 83904.990 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39479
+ time (ms)
39480
+ iteration 3913/ 159576 | consumed samples: 77968 | elapsed time per iteration (ms): 14525.8 | learning rate: 2.159E-05 | global batch size: 32 | lm loss: 6.564670E+00 | loss scale: 16384.0 | grad norm: 97516.560 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39481
+ time (ms)
39482
+ iteration 3914/ 159576 | consumed samples: 78000 | elapsed time per iteration (ms): 15027.0 | learning rate: 2.160E-05 | global batch size: 32 | lm loss: 6.400544E+00 | loss scale: 16384.0 | grad norm: 92743.388 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39483
+ time (ms)
39484
+ iteration 3915/ 159576 | consumed samples: 78032 | elapsed time per iteration (ms): 14573.6 | learning rate: 2.161E-05 | global batch size: 32 | lm loss: 6.603245E+00 | loss scale: 16384.0 | grad norm: 106541.895 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39485
+ time (ms)
39486
+ iteration 3916/ 159576 | consumed samples: 78064 | elapsed time per iteration (ms): 14538.9 | learning rate: 2.162E-05 | global batch size: 32 | lm loss: 6.560642E+00 | loss scale: 16384.0 | grad norm: 71313.618 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39487
+ time (ms)
39488
+ iteration 3917/ 159576 | consumed samples: 78096 | elapsed time per iteration (ms): 14550.2 | learning rate: 2.163E-05 | global batch size: 32 | lm loss: 6.578140E+00 | loss scale: 16384.0 | grad norm: 83812.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39489
+ time (ms)
39490
+ iteration 3918/ 159576 | consumed samples: 78128 | elapsed time per iteration (ms): 14857.6 | learning rate: 2.163E-05 | global batch size: 32 | lm loss: 6.583351E+00 | loss scale: 16384.0 | grad norm: 69616.816 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39491
+ time (ms)
39492
+ iteration 3919/ 159576 | consumed samples: 78160 | elapsed time per iteration (ms): 14509.2 | learning rate: 2.164E-05 | global batch size: 32 | lm loss: 6.595952E+00 | loss scale: 16384.0 | grad norm: 83133.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39493
+ time (ms)
39494
+ iteration 3920/ 159576 | consumed samples: 78192 | elapsed time per iteration (ms): 14502.7 | learning rate: 2.165E-05 | global batch size: 32 | lm loss: 6.645111E+00 | loss scale: 16384.0 | grad norm: 69570.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39495
+ time (ms)
39496
+ iteration 3921/ 159576 | consumed samples: 78224 | elapsed time per iteration (ms): 14498.8 | learning rate: 2.166E-05 | global batch size: 32 | lm loss: 6.553501E+00 | loss scale: 16384.0 | grad norm: 142896.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39497
+ time (ms)
39498
+ iteration 3922/ 159576 | consumed samples: 78256 | elapsed time per iteration (ms): 14842.1 | learning rate: 2.167E-05 | global batch size: 32 | lm loss: 6.687614E+00 | loss scale: 16384.0 | grad norm: 107346.964 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39499
+ time (ms)
39500
+ iteration 3923/ 159576 | consumed samples: 78288 | elapsed time per iteration (ms): 14567.6 | learning rate: 2.168E-05 | global batch size: 32 | lm loss: 6.764112E+00 | loss scale: 16384.0 | grad norm: 75484.388 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39501
+ time (ms)
39502
+ iteration 3924/ 159576 | consumed samples: 78320 | elapsed time per iteration (ms): 14603.6 | learning rate: 2.169E-05 | global batch size: 32 | lm loss: 6.384696E+00 | loss scale: 16384.0 | grad norm: 91570.469 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39503
+ time (ms)
39504
+ iteration 3925/ 159576 | consumed samples: 78352 | elapsed time per iteration (ms): 14494.1 | learning rate: 2.170E-05 | global batch size: 32 | lm loss: 6.148740E+00 | loss scale: 16384.0 | grad norm: 66094.874 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39505
+ time (ms)
39506
+ iteration 3926/ 159576 | consumed samples: 78384 | elapsed time per iteration (ms): 14880.0 | learning rate: 2.171E-05 | global batch size: 32 | lm loss: 6.492467E+00 | loss scale: 16384.0 | grad norm: 95980.364 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39507
+ time (ms)
39508
+ iteration 3927/ 159576 | consumed samples: 78416 | elapsed time per iteration (ms): 14529.0 | learning rate: 2.171E-05 | global batch size: 32 | lm loss: 6.634668E+00 | loss scale: 16384.0 | grad norm: 102240.933 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39509
+ time (ms)
39510
+ iteration 3928/ 159576 | consumed samples: 78448 | elapsed time per iteration (ms): 14524.9 | learning rate: 2.172E-05 | global batch size: 32 | lm loss: 6.542571E+00 | loss scale: 16384.0 | grad norm: 78190.337 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39511
+ time (ms)
39512
+ iteration 3929/ 159576 | consumed samples: 78480 | elapsed time per iteration (ms): 14519.9 | learning rate: 2.173E-05 | global batch size: 32 | lm loss: 6.546354E+00 | loss scale: 16384.0 | grad norm: 69181.655 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39513
+ time (ms)
39514
+ iteration 3930/ 159576 | consumed samples: 78512 | elapsed time per iteration (ms): 14848.7 | learning rate: 2.174E-05 | global batch size: 32 | lm loss: 6.556016E+00 | loss scale: 16384.0 | grad norm: 166890.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39515
+ time (ms)
39516
+ iteration 3931/ 159576 | consumed samples: 78544 | elapsed time per iteration (ms): 14630.3 | learning rate: 2.175E-05 | global batch size: 32 | lm loss: 6.575625E+00 | loss scale: 16384.0 | grad norm: 67026.457 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39517
+ time (ms)
39518
+ iteration 3932/ 159576 | consumed samples: 78576 | elapsed time per iteration (ms): 14503.2 | learning rate: 2.176E-05 | global batch size: 32 | lm loss: 6.528583E+00 | loss scale: 16384.0 | grad norm: 65300.446 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39519
+ time (ms)
39520
+ iteration 3933/ 159576 | consumed samples: 78608 | elapsed time per iteration (ms): 14533.6 | learning rate: 2.177E-05 | global batch size: 32 | lm loss: 6.571996E+00 | loss scale: 16384.0 | grad norm: 61530.557 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39521
+ time (ms)
39522
+ iteration 3934/ 159576 | consumed samples: 78640 | elapsed time per iteration (ms): 14528.2 | learning rate: 2.178E-05 | global batch size: 32 | lm loss: 6.524823E+00 | loss scale: 16384.0 | grad norm: 58107.513 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39523
+ time (ms)
39524
+ iteration 3935/ 159576 | consumed samples: 78672 | elapsed time per iteration (ms): 14801.4 | learning rate: 2.179E-05 | global batch size: 32 | lm loss: 6.627916E+00 | loss scale: 16384.0 | grad norm: 64798.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39525
+ time (ms)
39526
+ iteration 3936/ 159576 | consumed samples: 78704 | elapsed time per iteration (ms): 14509.3 | learning rate: 2.179E-05 | global batch size: 32 | lm loss: 6.511620E+00 | loss scale: 16384.0 | grad norm: 59258.569 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39527
+ time (ms)
39528
+ iteration 3937/ 159576 | consumed samples: 78736 | elapsed time per iteration (ms): 14529.7 | learning rate: 2.180E-05 | global batch size: 32 | lm loss: 6.414696E+00 | loss scale: 16384.0 | grad norm: 75598.973 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39529
+ time (ms)
39530
+ iteration 3938/ 159576 | consumed samples: 78768 | elapsed time per iteration (ms): 14568.6 | learning rate: 2.181E-05 | global batch size: 32 | lm loss: 6.692476E+00 | loss scale: 16384.0 | grad norm: 68594.644 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39531
+ time (ms)
39532
+ iteration 3939/ 159576 | consumed samples: 78800 | elapsed time per iteration (ms): 14680.0 | learning rate: 2.182E-05 | global batch size: 32 | lm loss: 6.509182E+00 | loss scale: 16384.0 | grad norm: 77431.860 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39533
+ time (ms)
39534
+ iteration 3940/ 159576 | consumed samples: 78832 | elapsed time per iteration (ms): 14561.3 | learning rate: 2.183E-05 | global batch size: 32 | lm loss: 6.521114E+00 | loss scale: 16384.0 | grad norm: 67107.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39535
+ time (ms)
39536
+ iteration 3941/ 159576 | consumed samples: 78864 | elapsed time per iteration (ms): 14540.3 | learning rate: 2.184E-05 | global batch size: 32 | lm loss: 6.557777E+00 | loss scale: 16384.0 | grad norm: 82252.980 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39537
+ time (ms)
39538
+ iteration 3942/ 159576 | consumed samples: 78896 | elapsed time per iteration (ms): 14516.4 | learning rate: 2.185E-05 | global batch size: 32 | lm loss: 6.519272E+00 | loss scale: 16384.0 | grad norm: 62956.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39539
+ time (ms)
39540
+ iteration 3943/ 159576 | consumed samples: 78928 | elapsed time per iteration (ms): 14804.0 | learning rate: 2.186E-05 | global batch size: 32 | lm loss: 6.436077E+00 | loss scale: 16384.0 | grad norm: 63372.650 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39541
+ time (ms)
39542
+ iteration 3944/ 159576 | consumed samples: 78960 | elapsed time per iteration (ms): 14504.5 | learning rate: 2.187E-05 | global batch size: 32 | lm loss: 6.536609E+00 | loss scale: 16384.0 | grad norm: 70623.314 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39543
+ time (ms)
39544
+ iteration 3945/ 159576 | consumed samples: 78992 | elapsed time per iteration (ms): 14519.8 | learning rate: 2.187E-05 | global batch size: 32 | lm loss: 6.631818E+00 | loss scale: 16384.0 | grad norm: 62267.463 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39545
+ time (ms)
39546
+ iteration 3946/ 159576 | consumed samples: 79024 | elapsed time per iteration (ms): 14592.1 | learning rate: 2.188E-05 | global batch size: 32 | lm loss: 6.263665E+00 | loss scale: 16384.0 | grad norm: 67107.842 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39547
+ time (ms)
39548
+ iteration 3947/ 159576 | consumed samples: 79056 | elapsed time per iteration (ms): 14791.6 | learning rate: 2.189E-05 | global batch size: 32 | lm loss: 6.622372E+00 | loss scale: 16384.0 | grad norm: 84764.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39549
+ time (ms)
39550
+ iteration 3948/ 159576 | consumed samples: 79088 | elapsed time per iteration (ms): 14637.3 | learning rate: 2.190E-05 | global batch size: 32 | lm loss: 6.395759E+00 | loss scale: 16384.0 | grad norm: 60113.545 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39551
+ time (ms)
39552
+ iteration 3949/ 159576 | consumed samples: 79120 | elapsed time per iteration (ms): 14546.6 | learning rate: 2.191E-05 | global batch size: 32 | lm loss: 6.588756E+00 | loss scale: 16384.0 | grad norm: 68679.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39553
+ time (ms)
39554
+ iteration 3950/ 159576 | consumed samples: 79152 | elapsed time per iteration (ms): 14514.6 | learning rate: 2.192E-05 | global batch size: 32 | lm loss: 6.484011E+00 | loss scale: 16384.0 | grad norm: 68729.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39555
+ time (ms)
39556
+ iteration 3951/ 159576 | consumed samples: 79184 | elapsed time per iteration (ms): 14907.8 | learning rate: 2.193E-05 | global batch size: 32 | lm loss: 6.496289E+00 | loss scale: 16384.0 | grad norm: 58918.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39557
+ time (ms)
39558
+ iteration 3952/ 159576 | consumed samples: 79216 | elapsed time per iteration (ms): 14467.7 | learning rate: 2.194E-05 | global batch size: 32 | lm loss: 6.442475E+00 | loss scale: 16384.0 | grad norm: 73240.452 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39559
+ time (ms)
39560
+ iteration 3953/ 159576 | consumed samples: 79248 | elapsed time per iteration (ms): 14613.3 | learning rate: 2.195E-05 | global batch size: 32 | lm loss: 6.412640E+00 | loss scale: 16384.0 | grad norm: 63495.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39561
+ time (ms)
39562
+ iteration 3954/ 159576 | consumed samples: 79280 | elapsed time per iteration (ms): 14497.1 | learning rate: 2.195E-05 | global batch size: 32 | lm loss: 6.419092E+00 | loss scale: 16384.0 | grad norm: 64832.581 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39563
+ time (ms)
39564
+ iteration 3955/ 159576 | consumed samples: 79312 | elapsed time per iteration (ms): 14864.8 | learning rate: 2.196E-05 | global batch size: 32 | lm loss: 6.411493E+00 | loss scale: 16384.0 | grad norm: 70227.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39565
+ time (ms)
39566
+ iteration 3956/ 159576 | consumed samples: 79344 | elapsed time per iteration (ms): 14501.1 | learning rate: 2.197E-05 | global batch size: 32 | lm loss: 6.377773E+00 | loss scale: 16384.0 | grad norm: 65521.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39567
+ time (ms)
39568
+ iteration 3957/ 159576 | consumed samples: 79376 | elapsed time per iteration (ms): 14522.7 | learning rate: 2.198E-05 | global batch size: 32 | lm loss: 6.458980E+00 | loss scale: 16384.0 | grad norm: 62294.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39569
+ time (ms)
39570
+ iteration 3958/ 159576 | consumed samples: 79408 | elapsed time per iteration (ms): 14509.2 | learning rate: 2.199E-05 | global batch size: 32 | lm loss: 6.540348E+00 | loss scale: 16384.0 | grad norm: 64994.102 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39571
+ time (ms)
39572
+ iteration 3959/ 159576 | consumed samples: 79440 | elapsed time per iteration (ms): 14868.7 | learning rate: 2.200E-05 | global batch size: 32 | lm loss: 6.503858E+00 | loss scale: 16384.0 | grad norm: 54271.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39573
+ time (ms)
39574
+ iteration 3960/ 159576 | consumed samples: 79472 | elapsed time per iteration (ms): 14512.5 | learning rate: 2.201E-05 | global batch size: 32 | lm loss: 6.372645E+00 | loss scale: 16384.0 | grad norm: 73237.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39575
+ time (ms)
39576
+ iteration 3961/ 159576 | consumed samples: 79504 | elapsed time per iteration (ms): 14552.3 | learning rate: 2.202E-05 | global batch size: 32 | lm loss: 6.396554E+00 | loss scale: 16384.0 | grad norm: 64579.000 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39577
+ time (ms)
39578
+ iteration 3962/ 159576 | consumed samples: 79536 | elapsed time per iteration (ms): 14559.3 | learning rate: 2.203E-05 | global batch size: 32 | lm loss: 6.556979E+00 | loss scale: 16384.0 | grad norm: 83489.476 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39579
+ time (ms)
39580
+ iteration 3963/ 159576 | consumed samples: 79568 | elapsed time per iteration (ms): 14899.9 | learning rate: 2.203E-05 | global batch size: 32 | lm loss: 6.458327E+00 | loss scale: 16384.0 | grad norm: 58716.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39581
+ time (ms)
39582
+ iteration 3964/ 159576 | consumed samples: 79600 | elapsed time per iteration (ms): 14539.5 | learning rate: 2.204E-05 | global batch size: 32 | lm loss: 6.802517E+00 | loss scale: 16384.0 | grad norm: 60731.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39583
+ time (ms)
39584
+ iteration 3965/ 159576 | consumed samples: 79632 | elapsed time per iteration (ms): 14520.1 | learning rate: 2.205E-05 | global batch size: 32 | lm loss: 6.616902E+00 | loss scale: 16384.0 | grad norm: 64155.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39585
+ time (ms)
39586
+ iteration 3966/ 159576 | consumed samples: 79664 | elapsed time per iteration (ms): 14585.2 | learning rate: 2.206E-05 | global batch size: 32 | lm loss: 6.457995E+00 | loss scale: 16384.0 | grad norm: 74880.971 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39587
+ time (ms)
39588
+ iteration 3967/ 159576 | consumed samples: 79696 | elapsed time per iteration (ms): 14850.0 | learning rate: 2.207E-05 | global batch size: 32 | lm loss: 6.591904E+00 | loss scale: 16384.0 | grad norm: 75336.614 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39589
+ time (ms)
39590
+ iteration 3968/ 159576 | consumed samples: 79728 | elapsed time per iteration (ms): 14661.7 | learning rate: 2.208E-05 | global batch size: 32 | lm loss: 6.475752E+00 | loss scale: 16384.0 | grad norm: 76852.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39591
+ time (ms)
39592
+ iteration 3969/ 159576 | consumed samples: 79760 | elapsed time per iteration (ms): 14523.7 | learning rate: 2.209E-05 | global batch size: 32 | lm loss: 6.452621E+00 | loss scale: 16384.0 | grad norm: 65844.475 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39593
+ time (ms)
39594
+ iteration 3970/ 159576 | consumed samples: 79792 | elapsed time per iteration (ms): 14549.1 | learning rate: 2.210E-05 | global batch size: 32 | lm loss: 6.401618E+00 | loss scale: 16384.0 | grad norm: 84954.581 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39595
+ time (ms)
39596
+ iteration 3971/ 159576 | consumed samples: 79824 | elapsed time per iteration (ms): 14508.8 | learning rate: 2.211E-05 | global batch size: 32 | lm loss: 6.516178E+00 | loss scale: 16384.0 | grad norm: 71111.037 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39597
+ time (ms)
39598
+ iteration 3972/ 159576 | consumed samples: 79856 | elapsed time per iteration (ms): 14847.5 | learning rate: 2.211E-05 | global batch size: 32 | lm loss: 6.601567E+00 | loss scale: 16384.0 | grad norm: 74563.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39599
+ time (ms)
39600
+ iteration 3973/ 159576 | consumed samples: 79888 | elapsed time per iteration (ms): 14594.0 | learning rate: 2.212E-05 | global batch size: 32 | lm loss: 6.441951E+00 | loss scale: 16384.0 | grad norm: 72653.525 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39601
+ time (ms)
39602
+ iteration 3974/ 159576 | consumed samples: 79920 | elapsed time per iteration (ms): 14478.4 | learning rate: 2.213E-05 | global batch size: 32 | lm loss: 6.510294E+00 | loss scale: 16384.0 | grad norm: 65083.374 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39603
+ time (ms)
39604
+ iteration 3975/ 159576 | consumed samples: 79952 | elapsed time per iteration (ms): 14520.1 | learning rate: 2.214E-05 | global batch size: 32 | lm loss: 6.345959E+00 | loss scale: 16384.0 | grad norm: 133600.019 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39605
+ time (ms)
39606
+ iteration 3976/ 159576 | consumed samples: 79984 | elapsed time per iteration (ms): 14770.3 | learning rate: 2.215E-05 | global batch size: 32 | lm loss: 6.477483E+00 | loss scale: 16384.0 | grad norm: 89443.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39607
+ time (ms)
39608
+ iteration 3977/ 159576 | consumed samples: 80016 | elapsed time per iteration (ms): 14483.7 | learning rate: 2.216E-05 | global batch size: 32 | lm loss: 6.466526E+00 | loss scale: 16384.0 | grad norm: 79203.283 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39609
+ time (ms)
39610
+ iteration 3978/ 159576 | consumed samples: 80048 | elapsed time per iteration (ms): 14548.9 | learning rate: 2.217E-05 | global batch size: 32 | lm loss: 6.490917E+00 | loss scale: 16384.0 | grad norm: 85035.254 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39611
+ time (ms)
39612
+ iteration 3979/ 159576 | consumed samples: 80080 | elapsed time per iteration (ms): 14519.8 | learning rate: 2.218E-05 | global batch size: 32 | lm loss: 6.412145E+00 | loss scale: 16384.0 | grad norm: 93580.388 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39613
+ time (ms)
39614
+ iteration 3980/ 159576 | consumed samples: 80112 | elapsed time per iteration (ms): 14659.7 | learning rate: 2.218E-05 | global batch size: 32 | lm loss: 6.473646E+00 | loss scale: 16384.0 | grad norm: 79422.522 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39615
+ time (ms)
39616
+ iteration 3981/ 159576 | consumed samples: 80144 | elapsed time per iteration (ms): 14525.1 | learning rate: 2.219E-05 | global batch size: 32 | lm loss: 6.522334E+00 | loss scale: 16384.0 | grad norm: 83533.865 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39617
+ time (ms)
39618
+ iteration 3982/ 159576 | consumed samples: 80176 | elapsed time per iteration (ms): 14543.1 | learning rate: 2.220E-05 | global batch size: 32 | lm loss: 6.387228E+00 | loss scale: 16384.0 | grad norm: 89795.957 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39619
+ time (ms)
39620
+ iteration 3983/ 159576 | consumed samples: 80208 | elapsed time per iteration (ms): 14609.8 | learning rate: 2.221E-05 | global batch size: 32 | lm loss: 6.475267E+00 | loss scale: 16384.0 | grad norm: 119598.589 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39621
+ time (ms)
39622
+ iteration 3984/ 159576 | consumed samples: 80240 | elapsed time per iteration (ms): 14596.2 | learning rate: 2.222E-05 | global batch size: 32 | lm loss: 6.533351E+00 | loss scale: 16384.0 | grad norm: 72306.036 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39623
+ time (ms)
39624
+ iteration 3985/ 159576 | consumed samples: 80272 | elapsed time per iteration (ms): 14621.5 | learning rate: 2.223E-05 | global batch size: 32 | lm loss: 6.540237E+00 | loss scale: 16384.0 | grad norm: 88358.505 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39625
+ time (ms)
39626
+ iteration 3986/ 159576 | consumed samples: 80304 | elapsed time per iteration (ms): 14563.8 | learning rate: 2.224E-05 | global batch size: 32 | lm loss: 6.419699E+00 | loss scale: 16384.0 | grad norm: 75411.849 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39627
+ time (ms)
39628
+ iteration 3987/ 159576 | consumed samples: 80336 | elapsed time per iteration (ms): 14555.9 | learning rate: 2.225E-05 | global batch size: 32 | lm loss: 6.591748E+00 | loss scale: 16384.0 | grad norm: 112139.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39629
+ time (ms)
39630
+ iteration 3988/ 159576 | consumed samples: 80368 | elapsed time per iteration (ms): 15004.4 | learning rate: 2.226E-05 | global batch size: 32 | lm loss: 6.551664E+00 | loss scale: 16384.0 | grad norm: 88397.931 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39631
+ time (ms)
39632
+ iteration 3989/ 159576 | consumed samples: 80400 | elapsed time per iteration (ms): 14610.9 | learning rate: 2.226E-05 | global batch size: 32 | lm loss: 6.531049E+00 | loss scale: 16384.0 | grad norm: 63924.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39633
+ time (ms)
39634
+ iteration 3990/ 159576 | consumed samples: 80432 | elapsed time per iteration (ms): 14532.5 | learning rate: 2.227E-05 | global batch size: 32 | lm loss: 6.546918E+00 | loss scale: 16384.0 | grad norm: 97299.376 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39635
+ time (ms)
39636
+ iteration 3991/ 159576 | consumed samples: 80464 | elapsed time per iteration (ms): 14437.4 | learning rate: 2.228E-05 | global batch size: 32 | lm loss: 6.471569E+00 | loss scale: 16384.0 | grad norm: 76326.402 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39637
+ time (ms)
39638
+ iteration 3992/ 159576 | consumed samples: 80496 | elapsed time per iteration (ms): 14906.8 | learning rate: 2.229E-05 | global batch size: 32 | lm loss: 6.525407E+00 | loss scale: 16384.0 | grad norm: 77183.511 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39639
+ time (ms)
39640
+ iteration 3993/ 159576 | consumed samples: 80528 | elapsed time per iteration (ms): 14534.2 | learning rate: 2.230E-05 | global batch size: 32 | lm loss: 6.539597E+00 | loss scale: 16384.0 | grad norm: 60376.571 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39641
+ time (ms)
39642
+ iteration 3994/ 159576 | consumed samples: 80560 | elapsed time per iteration (ms): 14579.3 | learning rate: 2.231E-05 | global batch size: 32 | lm loss: 6.552666E+00 | loss scale: 16384.0 | grad norm: 84746.386 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39643
+ time (ms)
39644
+ iteration 3995/ 159576 | consumed samples: 80592 | elapsed time per iteration (ms): 14529.3 | learning rate: 2.232E-05 | global batch size: 32 | lm loss: 6.413946E+00 | loss scale: 16384.0 | grad norm: 67969.641 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39645
+ time (ms)
39646
+ iteration 3996/ 159576 | consumed samples: 80624 | elapsed time per iteration (ms): 14922.8 | learning rate: 2.233E-05 | global batch size: 32 | lm loss: 6.427969E+00 | loss scale: 16384.0 | grad norm: 69649.626 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39647
+ time (ms)
39648
+ iteration 3997/ 159576 | consumed samples: 80656 | elapsed time per iteration (ms): 14490.2 | learning rate: 2.234E-05 | global batch size: 32 | lm loss: 6.527985E+00 | loss scale: 16384.0 | grad norm: 71644.429 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39649
+ time (ms)
39650
+ iteration 3998/ 159576 | consumed samples: 80688 | elapsed time per iteration (ms): 14523.9 | learning rate: 2.234E-05 | global batch size: 32 | lm loss: 6.565372E+00 | loss scale: 16384.0 | grad norm: 75823.635 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39651
+ time (ms)
39652
+ iteration 3999/ 159576 | consumed samples: 80720 | elapsed time per iteration (ms): 14567.0 | learning rate: 2.235E-05 | global batch size: 32 | lm loss: 6.377391E+00 | loss scale: 16384.0 | grad norm: 85967.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39653
+ time (ms)
39654
+ [2021-09-24 17:49:17,238] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=6, lr=[2.2362426035502958e-05, 2.2362426035502958e-05], mom=[(0.9, 0.999), (0.9, 0.999)]
39655
+ steps: 4000 loss: 6.3572 iter time (s): 0.007 samples/sec: 4556.985
39656
+ iteration 4000/ 159576 | consumed samples: 80752 | elapsed time per iteration (ms): 14858.7 | learning rate: 2.236E-05 | global batch size: 32 | lm loss: 6.357164E+00 | loss scale: 16384.0 | grad norm: 56878.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39657
+ time (ms)
39658
+ ------------------------------------------------------------------------------------------------
39659
+ validation loss at iteration 4000 | lm loss value: 6.453290E+00 | lm loss PPL: 6.347876E+02 |
39660
+ ------------------------------------------------------------------------------------------------
39661
+ iteration 4001/ 159576 | consumed samples: 80784 | elapsed time per iteration (ms): 20796.3 | learning rate: 2.237E-05 | global batch size: 32 | lm loss: 6.357805E+00 | loss scale: 16384.0 | grad norm: 75271.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39662
+ time (ms)
39663
+ iteration 4002/ 159576 | consumed samples: 80816 | elapsed time per iteration (ms): 14528.3 | learning rate: 2.238E-05 | global batch size: 32 | lm loss: 6.590372E+00 | loss scale: 16384.0 | grad norm: 82823.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39664
+ time (ms)
39665
+ iteration 4003/ 159576 | consumed samples: 80848 | elapsed time per iteration (ms): 14569.0 | learning rate: 2.239E-05 | global batch size: 32 | lm loss: 6.547601E+00 | loss scale: 16384.0 | grad norm: 63495.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39666
+ time (ms)
39667
+ iteration 4004/ 159576 | consumed samples: 80880 | elapsed time per iteration (ms): 14981.7 | learning rate: 2.240E-05 | global batch size: 32 | lm loss: 6.488581E+00 | loss scale: 16384.0 | grad norm: 84538.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39668
+ time (ms)
39669
+ iteration 4005/ 159576 | consumed samples: 80912 | elapsed time per iteration (ms): 14517.6 | learning rate: 2.241E-05 | global batch size: 32 | lm loss: 6.473035E+00 | loss scale: 16384.0 | grad norm: 69154.929 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39670
+ time (ms)
39671
+ iteration 4006/ 159576 | consumed samples: 80944 | elapsed time per iteration (ms): 14515.3 | learning rate: 2.242E-05 | global batch size: 32 | lm loss: 6.574604E+00 | loss scale: 16384.0 | grad norm: 71258.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39672
+ time (ms)
39673
+ iteration 4007/ 159576 | consumed samples: 80976 | elapsed time per iteration (ms): 14530.3 | learning rate: 2.242E-05 | global batch size: 32 | lm loss: 6.480978E+00 | loss scale: 16384.0 | grad norm: 63598.555 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39674
+ time (ms)
39675
+ iteration 4008/ 159576 | consumed samples: 81008 | elapsed time per iteration (ms): 15052.4 | learning rate: 2.243E-05 | global batch size: 32 | lm loss: 6.393389E+00 | loss scale: 16384.0 | grad norm: 76474.916 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39676
+ time (ms)
39677
+ iteration 4009/ 159576 | consumed samples: 81040 | elapsed time per iteration (ms): 14618.9 | learning rate: 2.244E-05 | global batch size: 32 | lm loss: 6.322450E+00 | loss scale: 16384.0 | grad norm: 62736.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39678
+ time (ms)
39679
+ iteration 4010/ 159576 | consumed samples: 81072 | elapsed time per iteration (ms): 14521.7 | learning rate: 2.245E-05 | global batch size: 32 | lm loss: 6.502364E+00 | loss scale: 16384.0 | grad norm: 78751.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39680
+ time (ms)
39681
+ iteration 4011/ 159576 | consumed samples: 81104 | elapsed time per iteration (ms): 14513.4 | learning rate: 2.246E-05 | global batch size: 32 | lm loss: 6.504915E+00 | loss scale: 16384.0 | grad norm: 73290.420 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39682
+ time (ms)
39683
+ iteration 4012/ 159576 | consumed samples: 81136 | elapsed time per iteration (ms): 14859.5 | learning rate: 2.247E-05 | global batch size: 32 | lm loss: 6.422670E+00 | loss scale: 16384.0 | grad norm: 70911.343 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39684
+ time (ms)
39685
+ iteration 4013/ 159576 | consumed samples: 81168 | elapsed time per iteration (ms): 14562.7 | learning rate: 2.248E-05 | global batch size: 32 | lm loss: 6.460926E+00 | loss scale: 16384.0 | grad norm: 88361.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39686
+ time (ms)
39687
+ iteration 4014/ 159576 | consumed samples: 81200 | elapsed time per iteration (ms): 14537.6 | learning rate: 2.249E-05 | global batch size: 32 | lm loss: 6.359708E+00 | loss scale: 16384.0 | grad norm: 70950.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39688
+ time (ms)
39689
+ iteration 4015/ 159576 | consumed samples: 81232 | elapsed time per iteration (ms): 14575.5 | learning rate: 2.250E-05 | global batch size: 32 | lm loss: 6.479752E+00 | loss scale: 16384.0 | grad norm: 60916.908 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39690
+ time (ms)
39691
+ iteration 4016/ 159576 | consumed samples: 81264 | elapsed time per iteration (ms): 14890.4 | learning rate: 2.250E-05 | global batch size: 32 | lm loss: 6.438080E+00 | loss scale: 16384.0 | grad norm: 78503.860 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39692
+ time (ms)
39693
+ iteration 4017/ 159576 | consumed samples: 81296 | elapsed time per iteration (ms): 14519.4 | learning rate: 2.251E-05 | global batch size: 32 | lm loss: 6.446492E+00 | loss scale: 16384.0 | grad norm: 66299.320 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39694
+ time (ms)
39695
+ iteration 4018/ 159576 | consumed samples: 81328 | elapsed time per iteration (ms): 14512.9 | learning rate: 2.252E-05 | global batch size: 32 | lm loss: 6.418320E+00 | loss scale: 16384.0 | grad norm: 65936.043 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39696
+ time (ms)
39697
+ iteration 4019/ 159576 | consumed samples: 81360 | elapsed time per iteration (ms): 14568.1 | learning rate: 2.253E-05 | global batch size: 32 | lm loss: 6.337445E+00 | loss scale: 16384.0 | grad norm: 71727.512 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39698
+ time (ms)
39699
+ iteration 4020/ 159576 | consumed samples: 81392 | elapsed time per iteration (ms): 14867.3 | learning rate: 2.254E-05 | global batch size: 32 | lm loss: 6.564549E+00 | loss scale: 16384.0 | grad norm: 96122.107 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39700
+ time (ms)
39701
+ iteration 4021/ 159576 | consumed samples: 81424 | elapsed time per iteration (ms): 14435.4 | learning rate: 2.255E-05 | global batch size: 32 | lm loss: 6.485852E+00 | loss scale: 16384.0 | grad norm: 82597.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39702
+ time (ms)
39703
+ iteration 4022/ 159576 | consumed samples: 81456 | elapsed time per iteration (ms): 14558.0 | learning rate: 2.256E-05 | global batch size: 32 | lm loss: 6.539099E+00 | loss scale: 16384.0 | grad norm: 121006.289 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39704
+ time (ms)
39705
+ iteration 4023/ 159576 | consumed samples: 81488 | elapsed time per iteration (ms): 14530.8 | learning rate: 2.257E-05 | global batch size: 32 | lm loss: 6.588836E+00 | loss scale: 16384.0 | grad norm: 83990.530 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39706
+ time (ms)
39707
+ iteration 4024/ 159576 | consumed samples: 81520 | elapsed time per iteration (ms): 14903.1 | learning rate: 2.258E-05 | global batch size: 32 | lm loss: 6.478038E+00 | loss scale: 16384.0 | grad norm: 86310.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39708
+ time (ms)
39709
+ iteration 4025/ 159576 | consumed samples: 81552 | elapsed time per iteration (ms): 14640.8 | learning rate: 2.258E-05 | global batch size: 32 | lm loss: 6.423618E+00 | loss scale: 16384.0 | grad norm: 72646.553 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39710
+ time (ms)
39711
+ iteration 4026/ 159576 | consumed samples: 81584 | elapsed time per iteration (ms): 14523.1 | learning rate: 2.259E-05 | global batch size: 32 | lm loss: 6.389876E+00 | loss scale: 16384.0 | grad norm: 75260.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39712
+ time (ms)
39713
+ iteration 4027/ 159576 | consumed samples: 81616 | elapsed time per iteration (ms): 14495.3 | learning rate: 2.260E-05 | global batch size: 32 | lm loss: 6.686980E+00 | loss scale: 16384.0 | grad norm: 68901.893 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39714
+ time (ms)
39715
+ iteration 4028/ 159576 | consumed samples: 81648 | elapsed time per iteration (ms): 14518.7 | learning rate: 2.261E-05 | global batch size: 32 | lm loss: 6.454273E+00 | loss scale: 16384.0 | grad norm: 78058.506 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39716
+ time (ms)
39717
+ iteration 4029/ 159576 | consumed samples: 81680 | elapsed time per iteration (ms): 14751.7 | learning rate: 2.262E-05 | global batch size: 32 | lm loss: 6.645922E+00 | loss scale: 16384.0 | grad norm: 90877.563 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39718
+ time (ms)
39719
+ iteration 4030/ 159576 | consumed samples: 81712 | elapsed time per iteration (ms): 14605.8 | learning rate: 2.263E-05 | global batch size: 32 | lm loss: 6.554152E+00 | loss scale: 16384.0 | grad norm: 71333.048 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39720
+ time (ms)
39721
+ iteration 4031/ 159576 | consumed samples: 81744 | elapsed time per iteration (ms): 14567.0 | learning rate: 2.264E-05 | global batch size: 32 | lm loss: 6.512757E+00 | loss scale: 16384.0 | grad norm: 75409.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39722
+ time (ms)
39723
+ iteration 4032/ 159576 | consumed samples: 81776 | elapsed time per iteration (ms): 14627.7 | learning rate: 2.265E-05 | global batch size: 32 | lm loss: 6.529600E+00 | loss scale: 16384.0 | grad norm: 83852.632 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39724
+ time (ms)
39725
+ iteration 4033/ 159576 | consumed samples: 81808 | elapsed time per iteration (ms): 14706.7 | learning rate: 2.266E-05 | global batch size: 32 | lm loss: 6.312231E+00 | loss scale: 16384.0 | grad norm: 64610.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39726
+ time (ms)
39727
+ iteration 4034/ 159576 | consumed samples: 81840 | elapsed time per iteration (ms): 14453.1 | learning rate: 2.266E-05 | global batch size: 32 | lm loss: 6.378237E+00 | loss scale: 16384.0 | grad norm: 70363.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39728
+ time (ms)
39729
+ iteration 4035/ 159576 | consumed samples: 81872 | elapsed time per iteration (ms): 14558.4 | learning rate: 2.267E-05 | global batch size: 32 | lm loss: 6.617406E+00 | loss scale: 16384.0 | grad norm: 76776.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39730
+ time (ms)
39731
+ iteration 4036/ 159576 | consumed samples: 81904 | elapsed time per iteration (ms): 14451.4 | learning rate: 2.268E-05 | global batch size: 32 | lm loss: 6.510260E+00 | loss scale: 16384.0 | grad norm: 65763.594 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39732
+ time (ms)
39733
+ iteration 4037/ 159576 | consumed samples: 81936 | elapsed time per iteration (ms): 14734.4 | learning rate: 2.269E-05 | global batch size: 32 | lm loss: 6.484540E+00 | loss scale: 16384.0 | grad norm: 113964.842 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39734
+ time (ms)
39735
+ iteration 4038/ 159576 | consumed samples: 81968 | elapsed time per iteration (ms): 14560.9 | learning rate: 2.270E-05 | global batch size: 32 | lm loss: 6.422564E+00 | loss scale: 16384.0 | grad norm: 71196.418 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39736
+ time (ms)
39737
+ iteration 4039/ 159576 | consumed samples: 82000 | elapsed time per iteration (ms): 14521.4 | learning rate: 2.271E-05 | global batch size: 32 | lm loss: 6.468810E+00 | loss scale: 16384.0 | grad norm: 81464.635 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39738
+ time (ms)
39739
+ iteration 4040/ 159576 | consumed samples: 82032 | elapsed time per iteration (ms): 14534.9 | learning rate: 2.272E-05 | global batch size: 32 | lm loss: 6.528829E+00 | loss scale: 16384.0 | grad norm: 64883.399 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39740
+ time (ms)
39741
+ iteration 4041/ 159576 | consumed samples: 82064 | elapsed time per iteration (ms): 14840.7 | learning rate: 2.273E-05 | global batch size: 32 | lm loss: 6.466451E+00 | loss scale: 16384.0 | grad norm: 113319.594 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39742
+ time (ms)
39743
+ iteration 4042/ 159576 | consumed samples: 82096 | elapsed time per iteration (ms): 14627.3 | learning rate: 2.274E-05 | global batch size: 32 | lm loss: 6.455089E+00 | loss scale: 16384.0 | grad norm: 63704.855 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39744
+ time (ms)
39745
+ iteration 4043/ 159576 | consumed samples: 82128 | elapsed time per iteration (ms): 14401.0 | learning rate: 2.274E-05 | global batch size: 32 | lm loss: 6.394213E+00 | loss scale: 16384.0 | grad norm: 104510.525 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39746
+ time (ms)
39747
+ iteration 4044/ 159576 | consumed samples: 82160 | elapsed time per iteration (ms): 14522.2 | learning rate: 2.275E-05 | global batch size: 32 | lm loss: 6.436733E+00 | loss scale: 16384.0 | grad norm: 69916.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39748
+ time (ms)
39749
+ iteration 4045/ 159576 | consumed samples: 82192 | elapsed time per iteration (ms): 14878.3 | learning rate: 2.276E-05 | global batch size: 32 | lm loss: 6.467334E+00 | loss scale: 16384.0 | grad norm: 86814.439 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39750
+ time (ms)
39751
+ iteration 4046/ 159576 | consumed samples: 82224 | elapsed time per iteration (ms): 14619.5 | learning rate: 2.277E-05 | global batch size: 32 | lm loss: 6.542828E+00 | loss scale: 16384.0 | grad norm: 91169.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39752
+ time (ms)
39753
+ iteration 4047/ 159576 | consumed samples: 82256 | elapsed time per iteration (ms): 14546.0 | learning rate: 2.278E-05 | global batch size: 32 | lm loss: 6.482902E+00 | loss scale: 16384.0 | grad norm: 71855.514 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39754
+ time (ms)
39755
+ iteration 4048/ 159576 | consumed samples: 82288 | elapsed time per iteration (ms): 14535.3 | learning rate: 2.279E-05 | global batch size: 32 | lm loss: 6.380974E+00 | loss scale: 16384.0 | grad norm: 110448.433 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39756
+ time (ms)
39757
+ iteration 4049/ 159576 | consumed samples: 82320 | elapsed time per iteration (ms): 14946.7 | learning rate: 2.280E-05 | global batch size: 32 | lm loss: 6.604033E+00 | loss scale: 16384.0 | grad norm: 86973.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39758
+ time (ms)
39759
+ iteration 4050/ 159576 | consumed samples: 82352 | elapsed time per iteration (ms): 14452.3 | learning rate: 2.281E-05 | global batch size: 32 | lm loss: 6.485418E+00 | loss scale: 16384.0 | grad norm: 93547.929 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39760
+ time (ms)
39761
+ iteration 4051/ 159576 | consumed samples: 82384 | elapsed time per iteration (ms): 14486.7 | learning rate: 2.282E-05 | global batch size: 32 | lm loss: 6.447795E+00 | loss scale: 16384.0 | grad norm: 71623.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39762
+ time (ms)
39763
+ iteration 4052/ 159576 | consumed samples: 82416 | elapsed time per iteration (ms): 14546.0 | learning rate: 2.282E-05 | global batch size: 32 | lm loss: 6.490433E+00 | loss scale: 16384.0 | grad norm: 122748.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39764
+ time (ms)
39765
+ iteration 4053/ 159576 | consumed samples: 82448 | elapsed time per iteration (ms): 14923.8 | learning rate: 2.283E-05 | global batch size: 32 | lm loss: 6.393107E+00 | loss scale: 16384.0 | grad norm: 94716.038 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39766
+ time (ms)
39767
+ iteration 4054/ 159576 | consumed samples: 82480 | elapsed time per iteration (ms): 14522.3 | learning rate: 2.284E-05 | global batch size: 32 | lm loss: 6.560749E+00 | loss scale: 16384.0 | grad norm: 87911.375 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39768
+ time (ms)
39769
+ iteration 4055/ 159576 | consumed samples: 82512 | elapsed time per iteration (ms): 14576.1 | learning rate: 2.285E-05 | global batch size: 32 | lm loss: 6.508199E+00 | loss scale: 16384.0 | grad norm: 75712.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39770
+ time (ms)
39771
+ iteration 4056/ 159576 | consumed samples: 82544 | elapsed time per iteration (ms): 14509.2 | learning rate: 2.286E-05 | global batch size: 32 | lm loss: 6.480619E+00 | loss scale: 16384.0 | grad norm: 92968.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39772
+ time (ms)
39773
+ iteration 4057/ 159576 | consumed samples: 82576 | elapsed time per iteration (ms): 14814.4 | learning rate: 2.287E-05 | global batch size: 32 | lm loss: 6.324226E+00 | loss scale: 16384.0 | grad norm: 78472.900 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39774
+ time (ms)
39775
+ iteration 4058/ 159576 | consumed samples: 82608 | elapsed time per iteration (ms): 14459.3 | learning rate: 2.288E-05 | global batch size: 32 | lm loss: 6.626959E+00 | loss scale: 16384.0 | grad norm: 80531.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39776
+ time (ms)
39777
+ iteration 4059/ 159576 | consumed samples: 82640 | elapsed time per iteration (ms): 14496.4 | learning rate: 2.289E-05 | global batch size: 32 | lm loss: 6.406682E+00 | loss scale: 16384.0 | grad norm: 75308.856 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39778
+ time (ms)
39779
+ iteration 4060/ 159576 | consumed samples: 82672 | elapsed time per iteration (ms): 14562.2 | learning rate: 2.289E-05 | global batch size: 32 | lm loss: 6.440542E+00 | loss scale: 16384.0 | grad norm: 78114.884 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39780
+ time (ms)
39781
+ iteration 4061/ 159576 | consumed samples: 82704 | elapsed time per iteration (ms): 14796.0 | learning rate: 2.290E-05 | global batch size: 32 | lm loss: 6.468933E+00 | loss scale: 16384.0 | grad norm: 77154.286 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39782
+ time (ms)
39783
+ iteration 4062/ 159576 | consumed samples: 82736 | elapsed time per iteration (ms): 14696.5 | learning rate: 2.291E-05 | global batch size: 32 | lm loss: 6.318196E+00 | loss scale: 16384.0 | grad norm: 97551.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39784
+ time (ms)
39785
+ iteration 4063/ 159576 | consumed samples: 82768 | elapsed time per iteration (ms): 14468.1 | learning rate: 2.292E-05 | global batch size: 32 | lm loss: 6.472930E+00 | loss scale: 16384.0 | grad norm: 110041.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39786
+ time (ms)
39787
+ iteration 4064/ 159576 | consumed samples: 82800 | elapsed time per iteration (ms): 14496.2 | learning rate: 2.293E-05 | global batch size: 32 | lm loss: 6.523721E+00 | loss scale: 16384.0 | grad norm: 88018.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39788
+ time (ms)
39789
+ iteration 4065/ 159576 | consumed samples: 82832 | elapsed time per iteration (ms): 14563.8 | learning rate: 2.294E-05 | global batch size: 32 | lm loss: 6.453180E+00 | loss scale: 16384.0 | grad norm: 83087.922 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39790
+ time (ms)
39791
+ iteration 4066/ 159576 | consumed samples: 82864 | elapsed time per iteration (ms): 14884.4 | learning rate: 2.295E-05 | global batch size: 32 | lm loss: 6.447326E+00 | loss scale: 16384.0 | grad norm: 72433.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39792
+ time (ms)
39793
+ iteration 4067/ 159576 | consumed samples: 82896 | elapsed time per iteration (ms): 14491.5 | learning rate: 2.296E-05 | global batch size: 32 | lm loss: 6.366633E+00 | loss scale: 16384.0 | grad norm: 100504.434 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39794
+ time (ms)
39795
+ iteration 4068/ 159576 | consumed samples: 82928 | elapsed time per iteration (ms): 14561.6 | learning rate: 2.297E-05 | global batch size: 32 | lm loss: 6.315294E+00 | loss scale: 16384.0 | grad norm: 79868.630 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39796
+ time (ms)
39797
+ iteration 4069/ 159576 | consumed samples: 82960 | elapsed time per iteration (ms): 14538.6 | learning rate: 2.297E-05 | global batch size: 32 | lm loss: 6.452709E+00 | loss scale: 16384.0 | grad norm: 94073.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39798
+ time (ms)
39799
+ iteration 4070/ 159576 | consumed samples: 82992 | elapsed time per iteration (ms): 14651.1 | learning rate: 2.298E-05 | global batch size: 32 | lm loss: 6.421084E+00 | loss scale: 16384.0 | grad norm: 96558.906 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39800
+ time (ms)
39801
+ iteration 4071/ 159576 | consumed samples: 83024 | elapsed time per iteration (ms): 14508.0 | learning rate: 2.299E-05 | global batch size: 32 | lm loss: 6.474918E+00 | loss scale: 16384.0 | grad norm: 104437.791 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39802
+ time (ms)
39803
+ iteration 4072/ 159576 | consumed samples: 83056 | elapsed time per iteration (ms): 14540.3 | learning rate: 2.300E-05 | global batch size: 32 | lm loss: 6.442264E+00 | loss scale: 16384.0 | grad norm: 69985.883 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
39804
+ time (ms)
39805
+ [2021-09-24 18:07:07] PULSE: tr8-104B is waiting for the previous job to finish before scheduling a new one using the dependency mechanism (1165978_[1-10%1] on 'gpu_p13' partition)
39806
+ [2021-09-24 18:07:07] PULSE: tr8-104B is running for 12:14:56 since 2021-09-24T05:52:11 (1162855_1 on 'gpu_p13' partition (r6i4n[5,7],r6i5n[2,7-8],r6i6n[0,2,6],r7i2n[4-5],r7i6n[2-4],r7i7n[7-8],r8i0n[2-3,5-8],r8i1n[0,2-4],r8i2n8,r8i3n[0-2],r8i5n[3-4],r8i7n[3-8],r9i0n[0-2],r9i1n[0-3],r9i2n[3-5,8],r9i3n[0-1,7-8],r9i4n[0-2],r9i5n[3-8],r9i6n[0,7-8])