bigscience-bot commited on
Commit
7abe266
1 Parent(s): f68d57e
Files changed (1) hide show
  1. logs/main_log.txt +270 -0
logs/main_log.txt CHANGED
@@ -22434,3 +22434,273 @@ time (ms)
22434
  [2021-09-24 05:03:31] PULSE: tr8-104B is running for 1:02:26 since 2021-09-24T04:01:05 (1162747 on 'gpu_p13' partition (r6i4n[5,7],r6i5n[2,7-8],r6i6n[0,2,6],r7i2n[4-5],r7i6n[2-4],r7i7n[7-8],r8i0n[2-3,5-8],r8i1n[0,2-4],r8i2n8,r8i3n[0-2],r8i5n[3-4],r8i7n[3-8],r9i0n[0-2],r9i1n[0-3],r9i2n[3-5,8],r9i3n[0-1,7-8],r9i4n[0-2],r9i5n[3-8],r9i6n[0,7-8])
22435
  iteration 733/ 159576 | consumed samples: 11728 | elapsed time per iteration (ms): 13725.2 | learning rate: 3.253E-06 | global batch size: 16 | lm loss: 7.808900E+00 | loss scale: 8192.0 | grad norm: 50692.428 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22436
  time (ms)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22434
  [2021-09-24 05:03:31] PULSE: tr8-104B is running for 1:02:26 since 2021-09-24T04:01:05 (1162747 on 'gpu_p13' partition (r6i4n[5,7],r6i5n[2,7-8],r6i6n[0,2,6],r7i2n[4-5],r7i6n[2-4],r7i7n[7-8],r8i0n[2-3,5-8],r8i1n[0,2-4],r8i2n8,r8i3n[0-2],r8i5n[3-4],r8i7n[3-8],r9i0n[0-2],r9i1n[0-3],r9i2n[3-5,8],r9i3n[0-1,7-8],r9i4n[0-2],r9i5n[3-8],r9i6n[0,7-8])
22435
  iteration 733/ 159576 | consumed samples: 11728 | elapsed time per iteration (ms): 13725.2 | learning rate: 3.253E-06 | global batch size: 16 | lm loss: 7.808900E+00 | loss scale: 8192.0 | grad norm: 50692.428 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22436
  time (ms)
22437
+ iteration 734/ 159576 | consumed samples: 11744 | elapsed time per iteration (ms): 13115.2 | learning rate: 3.257E-06 | global batch size: 16 | lm loss: 7.737029E+00 | loss scale: 8192.0 | grad norm: 69149.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22438
+ time (ms)
22439
+ iteration 735/ 159576 | consumed samples: 11760 | elapsed time per iteration (ms): 13493.9 | learning rate: 3.262E-06 | global batch size: 16 | lm loss: 7.630354E+00 | loss scale: 8192.0 | grad norm: 85240.602 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22440
+ time (ms)
22441
+ iteration 736/ 159576 | consumed samples: 11776 | elapsed time per iteration (ms): 13636.0 | learning rate: 3.266E-06 | global batch size: 16 | lm loss: 7.626644E+00 | loss scale: 8192.0 | grad norm: 57646.552 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22442
+ time (ms)
22443
+ iteration 737/ 159576 | consumed samples: 11792 | elapsed time per iteration (ms): 13810.1 | learning rate: 3.271E-06 | global batch size: 16 | lm loss: 7.526936E+00 | loss scale: 8192.0 | grad norm: 95065.076 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22444
+ time (ms)
22445
+ iteration 738/ 159576 | consumed samples: 11808 | elapsed time per iteration (ms): 13385.6 | learning rate: 3.275E-06 | global batch size: 16 | lm loss: 7.820796E+00 | loss scale: 8192.0 | grad norm: 113407.272 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22446
+ time (ms)
22447
+ iteration 739/ 159576 | consumed samples: 11824 | elapsed time per iteration (ms): 13689.8 | learning rate: 3.280E-06 | global batch size: 16 | lm loss: 7.774467E+00 | loss scale: 8192.0 | grad norm: 98657.078 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22448
+ time (ms)
22449
+ iteration 740/ 159576 | consumed samples: 11840 | elapsed time per iteration (ms): 13965.2 | learning rate: 3.284E-06 | global batch size: 16 | lm loss: 7.762564E+00 | loss scale: 8192.0 | grad norm: 71745.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22450
+ time (ms)
22451
+ iteration 741/ 159576 | consumed samples: 11856 | elapsed time per iteration (ms): 13569.2 | learning rate: 3.288E-06 | global batch size: 16 | lm loss: 7.608281E+00 | loss scale: 8192.0 | grad norm: 40905.544 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22452
+ time (ms)
22453
+ iteration 742/ 159576 | consumed samples: 11872 | elapsed time per iteration (ms): 13635.8 | learning rate: 3.293E-06 | global batch size: 16 | lm loss: 7.570668E+00 | loss scale: 8192.0 | grad norm: 80257.423 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22454
+ time (ms)
22455
+ iteration 743/ 159576 | consumed samples: 11888 | elapsed time per iteration (ms): 13669.8 | learning rate: 3.297E-06 | global batch size: 16 | lm loss: 7.586653E+00 | loss scale: 8192.0 | grad norm: 56412.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22456
+ time (ms)
22457
+ iteration 744/ 159576 | consumed samples: 11904 | elapsed time per iteration (ms): 13473.9 | learning rate: 3.302E-06 | global batch size: 16 | lm loss: 7.701398E+00 | loss scale: 8192.0 | grad norm: 100221.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22458
+ time (ms)
22459
+ iteration 745/ 159576 | consumed samples: 11920 | elapsed time per iteration (ms): 13453.8 | learning rate: 3.306E-06 | global batch size: 16 | lm loss: 7.772648E+00 | loss scale: 8192.0 | grad norm: 88519.971 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22460
+ time (ms)
22461
+ iteration 746/ 159576 | consumed samples: 11936 | elapsed time per iteration (ms): 13732.5 | learning rate: 3.311E-06 | global batch size: 16 | lm loss: 7.940891E+00 | loss scale: 8192.0 | grad norm: 66980.299 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22462
+ time (ms)
22463
+ iteration 747/ 159576 | consumed samples: 11952 | elapsed time per iteration (ms): 13956.5 | learning rate: 3.315E-06 | global batch size: 16 | lm loss: 7.879022E+00 | loss scale: 8192.0 | grad norm: 73008.302 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22464
+ time (ms)
22465
+ iteration 748/ 159576 | consumed samples: 11968 | elapsed time per iteration (ms): 13250.5 | learning rate: 3.320E-06 | global batch size: 16 | lm loss: 7.693480E+00 | loss scale: 8192.0 | grad norm: 45346.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22466
+ time (ms)
22467
+ iteration 749/ 159576 | consumed samples: 11984 | elapsed time per iteration (ms): 13529.3 | learning rate: 3.324E-06 | global batch size: 16 | lm loss: 7.658270E+00 | loss scale: 8192.0 | grad norm: 156261.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22468
+ time (ms)
22469
+ iteration 750/ 159576 | consumed samples: 12000 | elapsed time per iteration (ms): 14110.0 | learning rate: 3.328E-06 | global batch size: 16 | lm loss: 7.741945E+00 | loss scale: 8192.0 | grad norm: 121818.343 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22470
+ time (ms)
22471
+ iteration 751/ 159576 | consumed samples: 12016 | elapsed time per iteration (ms): 13463.3 | learning rate: 3.333E-06 | global batch size: 16 | lm loss: 7.631550E+00 | loss scale: 8192.0 | grad norm: 69835.617 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22472
+ time (ms)
22473
+ iteration 752/ 159576 | consumed samples: 12032 | elapsed time per iteration (ms): 13424.2 | learning rate: 3.337E-06 | global batch size: 16 | lm loss: 7.669878E+00 | loss scale: 8192.0 | grad norm: 47821.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22474
+ time (ms)
22475
+ iteration 753/ 159576 | consumed samples: 12048 | elapsed time per iteration (ms): 13566.2 | learning rate: 3.342E-06 | global batch size: 16 | lm loss: 7.567214E+00 | loss scale: 8192.0 | grad norm: 68234.683 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22476
+ time (ms)
22477
+ iteration 754/ 159576 | consumed samples: 12064 | elapsed time per iteration (ms): 14065.3 | learning rate: 3.346E-06 | global batch size: 16 | lm loss: 7.753268E+00 | loss scale: 8192.0 | grad norm: 134900.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22478
+ time (ms)
22479
+ iteration 755/ 159576 | consumed samples: 12080 | elapsed time per iteration (ms): 13518.6 | learning rate: 3.351E-06 | global batch size: 16 | lm loss: 7.552173E+00 | loss scale: 8192.0 | grad norm: 48964.281 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22480
+ time (ms)
22481
+ iteration 756/ 159576 | consumed samples: 12096 | elapsed time per iteration (ms): 13728.7 | learning rate: 3.355E-06 | global batch size: 16 | lm loss: 7.735795E+00 | loss scale: 8192.0 | grad norm: 73204.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22482
+ time (ms)
22483
+ iteration 757/ 159576 | consumed samples: 12112 | elapsed time per iteration (ms): 14082.3 | learning rate: 3.359E-06 | global batch size: 16 | lm loss: 7.910018E+00 | loss scale: 8192.0 | grad norm: 83429.905 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22484
+ time (ms)
22485
+ iteration 758/ 159576 | consumed samples: 12128 | elapsed time per iteration (ms): 13428.5 | learning rate: 3.364E-06 | global batch size: 16 | lm loss: 7.669195E+00 | loss scale: 8192.0 | grad norm: 61137.847 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22486
+ time (ms)
22487
+ iteration 759/ 159576 | consumed samples: 12144 | elapsed time per iteration (ms): 13632.1 | learning rate: 3.368E-06 | global batch size: 16 | lm loss: 7.795278E+00 | loss scale: 8192.0 | grad norm: 59141.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22488
+ time (ms)
22489
+ iteration 760/ 159576 | consumed samples: 12160 | elapsed time per iteration (ms): 13624.6 | learning rate: 3.373E-06 | global batch size: 16 | lm loss: 7.692988E+00 | loss scale: 8192.0 | grad norm: 104447.460 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22490
+ time (ms)
22491
+ iteration 761/ 159576 | consumed samples: 12176 | elapsed time per iteration (ms): 13611.0 | learning rate: 3.377E-06 | global batch size: 16 | lm loss: 7.784515E+00 | loss scale: 8192.0 | grad norm: 51368.314 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22492
+ time (ms)
22493
+ iteration 762/ 159576 | consumed samples: 12192 | elapsed time per iteration (ms): 13558.6 | learning rate: 3.382E-06 | global batch size: 16 | lm loss: 7.582584E+00 | loss scale: 8192.0 | grad norm: 61983.639 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22494
+ time (ms)
22495
+ iteration 763/ 159576 | consumed samples: 12208 | elapsed time per iteration (ms): 13793.4 | learning rate: 3.386E-06 | global batch size: 16 | lm loss: 7.743572E+00 | loss scale: 8192.0 | grad norm: 56837.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22496
+ time (ms)
22497
+ iteration 764/ 159576 | consumed samples: 12224 | elapsed time per iteration (ms): 13743.7 | learning rate: 3.391E-06 | global batch size: 16 | lm loss: 7.701952E+00 | loss scale: 8192.0 | grad norm: 92476.492 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22498
+ time (ms)
22499
+ iteration 765/ 159576 | consumed samples: 12240 | elapsed time per iteration (ms): 13529.8 | learning rate: 3.395E-06 | global batch size: 16 | lm loss: 7.691103E+00 | loss scale: 8192.0 | grad norm: 103276.953 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22500
+ time (ms)
22501
+ iteration 766/ 159576 | consumed samples: 12256 | elapsed time per iteration (ms): 13189.2 | learning rate: 3.399E-06 | global batch size: 16 | lm loss: 7.589336E+00 | loss scale: 8192.0 | grad norm: 54735.017 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22502
+ time (ms)
22503
+ iteration 767/ 159576 | consumed samples: 12272 | elapsed time per iteration (ms): 13483.6 | learning rate: 3.404E-06 | global batch size: 16 | lm loss: 7.717595E+00 | loss scale: 8192.0 | grad norm: 54456.344 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22504
+ time (ms)
22505
+ iteration 768/ 159576 | consumed samples: 12288 | elapsed time per iteration (ms): 13780.9 | learning rate: 3.408E-06 | global batch size: 16 | lm loss: 7.852913E+00 | loss scale: 8192.0 | grad norm: 88912.086 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22506
+ time (ms)
22507
+ iteration 769/ 159576 | consumed samples: 12304 | elapsed time per iteration (ms): 13724.3 | learning rate: 3.413E-06 | global batch size: 16 | lm loss: 7.716819E+00 | loss scale: 8192.0 | grad norm: 102833.662 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22508
+ time (ms)
22509
+ iteration 770/ 159576 | consumed samples: 12320 | elapsed time per iteration (ms): 13377.3 | learning rate: 3.417E-06 | global batch size: 16 | lm loss: 7.597641E+00 | loss scale: 8192.0 | grad norm: 50835.662 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22510
+ time (ms)
22511
+ iteration 771/ 159576 | consumed samples: 12336 | elapsed time per iteration (ms): 13692.5 | learning rate: 3.422E-06 | global batch size: 16 | lm loss: 7.478999E+00 | loss scale: 8192.0 | grad norm: 53587.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22512
+ time (ms)
22513
+ iteration 772/ 159576 | consumed samples: 12352 | elapsed time per iteration (ms): 14180.5 | learning rate: 3.426E-06 | global batch size: 16 | lm loss: 7.546258E+00 | loss scale: 8192.0 | grad norm: 63294.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22514
+ time (ms)
22515
+ iteration 773/ 159576 | consumed samples: 12368 | elapsed time per iteration (ms): 13096.5 | learning rate: 3.430E-06 | global batch size: 16 | lm loss: 7.711743E+00 | loss scale: 8192.0 | grad norm: 99934.626 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22516
+ time (ms)
22517
+ iteration 774/ 159576 | consumed samples: 12384 | elapsed time per iteration (ms): 13520.5 | learning rate: 3.435E-06 | global batch size: 16 | lm loss: 7.645664E+00 | loss scale: 8192.0 | grad norm: 56458.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22518
+ time (ms)
22519
+ iteration 775/ 159576 | consumed samples: 12400 | elapsed time per iteration (ms): 13630.5 | learning rate: 3.439E-06 | global batch size: 16 | lm loss: 7.603559E+00 | loss scale: 8192.0 | grad norm: 46450.456 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22520
+ time (ms)
22521
+ iteration 776/ 159576 | consumed samples: 12416 | elapsed time per iteration (ms): 14027.6 | learning rate: 3.444E-06 | global batch size: 16 | lm loss: 7.737686E+00 | loss scale: 8192.0 | grad norm: 141770.957 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22522
+ time (ms)
22523
+ iteration 777/ 159576 | consumed samples: 12432 | elapsed time per iteration (ms): 13425.6 | learning rate: 3.448E-06 | global batch size: 16 | lm loss: 7.584914E+00 | loss scale: 8192.0 | grad norm: 124071.305 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22524
+ time (ms)
22525
+ iteration 778/ 159576 | consumed samples: 12448 | elapsed time per iteration (ms): 13642.7 | learning rate: 3.453E-06 | global batch size: 16 | lm loss: 7.606685E+00 | loss scale: 8192.0 | grad norm: 53139.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22526
+ time (ms)
22527
+ iteration 779/ 159576 | consumed samples: 12464 | elapsed time per iteration (ms): 13834.1 | learning rate: 3.457E-06 | global batch size: 16 | lm loss: 7.786515E+00 | loss scale: 8192.0 | grad norm: 58657.499 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22528
+ time (ms)
22529
+ iteration 780/ 159576 | consumed samples: 12480 | elapsed time per iteration (ms): 13091.5 | learning rate: 3.462E-06 | global batch size: 16 | lm loss: 7.618142E+00 | loss scale: 8192.0 | grad norm: 37881.566 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22530
+ time (ms)
22531
+ iteration 781/ 159576 | consumed samples: 12496 | elapsed time per iteration (ms): 14146.0 | learning rate: 3.466E-06 | global batch size: 16 | lm loss: 7.906812E+00 | loss scale: 8192.0 | grad norm: 114163.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22532
+ time (ms)
22533
+ iteration 782/ 159576 | consumed samples: 12512 | elapsed time per iteration (ms): 14025.7 | learning rate: 3.470E-06 | global batch size: 16 | lm loss: 7.566094E+00 | loss scale: 8192.0 | grad norm: 46220.333 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22534
+ time (ms)
22535
+ iteration 783/ 159576 | consumed samples: 12528 | elapsed time per iteration (ms): 13895.4 | learning rate: 3.475E-06 | global batch size: 16 | lm loss: 7.630446E+00 | loss scale: 8192.0 | grad norm: 64319.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22536
+ time (ms)
22537
+ iteration 784/ 159576 | consumed samples: 12544 | elapsed time per iteration (ms): 13890.1 | learning rate: 3.479E-06 | global batch size: 16 | lm loss: 7.692337E+00 | loss scale: 8192.0 | grad norm: 48575.291 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22538
+ time (ms)
22539
+ iteration 785/ 159576 | consumed samples: 12560 | elapsed time per iteration (ms): 14156.1 | learning rate: 3.484E-06 | global batch size: 16 | lm loss: 7.736514E+00 | loss scale: 8192.0 | grad norm: 90651.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22540
+ time (ms)
22541
+ iteration 786/ 159576 | consumed samples: 12576 | elapsed time per iteration (ms): 14206.7 | learning rate: 3.488E-06 | global batch size: 16 | lm loss: 7.744794E+00 | loss scale: 8192.0 | grad norm: 84355.344 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22542
+ time (ms)
22543
+ iteration 787/ 159576 | consumed samples: 12592 | elapsed time per iteration (ms): 13622.2 | learning rate: 3.493E-06 | global batch size: 16 | lm loss: 7.672806E+00 | loss scale: 8192.0 | grad norm: 51705.493 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22544
+ time (ms)
22545
+ iteration 788/ 159576 | consumed samples: 12608 | elapsed time per iteration (ms): 13771.2 | learning rate: 3.497E-06 | global batch size: 16 | lm loss: 7.713612E+00 | loss scale: 8192.0 | grad norm: 50748.595 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22546
+ time (ms)
22547
+ iteration 789/ 159576 | consumed samples: 12624 | elapsed time per iteration (ms): 14226.1 | learning rate: 3.501E-06 | global batch size: 16 | lm loss: 7.630927E+00 | loss scale: 8192.0 | grad norm: 68226.483 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22548
+ time (ms)
22549
+ iteration 790/ 159576 | consumed samples: 12640 | elapsed time per iteration (ms): 14175.2 | learning rate: 3.506E-06 | global batch size: 16 | lm loss: 7.523444E+00 | loss scale: 8192.0 | grad norm: 67731.569 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22550
+ time (ms)
22551
+ iteration 791/ 159576 | consumed samples: 12656 | elapsed time per iteration (ms): 13844.2 | learning rate: 3.510E-06 | global batch size: 16 | lm loss: 7.357096E+00 | loss scale: 8192.0 | grad norm: 45569.401 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22552
+ time (ms)
22553
+ iteration 792/ 159576 | consumed samples: 12672 | elapsed time per iteration (ms): 13884.3 | learning rate: 3.515E-06 | global batch size: 16 | lm loss: 7.701885E+00 | loss scale: 8192.0 | grad norm: 53017.231 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22554
+ time (ms)
22555
+ iteration 793/ 159576 | consumed samples: 12688 | elapsed time per iteration (ms): 14159.9 | learning rate: 3.519E-06 | global batch size: 16 | lm loss: 7.529918E+00 | loss scale: 8192.0 | grad norm: 55466.888 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22556
+ time (ms)
22557
+ iteration 794/ 159576 | consumed samples: 12704 | elapsed time per iteration (ms): 13975.0 | learning rate: 3.524E-06 | global batch size: 16 | lm loss: 7.684763E+00 | loss scale: 8192.0 | grad norm: 44801.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22558
+ time (ms)
22559
+ iteration 795/ 159576 | consumed samples: 12720 | elapsed time per iteration (ms): 13769.3 | learning rate: 3.528E-06 | global batch size: 16 | lm loss: 7.843237E+00 | loss scale: 8192.0 | grad norm: 59761.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22560
+ time (ms)
22561
+ iteration 796/ 159576 | consumed samples: 12736 | elapsed time per iteration (ms): 13954.1 | learning rate: 3.533E-06 | global batch size: 16 | lm loss: 7.737316E+00 | loss scale: 8192.0 | grad norm: 66240.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22562
+ time (ms)
22563
+ iteration 797/ 159576 | consumed samples: 12752 | elapsed time per iteration (ms): 13982.4 | learning rate: 3.537E-06 | global batch size: 16 | lm loss: 7.712746E+00 | loss scale: 8192.0 | grad norm: 53315.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22564
+ time (ms)
22565
+ iteration 798/ 159576 | consumed samples: 12768 | elapsed time per iteration (ms): 14164.1 | learning rate: 3.541E-06 | global batch size: 16 | lm loss: 7.649867E+00 | loss scale: 8192.0 | grad norm: 46451.967 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22566
+ time (ms)
22567
+ iteration 799/ 159576 | consumed samples: 12784 | elapsed time per iteration (ms): 14010.0 | learning rate: 3.546E-06 | global batch size: 16 | lm loss: 7.833376E+00 | loss scale: 8192.0 | grad norm: 65829.045 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22568
+ time (ms)
22569
+ iteration 800/ 159576 | consumed samples: 12800 | elapsed time per iteration (ms): 14307.9 | learning rate: 3.550E-06 | global batch size: 16 | lm loss: 7.790625E+00 | loss scale: 8192.0 | grad norm: 71968.262 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22570
+ time (ms)
22571
+ iteration 801/ 159576 | consumed samples: 12816 | elapsed time per iteration (ms): 13972.6 | learning rate: 3.555E-06 | global batch size: 16 | lm loss: 7.611866E+00 | loss scale: 8192.0 | grad norm: 48597.309 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22572
+ time (ms)
22573
+ iteration 802/ 159576 | consumed samples: 12832 | elapsed time per iteration (ms): 13959.0 | learning rate: 3.559E-06 | global batch size: 16 | lm loss: 7.617666E+00 | loss scale: 8192.0 | grad norm: 147672.383 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22574
+ time (ms)
22575
+ iteration 803/ 159576 | consumed samples: 12848 | elapsed time per iteration (ms): 13806.4 | learning rate: 3.564E-06 | global batch size: 16 | lm loss: 7.813154E+00 | loss scale: 8192.0 | grad norm: 121980.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22576
+ time (ms)
22577
+ iteration 804/ 159576 | consumed samples: 12864 | elapsed time per iteration (ms): 13949.2 | learning rate: 3.568E-06 | global batch size: 16 | lm loss: 7.654176E+00 | loss scale: 8192.0 | grad norm: 52351.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22578
+ time (ms)
22579
+ iteration 805/ 159576 | consumed samples: 12880 | elapsed time per iteration (ms): 13801.9 | learning rate: 3.572E-06 | global batch size: 16 | lm loss: 7.564305E+00 | loss scale: 8192.0 | grad norm: 62792.545 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22580
+ time (ms)
22581
+ iteration 806/ 159576 | consumed samples: 12896 | elapsed time per iteration (ms): 13954.3 | learning rate: 3.577E-06 | global batch size: 16 | lm loss: 7.707185E+00 | loss scale: 8192.0 | grad norm: 64767.398 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22582
+ time (ms)
22583
+ iteration 807/ 159576 | consumed samples: 12912 | elapsed time per iteration (ms): 14250.4 | learning rate: 3.581E-06 | global batch size: 16 | lm loss: 7.578569E+00 | loss scale: 8192.0 | grad norm: 73926.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22584
+ time (ms)
22585
+ iteration 808/ 159576 | consumed samples: 12928 | elapsed time per iteration (ms): 14201.0 | learning rate: 3.586E-06 | global batch size: 16 | lm loss: 7.631069E+00 | loss scale: 8192.0 | grad norm: 110069.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22586
+ time (ms)
22587
+ iteration 809/ 159576 | consumed samples: 12944 | elapsed time per iteration (ms): 13598.4 | learning rate: 3.590E-06 | global batch size: 16 | lm loss: 7.628491E+00 | loss scale: 8192.0 | grad norm: 49670.988 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22588
+ time (ms)
22589
+ iteration 810/ 159576 | consumed samples: 12960 | elapsed time per iteration (ms): 13941.6 | learning rate: 3.595E-06 | global batch size: 16 | lm loss: 7.759563E+00 | loss scale: 8192.0 | grad norm: 45971.027 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22590
+ time (ms)
22591
+ iteration 811/ 159576 | consumed samples: 12976 | elapsed time per iteration (ms): 14298.0 | learning rate: 3.599E-06 | global batch size: 16 | lm loss: 7.502759E+00 | loss scale: 8192.0 | grad norm: 77602.902 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22592
+ time (ms)
22593
+ iteration 812/ 159576 | consumed samples: 12992 | elapsed time per iteration (ms): 13416.1 | learning rate: 3.604E-06 | global batch size: 16 | lm loss: 7.624804E+00 | loss scale: 8192.0 | grad norm: 95989.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22594
+ time (ms)
22595
+ iteration 813/ 159576 | consumed samples: 13008 | elapsed time per iteration (ms): 13579.1 | learning rate: 3.608E-06 | global batch size: 16 | lm loss: 7.542982E+00 | loss scale: 8192.0 | grad norm: 52064.554 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22596
+ time (ms)
22597
+ iteration 814/ 159576 | consumed samples: 13024 | elapsed time per iteration (ms): 14100.2 | learning rate: 3.612E-06 | global batch size: 16 | lm loss: 7.676429E+00 | loss scale: 8192.0 | grad norm: 38221.569 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22598
+ time (ms)
22599
+ iteration 815/ 159576 | consumed samples: 13040 | elapsed time per iteration (ms): 14346.2 | learning rate: 3.617E-06 | global batch size: 16 | lm loss: 7.695131E+00 | loss scale: 8192.0 | grad norm: 57869.513 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22600
+ time (ms)
22601
+ iteration 816/ 159576 | consumed samples: 13056 | elapsed time per iteration (ms): 13771.7 | learning rate: 3.621E-06 | global batch size: 16 | lm loss: 7.578337E+00 | loss scale: 8192.0 | grad norm: 49771.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22602
+ time (ms)
22603
+ iteration 817/ 159576 | consumed samples: 13072 | elapsed time per iteration (ms): 13776.0 | learning rate: 3.626E-06 | global batch size: 16 | lm loss: 7.583301E+00 | loss scale: 8192.0 | grad norm: 46160.592 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22604
+ time (ms)
22605
+ iteration 818/ 159576 | consumed samples: 13088 | elapsed time per iteration (ms): 14040.8 | learning rate: 3.630E-06 | global batch size: 16 | lm loss: 7.773385E+00 | loss scale: 8192.0 | grad norm: 42207.098 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22606
+ time (ms)
22607
+ iteration 819/ 159576 | consumed samples: 13104 | elapsed time per iteration (ms): 13835.3 | learning rate: 3.635E-06 | global batch size: 16 | lm loss: 7.905573E+00 | loss scale: 8192.0 | grad norm: 111883.611 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22608
+ time (ms)
22609
+ iteration 820/ 159576 | consumed samples: 13120 | elapsed time per iteration (ms): 13924.4 | learning rate: 3.639E-06 | global batch size: 16 | lm loss: 7.730550E+00 | loss scale: 8192.0 | grad norm: 75433.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22610
+ time (ms)
22611
+ iteration 821/ 159576 | consumed samples: 13136 | elapsed time per iteration (ms): 13915.0 | learning rate: 3.643E-06 | global batch size: 16 | lm loss: 7.688564E+00 | loss scale: 8192.0 | grad norm: 41927.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22612
+ time (ms)
22613
+ iteration 822/ 159576 | consumed samples: 13152 | elapsed time per iteration (ms): 13890.4 | learning rate: 3.648E-06 | global batch size: 16 | lm loss: 7.552343E+00 | loss scale: 8192.0 | grad norm: 96543.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22614
+ time (ms)
22615
+ iteration 823/ 159576 | consumed samples: 13168 | elapsed time per iteration (ms): 13560.6 | learning rate: 3.652E-06 | global batch size: 16 | lm loss: 7.617982E+00 | loss scale: 8192.0 | grad norm: 56370.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22616
+ time (ms)
22617
+ iteration 824/ 159576 | consumed samples: 13184 | elapsed time per iteration (ms): 14024.1 | learning rate: 3.657E-06 | global batch size: 16 | lm loss: 7.600199E+00 | loss scale: 8192.0 | grad norm: 61928.907 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22618
+ time (ms)
22619
+ iteration 825/ 159576 | consumed samples: 13200 | elapsed time per iteration (ms): 14003.2 | learning rate: 3.661E-06 | global batch size: 16 | lm loss: 7.541789E+00 | loss scale: 8192.0 | grad norm: 56863.341 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22620
+ time (ms)
22621
+ iteration 826/ 159576 | consumed samples: 13216 | elapsed time per iteration (ms): 13848.3 | learning rate: 3.666E-06 | global batch size: 16 | lm loss: 7.782004E+00 | loss scale: 8192.0 | grad norm: 59985.533 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22622
+ time (ms)
22623
+ iteration 827/ 159576 | consumed samples: 13232 | elapsed time per iteration (ms): 13902.1 | learning rate: 3.670E-06 | global batch size: 16 | lm loss: 7.733065E+00 | loss scale: 8192.0 | grad norm: 39148.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22624
+ time (ms)
22625
+ iteration 828/ 159576 | consumed samples: 13248 | elapsed time per iteration (ms): 14356.1 | learning rate: 3.675E-06 | global batch size: 16 | lm loss: 7.625387E+00 | loss scale: 8192.0 | grad norm: 56612.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22626
+ time (ms)
22627
+ iteration 829/ 159576 | consumed samples: 13264 | elapsed time per iteration (ms): 14368.0 | learning rate: 3.679E-06 | global batch size: 16 | lm loss: 7.759684E+00 | loss scale: 8192.0 | grad norm: 67635.907 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22628
+ time (ms)
22629
+ iteration 830/ 159576 | consumed samples: 13280 | elapsed time per iteration (ms): 13627.9 | learning rate: 3.683E-06 | global batch size: 16 | lm loss: 7.694915E+00 | loss scale: 8192.0 | grad norm: 60776.045 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22630
+ time (ms)
22631
+ iteration 831/ 159576 | consumed samples: 13296 | elapsed time per iteration (ms): 13498.1 | learning rate: 3.688E-06 | global batch size: 16 | lm loss: 7.492978E+00 | loss scale: 8192.0 | grad norm: 42000.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22632
+ time (ms)
22633
+ iteration 832/ 159576 | consumed samples: 13312 | elapsed time per iteration (ms): 13938.9 | learning rate: 3.692E-06 | global batch size: 16 | lm loss: 7.616700E+00 | loss scale: 8192.0 | grad norm: 105579.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22634
+ time (ms)
22635
+ iteration 833/ 159576 | consumed samples: 13328 | elapsed time per iteration (ms): 13687.8 | learning rate: 3.697E-06 | global batch size: 16 | lm loss: 7.715961E+00 | loss scale: 8192.0 | grad norm: 78119.339 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22636
+ time (ms)
22637
+ iteration 834/ 159576 | consumed samples: 13344 | elapsed time per iteration (ms): 13717.8 | learning rate: 3.701E-06 | global batch size: 16 | lm loss: 7.778497E+00 | loss scale: 8192.0 | grad norm: 58326.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22638
+ time (ms)
22639
+ iteration 835/ 159576 | consumed samples: 13360 | elapsed time per iteration (ms): 13913.9 | learning rate: 3.706E-06 | global batch size: 16 | lm loss: 7.718093E+00 | loss scale: 8192.0 | grad norm: 48122.513 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22640
+ time (ms)
22641
+ iteration 836/ 159576 | consumed samples: 13376 | elapsed time per iteration (ms): 14318.5 | learning rate: 3.710E-06 | global batch size: 16 | lm loss: 7.521303E+00 | loss scale: 8192.0 | grad norm: 60082.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22642
+ time (ms)
22643
+ iteration 837/ 159576 | consumed samples: 13392 | elapsed time per iteration (ms): 13780.0 | learning rate: 3.714E-06 | global batch size: 16 | lm loss: 7.538383E+00 | loss scale: 8192.0 | grad norm: 61043.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22644
+ time (ms)
22645
+ iteration 838/ 159576 | consumed samples: 13408 | elapsed time per iteration (ms): 13961.2 | learning rate: 3.719E-06 | global batch size: 16 | lm loss: 7.548276E+00 | loss scale: 8192.0 | grad norm: 58423.396 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22646
+ time (ms)
22647
+ iteration 839/ 159576 | consumed samples: 13424 | elapsed time per iteration (ms): 14239.6 | learning rate: 3.723E-06 | global batch size: 16 | lm loss: 7.618182E+00 | loss scale: 8192.0 | grad norm: 48500.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22648
+ time (ms)
22649
+ iteration 840/ 159576 | consumed samples: 13440 | elapsed time per iteration (ms): 13752.3 | learning rate: 3.728E-06 | global batch size: 16 | lm loss: 7.595082E+00 | loss scale: 8192.0 | grad norm: 50825.625 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22650
+ time (ms)
22651
+ iteration 841/ 159576 | consumed samples: 13456 | elapsed time per iteration (ms): 14199.3 | learning rate: 3.732E-06 | global batch size: 16 | lm loss: 7.492725E+00 | loss scale: 8192.0 | grad norm: 56977.964 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22652
+ time (ms)
22653
+ iteration 842/ 159576 | consumed samples: 13472 | elapsed time per iteration (ms): 13925.4 | learning rate: 3.737E-06 | global batch size: 16 | lm loss: 7.783816E+00 | loss scale: 8192.0 | grad norm: 40797.888 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22654
+ time (ms)
22655
+ iteration 843/ 159576 | consumed samples: 13488 | elapsed time per iteration (ms): 14119.4 | learning rate: 3.741E-06 | global batch size: 16 | lm loss: 7.606951E+00 | loss scale: 8192.0 | grad norm: 50890.553 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22656
+ time (ms)
22657
+ iteration 844/ 159576 | consumed samples: 13504 | elapsed time per iteration (ms): 13941.8 | learning rate: 3.746E-06 | global batch size: 16 | lm loss: 7.638199E+00 | loss scale: 8192.0 | grad norm: 52652.311 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22658
+ time (ms)
22659
+ iteration 845/ 159576 | consumed samples: 13520 | elapsed time per iteration (ms): 14424.1 | learning rate: 3.750E-06 | global batch size: 16 | lm loss: 7.555171E+00 | loss scale: 8192.0 | grad norm: 48298.607 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22660
+ time (ms)
22661
+ iteration 846/ 159576 | consumed samples: 13536 | elapsed time per iteration (ms): 14202.9 | learning rate: 3.754E-06 | global batch size: 16 | lm loss: 7.651504E+00 | loss scale: 8192.0 | grad norm: 76618.386 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22662
+ time (ms)
22663
+ iteration 847/ 159576 | consumed samples: 13552 | elapsed time per iteration (ms): 13785.9 | learning rate: 3.759E-06 | global batch size: 16 | lm loss: 7.914087E+00 | loss scale: 8192.0 | grad norm: 40970.022 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22664
+ time (ms)
22665
+ iteration 848/ 159576 | consumed samples: 13568 | elapsed time per iteration (ms): 13892.7 | learning rate: 3.763E-06 | global batch size: 16 | lm loss: 7.714731E+00 | loss scale: 8192.0 | grad norm: 47666.946 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22666
+ time (ms)
22667
+ iteration 849/ 159576 | consumed samples: 13584 | elapsed time per iteration (ms): 13608.6 | learning rate: 3.768E-06 | global batch size: 16 | lm loss: 7.566309E+00 | loss scale: 8192.0 | grad norm: 56337.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22668
+ time (ms)
22669
+ iteration 850/ 159576 | consumed samples: 13600 | elapsed time per iteration (ms): 13752.1 | learning rate: 3.772E-06 | global batch size: 16 | lm loss: 7.621016E+00 | loss scale: 8192.0 | grad norm: 55695.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22670
+ time (ms)
22671
+ iteration 851/ 159576 | consumed samples: 13616 | elapsed time per iteration (ms): 13514.6 | learning rate: 3.777E-06 | global batch size: 16 | lm loss: 7.510153E+00 | loss scale: 8192.0 | grad norm: 70852.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22672
+ time (ms)
22673
+ iteration 852/ 159576 | consumed samples: 13632 | elapsed time per iteration (ms): 13536.1 | learning rate: 3.781E-06 | global batch size: 16 | lm loss: 7.417966E+00 | loss scale: 8192.0 | grad norm: 43169.299 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22674
+ time (ms)
22675
+ iteration 853/ 159576 | consumed samples: 13648 | elapsed time per iteration (ms): 14116.4 | learning rate: 3.786E-06 | global batch size: 16 | lm loss: 7.490001E+00 | loss scale: 8192.0 | grad norm: 61980.012 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22676
+ time (ms)
22677
+ iteration 854/ 159576 | consumed samples: 13664 | elapsed time per iteration (ms): 14372.8 | learning rate: 3.790E-06 | global batch size: 16 | lm loss: 7.555287E+00 | loss scale: 8192.0 | grad norm: 43650.333 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22678
+ time (ms)
22679
+ iteration 855/ 159576 | consumed samples: 13680 | elapsed time per iteration (ms): 13154.5 | learning rate: 3.794E-06 | global batch size: 16 | lm loss: 7.628311E+00 | loss scale: 8192.0 | grad norm: 32290.729 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22680
+ time (ms)
22681
+ iteration 856/ 159576 | consumed samples: 13696 | elapsed time per iteration (ms): 13509.6 | learning rate: 3.799E-06 | global batch size: 16 | lm loss: 7.757495E+00 | loss scale: 8192.0 | grad norm: 94063.051 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22682
+ time (ms)
22683
+ iteration 857/ 159576 | consumed samples: 13712 | elapsed time per iteration (ms): 14015.7 | learning rate: 3.803E-06 | global batch size: 16 | lm loss: 7.733263E+00 | loss scale: 8192.0 | grad norm: 53189.090 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22684
+ time (ms)
22685
+ iteration 858/ 159576 | consumed samples: 13728 | elapsed time per iteration (ms): 14357.8 | learning rate: 3.808E-06 | global batch size: 16 | lm loss: 7.570580E+00 | loss scale: 8192.0 | grad norm: 57239.238 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22686
+ time (ms)
22687
+ iteration 859/ 159576 | consumed samples: 13744 | elapsed time per iteration (ms): 13954.6 | learning rate: 3.812E-06 | global batch size: 16 | lm loss: 7.593122E+00 | loss scale: 8192.0 | grad norm: 45414.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22688
+ time (ms)
22689
+ iteration 860/ 159576 | consumed samples: 13760 | elapsed time per iteration (ms): 14212.3 | learning rate: 3.817E-06 | global batch size: 16 | lm loss: 7.571471E+00 | loss scale: 8192.0 | grad norm: 75659.476 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22690
+ time (ms)
22691
+ iteration 861/ 159576 | consumed samples: 13776 | elapsed time per iteration (ms): 14044.0 | learning rate: 3.821E-06 | global batch size: 16 | lm loss: 7.599829E+00 | loss scale: 8192.0 | grad norm: 47651.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22692
+ time (ms)
22693
+ iteration 862/ 159576 | consumed samples: 13792 | elapsed time per iteration (ms): 13529.5 | learning rate: 3.825E-06 | global batch size: 16 | lm loss: 7.427186E+00 | loss scale: 8192.0 | grad norm: 76377.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22694
+ time (ms)
22695
+ iteration 863/ 159576 | consumed samples: 13808 | elapsed time per iteration (ms): 14057.3 | learning rate: 3.830E-06 | global batch size: 16 | lm loss: 7.736305E+00 | loss scale: 8192.0 | grad norm: 76320.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22696
+ time (ms)
22697
+ iteration 864/ 159576 | consumed samples: 13824 | elapsed time per iteration (ms): 14064.2 | learning rate: 3.834E-06 | global batch size: 16 | lm loss: 7.637553E+00 | loss scale: 8192.0 | grad norm: 56695.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22698
+ time (ms)
22699
+ iteration 865/ 159576 | consumed samples: 13840 | elapsed time per iteration (ms): 14009.0 | learning rate: 3.839E-06 | global batch size: 16 | lm loss: 7.709378E+00 | loss scale: 8192.0 | grad norm: 77647.024 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22700
+ time (ms)
22701
+ iteration 866/ 159576 | consumed samples: 13856 | elapsed time per iteration (ms): 13951.3 | learning rate: 3.843E-06 | global batch size: 16 | lm loss: 7.856131E+00 | loss scale: 8192.0 | grad norm: 85925.999 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22702
+ time (ms)
22703
+ iteration 867/ 159576 | consumed samples: 13872 | elapsed time per iteration (ms): 14427.4 | learning rate: 3.848E-06 | global batch size: 16 | lm loss: 7.511599E+00 | loss scale: 8192.0 | grad norm: 50353.044 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22704
+ time (ms)
22705
+ iteration 868/ 159576 | consumed samples: 13888 | elapsed time per iteration (ms): 14117.9 | learning rate: 3.852E-06 | global batch size: 16 | lm loss: 7.803133E+00 | loss scale: 8192.0 | grad norm: 73334.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
22706
+ time (ms)