0%| | 0/1365 [00:00> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. 0%| | 1/1365 [00:11<4:10:37, 11.02s/it] [2024-02-01 17:59:06,373] [WARNING] [stage3.py:1949:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time 1%|▌ | 10/1365 [00:43<1:23:26, 3.70s/it] 1%|█▏ | 20/1365 [01:19<1:20:49, 3.61s/it] 2%|█▋ | 30/1365 [01:55<1:20:11, 3.60s/it] 3%|██▎ | 40/1365 [02:31<1:19:28, 3.60s/it] 4%|██▉ | 50/1365 [03:07<1:19:00, 3.61s/it] 4%|███▍ | 60/1365 [03:43<1:18:34, 3.61s/it] 5%|████ | 70/1365 [04:19<1:17:55, 3.61s/it] 6%|████▋ | 80/1365 [04:55<1:17:12, 3.61s/it] 7%|█████▏ | 90/1365 [05:31<1:16:32, 3.60s/it] 7%|█████▋ | 100/1365 [06:07<1:16:06, 3.61s/it] 8%|██████▎ | 110/1365 [06:44<1:15:24, 3.60s/it] 9%|██████▊ | 120/1365 [07:20<1:14:48, 3.61s/it] 10%|███████▍ | 130/1365 [07:56<1:14:12, 3.61s/it] 10%|████████ | 140/1365 [08:32<1:13:29, 3.60s/it] 11%|████████▌ | 150/1365 [09:07<1:12:25, 3.58s/it] 12%|█████████▏ | 160/1365 [09:43<1:12:00, 3.59s/it] 12%|█████████▋ | 170/1365 [10:20<1:12:29, 3.64s/it] 13%|██████████▏ | 179/1365 [10:52<1:11:26, 3.61s/it] 14%|██████████▊ | 189/1365 [11:28<1:10:32, 3.60s/it] 15%|███████████▎ | 199/1365 [12:04<1:09:55, 3.60s/it] 15%|████████████ | 210/1365 [12:44<1:09:11, 3.59s/it] 16%|████████████▌ | 220/1365 [13:20<1:08:33, 3.59s/it] 17%|█████████████▏ | 230/1365 [13:56<1:08:01, 3.60s/it] 18%|█████████████▋ | 240/1365 [14:32<1:07:15, 3.59s/it] 18%|██████████████▎ | 250/1365 [15:07<1:06:48, 3.59s/it] 19%|██████████████▊ | 260/1365 [15:44<1:06:51, 3.63s/it] 20%|███████████████▍ | 270/1365 [16:20<1:06:13, 3.63s/it] 20%|███████████████▌ | 273/1365 [16:31<1:05:09, 3.58s/it][INFO|trainer.py:3166] 2024-02-01 18:15:26,429 >> ***** Running Evaluation ***** [INFO|trainer.py:3168] 2024-02-01 18:15:26,429 >> Num examples = 15431 [INFO|trainer.py:3171] 2024-02-01 18:15:26,429 >> Batch size = 32 95%|██████████████████████████████████████████████████████████████████████████████▉ | 58/61 [00:28<00:01, 1.99it/s] 20%|███████████████▌ | 273/1365 [17:01<1:05:09, 3.58s/it][INFO|trainer.py:2889] 2024-02-01 18:15:58,319 >> Saving model checkpoint to ./tmp-checkpoint-273 [INFO|configuration_utils.py:483] 2024-02-01 18:15:58,323 >> Configuration saved in ./tmp-checkpoint-273/config.json [INFO|configuration_utils.py:594] 2024-02-01 18:15:58,326 >> Configuration saved in ./tmp-checkpoint-273/generation_config.json [INFO|modeling_utils.py:2382] 2024-02-01 18:16:01,522 >> Model weights saved in ./tmp-checkpoint-273/pytorch_model.bin [INFO|tokenization_utils_base.py:2432] 2024-02-01 18:16:01,541 >> tokenizer config file saved in ./tmp-checkpoint-273/tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 18:16:01,543 >> Special tokens file saved in ./tmp-checkpoint-273/special_tokens_map.json /fsx/sanchit/miniconda3/envs/venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2024-02-01 18:16:01,626] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step273 is about to be saved! [2024-02-01 18:16:01,783] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./tmp-checkpoint-273/global_step273/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-02-01 18:16:01,784] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./tmp-checkpoint-273/global_step273/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-02-01 18:16:01,787] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./tmp-checkpoint-273/global_step273/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-02-01 18:16:01,792] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./tmp-checkpoint-273/global_step273/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-02-01 18:16:05,650] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./tmp-checkpoint-273/global_step273/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-02-01 18:16:05,658] [INFO] [engine.py:3393:_save_zero_checkpoint] zero checkpoint saved ./tmp-checkpoint-273/global_step273/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-02-01 18:16:05,936] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step273 is ready now! [INFO|tokenization_utils_base.py:2432] 2024-02-01 18:16:08,421 >> tokenizer config file saved in ./tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 18:16:08,424 >> Special tokens file saved in ./special_tokens_map.json 21%|████████████████ | 280/1365 [17:38<1:31:41, 5.07s/it] 21%|████████████████▌ | 290/1365 [18:14<1:04:59, 3.63s/it] 22%|█████████████████▏ | 300/1365 [18:50<1:03:46, 3.59s/it] 23%|█████████████████▋ | 310/1365 [19:25<1:03:01, 3.58s/it] 23%|██████████████████▎ | 320/1365 [20:01<1:02:25, 3.58s/it] 24%|██████████████████▊ | 330/1365 [20:37<1:01:50, 3.59s/it] 25%|███████████████████▍ | 340/1365 [21:13<1:01:12, 3.58s/it] 26%|████████████████████ | 350/1365 [21:49<1:00:35, 3.58s/it] 26%|█████████████████████ | 360/1365 [22:24<59:59, 3.58s/it] 27%|█████████████████████▋ | 370/1365 [23:00<59:23, 3.58s/it] 28%|██████████████████████▏ | 379/1365 [23:33<58:51, 3.58s/it] 28%|██████████████████████▊ | 389/1365 [24:08<58:15, 3.58s/it] 29%|███████████████████████▍ | 400/1365 [24:48<57:37, 3.58s/it] 30%|████████████████████████ | 410/1365 [25:24<56:59, 3.58s/it] 31%|████████████████████████▌ | 420/1365 [25:59<56:22, 3.58s/it] 32%|█████████████████████████▏ | 430/1365 [26:36<57:28, 3.69s/it] 32%|█████████████████████████▊ | 440/1365 [27:13<56:44, 3.68s/it] 33%|██████████████████████████▎ | 450/1365 [27:49<55:38, 3.65s/it] 34%|██████████████████████████▉ | 460/1365 [28:26<54:37, 3.62s/it] 34%|███████████████████████████▌ | 470/1365 [29:02<53:57, 3.62s/it] 35%|████████████████████████████▏ | 480/1365 [29:38<53:18, 3.61s/it] 36%|████████████████████████████▋ | 490/1365 [30:14<52:50, 3.62s/it] 37%|█████████████████████████████▎ | 500/1365 [30:51<52:27, 3.64s/it] 37%|█████████████████████████████▉ | 510/1365 [31:27<51:33, 3.62s/it] 38%|██████████████████████████████▍ | 520/1365 [32:03<50:52, 3.61s/it] 39%|███████████████████████████████ | 530/1365 [32:39<50:11, 3.61s/it] 40%|███████████████████████████████▋ | 540/1365 [33:15<49:33, 3.60s/it] 40%|████████████████████████████████ | 546/1365 [33:37<48:43, 3.57s/it][INFO|trainer.py:3166] 2024-02-01 18:32:32,716 >> ***** Running Evaluation ***** [INFO|trainer.py:3168] 2024-02-01 18:32:32,716 >> Num examples = 15431 [INFO|trainer.py:3171] 2024-02-01 18:32:32,716 >> Batch size = 32 95%|██████████████████████████████████████████████████████████████████████████████▉ | 58/61 [00:28<00:01, 1.99it/s] 40%|████████████████████████████████ | 546/1365 [34:07<48:43, 3.57s/it][INFO|trainer.py:2889] 2024-02-01 18:33:04,367 >> Saving model checkpoint to ./tmp-checkpoint-546 [INFO|configuration_utils.py:483] 2024-02-01 18:33:04,370 >> Configuration saved in ./tmp-checkpoint-546/config.json [INFO|configuration_utils.py:594] 2024-02-01 18:33:04,373 >> Configuration saved in ./tmp-checkpoint-546/generation_config.json [INFO|modeling_utils.py:2382] 2024-02-01 18:33:07,531 >> Model weights saved in ./tmp-checkpoint-546/pytorch_model.bin [INFO|tokenization_utils_base.py:2432] 2024-02-01 18:33:07,550 >> tokenizer config file saved in ./tmp-checkpoint-546/tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 18:33:07,552 >> Special tokens file saved in ./tmp-checkpoint-546/special_tokens_map.json /fsx/sanchit/miniconda3/envs/venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2024-02-01 18:33:07,612] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step546 is about to be saved! [2024-02-01 18:33:07,619] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./tmp-checkpoint-546/global_step546/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-02-01 18:33:07,620] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./tmp-checkpoint-546/global_step546/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-02-01 18:33:07,625] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./tmp-checkpoint-546/global_step546/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-02-01 18:33:07,633] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./tmp-checkpoint-546/global_step546/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-02-01 18:33:11,463] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./tmp-checkpoint-546/global_step546/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-02-01 18:33:11,471] [INFO] [engine.py:3393:_save_zero_checkpoint] zero checkpoint saved ./tmp-checkpoint-546/global_step546/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-02-01 18:33:11,659] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step546 is ready now! [INFO|tokenization_utils_base.py:2432] 2024-02-01 18:33:14,019 >> tokenizer config file saved in ./tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 18:33:14,021 >> Special tokens file saved in ./special_tokens_map.json [INFO|trainer.py:2979] 2024-02-01 18:33:14,053 >> Deleting older checkpoint [checkpoint-273] due to args.save_total_limit 40%|███████████████████████████████▍ | 550/1365 [34:33<1:46:53, 7.87s/it] 41%|████████████████████████████████▊ | 560/1365 [35:09<49:51, 3.72s/it] 42%|█████████████████████████████████▍ | 570/1365 [35:45<47:37, 3.59s/it] 42%|█████████████████████████████████▉ | 580/1365 [36:21<46:58, 3.59s/it] 43%|██████████████████████████████████▌ | 590/1365 [36:57<46:21, 3.59s/it] 44%|███████████████████████████████████▏ | 600/1365 [37:32<45:44, 3.59s/it] 45%|███████████████████████████████████▋ | 609/1365 [38:05<45:14, 3.59s/it] 45%|████████████████████████████████████▎ | 619/1365 [38:41<44:40, 3.59s/it] 46%|████████████████████████████████████▉ | 630/1365 [39:20<43:58, 3.59s/it] 47%|█████████████████████████████████████▌ | 640/1365 [39:56<43:26, 3.59s/it] 48%|██████████████████████████████████████ | 650/1365 [40:32<42:50, 3.60s/it] 48%|██████████████████████████████████████▋ | 660/1365 [41:08<42:13, 3.59s/it] 49%|███████████████████████████████████████▎ | 670/1365 [41:44<41:36, 3.59s/it] 50%|███████████████████████████████████████▊ | 680/1365 [42:20<41:00, 3.59s/it] 51%|████████████████████████████████████████▍ | 690/1365 [42:56<40:27, 3.60s/it] 51%|█████████████████████████████████████████ | 700/1365 [43:32<39:48, 3.59s/it] 52%|█████████████████████████████████████████▌ | 710/1365 [44:08<39:42, 3.64s/it] 53%|██████████████████████████████████████████▏ | 719/1365 [44:41<38:56, 3.62s/it] 53%|██████████████████████████████████████████▊ | 730/1365 [45:20<38:11, 3.61s/it] 54%|███████████████████████████████████████████▎ | 740/1365 [45:57<37:33, 3.61s/it] 55%|███████████████████████████████████████████▉ | 750/1365 [46:33<36:57, 3.61s/it] 56%|████████████████████████████████████████████▌ | 760/1365 [47:09<36:23, 3.61s/it] 56%|█████████████████████████████████████████████▏ | 770/1365 [47:45<35:44, 3.60s/it] 57%|█████████████████████████████████████████████▋ | 780/1365 [48:21<35:09, 3.61s/it] 58%|██████████████████████████████████████████████▎ | 790/1365 [48:57<34:34, 3.61s/it] 59%|██████████████████████████████████████████████▉ | 800/1365 [49:33<33:56, 3.60s/it] 59%|███████████████████████████████████████████████▍ | 810/1365 [50:09<33:19, 3.60s/it] 60%|████████████████████████████████████████████████ | 819/1365 [50:41<32:27, 3.57s/it][INFO|trainer.py:3166] 2024-02-01 18:49:37,123 >> ***** Running Evaluation ***** [INFO|trainer.py:3168] 2024-02-01 18:49:37,124 >> Num examples = 15431 [INFO|trainer.py:3171] 2024-02-01 18:49:37,124 >> Batch size = 32 95%|██████████████████████████████████████████████████████████████████████████████▉ | 58/61 [00:28<00:01, 2.00it/s] 60%|████████████████████████████████████████████████ | 819/1365 [51:12<32:27, 3.57s/it][INFO|trainer.py:2889] 2024-02-01 18:50:08,864 >> Saving model checkpoint to ./tmp-checkpoint-819 [INFO|configuration_utils.py:483] 2024-02-01 18:50:08,868 >> Configuration saved in ./tmp-checkpoint-819/config.json [INFO|configuration_utils.py:594] 2024-02-01 18:50:08,870 >> Configuration saved in ./tmp-checkpoint-819/generation_config.json [INFO|modeling_utils.py:2382] 2024-02-01 18:50:11,947 >> Model weights saved in ./tmp-checkpoint-819/pytorch_model.bin [INFO|tokenization_utils_base.py:2432] 2024-02-01 18:50:11,966 >> tokenizer config file saved in ./tmp-checkpoint-819/tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 18:50:11,968 >> Special tokens file saved in ./tmp-checkpoint-819/special_tokens_map.json /fsx/sanchit/miniconda3/envs/venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2024-02-01 18:50:12,033] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step819 is about to be saved! [2024-02-01 18:50:12,045] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./tmp-checkpoint-819/global_step819/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-02-01 18:50:12,045] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./tmp-checkpoint-819/global_step819/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-02-01 18:50:12,049] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./tmp-checkpoint-819/global_step819/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-02-01 18:50:12,059] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./tmp-checkpoint-819/global_step819/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-02-01 18:50:15,835] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./tmp-checkpoint-819/global_step819/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-02-01 18:50:15,843] [INFO] [engine.py:3393:_save_zero_checkpoint] zero checkpoint saved ./tmp-checkpoint-819/global_step819/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-02-01 18:50:16,356] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step819 is ready now! [INFO|tokenization_utils_base.py:2432] 2024-02-01 18:50:18,699 >> tokenizer config file saved in ./tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 18:50:18,701 >> Special tokens file saved in ./special_tokens_map.json [INFO|trainer.py:2979] 2024-02-01 18:50:18,733 >> Deleting older checkpoint [checkpoint-546] due to args.save_total_limit 60%|██████████████████████████████████████████████▊ | 820/1365 [51:27<2:26:38, 16.14s/it] 61%|████████████████████████████████████████████████▋ | 830/1365 [52:03<35:06, 3.94s/it] 62%|█████████████████████████████████████████████████▏ | 840/1365 [52:38<31:23, 3.59s/it] 62%|█████████████████████████████████████████████████▊ | 850/1365 [53:14<30:45, 3.58s/it] 63%|██████████████████████████████████████████████████▍ | 860/1365 [53:50<30:11, 3.59s/it] 64%|██████████████████████████████████████████████████▉ | 870/1365 [54:26<29:35, 3.59s/it] 64%|███████████████████████████████████████████████████▌ | 880/1365 [55:02<28:58, 3.58s/it] 65%|████████████████████████████████████████████████████▏ | 890/1365 [55:38<28:21, 3.58s/it] 66%|████████████████████████████████████████████████████▋ | 900/1365 [56:13<27:47, 3.59s/it] 67%|█████████████████████████████████████████████████████▎ | 910/1365 [56:49<27:11, 3.59s/it] 67%|█████████████████████████████████████████████████████▉ | 920/1365 [57:25<26:35, 3.59s/it] 68%|██████████████████████████████████████████████████████▌ | 930/1365 [58:02<26:30, 3.66s/it] 69%|███████████████████████████████████████████████████████ | 940/1365 [58:37<25:23, 3.58s/it] 70%|███████████████████████████████████████████████████████▋ | 950/1365 [59:13<24:49, 3.59s/it] 70%|████████████████████████████████████████████████████████▎ | 960/1365 [59:49<24:11, 3.58s/it] 71%|███████████████████████████████████████████████████████▍ | 970/1365 [1:00:25<23:37, 3.59s/it] 72%|████████████████████████████████████████████████████████ | 980/1365 [1:01:02<23:42, 3.69s/it] 73%|████████████████████████████████████████████████████████▌ | 990/1365 [1:01:38<22:46, 3.65s/it] 73%|████████████████████████████████████████████████████████▍ | 1000/1365 [1:02:15<22:04, 3.63s/it] 74%|████████████████████████████████████████████████████████▉ | 1010/1365 [1:02:51<21:25, 3.62s/it] 75%|█████████████████████████████████████████████████████████▌ | 1020/1365 [1:03:27<20:46, 3.61s/it] 75%|██████████████████████████████████████████████████████████ | 1030/1365 [1:04:03<20:08, 3.61s/it] 76%|██████████████████████████████████████████████████████████▋ | 1040/1365 [1:04:39<19:33, 3.61s/it] 77%|███████████████████████████████████████████████████████████▏ | 1050/1365 [1:05:15<18:57, 3.61s/it] 78%|███████████████████████████████████████████████████████████▊ | 1060/1365 [1:05:52<18:20, 3.61s/it] 78%|████████████████████████████████████████████████████████████▎ | 1070/1365 [1:06:28<17:43, 3.61s/it] 79%|████████████████████████████████████████████████████████████▉ | 1080/1365 [1:07:04<17:08, 3.61s/it] 80%|█████████████████████████████████████████████████████████████▍ | 1090/1365 [1:07:40<16:31, 3.61s/it] 80%|█████████████████████████████████████████████████████████████▌ | 1092/1365 [1:07:47<16:14, 3.57s/it][INFO|trainer.py:3166] 2024-02-01 19:06:42,792 >> ***** Running Evaluation ***** [INFO|trainer.py:3168] 2024-02-01 19:06:42,792 >> Num examples = 15431 [INFO|trainer.py:3171] 2024-02-01 19:06:42,792 >> Batch size = 32 [INFO|configuration_utils.py:483] 2024-02-01 19:07:14,404 >> Configuration saved in ./tmp-checkpoint-1092/config.json [INFO|configuration_utils.py:594] 2024-02-01 19:07:14,406 >> Configuration saved in ./tmp-checkpoint-1092/generation_config.json {'eval_loss': 2.1637322902679443, 'eval_runtime': 30.5067, 'eval_samples_per_second': 505.824, 'eval_steps_per_second': 2.0, 'epoch': 4.0} [INFO|modeling_utils.py:2382] 2024-02-01 19:07:17,532 >> Model weights saved in ./tmp-checkpoint-1092/pytorch_model.bin [INFO|tokenization_utils_base.py:2432] 2024-02-01 19:07:17,551 >> tokenizer config file saved in ./tmp-checkpoint-1092/tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 19:07:17,554 >> Special tokens file saved in ./tmp-checkpoint-1092/special_tokens_map.json /fsx/sanchit/miniconda3/envs/venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2024-02-01 19:07:17,618] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step1092 is about to be saved! [2024-02-01 19:07:17,624] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./tmp-checkpoint-1092/global_step1092/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-02-01 19:07:17,624] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./tmp-checkpoint-1092/global_step1092/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-02-01 19:07:17,629] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./tmp-checkpoint-1092/global_step1092/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-02-01 19:07:17,637] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./tmp-checkpoint-1092/global_step1092/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-02-01 19:07:21,468] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./tmp-checkpoint-1092/global_step1092/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-02-01 19:07:21,476] [INFO] [engine.py:3393:_save_zero_checkpoint] zero checkpoint saved ./tmp-checkpoint-1092/global_step1092/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-02-01 19:07:21,809] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1092 is ready now! [INFO|tokenization_utils_base.py:2432] 2024-02-01 19:07:24,145 >> tokenizer config file saved in ./tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 19:07:24,147 >> Special tokens file saved in ./special_tokens_map.json [INFO|trainer.py:2979] 2024-02-01 19:07:24,178 >> Deleting older checkpoint [checkpoint-819] due to args.save_total_limit 81%|██████████████████████████████████████████████████████████████ | 1100/1365 [1:08:57<20:25, 4.62s/it] 81%|██████████████████████████████████████████████████████████████▌ | 1110/1365 [1:09:33<15:23, 3.62s/it] 82%|███████████████████████████████████████████████████████████████▏ | 1120/1365 [1:10:09<14:41, 3.60s/it] 83%|███████████████████████████████████████████████████████████████▋ | 1130/1365 [1:10:45<14:04, 3.59s/it] 84%|████████████████████████████████████████████████████████████████▎ | 1140/1365 [1:11:21<13:28, 3.59s/it] 84%|████████████████████████████████████████████████████████████████▊ | 1150/1365 [1:11:57<13:04, 3.65s/it] 85%|█████████████████████████████████████████████████████████████████▍ | 1160/1365 [1:12:33<12:16, 3.59s/it] 86%|██████████████████████████████████████████████████████████████████ | 1170/1365 [1:13:09<11:39, 3.59s/it] 86%|██████████████████████████████████████████████████████████████████▌ | 1180/1365 [1:13:45<11:05, 3.60s/it] 87%|███████████████████████████████████████████████████████████████████▏ | 1190/1365 [1:14:21<10:29, 3.60s/it] 88%|███████████████████████████████████████████████████████████████████▋ | 1200/1365 [1:14:57<09:52, 3.59s/it] 89%|████████████████████████████████████████████████████████████████████▎ | 1210/1365 [1:15:33<09:16, 3.59s/it] 89%|████████████████████████████████████████████████████████████████████▊ | 1220/1365 [1:16:09<08:40, 3.59s/it] 90%|█████████████████████████████████████████████████████████████████████▍ | 1230/1365 [1:16:45<08:04, 3.59s/it] 91%|█████████████████████████████████████████████████████████████████████▉ | 1240/1365 [1:17:21<07:28, 3.59s/it] 92%|██████████████████████████████████████████████████████████████████████▌ | 1250/1365 [1:17:57<06:59, 3.64s/it] 92%|███████████████████████████████████████████████████████████████████████ | 1260/1365 [1:18:34<06:23, 3.66s/it] 93%|███████████████████████████████████████████████████████████████████████▋ | 1270/1365 [1:19:10<05:43, 3.61s/it] 94%|████████████████████████████████████████████████████████████████████████▏ | 1280/1365 [1:19:46<05:06, 3.61s/it] 95%|████████████████████████████████████████████████████████████████████████▊ | 1290/1365 [1:20:22<04:30, 3.61s/it] 95%|█████████████████████████████████████████████████████████████████████████▎ | 1300/1365 [1:20:58<03:54, 3.60s/it] 96%|█████████████████████████████████████████████████████████████████████████▉ | 1310/1365 [1:21:34<03:18, 3.61s/it] 97%|██████████████████████████████████████████████████████████████████████████▍ | 1320/1365 [1:22:10<02:42, 3.60s/it] 97%|███████████████████████████████████████████████████████████████████████████ | 1330/1365 [1:22:46<02:06, 3.60s/it] 98%|███████████████████████████████████████████████████████████████████████████▌ | 1340/1365 [1:23:22<01:29, 3.60s/it] 99%|████████████████████████████████████████████████████████████████████████████▏| 1350/1365 [1:23:58<00:53, 3.60s/it] 100%|████████████████████████████████████████████████████████████████████████████▋| 1360/1365 [1:24:35<00:18, 3.68s/it] 100%|█████████████████████████████████████████████████████████████████████████████| 1365/1365 [1:24:52<00:00, 3.58s/it][INFO|trainer.py:3166] 2024-02-01 19:23:48,314 >> ***** Running Evaluation ***** [INFO|trainer.py:3168] 2024-02-01 19:23:48,314 >> Num examples = 15431 [INFO|trainer.py:3171] 2024-02-01 19:23:48,314 >> Batch size = 32 [INFO|configuration_utils.py:483] 2024-02-01 19:24:20,057 >> Configuration saved in ./tmp-checkpoint-1365/config.json [INFO|configuration_utils.py:594] 2024-02-01 19:24:20,059 >> Configuration saved in ./tmp-checkpoint-1365/generation_config.json {'eval_loss': 2.118281126022339, 'eval_runtime': 30.6172, 'eval_samples_per_second': 503.999, 'eval_steps_per_second': 1.992, 'epoch': 5.0} [INFO|modeling_utils.py:2382] 2024-02-01 19:24:23,364 >> Model weights saved in ./tmp-checkpoint-1365/pytorch_model.bin [INFO|tokenization_utils_base.py:2432] 2024-02-01 19:24:23,383 >> tokenizer config file saved in ./tmp-checkpoint-1365/tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 19:24:23,385 >> Special tokens file saved in ./tmp-checkpoint-1365/special_tokens_map.json /fsx/sanchit/miniconda3/envs/venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2024-02-01 19:24:23,449] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step1365 is about to be saved! [2024-02-01 19:24:23,462] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./tmp-checkpoint-1365/global_step1365/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-02-01 19:24:23,462] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./tmp-checkpoint-1365/global_step1365/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-02-01 19:24:23,467] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./tmp-checkpoint-1365/global_step1365/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-02-01 19:24:23,476] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./tmp-checkpoint-1365/global_step1365/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-02-01 19:24:27,289] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./tmp-checkpoint-1365/global_step1365/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-02-01 19:24:27,298] [INFO] [engine.py:3393:_save_zero_checkpoint] zero checkpoint saved ./tmp-checkpoint-1365/global_step1365/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-02-01 19:24:27,349] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1365 is ready now! [INFO|tokenization_utils_base.py:2432] 2024-02-01 19:24:29,644 >> tokenizer config file saved in ./tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 19:24:29,647 >> Special tokens file saved in ./special_tokens_map.json [INFO|trainer.py:2979] 2024-02-01 19:24:29,677 >> Deleting older checkpoint [checkpoint-1092] due to args.save_total_limit [INFO|trainer.py:1947] 2024-02-01 19:24:29,709 >> Training completed. Do not forget to share your model on huggingface.co/models =) 100%|█████████████████████████████████████████████████████████████████████████████| 1365/1365 [1:25:34<00:00, 3.76s/it] [INFO|trainer.py:3614] 2024-02-01 19:24:29,900 >> Waiting for the current checkpoint push to be finished, this might take a couple of minutes. {'train_runtime': 5141.5129, 'train_samples_per_second': 135.588, 'train_steps_per_second': 0.265, 'train_loss': 3.477488596011431, 'epoch': 5.0} [INFO|trainer.py:3166] 2024-02-01 19:25:31,190 >> ***** Running Evaluation ***** [INFO|trainer.py:3168] 2024-02-01 19:25:31,190 >> Num examples = 15431 [INFO|trainer.py:3171] 2024-02-01 19:25:31,190 >> Batch size = 32 3%|██▊ | 2/61 [00:00<00:14, 4.04it/s] ***** train metrics ***** epoch = 5.0 train_loss = 3.4775 train_runtime = 1:25:41.51 train_samples = 207865 train_samples_per_second = 135.588 train_steps_per_second = 0.265 100%|███████████████████████████████████████████████████████████████████████████████████| 61/61 [00:29<00:00, 2.04it/s] ***** eval metrics ***** epoch = 5.0 eval_loss = 2.1183 eval_runtime = 0:00:30.30 eval_samples = 23110 eval_samples_per_second = 509.26 eval_steps_per_second = 2.013 2024-02-01 19:26:01 - INFO - __main__ - *** Save model *** [INFO|trainer.py:2889] 2024-02-01 19:26:02,688 >> Saving model checkpoint to ./ [INFO|configuration_utils.py:483] 2024-02-01 19:26:02,691 >> Configuration saved in ./config.json [INFO|configuration_utils.py:594] 2024-02-01 19:26:02,693 >> Configuration saved in ./generation_config.json [INFO|modeling_utils.py:2382] 2024-02-01 19:26:06,302 >> Model weights saved in ./pytorch_model.bin [INFO|tokenization_utils_base.py:2432] 2024-02-01 19:26:06,305 >> tokenizer config file saved in ./tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 19:26:06,307 >> Special tokens file saved in ./special_tokens_map.json [INFO|trainer.py:2889] 2024-02-01 19:26:07,389 >> Saving model checkpoint to ./ [INFO|configuration_utils.py:483] 2024-02-01 19:26:07,392 >> Configuration saved in ./config.json [INFO|configuration_utils.py:594] 2024-02-01 19:26:07,394 >> Configuration saved in ./generation_config.json [INFO|modeling_utils.py:2382] 2024-02-01 19:26:11,028 >> Model weights saved in ./pytorch_model.bin [INFO|tokenization_utils_base.py:2432] 2024-02-01 19:26:11,031 >> tokenizer config file saved in ./tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 19:26:11,033 >> Special tokens file saved in ./special_tokens_map.json [INFO|modelcard.py:452] 2024-02-01 19:26:11,224 >> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'dataset': {'name': 'generator', 'type': 'generator', 'config': 'default', 'split': 'train', 'args': 'default'}} events.out.tfevents.1706815561.ip-26-0-165-24.239318.1: 100%|██████████████████████████| 359/359 [00:00<00:00, 4.01kB/s] run-i93q0p12.wandb: 100%|██████████████████████████████████████████████████████████| 1.57M/1.57M [00:00<00:00, 11.3MB/s] Upload 2 LFS files: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 6.81it/s] [INFO|modelcard.py:452] 2024-02-01 19:26:15,132 >> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'dataset': {'name': 'HuggingFaceH4/ultrachat_200k', 'type': 'HuggingFaceH4/ultrachat_200k', 'config': 'default', 'split': 'train', 'args': 'default'}} [INFO|configuration_utils.py:483] 2024-02-01 19:26:15,137 >> Configuration saved in ./config.json [INFO|trainer.py:2889] 2024-02-01 19:26:16,239 >> Saving model checkpoint to ./ [INFO|configuration_utils.py:483] 2024-02-01 19:26:16,242 >> Configuration saved in ./config.json [INFO|configuration_utils.py:594] 2024-02-01 19:26:16,244 >> Configuration saved in ./generation_config.json 2024-02-01 19:26:14 - INFO - __main__ - Model saved to ./ 2024-02-01 19:26:15 - INFO - __main__ - Pushing to hub... [INFO|modeling_utils.py:2382] 2024-02-01 19:26:19,933 >> Model weights saved in ./pytorch_model.bin [INFO|tokenization_utils_base.py:2432] 2024-02-01 19:26:19,936 >> tokenizer config file saved in ./tokenizer_config.json [INFO|tokenization_utils_base.py:2441] 2024-02-01 19:26:19,938 >> Special tokens file saved in ./special_tokens_map.json [INFO|modelcard.py:452] 2024-02-01 19:26:20,002 >> Dropping the following result as it does not have all the necessary fields: