/home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( [2023-04-21 23:18:50,389] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-21 23:18:51,040] [INFO] [runner.py:540:main] cmd = /home/AdamG012/.conda/envs/py39/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path facebook/opt-350m --num_padding_at_beginning 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 5e-5 --weight_decay 0.1 --num_train_epochs 1 --disable_dropout --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 0 --deepspeed --output_dir ./output /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( [2023-04-21 23:18:54,600] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-04-21 23:18:54,600] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-04-21 23:18:54,600] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-04-21 23:18:54,600] [INFO] [launch.py:247:main] dist_world_size=8 [2023-04-21 23:18:54,600] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( [2023-04-21 23:19:11,809] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Found cached dataset parquet (/reward/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) Found cached dataset parquet (/reward/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) 0%| | 0/2 [00:00 Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... [2023-04-21 23:24:55,342] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05, 5e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:24:55,342] [INFO] [config.py:953:print] DeepSpeedEngine configuration: Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] amp_enabled .................. False [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] amp_params ................... False [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] bfloat16_enabled ............. False [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] checkpoint_parallel_write_pipeline False [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] checkpoint_tag_validation_enabled True [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] checkpoint_tag_validation_fail False [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] comms_config ................. [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] communication_data_type ...... None [2023-04-21 23:24:55,343] [INFO] [config.py:957:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] curriculum_enabled_legacy .... False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] curriculum_params_legacy ..... False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] data_efficiency_enabled ...... False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] dataloader_drop_last ......... False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] disable_allgather ............ False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] dump_state ................... False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1} [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] eigenvalue_enabled ........... False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] eigenvalue_gas_boundary_resolution 1 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] eigenvalue_layer_num ......... 0 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] eigenvalue_max_iter .......... 100 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] eigenvalue_stability ......... 1e-06 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] eigenvalue_tol ............... 0.01 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] eigenvalue_verbose ........... False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] elasticity_enabled ........... False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] fp16_auto_cast ............... False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] fp16_enabled ................. True [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] fp16_master_weights_and_gradients False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] global_rank .................. 0 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] grad_accum_dtype ............. None [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] gradient_accumulation_steps .. 1 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] gradient_clipping ............ 1.0 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] gradient_predivide_factor .... 1.0 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] initial_dynamic_scale ........ 65536 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] load_universal_checkpoint .... False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] loss_scale ................... 0 [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] memory_breakdown ............. False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-04-21 23:24:55,344] [INFO] [config.py:957:print] optimizer_legacy_fusion ...... False [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] optimizer_name ............... None [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] optimizer_params ............. None [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] pld_enabled .................. False [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] pld_params ................... False [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] prescale_gradients ........... False [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] scheduler_name ............... None [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] scheduler_params ............. None [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] sparse_attention ............. None [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] sparse_gradients_enabled ..... False [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] steps_per_print .............. 10 [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] train_batch_size ............. 32 [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] train_micro_batch_size_per_gpu 4 [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] use_node_local_storage ....... False [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] wall_clock_breakdown ......... False [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] world_size ................... 8 [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] zero_allow_untested_optimizer False [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=False [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] zero_enabled ................. False [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] zero_force_ds_cpu_optimizer .. True [2023-04-21 23:24:55,345] [INFO] [config.py:957:print] zero_optimization_stage ...... 0 [2023-04-21 23:24:55,345] [INFO] [config.py:943:print_user_config] json = { "train_batch_size": 32, "train_micro_batch_size_per_gpu": 4, "steps_per_print": 10, "zero_optimization": { "stage": 0, "offload_param": { "device": "none" }, "offload_optimizer": { "device": "none" }, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 3.000000e+07, "stage3_prefetch_bucket_size": 3.000000e+07, "memory_efficient_linear": false }, "fp16": { "enabled": true, "loss_scale_window": 100 }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "hybrid_engine": { "enabled": false, "max_out_tokens": 512, "inference_tp_size": 1, "release_inference_cache": false, "pin_parameters": true, "tp_gather_partition_size": 8 } } Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Emitting ninja build file /home/AdamG012/.cache/torch_extensions/py39_cu113/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) ninja: no work to do. Loading extension module utils... Time to load utils op: 1.3323311805725098 seconds Loading extension module utils... Time to load utils op: 1.4053616523742676 seconds Loading extension module utils... Loading extension module utils... Loading extension module utils... Loading extension module utils... Loading extension module utils... Time to load utils op: 1.4050300121307373 seconds Time to load utils op: 1.4046807289123535 seconds Time to load utils op: 1.4048070907592773 seconds Time to load utils op: 1.4046287536621094 seconds Time to load utils op: 1.4051079750061035 seconds Loading extension module utils... Time to load utils op: 1.4054672718048096 seconds ***** Running training ***** ***** Evaluating reward, Epoch 0/1 ***** chosen_last_scores (higher is better) : 2.8115272521972656, acc (higher is better) : 0.4898989498615265 Beginning of Epoch 1/1, Total Micro Batches 3680 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-21 23:25:02,440] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 65536, reducing to 32768.0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-21 23:25:02,440] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-21 23:25:02,572] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-21 23:25:02,572] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:25:02,572] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-21 23:25:02,572] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:25:02,572] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-21 23:25:02,572] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-21 23:25:02,572] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-21 23:25:02,572] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-21 23:25:02,572] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-21 23:25:02,573] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:25:02,573] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:25:02,573] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:25:02,573] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:25:02,573] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:25:02,573] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-21 23:25:02,573] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:25:02,573] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-21 23:25:02,709] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:25:02,710] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:25:02,710] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-04-21 23:25:02,710] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-21 23:25:02,710] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:25:02,841] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:25:02,841] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-04-21 23:25:02,972] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-21 23:25:02,972] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-21 23:25:02,972] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-21 23:25:02,972] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-21 23:25:02,972] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:25:02,972] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:25:02,973] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:25:02,973] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:25:02,973] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-21 23:25:02,973] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-21 23:25:02,973] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:25:02,973] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-21 23:25:02,973] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:25:02,973] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:25:02,973] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-21 23:25:02,973] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-04-21 23:25:02,973] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-21 23:25:03,104] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-21 23:25:03,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-21 23:25:03,235] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-21 23:25:03,235] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-21 23:25:03,235] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-21 23:25:03,234] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-21 23:25:03,235] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1024.0, reducing to 512.0 [2023-04-21 23:25:03,235] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-21 23:25:03,740] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=7, lr=[4.999991801084829e-05, 4.999991801084829e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:03,749] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=218.34032515722657, CurrSamplesPerSec=191.61123190319628, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:05,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=7, lr=[4.999846044088921e-05, 4.999846044088921e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:05,446] [INFO] [timer.py:199:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=201.03654652250466, CurrSamplesPerSec=190.88200584516707, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:07,117] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=7, lr=[4.9995181012051625e-05, 4.9995181012051625e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:07,126] [INFO] [timer.py:199:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=197.33323335652648, CurrSamplesPerSec=192.0304003204853, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:08,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=7, lr=[4.9990079963336504e-05, 4.9990079963336504e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:08,806] [INFO] [timer.py:199:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=195.58291871979364, CurrSamplesPerSec=192.29837556557197, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 46 [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 46 [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 46 [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 46 [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 46 [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 46 [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:25:09,947] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 46 [2023-04-21 23:25:09,947] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:25:09,947] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 512.0, reducing to 256.0 [2023-04-21 23:25:09,947] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:25:09,946] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 46 [2023-04-21 23:25:09,947] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:25:10,443] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=8, lr=[4.998393183901334e-05, 4.998393183901334e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:10,452] [INFO] [timer.py:199:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=195.44771409984958, CurrSamplesPerSec=189.51256802207206, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:12,121] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=8, lr=[4.99753708464281e-05, 4.99753708464281e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:12,130] [INFO] [timer.py:199:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=194.69590084155664, CurrSamplesPerSec=190.1863324817172, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:13,810] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=8, lr=[4.9964989677707283e-05, 4.9964989677707283e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:13,819] [INFO] [timer.py:199:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=193.97865900228828, CurrSamplesPerSec=190.62770724474245, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:15,486] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=8, lr=[4.995278908941845e-05, 4.995278908941845e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:15,495] [INFO] [timer.py:199:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=193.63784034024235, CurrSamplesPerSec=192.27358263520406, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:17,161] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=8, lr=[4.9938769970726374e-05, 4.9938769970726374e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:17,170] [INFO] [timer.py:199:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=193.38975661858098, CurrSamplesPerSec=191.84814266356395, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 97 [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 97 [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 97 [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 97 [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 97 [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 97 [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:18,483] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 256.0, reducing to 128.0 [2023-04-21 23:25:18,483] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 97 [2023-04-21 23:25:18,483] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:18,483] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:18,482] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 97 [2023-04-21 23:25:18,483] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:18,809] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=9, lr=[4.9924598762104146e-05, 4.9924598762104146e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:18,818] [INFO] [timer.py:199:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=193.52008364890807, CurrSamplesPerSec=191.58688930823715, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:20,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=9, lr=[4.990712735989973e-05, 4.990712735989973e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:20,499] [INFO] [timer.py:199:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=193.25587504000134, CurrSamplesPerSec=191.52182375470178, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:22,168] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=9, lr=[4.9887840755066084e-05, 4.9887840755066084e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:22,177] [INFO] [timer.py:199:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=193.07973674081282, CurrSamplesPerSec=190.95777573534642, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:23,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=9, lr=[4.986674035318866e-05, 4.986674035318866e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:23,951] [INFO] [timer.py:199:stop] epoch=0/micro_step=130/global_step=130, RunningAvgSamplesPerSec=192.06261315695934, CurrSamplesPerSec=190.5727616337681, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:25,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=9, lr=[4.984382769204035e-05, 4.984382769204035e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:25,623] [INFO] [timer.py:199:stop] epoch=0/micro_step=140/global_step=140, RunningAvgSamplesPerSec=192.04641452203194, CurrSamplesPerSec=191.69141473001073, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:27,291] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=9, lr=[4.981910444146938e-05, 4.981910444146938e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:27,300] [INFO] [timer.py:199:stop] epoch=0/micro_step=150/global_step=150, RunningAvgSamplesPerSec=191.9947643161298, CurrSamplesPerSec=192.15741655427945, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:28,967] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=9, lr=[4.9792572403277656e-05, 4.9792572403277656e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:28,976] [INFO] [timer.py:199:stop] epoch=0/micro_step=160/global_step=160, RunningAvgSamplesPerSec=191.95524172864023, CurrSamplesPerSec=189.4991895814626, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:30,644] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=9, lr=[4.976423351108943e-05, 4.976423351108943e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:30,653] [INFO] [timer.py:199:stop] epoch=0/micro_step=170/global_step=170, RunningAvgSamplesPerSec=191.91357063607708, CurrSamplesPerSec=190.64151629471388, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:32,338] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=9, lr=[4.9734089830210384e-05, 4.9734089830210384e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:32,347] [INFO] [timer.py:199:stop] epoch=0/micro_step=180/global_step=180, RunningAvgSamplesPerSec=191.77048093528185, CurrSamplesPerSec=184.11867638626345, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:34,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=9, lr=[4.97021435574771e-05, 4.97021435574771e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:34,057] [INFO] [timer.py:199:stop] epoch=0/micro_step=190/global_step=190, RunningAvgSamplesPerSec=191.5426906965214, CurrSamplesPerSec=175.24870800527503, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:35,584] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:35,584] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:25:35,584] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:35,584] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:35,584] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:25:35,585] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:25:35,584] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:35,584] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:35,585] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:25:35,585] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:25:35,585] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:35,585] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:25:35,585] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:35,585] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:35,585] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:25:35,585] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:25:35,763] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=9, lr=[4.966839702109699e-05, 4.966839702109699e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:35,772] [INFO] [timer.py:199:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=191.30726804505613, CurrSamplesPerSec=191.3514413618089, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 206 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 206 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 206 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 206 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 206 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 206 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 206 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 206 [2023-04-21 23:25:36,908] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 256.0, reducing to 128.0 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:36,908] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:25:37,408] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=10, lr=[4.963648794292992e-05, 4.963648794292992e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:37,417] [INFO] [timer.py:199:stop] epoch=0/micro_step=210/global_step=210, RunningAvgSamplesPerSec=191.48272078801423, CurrSamplesPerSec=187.70581349653096, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:39,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=10, lr=[4.95993277895848e-05, 4.95993277895848e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:39,092] [INFO] [timer.py:199:stop] epoch=0/micro_step=220/global_step=220, RunningAvgSamplesPerSec=191.48465450487376, CurrSamplesPerSec=192.45609798765116, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 222 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 222 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 222 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 222 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 222 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 222 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 222 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 222 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:25:39,560] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:25:39,560] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 128.0, reducing to 64.0 [2023-04-21 23:25:40,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=11, lr=[4.956435075286774e-05, 4.956435075286774e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:40,741] [INFO] [timer.py:199:stop] epoch=0/micro_step=230/global_step=230, RunningAvgSamplesPerSec=191.6141605811198, CurrSamplesPerSec=191.81085546059964, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:42,412] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=11, lr=[4.9523786758964875e-05, 4.9523786758964875e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:42,421] [INFO] [timer.py:199:stop] epoch=0/micro_step=240/global_step=240, RunningAvgSamplesPerSec=191.5883658686034, CurrSamplesPerSec=191.7171531334901, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:44,091] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=11, lr=[4.948143549985232e-05, 4.948143549985232e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:44,100] [INFO] [timer.py:199:stop] epoch=0/micro_step=250/global_step=250, RunningAvgSamplesPerSec=191.56432124678173, CurrSamplesPerSec=192.04276464991165, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:45,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=11, lr=[4.943730006204088e-05, 4.943730006204088e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:45,778] [INFO] [timer.py:199:stop] epoch=0/micro_step=260/global_step=260, RunningAvgSamplesPerSec=191.5484755353007, CurrSamplesPerSec=191.5319360864434, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:47,442] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=11, lr=[4.939138366207017e-05, 4.939138366207017e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:47,451] [INFO] [timer.py:199:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=191.55477434537968, CurrSamplesPerSec=192.50109074324004, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:49,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=11, lr=[4.9343689646274324e-05, 4.9343689646274324e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:49,127] [INFO] [timer.py:199:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=191.5505777820709, CurrSamplesPerSec=191.79550611892608, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:50,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=11, lr=[4.929422149053803e-05, 4.929422149053803e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:50,801] [INFO] [timer.py:199:stop] epoch=0/micro_step=290/global_step=290, RunningAvgSamplesPerSec=191.55092616507162, CurrSamplesPerSec=192.66052875602523, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:52,469] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=11, lr=[4.924298280004326e-05, 4.924298280004326e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:52,478] [INFO] [timer.py:199:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=191.5432612211734, CurrSamplesPerSec=192.4331956465948, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:54,142] [INFO] [logging.py:96:log_dist] [Rank 0] step=310, skipped=11, lr=[4.9189977309006495e-05, 4.9189977309006495e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:54,151] [INFO] [timer.py:199:stop] epoch=0/micro_step=310/global_step=310, RunningAvgSamplesPerSec=191.54822933800378, CurrSamplesPerSec=192.30278386703918, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:55,869] [INFO] [logging.py:96:log_dist] [Rank 0] step=320, skipped=11, lr=[4.913520888040661e-05, 4.913520888040661e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:55,878] [INFO] [timer.py:199:stop] epoch=0/micro_step=320/global_step=320, RunningAvgSamplesPerSec=191.36017326148024, CurrSamplesPerSec=192.09306266387296, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:56,527] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:25:56,528] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:25:56,528] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:25:57,544] [INFO] [logging.py:96:log_dist] [Rank 0] step=330, skipped=11, lr=[4.907868150570334e-05, 4.907868150570334e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:57,553] [INFO] [timer.py:199:stop] epoch=0/micro_step=330/global_step=330, RunningAvgSamplesPerSec=191.36601346149828, CurrSamplesPerSec=192.4229879256904, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:25:59,215] [INFO] [logging.py:96:log_dist] [Rank 0] step=340, skipped=11, lr=[4.902039930454633e-05, 4.902039930454633e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:25:59,223] [INFO] [timer.py:199:stop] epoch=0/micro_step=340/global_step=340, RunningAvgSamplesPerSec=191.3845865493614, CurrSamplesPerSec=192.0944372963202, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 349 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 349 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 349 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 349 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 349 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 349 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 349 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 349 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:26:00,863] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:26:00,863] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 128.0, reducing to 64.0 [2023-04-21 23:26:00,864] [INFO] [logging.py:96:log_dist] [Rank 0] step=350, skipped=12, lr=[4.8966448454840854e-05, 4.8966448454840854e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:00,864] [INFO] [timer.py:199:stop] epoch=0/micro_step=350/global_step=350, RunningAvgSamplesPerSec=191.50030130040656, CurrSamplesPerSec=245.72100618798285, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:02,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=360, skipped=12, lr=[4.890484389084437e-05, 4.890484389084437e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:02,542] [INFO] [timer.py:199:stop] epoch=0/micro_step=360/global_step=360, RunningAvgSamplesPerSec=191.49158245308513, CurrSamplesPerSec=191.74481593071812, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:04,209] [INFO] [logging.py:96:log_dist] [Rank 0] step=370, skipped=12, lr=[4.884149716947845e-05, 4.884149716947845e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:04,218] [INFO] [timer.py:199:stop] epoch=0/micro_step=370/global_step=370, RunningAvgSamplesPerSec=191.48903583411033, CurrSamplesPerSec=192.80887435750836, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:05,889] [INFO] [logging.py:96:log_dist] [Rank 0] step=380, skipped=12, lr=[4.877641290737884e-05, 4.877641290737884e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:05,898] [INFO] [timer.py:199:stop] epoch=0/micro_step=380/global_step=380, RunningAvgSamplesPerSec=191.47424086673405, CurrSamplesPerSec=187.78486223000746, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:07,580] [INFO] [logging.py:96:log_dist] [Rank 0] step=390, skipped=12, lr=[4.8709595847811294e-05, 4.8709595847811294e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:07,589] [INFO] [timer.py:199:stop] epoch=0/micro_step=390/global_step=390, RunningAvgSamplesPerSec=191.42781597274592, CurrSamplesPerSec=191.74563772004262, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:09,262] [INFO] [logging.py:96:log_dist] [Rank 0] step=400, skipped=12, lr=[4.864105086032581e-05, 4.864105086032581e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:09,271] [INFO] [timer.py:199:stop] epoch=0/micro_step=400/global_step=400, RunningAvgSamplesPerSec=191.40871001418634, CurrSamplesPerSec=189.85489476610053, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 406 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 406 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 406 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 406 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 406 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 406 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 406 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 406 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-21 23:26:10,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-21 23:26:10,404] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 64.0, reducing to 32.0 [2023-04-21 23:26:10,897] [INFO] [logging.py:96:log_dist] [Rank 0] step=410, skipped=13, lr=[4.857788711938659e-05, 4.857788711938659e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:10,906] [INFO] [timer.py:199:stop] epoch=0/micro_step=410/global_step=410, RunningAvgSamplesPerSec=191.5224989486708, CurrSamplesPerSec=192.36975979888405, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 411 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 411 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 411 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 411 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:26:11,202] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32.0, reducing to 16.0 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 411 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 411 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 411 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 411 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:26:11,202] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 413 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 413 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 413 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 413 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 413 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 413 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 413 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 413 [2023-04-21 23:26:11,503] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16.0 to 8.0 [2023-04-21 23:26:11,503] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16.0, reducing to 8.0 [2023-04-21 23:26:11,633] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 414 [2023-04-21 23:26:11,633] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 414 [2023-04-21 23:26:11,633] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0 [2023-04-21 23:26:11,633] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0 [2023-04-21 23:26:11,633] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 414 [2023-04-21 23:26:11,633] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 414 [2023-04-21 23:26:11,633] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 414 [2023-04-21 23:26:11,633] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 414 [2023-04-21 23:26:11,633] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0 [2023-04-21 23:26:11,634] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0 [2023-04-21 23:26:11,634] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0 [2023-04-21 23:26:11,634] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 414 [2023-04-21 23:26:11,634] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0 [2023-04-21 23:26:11,634] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8.0, reducing to 4.0 [2023-04-21 23:26:11,634] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0 [2023-04-21 23:26:11,634] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 414 [2023-04-21 23:26:11,634] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8.0 to 4.0 [2023-04-21 23:26:12,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=420, skipped=16, lr=[4.852779730601908e-05, 4.852779730601908e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:12,469] [INFO] [timer.py:199:stop] epoch=0/micro_step=420/global_step=420, RunningAvgSamplesPerSec=191.82776107064345, CurrSamplesPerSec=192.6483612698759, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:14,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=430, skipped=16, lr=[4.8454783398062106e-05, 4.8454783398062106e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:14,149] [INFO] [timer.py:199:stop] epoch=0/micro_step=430/global_step=430, RunningAvgSamplesPerSec=191.8074258530824, CurrSamplesPerSec=188.90628698984798, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:15,821] [INFO] [logging.py:96:log_dist] [Rank 0] step=440, skipped=16, lr=[4.8380060132623776e-05, 4.8380060132623776e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:15,830] [INFO] [timer.py:199:stop] epoch=0/micro_step=440/global_step=440, RunningAvgSamplesPerSec=191.7863213436846, CurrSamplesPerSec=190.3816463425878, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:17,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=450, skipped=16, lr=[4.830363295544922e-05, 4.830363295544922e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:17,505] [INFO] [timer.py:199:stop] epoch=0/micro_step=450/global_step=450, RunningAvgSamplesPerSec=191.77911755986443, CurrSamplesPerSec=192.13486028401283, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:19,180] [INFO] [logging.py:96:log_dist] [Rank 0] step=460, skipped=16, lr=[4.8225507436462695e-05, 4.8225507436462695e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:19,189] [INFO] [timer.py:199:stop] epoch=0/micro_step=460/global_step=460, RunningAvgSamplesPerSec=191.74960980108247, CurrSamplesPerSec=190.91404455305414, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:20,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=470, skipped=16, lr=[4.814568926936166e-05, 4.814568926936166e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:20,862] [INFO] [timer.py:199:stop] epoch=0/micro_step=470/global_step=470, RunningAvgSamplesPerSec=191.74861647272414, CurrSamplesPerSec=192.1904352781811, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:22,530] [INFO] [logging.py:96:log_dist] [Rank 0] step=480, skipped=16, lr=[4.806418427120179e-05, 4.806418427120179e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:22,539] [INFO] [timer.py:199:stop] epoch=0/micro_step=480/global_step=480, RunningAvgSamplesPerSec=191.73910199622932, CurrSamplesPerSec=189.82750607806227, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:24,200] [INFO] [logging.py:96:log_dist] [Rank 0] step=490, skipped=16, lr=[4.798099838197308e-05, 4.798099838197308e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:24,209] [INFO] [timer.py:199:stop] epoch=0/micro_step=490/global_step=490, RunningAvgSamplesPerSec=191.74655269868464, CurrSamplesPerSec=191.57184922346573, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:25,885] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=16, lr=[4.789613766416689e-05, 4.789613766416689e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:25,894] [INFO] [timer.py:199:stop] epoch=0/micro_step=500/global_step=500, RunningAvgSamplesPerSec=191.71866206748285, CurrSamplesPerSec=191.70702123496147, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:27,675] [INFO] [logging.py:96:log_dist] [Rank 0] step=510, skipped=16, lr=[4.780960830233417e-05, 4.780960830233417e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:27,684] [INFO] [timer.py:199:stop] epoch=0/micro_step=510/global_step=510, RunningAvgSamplesPerSec=191.4539498599303, CurrSamplesPerSec=191.7119501329094, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4.0 to 8.0 [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4.0 to 8.0 [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4.0 to 8.0 [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4.0 to 8.0 [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4.0 to 8.0 [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4.0 to 8.0 [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4.0 to 8.0 [2023-04-21 23:26:28,666] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4.0 to 8.0 [2023-04-21 23:26:29,346] [INFO] [logging.py:96:log_dist] [Rank 0] step=520, skipped=16, lr=[4.772141660263471e-05, 4.772141660263471e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:29,354] [INFO] [timer.py:199:stop] epoch=0/micro_step=520/global_step=520, RunningAvgSamplesPerSec=191.46542565215702, CurrSamplesPerSec=191.2775395437593, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:31,027] [INFO] [logging.py:96:log_dist] [Rank 0] step=530, skipped=16, lr=[4.7631568992377586e-05, 4.7631568992377586e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:31,035] [INFO] [timer.py:199:stop] epoch=0/micro_step=530/global_step=530, RunningAvgSamplesPerSec=191.45322827194565, CurrSamplesPerSec=191.7530341409613, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:32,698] [INFO] [logging.py:96:log_dist] [Rank 0] step=540, skipped=16, lr=[4.7540072019552664e-05, 4.7540072019552664e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:32,707] [INFO] [timer.py:199:stop] epoch=0/micro_step=540/global_step=540, RunningAvgSamplesPerSec=191.46148591276594, CurrSamplesPerSec=191.83031976548912, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:34,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=550, skipped=16, lr=[4.74469323523535e-05, 4.74469323523535e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:34,383] [INFO] [timer.py:199:stop] epoch=0/micro_step=550/global_step=550, RunningAvgSamplesPerSec=191.4598125775483, CurrSamplesPerSec=192.31986794442662, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:36,062] [INFO] [logging.py:96:log_dist] [Rank 0] step=560, skipped=16, lr=[4.735215677869128e-05, 4.735215677869128e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:36,071] [INFO] [timer.py:199:stop] epoch=0/micro_step=560/global_step=560, RunningAvgSamplesPerSec=191.43306758921875, CurrSamplesPerSec=188.79893318779074, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:37,757] [INFO] [logging.py:96:log_dist] [Rank 0] step=570, skipped=16, lr=[4.7255752205700194e-05, 4.7255752205700194e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:37,766] [INFO] [timer.py:199:stop] epoch=0/micro_step=570/global_step=570, RunningAvgSamplesPerSec=191.39502096382836, CurrSamplesPerSec=191.12500800995656, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:39,492] [INFO] [logging.py:96:log_dist] [Rank 0] step=580, skipped=16, lr=[4.7157725659233985e-05, 4.7157725659233985e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:39,501] [INFO] [timer.py:199:stop] epoch=0/micro_step=580/global_step=580, RunningAvgSamplesPerSec=191.2774235258391, CurrSamplesPerSec=192.51793393290015, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:41,171] [INFO] [logging.py:96:log_dist] [Rank 0] step=590, skipped=16, lr=[4.705808428335397e-05, 4.705808428335397e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:41,180] [INFO] [timer.py:199:stop] epoch=0/micro_step=590/global_step=590, RunningAvgSamplesPerSec=191.27386421628603, CurrSamplesPerSec=192.21906068565306, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:42,848] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=16, lr=[4.695683533980835e-05, 4.695683533980835e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:42,857] [INFO] [timer.py:199:stop] epoch=0/micro_step=600/global_step=600, RunningAvgSamplesPerSec=191.27508986584593, CurrSamplesPerSec=191.6596619695784, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:44,519] [INFO] [logging.py:96:log_dist] [Rank 0] step=610, skipped=16, lr=[4.685398620750301e-05, 4.685398620750301e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:44,528] [INFO] [timer.py:199:stop] epoch=0/micro_step=610/global_step=610, RunningAvgSamplesPerSec=191.2860943915095, CurrSamplesPerSec=191.5838811023611, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:45,512] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8.0 to 16.0 [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8.0 to 16.0 [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8.0 to 16.0 [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8.0 to 16.0 [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8.0 to 16.0 [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8.0 to 16.0 [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8.0 to 16.0 [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:26:45,513] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8.0 to 16.0 [2023-04-21 23:26:46,192] [INFO] [logging.py:96:log_dist] [Rank 0] step=620, skipped=16, lr=[4.674954438196374e-05, 4.674954438196374e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:46,201] [INFO] [timer.py:199:stop] epoch=0/micro_step=620/global_step=620, RunningAvgSamplesPerSec=191.2920031756483, CurrSamplesPerSec=192.0933375887885, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:47,875] [INFO] [logging.py:96:log_dist] [Rank 0] step=630, skipped=16, lr=[4.6643517474789954e-05, 4.6643517474789954e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:47,884] [INFO] [timer.py:199:stop] epoch=0/micro_step=630/global_step=630, RunningAvgSamplesPerSec=191.28132773594317, CurrSamplesPerSec=192.0413907569037, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:49,549] [INFO] [logging.py:96:log_dist] [Rank 0] step=640, skipped=16, lr=[4.65359132131e-05, 4.65359132131e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:49,558] [INFO] [timer.py:199:stop] epoch=0/micro_step=640/global_step=640, RunningAvgSamplesPerSec=191.28623136480218, CurrSamplesPerSec=192.01858700426192, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:51,222] [INFO] [logging.py:96:log_dist] [Rank 0] step=650, skipped=16, lr=[4.6426739438967995e-05, 4.6426739438967995e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:51,231] [INFO] [timer.py:199:stop] epoch=0/micro_step=650/global_step=650, RunningAvgSamplesPerSec=191.29385247132885, CurrSamplesPerSec=192.24136138177786, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:52,894] [INFO] [logging.py:96:log_dist] [Rank 0] step=660, skipped=16, lr=[4.6316004108852305e-05, 4.6316004108852305e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:52,903] [INFO] [timer.py:199:stop] epoch=0/micro_step=660/global_step=660, RunningAvgSamplesPerSec=191.30120314698254, CurrSamplesPerSec=191.8582894847813, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:54,571] [INFO] [logging.py:96:log_dist] [Rank 0] step=670, skipped=16, lr=[4.6203715293015694e-05, 4.6203715293015694e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:54,580] [INFO] [timer.py:199:stop] epoch=0/micro_step=670/global_step=670, RunningAvgSamplesPerSec=191.3010447829297, CurrSamplesPerSec=191.9373068546317, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:56,243] [INFO] [logging.py:96:log_dist] [Rank 0] step=680, skipped=16, lr=[4.608988117493714e-05, 4.608988117493714e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:56,252] [INFO] [timer.py:199:stop] epoch=0/micro_step=680/global_step=680, RunningAvgSamplesPerSec=191.30888829435722, CurrSamplesPerSec=191.29062500445383, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:57,921] [INFO] [logging.py:96:log_dist] [Rank 0] step=690, skipped=16, lr=[4.5974510050715514e-05, 4.5974510050715514e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:57,930] [INFO] [timer.py:199:stop] epoch=0/micro_step=690/global_step=690, RunningAvgSamplesPerSec=191.30686742925548, CurrSamplesPerSec=192.28515187374467, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:26:59,668] [INFO] [logging.py:96:log_dist] [Rank 0] step=700, skipped=16, lr=[4.5857610328464876e-05, 4.5857610328464876e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:26:59,677] [INFO] [timer.py:199:stop] epoch=0/micro_step=700/global_step=700, RunningAvgSamplesPerSec=191.19139182990045, CurrSamplesPerSec=192.39044080546765, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:01,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=710, skipped=16, lr=[4.573919052770174e-05, 4.573919052770174e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:01,364] [INFO] [timer.py:199:stop] epoch=0/micro_step=710/global_step=710, RunningAvgSamplesPerSec=191.17687558327103, CurrSamplesPerSec=192.36727837687542, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:02,346] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:03,032] [INFO] [logging.py:96:log_dist] [Rank 0] step=720, skipped=16, lr=[4.5619259278724214e-05, 4.5619259278724214e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:03,041] [INFO] [timer.py:199:stop] epoch=0/micro_step=720/global_step=720, RunningAvgSamplesPerSec=191.17734842453225, CurrSamplesPerSec=192.39761125907748, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:04,709] [INFO] [logging.py:96:log_dist] [Rank 0] step=730, skipped=16, lr=[4.5497825321982985e-05, 4.5497825321982985e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:04,718] [INFO] [timer.py:199:stop] epoch=0/micro_step=730/global_step=730, RunningAvgSamplesPerSec=191.17865962739438, CurrSamplesPerSec=192.01474112192184, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:06,378] [INFO] [logging.py:96:log_dist] [Rank 0] step=740, skipped=16, lr=[4.537489750744434e-05, 4.537489750744434e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:06,387] [INFO] [timer.py:199:stop] epoch=0/micro_step=740/global_step=740, RunningAvgSamplesPerSec=191.1918006576372, CurrSamplesPerSec=192.23530389074685, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:08,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=750, skipped=16, lr=[4.525048479394518e-05, 4.525048479394518e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:08,057] [INFO] [timer.py:199:stop] epoch=0/micro_step=750/global_step=750, RunningAvgSamplesPerSec=191.20407386846696, CurrSamplesPerSec=192.55328660252897, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 754 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 754 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 754 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 754 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 754 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 754 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 754 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 754 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:27:08,855] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32.0 to 16.0 [2023-04-21 23:27:08,855] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32.0, reducing to 16.0 [2023-04-21 23:27:09,689] [INFO] [logging.py:96:log_dist] [Rank 0] step=760, skipped=17, lr=[4.5137251254879964e-05, 4.5137251254879964e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:09,698] [INFO] [timer.py:199:stop] epoch=0/micro_step=760/global_step=760, RunningAvgSamplesPerSec=191.25799992257342, CurrSamplesPerSec=186.3379407240397, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:11,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=770, skipped=17, lr=[4.501004230200098e-05, 4.501004230200098e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:11,367] [INFO] [timer.py:199:stop] epoch=0/micro_step=770/global_step=770, RunningAvgSamplesPerSec=191.26967933104623, CurrSamplesPerSec=192.18988487313135, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:13,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=780, skipped=17, lr=[4.48813750403868e-05, 4.48813750403868e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:13,037] [INFO] [timer.py:199:stop] epoch=0/micro_step=780/global_step=780, RunningAvgSamplesPerSec=191.28089238230413, CurrSamplesPerSec=191.96146403486316, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:14,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=790, skipped=17, lr=[4.475125884715861e-05, 4.475125884715861e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:14,720] [INFO] [timer.py:199:stop] epoch=0/micro_step=790/global_step=790, RunningAvgSamplesPerSec=191.2730978686161, CurrSamplesPerSec=188.37708054091607, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:16,390] [INFO] [logging.py:96:log_dist] [Rank 0] step=800, skipped=17, lr=[4.461970320503406e-05, 4.461970320503406e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:16,399] [INFO] [timer.py:199:stop] epoch=0/micro_step=800/global_step=800, RunningAvgSamplesPerSec=191.2703085314015, CurrSamplesPerSec=191.7864621214611, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:18,060] [INFO] [logging.py:96:log_dist] [Rank 0] step=810, skipped=17, lr=[4.448671770163615e-05, 4.448671770163615e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:18,069] [INFO] [timer.py:199:stop] epoch=0/micro_step=810/global_step=810, RunningAvgSamplesPerSec=191.2801552106342, CurrSamplesPerSec=191.01484652518445, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:19,733] [INFO] [logging.py:96:log_dist] [Rank 0] step=820, skipped=17, lr=[4.4352312028794545e-05, 4.4352312028794545e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:19,742] [INFO] [timer.py:199:stop] epoch=0/micro_step=820/global_step=820, RunningAvgSamplesPerSec=191.2856664226515, CurrSamplesPerSec=192.62264868518682, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:21,409] [INFO] [logging.py:96:log_dist] [Rank 0] step=830, skipped=17, lr=[4.421649598183919e-05, 4.421649598183919e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:21,418] [INFO] [timer.py:199:stop] epoch=0/micro_step=830/global_step=830, RunningAvgSamplesPerSec=191.28659487106592, CurrSamplesPerSec=192.23447789881953, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:23,084] [INFO] [logging.py:96:log_dist] [Rank 0] step=840, skipped=17, lr=[4.4079279458886475e-05, 4.4079279458886475e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:23,093] [INFO] [timer.py:199:stop] epoch=0/micro_step=840/global_step=840, RunningAvgSamplesPerSec=191.28870033006922, CurrSamplesPerSec=191.82620725927706, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:24,755] [INFO] [logging.py:96:log_dist] [Rank 0] step=850, skipped=17, lr=[4.394067246011786e-05, 4.394067246011786e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:24,763] [INFO] [timer.py:199:stop] epoch=0/micro_step=850/global_step=850, RunningAvgSamplesPerSec=191.29747288849148, CurrSamplesPerSec=192.24053533779448, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:25,745] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:25,746] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16.0 to 32.0 [2023-04-21 23:27:26,425] [INFO] [logging.py:96:log_dist] [Rank 0] step=860, skipped=17, lr=[4.3800685087051075e-05, 4.3800685087051075e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:26,434] [INFO] [timer.py:199:stop] epoch=0/micro_step=860/global_step=860, RunningAvgSamplesPerSec=191.30602298464603, CurrSamplesPerSec=192.67159144088404, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:28,106] [INFO] [logging.py:96:log_dist] [Rank 0] step=870, skipped=17, lr=[4.365932754180393e-05, 4.365932754180393e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:28,115] [INFO] [timer.py:199:stop] epoch=0/micro_step=870/global_step=870, RunningAvgSamplesPerSec=191.30018489178502, CurrSamplesPerSec=192.049634409833, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:29,779] [INFO] [logging.py:96:log_dist] [Rank 0] step=880, skipped=17, lr=[4.35166101263508e-05, 4.35166101263508e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:29,788] [INFO] [timer.py:199:stop] epoch=0/micro_step=880/global_step=880, RunningAvgSamplesPerSec=191.30505791191914, CurrSamplesPerSec=191.56720094771848, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:31,451] [INFO] [logging.py:96:log_dist] [Rank 0] step=890, skipped=17, lr=[4.337254324177182e-05, 4.337254324177182e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:31,460] [INFO] [timer.py:199:stop] epoch=0/micro_step=890/global_step=890, RunningAvgSamplesPerSec=191.31063825586517, CurrSamplesPerSec=192.19483863207427, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:33,195] [INFO] [logging.py:96:log_dist] [Rank 0] step=900, skipped=17, lr=[4.32271373874949e-05, 4.32271373874949e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:33,204] [INFO] [timer.py:199:stop] epoch=0/micro_step=900/global_step=900, RunningAvgSamplesPerSec=191.22468921497165, CurrSamplesPerSec=192.86899717057454, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:34,868] [INFO] [logging.py:96:log_dist] [Rank 0] step=910, skipped=17, lr=[4.308040316053047e-05, 4.308040316053047e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:34,877] [INFO] [timer.py:199:stop] epoch=0/micro_step=910/global_step=910, RunningAvgSamplesPerSec=191.23014675832243, CurrSamplesPerSec=191.83196481734777, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:36,548] [INFO] [logging.py:96:log_dist] [Rank 0] step=920, skipped=17, lr=[4.293235125469925e-05, 4.293235125469925e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:36,557] [INFO] [timer.py:199:stop] epoch=0/micro_step=920/global_step=920, RunningAvgSamplesPerSec=191.22762015770252, CurrSamplesPerSec=190.1814817176419, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:38,227] [INFO] [logging.py:96:log_dist] [Rank 0] step=930, skipped=17, lr=[4.2782992459852884e-05, 4.2782992459852884e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:38,236] [INFO] [timer.py:199:stop] epoch=0/micro_step=930/global_step=930, RunningAvgSamplesPerSec=191.22570911467523, CurrSamplesPerSec=192.1868577017087, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:39,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=940, skipped=17, lr=[4.2632337661087555e-05, 4.2632337661087555e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:39,904] [INFO] [timer.py:199:stop] epoch=0/micro_step=940/global_step=940, RunningAvgSamplesPerSec=191.23580381980034, CurrSamplesPerSec=192.92305657849724, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:41,584] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=17, lr=[4.24803978379507e-05, 4.24803978379507e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:41,593] [INFO] [timer.py:199:stop] epoch=0/micro_step=950/global_step=950, RunningAvgSamplesPerSec=191.22244519630428, CurrSamplesPerSec=191.08337651835697, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:42,596] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:42,596] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-21 23:27:42,596] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:42,596] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-21 23:27:42,596] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:42,596] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:42,596] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-21 23:27:42,596] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-21 23:27:42,596] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:42,596] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:42,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-21 23:27:42,596] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:42,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-21 23:27:42,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-21 23:27:42,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:42,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-21 23:27:43,326] [INFO] [logging.py:96:log_dist] [Rank 0] step=960, skipped=17, lr=[4.23271840636409e-05, 4.23271840636409e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:43,334] [INFO] [timer.py:199:stop] epoch=0/micro_step=960/global_step=960, RunningAvgSamplesPerSec=191.16227740563596, CurrSamplesPerSec=174.03931576053694, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:44,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=970, skipped=17, lr=[4.217270750420076e-05, 4.217270750420076e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:45,005] [INFO] [timer.py:199:stop] epoch=0/micro_step=970/global_step=970, RunningAvgSamplesPerSec=191.17001084532748, CurrSamplesPerSec=191.90327750953313, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:46,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=980, skipped=17, lr=[4.201697941770324e-05, 4.201697941770324e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:46,685] [INFO] [timer.py:199:stop] epoch=0/micro_step=980/global_step=980, RunningAvgSamplesPerSec=191.16785200694096, CurrSamplesPerSec=184.56453336230237, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:48,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=990, skipped=17, lr=[4.1860011153431134e-05, 4.1860011153431134e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:48,359] [INFO] [timer.py:199:stop] epoch=0/micro_step=990/global_step=990, RunningAvgSamplesPerSec=191.17313303877373, CurrSamplesPerSec=188.5194302464896, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:50,025] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=17, lr=[4.170181415104997e-05, 4.170181415104997e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:50,034] [INFO] [timer.py:199:stop] epoch=0/micro_step=1000/global_step=1000, RunningAvgSamplesPerSec=191.17668421107956, CurrSamplesPerSec=192.2639426665005, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:51,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=1010, skipped=17, lr=[4.154239993977427e-05, 4.154239993977427e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:51,703] [INFO] [timer.py:199:stop] epoch=0/micro_step=1010/global_step=1010, RunningAvgSamplesPerSec=191.18606193712478, CurrSamplesPerSec=192.17419965436122, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:53,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=1020, skipped=17, lr=[4.1381780137527335e-05, 4.1381780137527335e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:53,388] [INFO] [timer.py:199:stop] epoch=0/micro_step=1020/global_step=1020, RunningAvgSamplesPerSec=191.17789236279916, CurrSamplesPerSec=186.58765959339829, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:55,057] [INFO] [logging.py:96:log_dist] [Rank 0] step=1030, skipped=17, lr=[4.1219966450094554e-05, 4.1219966450094554e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:55,066] [INFO] [timer.py:199:stop] epoch=0/micro_step=1030/global_step=1030, RunningAvgSamplesPerSec=191.17780452571407, CurrSamplesPerSec=192.6519560490035, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:56,730] [INFO] [logging.py:96:log_dist] [Rank 0] step=1040, skipped=17, lr=[4.105697067027028e-05, 4.105697067027028e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:56,739] [INFO] [timer.py:199:stop] epoch=0/micro_step=1040/global_step=1040, RunningAvgSamplesPerSec=191.1837138637957, CurrSamplesPerSec=192.09388744098044, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:58,401] [INFO] [logging.py:96:log_dist] [Rank 0] step=1050, skipped=17, lr=[4.0892804676998395e-05, 4.0892804676998395e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:27:58,410] [INFO] [timer.py:199:stop] epoch=0/micro_step=1050/global_step=1050, RunningAvgSamplesPerSec=191.19038031378687, CurrSamplesPerSec=192.12193407738542, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:59,401] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:27:59,402] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:27:59,402] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:28:00,081] [INFO] [logging.py:96:log_dist] [Rank 0] step=1060, skipped=17, lr=[4.072748043450657e-05, 4.072748043450657e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:00,090] [INFO] [timer.py:199:stop] epoch=0/micro_step=1060/global_step=1060, RunningAvgSamplesPerSec=191.18773355331513, CurrSamplesPerSec=191.60029064580183, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:01,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=1070, skipped=17, lr=[4.056100999143435e-05, 4.056100999143435e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:01,762] [INFO] [timer.py:199:stop] epoch=0/micro_step=1070/global_step=1070, RunningAvgSamplesPerSec=191.19435178582222, CurrSamplesPerSec=190.59711784787189, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:03,431] [INFO] [logging.py:96:log_dist] [Rank 0] step=1080, skipped=17, lr=[4.039340547995506e-05, 4.039340547995506e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:03,440] [INFO] [timer.py:199:stop] epoch=0/micro_step=1080/global_step=1080, RunningAvgSamplesPerSec=191.19340561767697, CurrSamplesPerSec=192.6475317245131, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:05,182] [INFO] [logging.py:96:log_dist] [Rank 0] step=1090, skipped=17, lr=[4.022467911489161e-05, 4.022467911489161e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:05,191] [INFO] [timer.py:199:stop] epoch=0/micro_step=1090/global_step=1090, RunningAvgSamplesPerSec=191.11643733863215, CurrSamplesPerSec=192.40092087948005, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:06,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=1100, skipped=17, lr=[4.005484319282629e-05, 4.005484319282629e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:06,863] [INFO] [timer.py:199:stop] epoch=0/micro_step=1100/global_step=1100, RunningAvgSamplesPerSec=191.12368737123901, CurrSamplesPerSec=191.9633858898373, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:08,526] [INFO] [logging.py:96:log_dist] [Rank 0] step=1110, skipped=17, lr=[3.9883910091204645e-05, 3.9883910091204645e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:08,535] [INFO] [timer.py:199:stop] epoch=0/micro_step=1110/global_step=1110, RunningAvgSamplesPerSec=191.12962600787222, CurrSamplesPerSec=191.84320676022196, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:10,208] [INFO] [logging.py:96:log_dist] [Rank 0] step=1120, skipped=17, lr=[3.9711892267433373e-05, 3.9711892267433373e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:10,217] [INFO] [timer.py:199:stop] epoch=0/micro_step=1120/global_step=1120, RunningAvgSamplesPerSec=191.12588243313644, CurrSamplesPerSec=189.86322034521683, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:11,883] [INFO] [logging.py:96:log_dist] [Rank 0] step=1130, skipped=17, lr=[3.95388022579725e-05, 3.95388022579725e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:11,892] [INFO] [timer.py:199:stop] epoch=0/micro_step=1130/global_step=1130, RunningAvgSamplesPerSec=191.12919305237688, CurrSamplesPerSec=192.03589538144567, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:13,558] [INFO] [logging.py:96:log_dist] [Rank 0] step=1140, skipped=17, lr=[3.936465267742166e-05, 3.936465267742166e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:13,568] [INFO] [timer.py:199:stop] epoch=0/micro_step=1140/global_step=1140, RunningAvgSamplesPerSec=191.13068600224324, CurrSamplesPerSec=190.15345992499707, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:15,240] [INFO] [logging.py:96:log_dist] [Rank 0] step=1150, skipped=17, lr=[3.918945621760084e-05, 3.918945621760084e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:15,249] [INFO] [timer.py:199:stop] epoch=0/micro_step=1150/global_step=1150, RunningAvgSamplesPerSec=191.12826616849077, CurrSamplesPerSec=192.31545885967125, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:16,237] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:16,237] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:28:16,237] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:16,237] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:28:16,237] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:16,237] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:16,237] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:16,237] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:28:16,238] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:28:16,238] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:28:16,237] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:16,237] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:16,238] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:28:16,238] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:28:16,238] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:16,238] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:28:16,918] [INFO] [logging.py:96:log_dist] [Rank 0] step=1160, skipped=17, lr=[3.9013225646625355e-05, 3.9013225646625355e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:16,927] [INFO] [timer.py:199:stop] epoch=0/micro_step=1160/global_step=1160, RunningAvgSamplesPerSec=191.1292040278395, CurrSamplesPerSec=191.93428763460082, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:18,591] [INFO] [logging.py:96:log_dist] [Rank 0] step=1170, skipped=17, lr=[3.8835973807975355e-05, 3.8835973807975355e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:18,600] [INFO] [timer.py:199:stop] epoch=0/micro_step=1170/global_step=1170, RunningAvgSamplesPerSec=191.13410578171238, CurrSamplesPerSec=192.71004415090275, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:20,276] [INFO] [logging.py:96:log_dist] [Rank 0] step=1180, skipped=17, lr=[3.86577136195598e-05, 3.86577136195598e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:20,285] [INFO] [timer.py:199:stop] epoch=0/micro_step=1180/global_step=1180, RunningAvgSamplesPerSec=191.12721373713495, CurrSamplesPerSec=192.11780905840354, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:21,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=1190, skipped=17, lr=[3.847845807277502e-05, 3.847845807277502e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:21,959] [INFO] [timer.py:199:stop] epoch=0/micro_step=1190/global_step=1190, RunningAvgSamplesPerSec=191.1311968843421, CurrSamplesPerSec=191.896143854487, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:23,626] [INFO] [logging.py:96:log_dist] [Rank 0] step=1200, skipped=17, lr=[3.8298220231557856e-05, 3.8298220231557856e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:23,635] [INFO] [timer.py:199:stop] epoch=0/micro_step=1200/global_step=1200, RunningAvgSamplesPerSec=191.1333983287947, CurrSamplesPerSec=191.09643542590936, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:25,304] [INFO] [logging.py:96:log_dist] [Rank 0] step=1210, skipped=17, lr=[3.811701323143372e-05, 3.811701323143372e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:25,313] [INFO] [timer.py:199:stop] epoch=0/micro_step=1210/global_step=1210, RunningAvgSamplesPerSec=191.13312518216722, CurrSamplesPerSec=193.0012884226026, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:26,988] [INFO] [logging.py:96:log_dist] [Rank 0] step=1220, skipped=17, lr=[3.793485027855914e-05, 3.793485027855914e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:26,997] [INFO] [timer.py:199:stop] epoch=0/micro_step=1220/global_step=1220, RunningAvgSamplesPerSec=191.12857342416862, CurrSamplesPerSec=190.7780898556987, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:28,656] [INFO] [logging.py:96:log_dist] [Rank 0] step=1230, skipped=17, lr=[3.77517446487594e-05, 3.77517446487594e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:28,664] [INFO] [timer.py:199:stop] epoch=0/micro_step=1230/global_step=1230, RunningAvgSamplesPerSec=191.13810696584642, CurrSamplesPerSec=192.21851011663344, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:30,324] [INFO] [logging.py:96:log_dist] [Rank 0] step=1240, skipped=17, lr=[3.7567709686560984e-05, 3.7567709686560984e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:30,333] [INFO] [timer.py:199:stop] epoch=0/micro_step=1240/global_step=1240, RunningAvgSamplesPerSec=191.14711162821223, CurrSamplesPerSec=192.94246991242545, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:31,995] [INFO] [logging.py:96:log_dist] [Rank 0] step=1250, skipped=17, lr=[3.7382758804219026e-05, 3.7382758804219026e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:32,004] [INFO] [timer.py:199:stop] epoch=0/micro_step=1250/global_step=1250, RunningAvgSamplesPerSec=191.15374162136575, CurrSamplesPerSec=191.1337175439571, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:32,985] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:28:32,986] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:28:33,664] [INFO] [logging.py:96:log_dist] [Rank 0] step=1260, skipped=17, lr=[3.71969054807399e-05, 3.71969054807399e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:33,673] [INFO] [timer.py:199:stop] epoch=0/micro_step=1260/global_step=1260, RunningAvgSamplesPerSec=191.1614515851217, CurrSamplesPerSec=192.7742584126519, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:35,335] [INFO] [logging.py:96:log_dist] [Rank 0] step=1270, skipped=17, lr=[3.701016326089881e-05, 3.701016326089881e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:35,344] [INFO] [timer.py:199:stop] epoch=0/micro_step=1270/global_step=1270, RunningAvgSamplesPerSec=191.16861837681031, CurrSamplesPerSec=192.3052636254354, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:37,092] [INFO] [logging.py:96:log_dist] [Rank 0] step=1280, skipped=17, lr=[3.682254575425273e-05, 3.682254575425273e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:37,101] [INFO] [timer.py:199:stop] epoch=0/micro_step=1280/global_step=1280, RunningAvgSamplesPerSec=191.0978352174798, CurrSamplesPerSec=150.24933169148102, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:38,791] [INFO] [logging.py:96:log_dist] [Rank 0] step=1290, skipped=17, lr=[3.66340666341485e-05, 3.66340666341485e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:38,800] [INFO] [timer.py:199:stop] epoch=0/micro_step=1290/global_step=1290, RunningAvgSamplesPerSec=191.07993444667753, CurrSamplesPerSec=180.19043425260384, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:40,097] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1297 [2023-04-21 23:28:40,097] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1297 [2023-04-21 23:28:40,097] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:28:40,097] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:28:40,097] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1297 [2023-04-21 23:28:40,097] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1297 [2023-04-21 23:28:40,097] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:28:40,097] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1297 [2023-04-21 23:28:40,097] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1297 [2023-04-21 23:28:40,097] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1297 [2023-04-21 23:28:40,098] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:28:40,097] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1297 [2023-04-21 23:28:40,098] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:28:40,098] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:28:40,098] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:28:40,098] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:28:40,098] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 512.0, reducing to 256.0 [2023-04-21 23:28:40,423] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=18, lr=[3.6463710098186516e-05, 3.6463710098186516e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:40,432] [INFO] [timer.py:199:stop] epoch=0/micro_step=1300/global_step=1300, RunningAvgSamplesPerSec=191.12047532358022, CurrSamplesPerSec=192.2471438884369, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1301 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1301 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1301 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1301 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1301 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1301 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1301 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1301 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:28:40,729] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:28:40,729] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 256.0, reducing to 128.0 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1303 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1303 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1303 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1303 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1303 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1303 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1303 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1303 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:28:41,028] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:28:41,029] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 128.0, reducing to 64.0 [2023-04-21 23:28:42,032] [INFO] [logging.py:96:log_dist] [Rank 0] step=1310, skipped=20, lr=[3.631171363786768e-05, 3.631171363786768e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:42,041] [INFO] [timer.py:199:stop] epoch=0/micro_step=1310/global_step=1310, RunningAvgSamplesPerSec=191.18147552733356, CurrSamplesPerSec=191.87639188507237, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:43,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=1320, skipped=20, lr=[3.612097694785211e-05, 3.612097694785211e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:43,712] [INFO] [timer.py:199:stop] epoch=0/micro_step=1320/global_step=1320, RunningAvgSamplesPerSec=191.18774838658507, CurrSamplesPerSec=192.62817769396258, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:45,373] [INFO] [logging.py:96:log_dist] [Rank 0] step=1330, skipped=20, lr=[3.592942977390141e-05, 3.592942977390141e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:45,382] [INFO] [timer.py:199:stop] epoch=0/micro_step=1330/global_step=1330, RunningAvgSamplesPerSec=191.194796344318, CurrSamplesPerSec=191.893674635922, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:47,054] [INFO] [logging.py:96:log_dist] [Rank 0] step=1340, skipped=20, lr=[3.573708607575205e-05, 3.573708607575205e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:47,063] [INFO] [timer.py:199:stop] epoch=0/micro_step=1340/global_step=1340, RunningAvgSamplesPerSec=191.1920688616683, CurrSamplesPerSec=192.91224227267173, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:48,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=1350, skipped=20, lr=[3.554395987119024e-05, 3.554395987119024e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:48,797] [INFO] [timer.py:199:stop] epoch=0/micro_step=1350/global_step=1350, RunningAvgSamplesPerSec=191.14457427694353, CurrSamplesPerSec=192.91667876428897, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:50,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=1360, skipped=20, lr=[3.535006523503034e-05, 3.535006523503034e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:50,466] [INFO] [timer.py:199:stop] epoch=0/micro_step=1360/global_step=1360, RunningAvgSamplesPerSec=191.15193385074988, CurrSamplesPerSec=192.1464128340286, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:52,125] [INFO] [logging.py:96:log_dist] [Rank 0] step=1370, skipped=20, lr=[3.515541629808916e-05, 3.515541629808916e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:52,134] [INFO] [timer.py:199:stop] epoch=0/micro_step=1370/global_step=1370, RunningAvgSamplesPerSec=191.16052917291728, CurrSamplesPerSec=191.78838047389192, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:53,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=1380, skipped=20, lr=[3.496002724615604e-05, 3.496002724615604e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:53,806] [INFO] [timer.py:199:stop] epoch=0/micro_step=1380/global_step=1380, RunningAvgSamplesPerSec=191.16621860187706, CurrSamplesPerSec=192.63647180229094, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:55,467] [INFO] [logging.py:96:log_dist] [Rank 0] step=1390, skipped=20, lr=[3.4763912318959066e-05, 3.4763912318959066e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:55,476] [INFO] [timer.py:199:stop] epoch=0/micro_step=1390/global_step=1390, RunningAvgSamplesPerSec=191.17233408980414, CurrSamplesPerSec=192.1755754483027, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:57,137] [INFO] [logging.py:96:log_dist] [Rank 0] step=1400, skipped=20, lr=[3.456708580912725e-05, 3.456708580912725e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:57,146] [INFO] [timer.py:199:stop] epoch=0/micro_step=1400/global_step=1400, RunningAvgSamplesPerSec=191.1795627578409, CurrSamplesPerSec=192.22291475710358, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:28:57,962] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:57,962] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:57,962] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:28:57,962] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:28:57,963] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:28:58,811] [INFO] [logging.py:96:log_dist] [Rank 0] step=1410, skipped=20, lr=[3.436956206114894e-05, 3.436956206114894e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:28:58,820] [INFO] [timer.py:199:stop] epoch=0/micro_step=1410/global_step=1410, RunningAvgSamplesPerSec=191.1828439630521, CurrSamplesPerSec=192.74851221249511, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:00,483] [INFO] [logging.py:96:log_dist] [Rank 0] step=1420, skipped=20, lr=[3.4171355470326414e-05, 3.4171355470326414e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:00,491] [INFO] [timer.py:199:stop] epoch=0/micro_step=1420/global_step=1420, RunningAvgSamplesPerSec=191.18765713105552, CurrSamplesPerSec=190.3902882570809, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1420 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1420 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1420 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1420 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1420 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1420 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1420 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:29:00,621] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 128.0, reducing to 64.0 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1420 [2023-04-21 23:29:00,621] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-21 23:29:02,121] [INFO] [logging.py:96:log_dist] [Rank 0] step=1430, skipped=21, lr=[3.399239764579093e-05, 3.399239764579093e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:02,130] [INFO] [timer.py:199:stop] epoch=0/micro_step=1430/global_step=1430, RunningAvgSamplesPerSec=191.21922407302586, CurrSamplesPerSec=192.40919542866317, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:03,790] [INFO] [logging.py:96:log_dist] [Rank 0] step=1440, skipped=21, lr=[3.3792933490006415e-05, 3.3792933490006415e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:03,799] [INFO] [timer.py:199:stop] epoch=0/micro_step=1440/global_step=1440, RunningAvgSamplesPerSec=191.22603748722474, CurrSamplesPerSec=192.55908788974793, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:05,464] [INFO] [logging.py:96:log_dist] [Rank 0] step=1450, skipped=21, lr=[3.359282851540088e-05, 3.359282851540088e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:05,473] [INFO] [timer.py:199:stop] epoch=0/micro_step=1450/global_step=1450, RunningAvgSamplesPerSec=191.2286409910628, CurrSamplesPerSec=192.486458889061, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:07,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=1460, skipped=21, lr=[3.339209730539339e-05, 3.339209730539339e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:07,147] [INFO] [timer.py:199:stop] epoch=0/micro_step=1460/global_step=1460, RunningAvgSamplesPerSec=191.23145870256843, CurrSamplesPerSec=192.5300849487754, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:08,809] [INFO] [logging.py:96:log_dist] [Rank 0] step=1470, skipped=21, lr=[3.319075448904234e-05, 3.319075448904234e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:08,818] [INFO] [timer.py:199:stop] epoch=0/micro_step=1470/global_step=1470, RunningAvgSamplesPerSec=191.2362142897292, CurrSamplesPerSec=192.23612988977243, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:10,541] [INFO] [logging.py:96:log_dist] [Rank 0] step=1480, skipped=21, lr=[3.2988814739979255e-05, 3.2988814739979255e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:10,550] [INFO] [timer.py:199:stop] epoch=0/micro_step=1480/global_step=1480, RunningAvgSamplesPerSec=191.19420399725243, CurrSamplesPerSec=189.8449587403428, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:12,214] [INFO] [logging.py:96:log_dist] [Rank 0] step=1490, skipped=21, lr=[3.278629277533945e-05, 3.278629277533945e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:12,223] [INFO] [timer.py:199:stop] epoch=0/micro_step=1490/global_step=1490, RunningAvgSamplesPerSec=191.19783082478983, CurrSamplesPerSec=192.08948871158182, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:13,885] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=21, lr=[3.258320335468942e-05, 3.258320335468942e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:13,894] [INFO] [timer.py:199:stop] epoch=0/micro_step=1500/global_step=1500, RunningAvgSamplesPerSec=191.20340421943584, CurrSamplesPerSec=192.18410581043872, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:15,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=1510, skipped=21, lr=[3.237956127895121e-05, 3.237956127895121e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:15,563] [INFO] [timer.py:199:stop] epoch=0/micro_step=1510/global_step=1510, RunningAvgSamplesPerSec=191.21008628669674, CurrSamplesPerSec=192.57207298981024, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:17,223] [INFO] [logging.py:96:log_dist] [Rank 0] step=1520, skipped=21, lr=[3.217538138932373e-05, 3.217538138932373e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:17,232] [INFO] [timer.py:199:stop] epoch=0/micro_step=1520/global_step=1520, RunningAvgSamplesPerSec=191.21580464983955, CurrSamplesPerSec=191.98233194824886, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:17,545] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:29:17,545] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:17,546] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-21 23:29:18,892] [INFO] [logging.py:96:log_dist] [Rank 0] step=1530, skipped=21, lr=[3.197067856620113e-05, 3.197067856620113e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:18,901] [INFO] [timer.py:199:stop] epoch=0/micro_step=1530/global_step=1530, RunningAvgSamplesPerSec=191.22266287786042, CurrSamplesPerSec=192.42547078515565, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:20,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=1540, skipped=21, lr=[3.176546772808837e-05, 3.176546772808837e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:20,571] [INFO] [timer.py:199:stop] epoch=0/micro_step=1540/global_step=1540, RunningAvgSamplesPerSec=191.22850484720533, CurrSamplesPerSec=192.7631839352677, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:22,230] [INFO] [logging.py:96:log_dist] [Rank 0] step=1550, skipped=21, lr=[3.155976383051393e-05, 3.155976383051393e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:22,239] [INFO] [timer.py:199:stop] epoch=0/micro_step=1550/global_step=1550, RunningAvgSamplesPerSec=191.23519419003136, CurrSamplesPerSec=192.23420256975447, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:23,906] [INFO] [logging.py:96:log_dist] [Rank 0] step=1560, skipped=21, lr=[3.135358186493991e-05, 3.135358186493991e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:23,915] [INFO] [timer.py:199:stop] epoch=0/micro_step=1560/global_step=1560, RunningAvgSamplesPerSec=191.2369890333465, CurrSamplesPerSec=192.10296045663512, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:25,580] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=21, lr=[3.114693685766945e-05, 3.114693685766945e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:25,589] [INFO] [timer.py:199:stop] epoch=0/micro_step=1570/global_step=1570, RunningAvgSamplesPerSec=191.2392248320905, CurrSamplesPerSec=191.30725702768885, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:27,260] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=21, lr=[3.093984386875162e-05, 3.093984386875162e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:27,269] [INFO] [timer.py:199:stop] epoch=0/micro_step=1580/global_step=1580, RunningAvgSamplesPerSec=191.23730792416018, CurrSamplesPerSec=192.8476589777263, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:28,931] [INFO] [logging.py:96:log_dist] [Rank 0] step=1590, skipped=21, lr=[3.07323179908839e-05, 3.07323179908839e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:28,940] [INFO] [timer.py:199:stop] epoch=0/micro_step=1590/global_step=1590, RunningAvgSamplesPerSec=191.2424052608385, CurrSamplesPerSec=192.7673367151729, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:30,599] [INFO] [logging.py:96:log_dist] [Rank 0] step=1600, skipped=21, lr=[3.0524374348312204e-05, 3.0524374348312204e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:30,608] [INFO] [timer.py:199:stop] epoch=0/micro_step=1600/global_step=1600, RunningAvgSamplesPerSec=191.24920783674597, CurrSamplesPerSec=192.81856904995524, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:32,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=1610, skipped=21, lr=[3.0316028095728634e-05, 3.0316028095728634e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:32,277] [INFO] [timer.py:199:stop] epoch=0/micro_step=1610/global_step=1610, RunningAvgSamplesPerSec=191.2546659887457, CurrSamplesPerSec=192.81552204156336, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:33,935] [INFO] [logging.py:96:log_dist] [Rank 0] step=1620, skipped=21, lr=[3.0107294417167077e-05, 3.0107294417167077e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:33,944] [INFO] [timer.py:199:stop] epoch=0/micro_step=1620/global_step=1620, RunningAvgSamplesPerSec=191.26177547805264, CurrSamplesPerSec=192.3959564915741, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:34,257] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:34,257] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:29:34,257] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:29:34,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:29:35,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=1630, skipped=21, lr=[2.9898188524896548e-05, 2.9898188524896548e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:35,617] [INFO] [timer.py:199:stop] epoch=0/micro_step=1630/global_step=1630, RunningAvgSamplesPerSec=191.26470817367706, CurrSamplesPerSec=190.59738850744466, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:37,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=1640, skipped=21, lr=[2.9688725658312588e-05, 2.9688725658312588e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:37,290] [INFO] [timer.py:199:stop] epoch=0/micro_step=1640/global_step=1640, RunningAvgSamplesPerSec=191.26786257389696, CurrSamplesPerSec=191.51553824882032, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:38,951] [INFO] [logging.py:96:log_dist] [Rank 0] step=1650, skipped=21, lr=[2.9478921082826623e-05, 2.9478921082826623e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:38,960] [INFO] [timer.py:199:stop] epoch=0/micro_step=1650/global_step=1650, RunningAvgSamplesPerSec=191.27301883924886, CurrSamplesPerSec=192.16842153491075, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:40,621] [INFO] [logging.py:96:log_dist] [Rank 0] step=1660, skipped=21, lr=[2.926879008875338e-05, 2.926879008875338e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:40,630] [INFO] [timer.py:199:stop] epoch=0/micro_step=1660/global_step=1660, RunningAvgSamplesPerSec=191.2778821790062, CurrSamplesPerSec=192.4569258851921, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:42,296] [INFO] [logging.py:96:log_dist] [Rank 0] step=1670, skipped=21, lr=[2.9058347990196645e-05, 2.9058347990196645e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:42,305] [INFO] [timer.py:199:stop] epoch=0/micro_step=1670/global_step=1670, RunningAvgSamplesPerSec=191.27887213132598, CurrSamplesPerSec=192.84405689154028, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:44,050] [INFO] [logging.py:96:log_dist] [Rank 0] step=1680, skipped=21, lr=[2.8847610123933106e-05, 2.8847610123933106e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:44,059] [INFO] [timer.py:199:stop] epoch=0/micro_step=1680/global_step=1680, RunningAvgSamplesPerSec=191.22696443216302, CurrSamplesPerSec=192.74020845360056, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:45,721] [INFO] [logging.py:96:log_dist] [Rank 0] step=1690, skipped=21, lr=[2.8636591848294693e-05, 2.8636591848294693e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:45,730] [INFO] [timer.py:199:stop] epoch=0/micro_step=1690/global_step=1690, RunningAvgSamplesPerSec=191.23125802202713, CurrSamplesPerSec=192.3625914210737, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:47,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=1700, skipped=21, lr=[2.8425308542049206e-05, 2.8425308542049206e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:47,406] [INFO] [timer.py:199:stop] epoch=0/micro_step=1700/global_step=1700, RunningAvgSamplesPerSec=191.23314847054965, CurrSamplesPerSec=192.6110387998272, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:49,078] [INFO] [logging.py:96:log_dist] [Rank 0] step=1710, skipped=21, lr=[2.8213775603279595e-05, 2.8213775603279595e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:49,087] [INFO] [timer.py:199:stop] epoch=0/micro_step=1710/global_step=1710, RunningAvgSamplesPerSec=191.23053434547006, CurrSamplesPerSec=192.65029690350528, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:50,749] [INFO] [logging.py:96:log_dist] [Rank 0] step=1720, skipped=21, lr=[2.800200844826175e-05, 2.800200844826175e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:50,758] [INFO] [timer.py:199:stop] epoch=0/micro_step=1720/global_step=1720, RunningAvgSamplesPerSec=191.23505587759308, CurrSamplesPerSec=191.17972793960544, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:51,073] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:51,073] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:29:51,073] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:51,073] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:29:51,073] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:51,073] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:29:51,073] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:51,073] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:51,074] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:29:51,074] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:29:51,073] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:51,073] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:51,074] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:29:51,074] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:29:51,074] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:29:51,074] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:29:52,435] [INFO] [logging.py:96:log_dist] [Rank 0] step=1730, skipped=21, lr=[2.7790022510340935e-05, 2.7790022510340935e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:52,444] [INFO] [timer.py:199:stop] epoch=0/micro_step=1730/global_step=1730, RunningAvgSamplesPerSec=191.2296013229954, CurrSamplesPerSec=177.964639834947, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:54,115] [INFO] [logging.py:96:log_dist] [Rank 0] step=1740, skipped=21, lr=[2.7577833238807095e-05, 2.7577833238807095e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:54,124] [INFO] [timer.py:199:stop] epoch=0/micro_step=1740/global_step=1740, RunningAvgSamplesPerSec=191.22771968916874, CurrSamplesPerSec=192.29837556557197, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1742 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1742 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1742 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1742 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1742 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1742 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1742 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1742 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:29:54,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-21 23:29:54,587] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 512.0, reducing to 256.0 [2023-04-21 23:29:55,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=1750, skipped=22, lr=[2.7386701824985255e-05, 2.7386701824985255e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:55,757] [INFO] [timer.py:199:stop] epoch=0/micro_step=1750/global_step=1750, RunningAvgSamplesPerSec=191.25712734227298, CurrSamplesPerSec=192.5182100753189, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:57,420] [INFO] [logging.py:96:log_dist] [Rank 0] step=1760, skipped=22, lr=[2.7174168834545473e-05, 2.7174168834545473e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:57,429] [INFO] [timer.py:199:stop] epoch=0/micro_step=1760/global_step=1760, RunningAvgSamplesPerSec=191.260491463813, CurrSamplesPerSec=192.00540176329804, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:29:58,898] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1768 [2023-04-21 23:29:58,898] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1768 [2023-04-21 23:29:58,898] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1768 [2023-04-21 23:29:58,898] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1768 [2023-04-21 23:29:58,898] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1768 [2023-04-21 23:29:58,898] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1768 [2023-04-21 23:29:58,898] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1768 [2023-04-21 23:29:58,898] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1768 [2023-04-21 23:29:58,899] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:29:58,899] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:29:58,899] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:29:58,899] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:29:58,899] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:29:58,899] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:29:58,899] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:29:58,899] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:29:58,899] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 256.0, reducing to 128.0 [2023-04-21 23:29:59,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=1770, skipped=23, lr=[2.6982753225937236e-05, 2.6982753225937236e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:29:59,067] [INFO] [timer.py:199:stop] epoch=0/micro_step=1770/global_step=1770, RunningAvgSamplesPerSec=191.28579330293087, CurrSamplesPerSec=191.50406143300881, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:00,730] [INFO] [logging.py:96:log_dist] [Rank 0] step=1780, skipped=23, lr=[2.6769932431567312e-05, 2.6769932431567312e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:00,738] [INFO] [timer.py:199:stop] epoch=0/micro_step=1780/global_step=1780, RunningAvgSamplesPerSec=191.28923455858094, CurrSamplesPerSec=192.89948849582706, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:02,407] [INFO] [logging.py:96:log_dist] [Rank 0] step=1790, skipped=23, lr=[2.6556982646569245e-05, 2.6556982646569245e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:02,416] [INFO] [timer.py:199:stop] epoch=0/micro_step=1790/global_step=1790, RunningAvgSamplesPerSec=191.28915008666348, CurrSamplesPerSec=192.50578445068516, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:04,078] [INFO] [logging.py:96:log_dist] [Rank 0] step=1800, skipped=23, lr=[2.6343919390477027e-05, 2.6343919390477027e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:04,087] [INFO] [timer.py:199:stop] epoch=0/micro_step=1800/global_step=1800, RunningAvgSamplesPerSec=191.29250024256234, CurrSamplesPerSec=191.41748314978508, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:05,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=1810, skipped=23, lr=[2.613075819109429e-05, 2.613075819109429e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:05,757] [INFO] [timer.py:199:stop] epoch=0/micro_step=1810/global_step=1810, RunningAvgSamplesPerSec=191.29726778516718, CurrSamplesPerSec=192.61159161977582, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:07,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=1820, skipped=23, lr=[2.5917514583362652e-05, 2.5917514583362652e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:07,430] [INFO] [timer.py:199:stop] epoch=0/micro_step=1820/global_step=1820, RunningAvgSamplesPerSec=191.29940963023546, CurrSamplesPerSec=190.72143144384904, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:09,090] [INFO] [logging.py:96:log_dist] [Rank 0] step=1830, skipped=23, lr=[2.5704204108229575e-05, 2.5704204108229575e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:09,099] [INFO] [timer.py:199:stop] epoch=0/micro_step=1830/global_step=1830, RunningAvgSamplesPerSec=191.30460216544765, CurrSamplesPerSec=192.1827298943564, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:10,758] [INFO] [logging.py:96:log_dist] [Rank 0] step=1840, skipped=23, lr=[2.5490842311515707e-05, 2.5490842311515707e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:10,767] [INFO] [timer.py:199:stop] epoch=0/micro_step=1840/global_step=1840, RunningAvgSamplesPerSec=191.30954634908886, CurrSamplesPerSec=191.7289293882198, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:12,428] [INFO] [logging.py:96:log_dist] [Rank 0] step=1850, skipped=23, lr=[2.5277444742781996e-05, 2.5277444742781996e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:12,437] [INFO] [timer.py:199:stop] epoch=0/micro_step=1850/global_step=1850, RunningAvgSamplesPerSec=191.31383772058328, CurrSamplesPerSec=192.34549975279273, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:14,101] [INFO] [logging.py:96:log_dist] [Rank 0] step=1860, skipped=23, lr=[2.5064026954196378e-05, 2.5064026954196378e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:14,110] [INFO] [timer.py:199:stop] epoch=0/micro_step=1860/global_step=1860, RunningAvgSamplesPerSec=191.31603029432654, CurrSamplesPerSec=191.64242134346154, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:15,853] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:15,854] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:15,854] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:15,864] [INFO] [logging.py:96:log_dist] [Rank 0] step=1870, skipped=23, lr=[2.4850604499400404e-05, 2.4850604499400404e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:15,873] [INFO] [timer.py:199:stop] epoch=0/micro_step=1870/global_step=1870, RunningAvgSamplesPerSec=191.2631066631131, CurrSamplesPerSec=192.5687574786294, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:17,535] [INFO] [logging.py:96:log_dist] [Rank 0] step=1880, skipped=23, lr=[2.46371929323757e-05, 2.46371929323757e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:17,544] [INFO] [timer.py:199:stop] epoch=0/micro_step=1880/global_step=1880, RunningAvgSamplesPerSec=191.26711458257225, CurrSamplesPerSec=192.85320091154927, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:19,207] [INFO] [logging.py:96:log_dist] [Rank 0] step=1890, skipped=23, lr=[2.442380780631037e-05, 2.442380780631037e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:19,216] [INFO] [timer.py:199:stop] epoch=0/micro_step=1890/global_step=1890, RunningAvgSamplesPerSec=191.26999943849273, CurrSamplesPerSec=192.20612314746793, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1891 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1891 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1891 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1891 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1891 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1891 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1891 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1891 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:30:19,514] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 256.0, reducing to 128.0 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:30:19,514] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-21 23:30:20,843] [INFO] [logging.py:96:log_dist] [Rank 0] step=1900, skipped=24, lr=[2.4231796653047277e-05, 2.4231796653047277e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:20,851] [INFO] [timer.py:199:stop] epoch=0/micro_step=1900/global_step=1900, RunningAvgSamplesPerSec=191.29526934862113, CurrSamplesPerSec=191.11711568035543, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:22,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=1910, skipped=24, lr=[2.4018504606023293e-05, 2.4018504606023293e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:22,527] [INFO] [timer.py:199:stop] epoch=0/micro_step=1910/global_step=1910, RunningAvgSamplesPerSec=191.29634402611916, CurrSamplesPerSec=193.3056634534925, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:24,194] [INFO] [logging.py:96:log_dist] [Rank 0] step=1920, skipped=24, lr=[2.3805284089248203e-05, 2.3805284089248203e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:24,203] [INFO] [timer.py:199:stop] epoch=0/micro_step=1920/global_step=1920, RunningAvgSamplesPerSec=191.2963542946346, CurrSamplesPerSec=192.08234120595176, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:25,873] [INFO] [logging.py:96:log_dist] [Rank 0] step=1930, skipped=24, lr=[2.3592150641986648e-05, 2.3592150641986648e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:25,882] [INFO] [timer.py:199:stop] epoch=0/micro_step=1930/global_step=1930, RunningAvgSamplesPerSec=191.2951779892919, CurrSamplesPerSec=192.01336762984943, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:27,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=1940, skipped=24, lr=[2.3379119797157675e-05, 2.3379119797157675e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:27,551] [INFO] [timer.py:199:stop] epoch=0/micro_step=1940/global_step=1940, RunningAvgSamplesPerSec=191.30003761635064, CurrSamplesPerSec=192.37224128491124, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:29,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=1950, skipped=24, lr=[2.316620708020285e-05, 2.316620708020285e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:29,221] [INFO] [timer.py:199:stop] epoch=0/micro_step=1950/global_step=1950, RunningAvgSamplesPerSec=191.3042160240787, CurrSamplesPerSec=192.10625994755705, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:30,887] [INFO] [logging.py:96:log_dist] [Rank 0] step=1960, skipped=24, lr=[2.295342800795468e-05, 2.295342800795468e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:30,896] [INFO] [timer.py:199:stop] epoch=0/micro_step=1960/global_step=1960, RunningAvgSamplesPerSec=191.30484263010754, CurrSamplesPerSec=192.31049888096376, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:32,559] [INFO] [logging.py:96:log_dist] [Rank 0] step=1970, skipped=24, lr=[2.2740798087505783e-05, 2.2740798087505783e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:32,568] [INFO] [timer.py:199:stop] epoch=0/micro_step=1970/global_step=1970, RunningAvgSamplesPerSec=191.30774502605962, CurrSamplesPerSec=189.0403847914422, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:34,225] [INFO] [logging.py:96:log_dist] [Rank 0] step=1980, skipped=24, lr=[2.2528332815078816e-05, 2.2528332815078816e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:34,234] [INFO] [timer.py:199:stop] epoch=0/micro_step=1980/global_step=1980, RunningAvgSamplesPerSec=191.3142599721335, CurrSamplesPerSec=193.03293207346363, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:35,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=1990, skipped=24, lr=[2.2316047674897034e-05, 2.2316047674897034e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:35,905] [INFO] [timer.py:199:stop] epoch=0/micro_step=1990/global_step=1990, RunningAvgSamplesPerSec=191.3174190058208, CurrSamplesPerSec=192.16842153491075, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:36,388] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:36,389] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:36,389] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-21 23:30:37,567] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=24, lr=[2.2103958138055897e-05, 2.2103958138055897e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:37,576] [INFO] [timer.py:199:stop] epoch=0/micro_step=2000/global_step=2000, RunningAvgSamplesPerSec=191.320491103219, CurrSamplesPerSec=190.58115029357265, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:39,240] [INFO] [logging.py:96:log_dist] [Rank 0] step=2010, skipped=24, lr=[2.1892079661395495e-05, 2.1892079661395495e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:39,249] [INFO] [timer.py:199:stop] epoch=0/micro_step=2010/global_step=2010, RunningAvgSamplesPerSec=191.32253293466667, CurrSamplesPerSec=191.00234238695776, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:40,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=2020, skipped=24, lr=[2.168042768637409e-05, 2.168042768637409e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:40,916] [INFO] [timer.py:199:stop] epoch=0/micro_step=2020/global_step=2020, RunningAvgSamplesPerSec=191.32795780109223, CurrSamplesPerSec=191.99853516742579, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:42,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=2030, skipped=24, lr=[2.1469017637942804e-05, 2.1469017637942804e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:42,583] [INFO] [timer.py:199:stop] epoch=0/micro_step=2030/global_step=2030, RunningAvgSamplesPerSec=191.3333309503967, CurrSamplesPerSec=192.32482840643672, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:44,245] [INFO] [logging.py:96:log_dist] [Rank 0] step=2040, skipped=24, lr=[2.1257864923421404e-05, 2.1257864923421404e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:44,254] [INFO] [timer.py:199:stop] epoch=0/micro_step=2040/global_step=2040, RunningAvgSamplesPerSec=191.3365150428397, CurrSamplesPerSec=191.98562729669305, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:45,915] [INFO] [logging.py:96:log_dist] [Rank 0] step=2050, skipped=24, lr=[2.1046984931375433e-05, 2.1046984931375433e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:45,924] [INFO] [timer.py:199:stop] epoch=0/micro_step=2050/global_step=2050, RunningAvgSamplesPerSec=191.3403070575335, CurrSamplesPerSec=191.98425422109315, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:47,580] [INFO] [logging.py:96:log_dist] [Rank 0] step=2060, skipped=24, lr=[2.08363930304948e-05, 2.08363930304948e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:47,589] [INFO] [timer.py:199:stop] epoch=0/micro_step=2060/global_step=2060, RunningAvgSamplesPerSec=191.34663145788122, CurrSamplesPerSec=192.26917567242347, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:49,312] [INFO] [logging.py:96:log_dist] [Rank 0] step=2070, skipped=24, lr=[2.0626104568473596e-05, 2.0626104568473596e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:49,322] [INFO] [timer.py:199:stop] epoch=0/micro_step=2070/global_step=2070, RunningAvgSamplesPerSec=191.31932031044258, CurrSamplesPerSec=192.87426316530676, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:50,983] [INFO] [logging.py:96:log_dist] [Rank 0] step=2080, skipped=24, lr=[2.0416134870891696e-05, 2.0416134870891696e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:50,992] [INFO] [timer.py:199:stop] epoch=0/micro_step=2080/global_step=2080, RunningAvgSamplesPerSec=191.32291548505154, CurrSamplesPerSec=192.2697265317232, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:52,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=2090, skipped=24, lr=[2.0206499240097755e-05, 2.0206499240097755e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:52,663] [INFO] [timer.py:199:stop] epoch=0/micro_step=2090/global_step=2090, RunningAvgSamplesPerSec=191.32607940286832, CurrSamplesPerSec=191.53685598530697, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:53,142] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:30:53,143] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-21 23:30:54,322] [INFO] [logging.py:96:log_dist] [Rank 0] step=2100, skipped=24, lr=[1.999721295409402e-05, 1.999721295409402e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:54,331] [INFO] [timer.py:199:stop] epoch=0/micro_step=2100/global_step=2100, RunningAvgSamplesPerSec=191.33065992633382, CurrSamplesPerSec=192.27303175380843, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:55,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=2110, skipped=24, lr=[1.9788291265422945e-05, 1.9788291265422945e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:56,003] [INFO] [timer.py:199:stop] epoch=0/micro_step=2110/global_step=2110, RunningAvgSamplesPerSec=191.33312086794018, CurrSamplesPerSec=192.2812953062055, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:57,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=2120, skipped=24, lr=[1.957974940005548e-05, 1.957974940005548e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:57,725] [INFO] [timer.py:199:stop] epoch=0/micro_step=2120/global_step=2120, RunningAvgSamplesPerSec=191.3084963256113, CurrSamplesPerSec=191.87364887757306, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:30:59,389] [INFO] [logging.py:96:log_dist] [Rank 0] step=2130, skipped=24, lr=[1.937160255628156e-05, 1.937160255628156e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:30:59,397] [INFO] [timer.py:199:stop] epoch=0/micro_step=2130/global_step=2130, RunningAvgSamplesPerSec=191.31070441374516, CurrSamplesPerSec=192.16346913764073, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:01,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=2140, skipped=24, lr=[1.9163865903602374e-05, 1.9163865903602374e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:01,067] [INFO] [timer.py:199:stop] epoch=0/micro_step=2140/global_step=2140, RunningAvgSamplesPerSec=191.31429530850417, CurrSamplesPerSec=192.25760835622836, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:02,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=2150, skipped=24, lr=[1.8956554581624824e-05, 1.8956554581624824e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:02,752] [INFO] [timer.py:199:stop] epoch=0/micro_step=2150/global_step=2150, RunningAvgSamplesPerSec=191.3100154752456, CurrSamplesPerSec=190.34222756080715, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:04,416] [INFO] [logging.py:96:log_dist] [Rank 0] step=2160, skipped=24, lr=[1.8749683698958277e-05, 1.8749683698958277e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:04,425] [INFO] [timer.py:199:stop] epoch=0/micro_step=2160/global_step=2160, RunningAvgSamplesPerSec=191.31178579334232, CurrSamplesPerSec=192.3099477880113, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:06,086] [INFO] [logging.py:96:log_dist] [Rank 0] step=2170, skipped=24, lr=[1.8543268332113316e-05, 1.8543268332113316e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:06,095] [INFO] [timer.py:199:stop] epoch=0/micro_step=2170/global_step=2170, RunningAvgSamplesPerSec=191.31554129384529, CurrSamplesPerSec=192.48231820209122, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:07,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=2180, skipped=24, lr=[1.8337323524403127e-05, 1.8337323524403127e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:07,765] [INFO] [timer.py:199:stop] epoch=0/micro_step=2180/global_step=2180, RunningAvgSamplesPerSec=191.31889432830263, CurrSamplesPerSec=191.26200288422623, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:09,425] [INFO] [logging.py:96:log_dist] [Rank 0] step=2190, skipped=24, lr=[1.8131864284847043e-05, 1.8131864284847043e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:09,434] [INFO] [timer.py:199:stop] epoch=0/micro_step=2190/global_step=2190, RunningAvgSamplesPerSec=191.32299636013346, CurrSamplesPerSec=192.94940419286067, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:09,916] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-21 23:31:09,917] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:09,917] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-21 23:31:11,098] [INFO] [logging.py:96:log_dist] [Rank 0] step=2200, skipped=24, lr=[1.7926905587076747e-05, 1.7926905587076747e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:11,107] [INFO] [timer.py:199:stop] epoch=0/micro_step=2200/global_step=2200, RunningAvgSamplesPerSec=191.32462935204532, CurrSamplesPerSec=189.03479358211646, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:12,767] [INFO] [logging.py:96:log_dist] [Rank 0] step=2210, skipped=24, lr=[1.7722462368245068e-05, 1.7722462368245068e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:12,776] [INFO] [timer.py:199:stop] epoch=0/micro_step=2210/global_step=2210, RunningAvgSamplesPerSec=191.32867899452444, CurrSamplesPerSec=192.33171836152727, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:14,440] [INFO] [logging.py:96:log_dist] [Rank 0] step=2220, skipped=24, lr=[1.7518549527937268e-05, 1.7518549527937268e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:14,449] [INFO] [timer.py:199:stop] epoch=0/micro_step=2220/global_step=2220, RunningAvgSamplesPerSec=191.33086100549934, CurrSamplesPerSec=192.09306266387296, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:16,107] [INFO] [logging.py:96:log_dist] [Rank 0] step=2230, skipped=24, lr=[1.7315181927085277e-05, 1.7315181927085277e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:16,116] [INFO] [timer.py:199:stop] epoch=0/micro_step=2230/global_step=2230, RunningAvgSamplesPerSec=191.33541315998843, CurrSamplesPerSec=192.69012428450628, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:17,777] [INFO] [logging.py:96:log_dist] [Rank 0] step=2240, skipped=24, lr=[1.7112374386884583e-05, 1.7112374386884583e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:17,786] [INFO] [timer.py:199:stop] epoch=0/micro_step=2240/global_step=2240, RunningAvgSamplesPerSec=191.33894235812485, CurrSamplesPerSec=192.162918887034, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:19,484] [INFO] [logging.py:96:log_dist] [Rank 0] step=2250, skipped=24, lr=[1.691014168771409e-05, 1.691014168771409e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:19,485] [INFO] [timer.py:199:stop] epoch=0/micro_step=2250/global_step=2250, RunningAvgSamplesPerSec=191.32747592315073, CurrSamplesPerSec=162.72326392845972, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:21,232] [INFO] [logging.py:96:log_dist] [Rank 0] step=2260, skipped=24, lr=[1.6708498568058996e-05, 1.6708498568058996e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:21,241] [INFO] [timer.py:199:stop] epoch=0/micro_step=2260/global_step=2260, RunningAvgSamplesPerSec=191.28715111722204, CurrSamplesPerSec=192.39705966674694, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:22,903] [INFO] [logging.py:96:log_dist] [Rank 0] step=2270, skipped=24, lr=[1.6507459723436585e-05, 1.6507459723436585e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:22,912] [INFO] [timer.py:199:stop] epoch=0/micro_step=2270/global_step=2270, RunningAvgSamplesPerSec=191.2900989743992, CurrSamplesPerSec=191.90547258702165, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:24,571] [INFO] [logging.py:96:log_dist] [Rank 0] step=2280, skipped=24, lr=[1.630703980532528e-05, 1.630703980532528e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:24,580] [INFO] [timer.py:199:stop] epoch=0/micro_step=2280/global_step=2280, RunningAvgSamplesPerSec=191.2946617967073, CurrSamplesPerSec=192.69233739962874, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:26,241] [INFO] [logging.py:96:log_dist] [Rank 0] step=2290, skipped=24, lr=[1.6107253420096892e-05, 1.6107253420096892e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:26,250] [INFO] [timer.py:199:stop] epoch=0/micro_step=2290/global_step=2290, RunningAvgSamplesPerSec=191.2982772492552, CurrSamplesPerSec=190.7295621615246, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:26,732] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:26,732] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-21 23:31:26,733] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:26,734] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-21 23:31:27,916] [INFO] [logging.py:96:log_dist] [Rank 0] step=2300, skipped=24, lr=[1.5908115127952027e-05, 1.5908115127952027e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:27,924] [INFO] [timer.py:199:stop] epoch=0/micro_step=2300/global_step=2300, RunningAvgSamplesPerSec=191.2991907676395, CurrSamplesPerSec=192.01254354403758, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:29,584] [INFO] [logging.py:96:log_dist] [Rank 0] step=2310, skipped=24, lr=[1.5709639441859087e-05, 1.5709639441859087e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:29,593] [INFO] [timer.py:199:stop] epoch=0/micro_step=2310/global_step=2310, RunningAvgSamplesPerSec=191.3031897188706, CurrSamplesPerSec=191.53822266873686, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:31,257] [INFO] [logging.py:96:log_dist] [Rank 0] step=2320, skipped=24, lr=[1.5511840826496463e-05, 1.5511840826496463e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:31,266] [INFO] [timer.py:199:stop] epoch=0/micro_step=2320/global_step=2320, RunningAvgSamplesPerSec=191.30510996284409, CurrSamplesPerSec=190.3514058189915, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:32,931] [INFO] [logging.py:96:log_dist] [Rank 0] step=2330, skipped=24, lr=[1.5314733697198407e-05, 1.5314733697198407e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:32,940] [INFO] [timer.py:199:stop] epoch=0/micro_step=2330/global_step=2330, RunningAvgSamplesPerSec=191.30632596180382, CurrSamplesPerSec=192.0576039471067, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:34,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=2340, skipped=24, lr=[1.5118332418904525e-05, 1.5118332418904525e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:34,612] [INFO] [timer.py:199:stop] epoch=0/micro_step=2340/global_step=2340, RunningAvgSamplesPerSec=191.30843207735168, CurrSamplesPerSec=192.04661165491694, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:36,283] [INFO] [logging.py:96:log_dist] [Rank 0] step=2350, skipped=24, lr=[1.4922651305112744e-05, 1.4922651305112744e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:36,292] [INFO] [timer.py:199:stop] epoch=0/micro_step=2350/global_step=2350, RunningAvgSamplesPerSec=191.3070128007127, CurrSamplesPerSec=191.368902525964, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:37,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=2360, skipped=24, lr=[1.4727704616836296e-05, 1.4727704616836296e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:37,962] [INFO] [timer.py:199:stop] epoch=0/micro_step=2360/global_step=2360, RunningAvgSamplesPerSec=191.31056717288138, CurrSamplesPerSec=192.99629300502414, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:39,630] [INFO] [logging.py:96:log_dist] [Rank 0] step=2370, skipped=24, lr=[1.4533506561564306e-05, 1.4533506561564306e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:39,639] [INFO] [timer.py:199:stop] epoch=0/micro_step=2370/global_step=2370, RunningAvgSamplesPerSec=191.31001610827792, CurrSamplesPerSec=190.84753551968512, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:41,300] [INFO] [logging.py:96:log_dist] [Rank 0] step=2380, skipped=24, lr=[1.4340071292226371e-05, 1.4340071292226371e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:41,309] [INFO] [timer.py:199:stop] epoch=0/micro_step=2380/global_step=2380, RunningAvgSamplesPerSec=191.31345467442836, CurrSamplesPerSec=191.9537769997669, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:42,971] [INFO] [logging.py:96:log_dist] [Rank 0] step=2390, skipped=24, lr=[1.4147412906161172e-05, 1.4147412906161172e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:42,980] [INFO] [timer.py:199:stop] epoch=0/micro_step=2390/global_step=2390, RunningAvgSamplesPerSec=191.3163209782296, CurrSamplesPerSec=192.19841650568713, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:43,465] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:31:43,465] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:31:43,466] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:31:44,645] [INFO] [logging.py:96:log_dist] [Rank 0] step=2400, skipped=24, lr=[1.3955545444089015e-05, 1.3955545444089015e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:44,654] [INFO] [timer.py:199:stop] epoch=0/micro_step=2400/global_step=2400, RunningAvgSamplesPerSec=191.31786608926586, CurrSamplesPerSec=192.10158570220173, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:46,317] [INFO] [logging.py:96:log_dist] [Rank 0] step=2410, skipped=24, lr=[1.3764482889088581e-05, 1.3764482889088581e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:46,326] [INFO] [timer.py:199:stop] epoch=0/micro_step=2410/global_step=2410, RunningAvgSamplesPerSec=191.31992977871786, CurrSamplesPerSec=192.4003692681724, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:47,456] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2416 [2023-04-21 23:31:47,456] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2416 [2023-04-21 23:31:47,456] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2416 [2023-04-21 23:31:47,456] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2416 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2416 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2416 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2416 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2416 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:31:47,457] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-04-21 23:31:47,457] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:31:47,951] [INFO] [logging.py:96:log_dist] [Rank 0] step=2420, skipped=25, lr=[1.3593226294894417e-05, 1.3593226294894417e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:47,960] [INFO] [timer.py:199:stop] epoch=0/micro_step=2420/global_step=2420, RunningAvgSamplesPerSec=191.34043544308162, CurrSamplesPerSec=192.65499788997437, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:49,624] [INFO] [logging.py:96:log_dist] [Rank 0] step=2430, skipped=25, lr=[1.3403731375800895e-05, 1.3403731375800895e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:49,633] [INFO] [timer.py:199:stop] epoch=0/micro_step=2430/global_step=2430, RunningAvgSamplesPerSec=191.3419473372644, CurrSamplesPerSec=191.83086811297375, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:51,292] [INFO] [logging.py:96:log_dist] [Rank 0] step=2440, skipped=25, lr=[1.3215081579350058e-05, 1.3215081579350058e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:51,301] [INFO] [timer.py:199:stop] epoch=0/micro_step=2440/global_step=2440, RunningAvgSamplesPerSec=191.34614574138618, CurrSamplesPerSec=192.22126299327172, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:53,045] [INFO] [logging.py:96:log_dist] [Rank 0] step=2450, skipped=25, lr=[1.302729065412083e-05, 1.302729065412083e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:53,054] [INFO] [timer.py:199:stop] epoch=0/micro_step=2450/global_step=2450, RunningAvgSamplesPerSec=191.31042152494763, CurrSamplesPerSec=192.25953614560518, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:54,718] [INFO] [logging.py:96:log_dist] [Rank 0] step=2460, skipped=25, lr=[1.28403722860986e-05, 1.28403722860986e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:54,727] [INFO] [timer.py:199:stop] epoch=0/micro_step=2460/global_step=2460, RunningAvgSamplesPerSec=191.31188082635265, CurrSamplesPerSec=191.9392282259264, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:56,390] [INFO] [logging.py:96:log_dist] [Rank 0] step=2470, skipped=25, lr=[1.2654340097677808e-05, 1.2654340097677808e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:56,399] [INFO] [timer.py:199:stop] epoch=0/micro_step=2470/global_step=2470, RunningAvgSamplesPerSec=191.31410553605406, CurrSamplesPerSec=192.17034753612026, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:58,066] [INFO] [logging.py:96:log_dist] [Rank 0] step=2480, skipped=25, lr=[1.2469207646669126e-05, 1.2469207646669126e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:58,075] [INFO] [timer.py:199:stop] epoch=0/micro_step=2480/global_step=2480, RunningAvgSamplesPerSec=191.31419696997983, CurrSamplesPerSec=182.65338816747456, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:31:59,737] [INFO] [logging.py:96:log_dist] [Rank 0] step=2490, skipped=25, lr=[1.2284988425311444e-05, 1.2284988425311444e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:31:59,746] [INFO] [timer.py:199:stop] epoch=0/micro_step=2490/global_step=2490, RunningAvgSamplesPerSec=191.3170485757835, CurrSamplesPerSec=192.56406069988222, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:01,424] [INFO] [logging.py:96:log_dist] [Rank 0] step=2500, skipped=25, lr=[1.2101695859288497e-05, 1.2101695859288497e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:01,433] [INFO] [timer.py:199:stop] epoch=0/micro_step=2500/global_step=2500, RunningAvgSamplesPerSec=191.31216910207573, CurrSamplesPerSec=190.26532695230102, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:03,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=2510, skipped=25, lr=[1.1919343306750463e-05, 1.1919343306750463e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:03,173] [INFO] [timer.py:199:stop] epoch=0/micro_step=2510/global_step=2510, RunningAvgSamplesPerSec=191.28314442878508, CurrSamplesPerSec=192.3036104460654, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:04,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:32:04,500] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:04,500] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:32:04,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=2520, skipped=25, lr=[1.1737944057340422e-05, 1.1737944057340422e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:04,855] [INFO] [timer.py:199:stop] epoch=0/micro_step=2520/global_step=2520, RunningAvgSamplesPerSec=191.28088165291845, CurrSamplesPerSec=191.9537769997669, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:06,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=2530, skipped=25, lr=[1.1557511331225821e-05, 1.1557511331225821e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:06,530] [INFO] [timer.py:199:stop] epoch=0/micro_step=2530/global_step=2530, RunningAvgSamplesPerSec=191.28178535006566, CurrSamplesPerSec=191.20233287319775, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:08,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=2540, skipped=25, lr=[1.137805827813503e-05, 1.137805827813503e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:08,205] [INFO] [timer.py:199:stop] epoch=0/micro_step=2540/global_step=2540, RunningAvgSamplesPerSec=191.2828405808783, CurrSamplesPerSec=191.63941139347142, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:09,869] [INFO] [logging.py:96:log_dist] [Rank 0] step=2550, skipped=25, lr=[1.1199597976398956e-05, 1.1199597976398956e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:09,878] [INFO] [timer.py:199:stop] epoch=0/micro_step=2550/global_step=2550, RunningAvgSamplesPerSec=191.28468719487304, CurrSamplesPerSec=191.913155753703, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:11,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=2560, skipped=25, lr=[1.1022143431997947e-05, 1.1022143431997947e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:11,551] [INFO] [timer.py:199:stop] epoch=0/micro_step=2560/global_step=2560, RunningAvgSamplesPerSec=191.28643521136559, CurrSamplesPerSec=191.9617385832177, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:13,219] [INFO] [logging.py:96:log_dist] [Rank 0] step=2570, skipped=25, lr=[1.0845707577613918e-05, 1.0845707577613918e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:13,228] [INFO] [timer.py:199:stop] epoch=0/micro_step=2570/global_step=2570, RunningAvgSamplesPerSec=191.28635959133507, CurrSamplesPerSec=190.78595420889013, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:14,893] [INFO] [logging.py:96:log_dist] [Rank 0] step=2580, skipped=25, lr=[1.0670303271687832e-05, 1.0670303271687832e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:14,902] [INFO] [timer.py:199:stop] epoch=0/micro_step=2580/global_step=2580, RunningAvgSamplesPerSec=191.28783928743673, CurrSamplesPerSec=191.73276382986322, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:16,568] [INFO] [logging.py:96:log_dist] [Rank 0] step=2590, skipped=25, lr=[1.0495943297482586e-05, 1.0495943297482586e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:16,577] [INFO] [timer.py:199:stop] epoch=0/micro_step=2590/global_step=2590, RunningAvgSamplesPerSec=191.28852097535196, CurrSamplesPerSec=191.41338832532554, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:18,244] [INFO] [logging.py:96:log_dist] [Rank 0] step=2600, skipped=25, lr=[1.0322640362151418e-05, 1.0322640362151418e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:18,253] [INFO] [timer.py:199:stop] epoch=0/micro_step=2600/global_step=2600, RunningAvgSamplesPerSec=191.28885574348791, CurrSamplesPerSec=189.7011372084543, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:19,921] [INFO] [logging.py:96:log_dist] [Rank 0] step=2610, skipped=25, lr=[1.015040709581177e-05, 1.015040709581177e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:19,930] [INFO] [timer.py:199:stop] epoch=0/micro_step=2610/global_step=2610, RunningAvgSamplesPerSec=191.28866443370143, CurrSamplesPerSec=191.72947715620745, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:21,249] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:21,249] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:21,249] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:21,249] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:32:21,249] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:32:21,249] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:32:21,249] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:21,249] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:21,250] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:32:21,250] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:32:21,250] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:21,250] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:32:21,250] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:21,250] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:32:21,250] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:21,251] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:32:21,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=2620, skipped=25, lr=[9.979256050624853e-06, 9.979256050624853e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:21,605] [INFO] [timer.py:199:stop] epoch=0/micro_step=2620/global_step=2620, RunningAvgSamplesPerSec=191.28981002653805, CurrSamplesPerSec=191.19416179841224, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:23,269] [INFO] [logging.py:96:log_dist] [Rank 0] step=2630, skipped=25, lr=[9.809199699880844e-06, 9.809199699880844e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:23,278] [INFO] [timer.py:199:stop] epoch=0/micro_step=2630/global_step=2630, RunningAvgSamplesPerSec=191.29120045976242, CurrSamplesPerSec=191.56556043366314, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:25,007] [INFO] [logging.py:96:log_dist] [Rank 0] step=2640, skipped=25, lr=[9.640250437089863e-06, 9.640250437089863e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:25,016] [INFO] [timer.py:199:stop] epoch=0/micro_step=2640/global_step=2640, RunningAvgSamplesPerSec=191.26799299114538, CurrSamplesPerSec=190.52217473203373, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:26,684] [INFO] [logging.py:96:log_dist] [Rank 0] step=2650, skipped=25, lr=[9.47242057507875e-06, 9.47242057507875e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:26,693] [INFO] [timer.py:199:stop] epoch=0/micro_step=2650/global_step=2650, RunningAvgSamplesPerSec=191.2678426288326, CurrSamplesPerSec=191.38964179785535, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:28,360] [INFO] [logging.py:96:log_dist] [Rank 0] step=2660, skipped=25, lr=[9.305722345093696e-06, 9.305722345093696e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:28,369] [INFO] [timer.py:199:stop] epoch=0/micro_step=2660/global_step=2660, RunningAvgSamplesPerSec=191.2682961649593, CurrSamplesPerSec=189.19707304833474, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:30,033] [INFO] [logging.py:96:log_dist] [Rank 0] step=2670, skipped=25, lr=[9.140167895908867e-06, 9.140167895908867e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:30,042] [INFO] [timer.py:199:stop] epoch=0/micro_step=2670/global_step=2670, RunningAvgSamplesPerSec=191.2702296149729, CurrSamplesPerSec=190.76290535316323, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:31,702] [INFO] [logging.py:96:log_dist] [Rank 0] step=2680, skipped=25, lr=[8.975769292941003e-06, 8.975769292941003e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:31,711] [INFO] [timer.py:199:stop] epoch=0/micro_step=2680/global_step=2680, RunningAvgSamplesPerSec=191.27374395093753, CurrSamplesPerSec=192.76456817535524, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:33,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=2690, skipped=25, lr=[8.812538517370097e-06, 8.812538517370097e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:33,380] [INFO] [timer.py:199:stop] epoch=0/micro_step=2690/global_step=2690, RunningAvgSamplesPerSec=191.2767495485274, CurrSamplesPerSec=188.14022024356876, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:35,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=2700, skipped=25, lr=[8.650487465266271e-06, 8.650487465266271e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:35,050] [INFO] [timer.py:199:stop] epoch=0/micro_step=2700/global_step=2700, RunningAvgSamplesPerSec=191.2800679019321, CurrSamplesPerSec=193.46560096864167, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:36,709] [INFO] [logging.py:96:log_dist] [Rank 0] step=2710, skipped=25, lr=[8.489627946722731e-06, 8.489627946722731e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:36,717] [INFO] [timer.py:199:stop] epoch=0/micro_step=2710/global_step=2710, RunningAvgSamplesPerSec=191.2841260430213, CurrSamplesPerSec=191.930445128299, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:38,031] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:38,031] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:38,031] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:32:38,031] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:32:38,031] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:38,032] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:32:38,031] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:38,032] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:38,032] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:32:38,032] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:32:38,032] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:38,032] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:38,032] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:32:38,032] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:32:38,033] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:38,033] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:32:38,380] [INFO] [logging.py:96:log_dist] [Rank 0] step=2720, skipped=25, lr=[8.3299716849951e-06, 8.3299716849951e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:38,389] [INFO] [timer.py:199:stop] epoch=0/micro_step=2720/global_step=2720, RunningAvgSamplesPerSec=191.28614282395276, CurrSamplesPerSec=189.61993148023876, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:40,047] [INFO] [logging.py:96:log_dist] [Rank 0] step=2730, skipped=25, lr=[8.171530315647041e-06, 8.171530315647041e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:40,056] [INFO] [timer.py:199:stop] epoch=0/micro_step=2730/global_step=2730, RunningAvgSamplesPerSec=191.2903041022495, CurrSamplesPerSec=192.33888443851487, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:41,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=2740, skipped=25, lr=[8.014315385702261e-06, 8.014315385702261e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:41,725] [INFO] [timer.py:199:stop] epoch=0/micro_step=2740/global_step=2740, RunningAvgSamplesPerSec=191.2935141750068, CurrSamplesPerSec=191.46363028149423, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:43,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=2750, skipped=25, lr=[7.858338352803005e-06, 7.858338352803005e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:43,391] [INFO] [timer.py:199:stop] epoch=0/micro_step=2750/global_step=2750, RunningAvgSamplesPerSec=191.29792157302413, CurrSamplesPerSec=192.599983067216, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:45,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=2760, skipped=25, lr=[7.703610584374984e-06, 7.703610584374984e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:45,060] [INFO] [timer.py:199:stop] epoch=0/micro_step=2760/global_step=2760, RunningAvgSamplesPerSec=191.30118094921556, CurrSamplesPerSec=190.8179606784943, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:46,720] [INFO] [logging.py:96:log_dist] [Rank 0] step=2770, skipped=25, lr=[7.550143356798969e-06, 7.550143356798969e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:46,729] [INFO] [timer.py:199:stop] epoch=0/micro_step=2770/global_step=2770, RunningAvgSamplesPerSec=191.30441481618993, CurrSamplesPerSec=192.84571937613597, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:48,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=2780, skipped=25, lr=[7.397947854588977e-06, 7.397947854588977e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:48,403] [INFO] [timer.py:199:stop] epoch=0/micro_step=2780/global_step=2780, RunningAvgSamplesPerSec=191.3057250073048, CurrSamplesPerSec=192.38271937620044, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:50,074] [INFO] [logging.py:96:log_dist] [Rank 0] step=2790, skipped=25, lr=[7.247035169577138e-06, 7.247035169577138e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:50,083] [INFO] [timer.py:199:stop] epoch=0/micro_step=2790/global_step=2790, RunningAvgSamplesPerSec=191.30444283333168, CurrSamplesPerSec=192.26476891165584, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:51,744] [INFO] [logging.py:96:log_dist] [Rank 0] step=2800, skipped=25, lr=[7.097416300105375e-06, 7.097416300105375e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:51,753] [INFO] [timer.py:199:stop] epoch=0/micro_step=2800/global_step=2800, RunningAvgSamplesPerSec=191.30710753143268, CurrSamplesPerSec=193.03209921287248, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:53,420] [INFO] [logging.py:96:log_dist] [Rank 0] step=2810, skipped=25, lr=[6.949102150223808e-06, 6.949102150223808e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:53,429] [INFO] [timer.py:199:stop] epoch=0/micro_step=2810/global_step=2810, RunningAvgSamplesPerSec=191.3075671917819, CurrSamplesPerSec=192.55494407749384, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:54,745] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:54,746] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:32:54,746] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:32:54,747] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:32:54,747] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:32:55,090] [INFO] [logging.py:96:log_dist] [Rank 0] step=2820, skipped=25, lr=[6.802103528896109e-06, 6.802103528896109e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:55,099] [INFO] [timer.py:199:stop] epoch=0/micro_step=2820/global_step=2820, RunningAvgSamplesPerSec=191.30999954095785, CurrSamplesPerSec=192.47376134685157, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:56,821] [INFO] [logging.py:96:log_dist] [Rank 0] step=2830, skipped=25, lr=[6.656431149211748e-06, 6.656431149211748e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:56,830] [INFO] [timer.py:199:stop] epoch=0/micro_step=2830/global_step=2830, RunningAvgSamplesPerSec=191.2905012209343, CurrSamplesPerSec=192.79585730394376, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:32:58,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=2840, skipped=25, lr=[6.512095627605238e-06, 6.512095627605238e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:32:58,500] [INFO] [timer.py:199:stop] epoch=0/micro_step=2840/global_step=2840, RunningAvgSamplesPerSec=191.2933736551661, CurrSamplesPerSec=190.51919986827238, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:00,159] [INFO] [logging.py:96:log_dist] [Rank 0] step=2850, skipped=25, lr=[6.36910748308244e-06, 6.36910748308244e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:00,168] [INFO] [timer.py:199:stop] epoch=0/micro_step=2850/global_step=2850, RunningAvgSamplesPerSec=191.29678240660783, CurrSamplesPerSec=193.19325265965486, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2854 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2854 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2854 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2854 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2854 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2854 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:00,965] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2854 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2854 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:00,965] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:01,790] [INFO] [logging.py:96:log_dist] [Rank 0] step=2860, skipped=26, lr=[6.241578775259638e-06, 6.241578775259638e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:01,799] [INFO] [timer.py:199:stop] epoch=0/micro_step=2860/global_step=2860, RunningAvgSamplesPerSec=191.31522060232467, CurrSamplesPerSec=193.12792531454684, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:03,475] [INFO] [logging.py:96:log_dist] [Rank 0] step=2870, skipped=26, lr=[6.101179274758461e-06, 6.101179274758461e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:03,484] [INFO] [timer.py:199:stop] epoch=0/micro_step=2870/global_step=2870, RunningAvgSamplesPerSec=191.3121419053363, CurrSamplesPerSec=191.98397960833003, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:05,168] [INFO] [logging.py:96:log_dist] [Rank 0] step=2880, skipped=26, lr=[5.962157098449431e-06, 5.962157098449431e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:05,177] [INFO] [timer.py:199:stop] epoch=0/micro_step=2880/global_step=2880, RunningAvgSamplesPerSec=191.3054233255522, CurrSamplesPerSec=191.29525986779282, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:06,887] [INFO] [logging.py:96:log_dist] [Rank 0] step=2890, skipped=26, lr=[5.824522378107935e-06, 5.824522378107935e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:06,896] [INFO] [timer.py:199:stop] epoch=0/micro_step=2890/global_step=2890, RunningAvgSamplesPerSec=191.29151625605178, CurrSamplesPerSec=192.19759083072236, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:08,559] [INFO] [logging.py:96:log_dist] [Rank 0] step=2900, skipped=26, lr=[5.688285144393169e-06, 5.688285144393169e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:08,567] [INFO] [timer.py:199:stop] epoch=0/micro_step=2900/global_step=2900, RunningAvgSamplesPerSec=191.2938263721855, CurrSamplesPerSec=192.32813552327264, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:10,235] [INFO] [logging.py:96:log_dist] [Rank 0] step=2910, skipped=26, lr=[5.553455326117138e-06, 5.553455326117138e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:10,244] [INFO] [timer.py:199:stop] epoch=0/micro_step=2910/global_step=2910, RunningAvgSamplesPerSec=191.29394324600074, CurrSamplesPerSec=192.50606055825122, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:11,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=2920, skipped=26, lr=[5.420042749521021e-06, 5.420042749521021e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:11,916] [INFO] [timer.py:199:stop] epoch=0/micro_step=2920/global_step=2920, RunningAvgSamplesPerSec=191.29568098381088, CurrSamplesPerSec=190.8366813021817, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:13,575] [INFO] [logging.py:96:log_dist] [Rank 0] step=2930, skipped=26, lr=[5.2880571375590655e-06, 5.2880571375590655e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:13,584] [INFO] [timer.py:199:stop] epoch=0/micro_step=2930/global_step=2930, RunningAvgSamplesPerSec=191.29916283975942, CurrSamplesPerSec=191.79002480652056, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:15,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=2940, skipped=26, lr=[5.157508109189993e-06, 5.157508109189993e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:15,257] [INFO] [timer.py:199:stop] epoch=0/micro_step=2940/global_step=2940, RunningAvgSamplesPerSec=191.30091692141772, CurrSamplesPerSec=191.47646805390306, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:16,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=2950, skipped=26, lr=[5.02840517867596e-06, 5.02840517867596e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:16,931] [INFO] [timer.py:199:stop] epoch=0/micro_step=2950/global_step=2950, RunningAvgSamplesPerSec=191.30179045288867, CurrSamplesPerSec=192.72277288855003, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:17,911] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:17,911] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:17,912] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:18,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=2960, skipped=26, lr=[4.90075775488921e-06, 4.90075775488921e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:18,601] [INFO] [timer.py:199:stop] epoch=0/micro_step=2960/global_step=2960, RunningAvgSamplesPerSec=191.30474200880326, CurrSamplesPerSec=192.51185900010327, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2966 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2966 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2966 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2966 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2966 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2966 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2966 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2966 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:19,733] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-21 23:33:19,733] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:20,228] [INFO] [logging.py:96:log_dist] [Rank 0] step=2970, skipped=27, lr=[4.787127222697066e-06, 4.787127222697066e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:20,236] [INFO] [timer.py:199:stop] epoch=0/micro_step=2970/global_step=2970, RunningAvgSamplesPerSec=191.3204918264416, CurrSamplesPerSec=189.5540531952304, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:21,900] [INFO] [logging.py:96:log_dist] [Rank 0] step=2980, skipped=27, lr=[4.662270802678737e-06, 4.662270802678737e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:21,909] [INFO] [timer.py:199:stop] epoch=0/micro_step=2980/global_step=2980, RunningAvgSamplesPerSec=191.32209670424865, CurrSamplesPerSec=189.5072164191538, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:23,575] [INFO] [logging.py:96:log_dist] [Rank 0] step=2990, skipped=27, lr=[4.538896572837459e-06, 4.538896572837459e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:23,584] [INFO] [timer.py:199:stop] epoch=0/micro_step=2990/global_step=2990, RunningAvgSamplesPerSec=191.3225925729687, CurrSamplesPerSec=192.14778823014032, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:25,246] [INFO] [logging.py:96:log_dist] [Rank 0] step=3000, skipped=27, lr=[4.417013524544378e-06, 4.417013524544378e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:25,255] [INFO] [timer.py:199:stop] epoch=0/micro_step=3000/global_step=3000, RunningAvgSamplesPerSec=191.32483111945803, CurrSamplesPerSec=192.76789043267945, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:26,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=3010, skipped=27, lr=[4.2966305404950695e-06, 4.2966305404950695e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:26,928] [INFO] [timer.py:199:stop] epoch=0/micro_step=3010/global_step=3010, RunningAvgSamplesPerSec=191.32611259272937, CurrSamplesPerSec=189.15974276545458, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:28,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=3020, skipped=27, lr=[4.177756394062146e-06, 4.177756394062146e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:28,680] [INFO] [timer.py:199:stop] epoch=0/micro_step=3020/global_step=3020, RunningAvgSamplesPerSec=191.2976403521007, CurrSamplesPerSec=191.72399961717403, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:30,342] [INFO] [logging.py:96:log_dist] [Rank 0] step=3030, skipped=27, lr=[4.060399748655883e-06, 4.060399748655883e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:30,351] [INFO] [timer.py:199:stop] epoch=0/micro_step=3030/global_step=3030, RunningAvgSamplesPerSec=191.29992043871684, CurrSamplesPerSec=192.40836794171742, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:32,010] [INFO] [logging.py:96:log_dist] [Rank 0] step=3040, skipped=27, lr=[3.944569157092839e-06, 3.944569157092839e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:32,019] [INFO] [timer.py:199:stop] epoch=0/micro_step=3040/global_step=3040, RunningAvgSamplesPerSec=191.30309071789864, CurrSamplesPerSec=192.9324854241654, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:33,679] [INFO] [logging.py:96:log_dist] [Rank 0] step=3050, skipped=27, lr=[3.830273060972528e-06, 3.830273060972528e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:33,688] [INFO] [timer.py:199:stop] epoch=0/micro_step=3050/global_step=3050, RunningAvgSamplesPerSec=191.30588147163795, CurrSamplesPerSec=192.5919684546751, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:35,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=3060, skipped=27, lr=[3.7175197900622294e-06, 3.7175197900622294e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:35,359] [INFO] [timer.py:199:stop] epoch=0/micro_step=3060/global_step=3060, RunningAvgSamplesPerSec=191.30830093878475, CurrSamplesPerSec=192.5535628463749, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:36,673] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:36,673] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:36,673] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:36,673] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:33:36,674] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:36,991] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:36,991] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-21 23:33:36,992] [INFO] [logging.py:96:log_dist] [Rank 0] step=3070, skipped=28, lr=[3.6173677557520186e-06, 3.6173677557520186e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:36,992] [INFO] [timer.py:199:stop] epoch=0/micro_step=3070/global_step=3070, RunningAvgSamplesPerSec=191.32449633658953, CurrSamplesPerSec=245.25718086515744, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:38,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=3080, skipped=28, lr=[3.507568398065414e-06, 3.507568398065414e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:38,662] [INFO] [timer.py:199:stop] epoch=0/micro_step=3080/global_step=3080, RunningAvgSamplesPerSec=191.3270012440131, CurrSamplesPerSec=190.82989804318555, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:33:39,125] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:33:39,125] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:33:40,283] [INFO] [logging.py:96:log_dist] [Rank 0] step=3090, skipped=29, lr=[3.4100879741825186e-06, 3.4100879741825186e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:40,292] [INFO] [timer.py:199:stop] epoch=0/micro_step=3090/global_step=3090, RunningAvgSamplesPerSec=191.34420309427045, CurrSamplesPerSec=193.05653277715848, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:41,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=3100, skipped=29, lr=[3.303271416662826e-06, 3.303271416662826e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:41,962] [INFO] [timer.py:199:stop] epoch=0/micro_step=3100/global_step=3100, RunningAvgSamplesPerSec=191.34678193818561, CurrSamplesPerSec=192.5883759138489, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:43,621] [INFO] [logging.py:96:log_dist] [Rank 0] step=3110, skipped=29, lr=[3.1980360916233327e-06, 3.1980360916233327e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:43,630] [INFO] [timer.py:199:stop] epoch=0/micro_step=3110/global_step=3110, RunningAvgSamplesPerSec=191.34977241581092, CurrSamplesPerSec=192.48480259289536, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:45,293] [INFO] [logging.py:96:log_dist] [Rank 0] step=3120, skipped=29, lr=[3.0943896684927976e-06, 3.0943896684927976e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:45,302] [INFO] [timer.py:199:stop] epoch=0/micro_step=3120/global_step=3120, RunningAvgSamplesPerSec=191.35129253490132, CurrSamplesPerSec=192.40505806510805, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:46,963] [INFO] [logging.py:96:log_dist] [Rank 0] step=3130, skipped=29, lr=[2.9923397009026438e-06, 2.9923397009026438e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:46,972] [INFO] [timer.py:199:stop] epoch=0/micro_step=3130/global_step=3130, RunningAvgSamplesPerSec=191.35353365164403, CurrSamplesPerSec=192.4765215396088, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:48,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=3140, skipped=29, lr=[2.891893626136438e-06, 2.891893626136438e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:48,640] [INFO] [timer.py:199:stop] epoch=0/micro_step=3140/global_step=3140, RunningAvgSamplesPerSec=191.35644988738858, CurrSamplesPerSec=192.70257371838127, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3141 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3141 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3141 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3141 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3141 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3141 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3141 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3141 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:33:48,936] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-21 23:33:48,936] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-04-21 23:33:50,260] [INFO] [logging.py:96:log_dist] [Rank 0] step=3150, skipped=30, lr=[2.8028695399406195e-06, 2.8028695399406195e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:50,269] [INFO] [timer.py:199:stop] epoch=0/micro_step=3150/global_step=3150, RunningAvgSamplesPerSec=191.3736550810254, CurrSamplesPerSec=192.82577145262528, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:51,930] [INFO] [logging.py:96:log_dist] [Rank 0] step=3160, skipped=30, lr=[2.7054909321851562e-06, 2.7054909321851562e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:51,939] [INFO] [timer.py:199:stop] epoch=0/micro_step=3160/global_step=3160, RunningAvgSamplesPerSec=191.37594648565957, CurrSamplesPerSec=192.39154391738865, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:53,606] [INFO] [logging.py:96:log_dist] [Rank 0] step=3170, skipped=30, lr=[2.609737122460082e-06, 2.609737122460082e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:53,615] [INFO] [timer.py:199:stop] epoch=0/micro_step=3170/global_step=3170, RunningAvgSamplesPerSec=191.37591658410975, CurrSamplesPerSec=192.90142917917055, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:55,278] [INFO] [logging.py:96:log_dist] [Rank 0] step=3180, skipped=30, lr=[2.515615089192297e-06, 2.515615089192297e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:55,287] [INFO] [timer.py:199:stop] epoch=0/micro_step=3180/global_step=3180, RunningAvgSamplesPerSec=191.37731831868382, CurrSamplesPerSec=192.60855114932798, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:56,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=3190, skipped=30, lr=[2.423131691886682e-06, 2.423131691886682e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:56,957] [INFO] [timer.py:199:stop] epoch=0/micro_step=3190/global_step=3190, RunningAvgSamplesPerSec=191.37953746656612, CurrSamplesPerSec=192.07959230695474, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:33:58,616] [INFO] [logging.py:96:log_dist] [Rank 0] step=3200, skipped=30, lr=[2.332293670626265e-06, 2.332293670626265e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:33:58,625] [INFO] [timer.py:199:stop] epoch=0/micro_step=3200/global_step=3200, RunningAvgSamplesPerSec=191.38252750471636, CurrSamplesPerSec=192.1472380693328, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:00,287] [INFO] [logging.py:96:log_dist] [Rank 0] step=3210, skipped=30, lr=[2.2431076455809467e-06, 2.2431076455809467e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:00,296] [INFO] [timer.py:199:stop] epoch=0/micro_step=3210/global_step=3210, RunningAvgSamplesPerSec=191.38442423242466, CurrSamplesPerSec=189.19573957023565, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:02,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=3220, skipped=30, lr=[2.1555801165250605e-06, 2.1555801165250605e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:02,065] [INFO] [timer.py:199:stop] epoch=0/micro_step=3220/global_step=3220, RunningAvgSamplesPerSec=191.35127452484858, CurrSamplesPerSec=192.25898534469545, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:03,705] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-04-21 23:34:03,705] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:34:03,705] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-04-21 23:34:03,705] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-04-21 23:34:03,705] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-04-21 23:34:03,705] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:34:03,705] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:34:03,705] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-04-21 23:34:03,705] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-04-21 23:34:03,705] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-04-21 23:34:03,706] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:34:03,706] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:34:03,706] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:34:03,705] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-04-21 23:34:03,706] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:34:03,706] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-04-21 23:34:03,706] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-21 23:34:03,706] [INFO] [logging.py:96:log_dist] [Rank 0] step=3230, skipped=31, lr=[2.07822862911575e-06, 2.07822862911575e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:03,706] [INFO] [timer.py:199:stop] epoch=0/micro_step=3230/global_step=3230, RunningAvgSamplesPerSec=191.36383724992095, CurrSamplesPerSec=246.1044320450924, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:05,365] [INFO] [logging.py:96:log_dist] [Rank 0] step=3240, skipped=31, lr=[1.9938697160500257e-06, 1.9938697160500257e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:05,374] [INFO] [timer.py:199:stop] epoch=0/micro_step=3240/global_step=3240, RunningAvgSamplesPerSec=191.36688773029388, CurrSamplesPerSec=192.39871445322692, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:07,043] [INFO] [logging.py:96:log_dist] [Rank 0] step=3250, skipped=31, lr=[1.9111874631457167e-06, 1.9111874631457167e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:07,052] [INFO] [timer.py:199:stop] epoch=0/micro_step=3250/global_step=3250, RunningAvgSamplesPerSec=191.36640619873734, CurrSamplesPerSec=192.79751895759446, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:08,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=3260, skipped=31, lr=[1.8301878961897722e-06, 1.8301878961897722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:08,736] [INFO] [timer.py:199:stop] epoch=0/micro_step=3260/global_step=3260, RunningAvgSamplesPerSec=191.36374093763962, CurrSamplesPerSec=192.09113821149847, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:10,416] [INFO] [logging.py:96:log_dist] [Rank 0] step=3270, skipped=31, lr=[1.7508769183369217e-06, 1.7508769183369217e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:10,424] [INFO] [timer.py:199:stop] epoch=0/micro_step=3270/global_step=3270, RunningAvgSamplesPerSec=191.35945674518558, CurrSamplesPerSec=184.33741101950395, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:12,132] [INFO] [logging.py:96:log_dist] [Rank 0] step=3280, skipped=31, lr=[1.6732603096795003e-06, 1.6732603096795003e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:12,141] [INFO] [timer.py:199:stop] epoch=0/micro_step=3280/global_step=3280, RunningAvgSamplesPerSec=191.34556073840272, CurrSamplesPerSec=192.57759909549267, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:13,800] [INFO] [logging.py:96:log_dist] [Rank 0] step=3290, skipped=31, lr=[1.5973437268261448e-06, 1.5973437268261448e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:13,809] [INFO] [timer.py:199:stop] epoch=0/micro_step=3290/global_step=3290, RunningAvgSamplesPerSec=191.34868048751267, CurrSamplesPerSec=191.73440720894718, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:15,467] [INFO] [logging.py:96:log_dist] [Rank 0] step=3300, skipped=31, lr=[1.5231327024895936e-06, 1.5231327024895936e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:15,476] [INFO] [timer.py:199:stop] epoch=0/micro_step=3300/global_step=3300, RunningAvgSamplesPerSec=191.35184618727578, CurrSamplesPerSec=192.920006266916, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:17,139] [INFO] [logging.py:96:log_dist] [Rank 0] step=3310, skipped=31, lr=[1.4506326450834578e-06, 1.4506326450834578e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:17,148] [INFO] [timer.py:199:stop] epoch=0/micro_step=3310/global_step=3310, RunningAvgSamplesPerSec=191.3532422433782, CurrSamplesPerSec=192.0963618147989, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:18,808] [INFO] [logging.py:96:log_dist] [Rank 0] step=3320, skipped=31, lr=[1.379848838328049e-06, 1.379848838328049e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:18,817] [INFO] [timer.py:199:stop] epoch=0/micro_step=3320/global_step=3320, RunningAvgSamplesPerSec=191.35589762246502, CurrSamplesPerSec=192.33971132791453, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:20,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=3330, skipped=31, lr=[1.3107864408653248e-06, 1.3107864408653248e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:20,486] [INFO] [timer.py:199:stop] epoch=0/micro_step=3330/global_step=3330, RunningAvgSamplesPerSec=191.3581265669534, CurrSamplesPerSec=192.38768308100717, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:20,632] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:20,632] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:20,633] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-21 23:34:22,147] [INFO] [logging.py:96:log_dist] [Rank 0] step=3340, skipped=31, lr=[1.2434504858829161e-06, 1.2434504858829161e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:22,156] [INFO] [timer.py:199:stop] epoch=0/micro_step=3340/global_step=3340, RunningAvgSamplesPerSec=191.36028662820075, CurrSamplesPerSec=192.51241125087853, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:23,822] [INFO] [logging.py:96:log_dist] [Rank 0] step=3350, skipped=31, lr=[1.1778458807473246e-06, 1.1778458807473246e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:23,831] [INFO] [timer.py:199:stop] epoch=0/micro_step=3350/global_step=3350, RunningAvgSamplesPerSec=191.36092509255292, CurrSamplesPerSec=192.7487890167807, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:25,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=3360, skipped=31, lr=[1.1139774066462883e-06, 1.1139774066462883e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:25,499] [INFO] [timer.py:199:stop] epoch=0/micro_step=3360/global_step=3360, RunningAvgSamplesPerSec=191.36351698928894, CurrSamplesPerSec=192.49115188299461, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:27,159] [INFO] [logging.py:96:log_dist] [Rank 0] step=3370, skipped=31, lr=[1.051849718240308e-06, 1.051849718240308e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:27,169] [INFO] [timer.py:199:stop] epoch=0/micro_step=3370/global_step=3370, RunningAvgSamplesPerSec=191.3658788834047, CurrSamplesPerSec=192.2559559961038, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:28,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=3380, skipped=31, lr=[9.914673433234545e-07, 9.914673433234545e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:28,838] [INFO] [timer.py:199:stop] epoch=0/micro_step=3380/global_step=3380, RunningAvgSamplesPerSec=191.36805843824956, CurrSamplesPerSec=192.19016007526216, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:30,497] [INFO] [logging.py:96:log_dist] [Rank 0] step=3390, skipped=31, lr=[9.328346824933554e-07, 9.328346824933554e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:30,506] [INFO] [timer.py:199:stop] epoch=0/micro_step=3390/global_step=3390, RunningAvgSamplesPerSec=191.3710706571501, CurrSamplesPerSec=192.2967225046277, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:32,170] [INFO] [logging.py:96:log_dist] [Rank 0] step=3400, skipped=31, lr=[8.759560088305002e-07, 8.759560088305002e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:32,179] [INFO] [timer.py:199:stop] epoch=0/micro_step=3400/global_step=3400, RunningAvgSamplesPerSec=191.37197762089588, CurrSamplesPerSec=192.62541314989932, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:33,837] [INFO] [logging.py:96:log_dist] [Rank 0] step=3410, skipped=31, lr=[8.208354675868335e-07, 8.208354675868335e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:33,846] [INFO] [timer.py:199:stop] epoch=0/micro_step=3410/global_step=3410, RunningAvgSamplesPerSec=191.3749880875963, CurrSamplesPerSec=192.27055282659114, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:35,576] [INFO] [logging.py:96:log_dist] [Rank 0] step=3420, skipped=31, lr=[7.67477075883638e-07, 7.67477075883638e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:35,585] [INFO] [timer.py:199:stop] epoch=0/micro_step=3420/global_step=3420, RunningAvgSamplesPerSec=191.3540186051657, CurrSamplesPerSec=192.35046153713864, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:37,249] [INFO] [logging.py:96:log_dist] [Rank 0] step=3430, skipped=31, lr=[7.158847224187776e-07, 7.158847224187776e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:37,258] [INFO] [timer.py:199:stop] epoch=0/micro_step=3430/global_step=3430, RunningAvgSamplesPerSec=191.3552627883888, CurrSamplesPerSec=192.24328884534623, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:37,404] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:37,404] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:34:37,404] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:37,404] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:37,404] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:34:37,404] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:34:37,404] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:37,404] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:37,405] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:34:37,405] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:34:37,404] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:37,405] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:37,405] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:34:37,405] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:34:37,405] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:37,405] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-21 23:34:38,917] [INFO] [logging.py:96:log_dist] [Rank 0] step=3440, skipped=31, lr=[6.660621671832845e-07, 6.660621671832845e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:38,926] [INFO] [timer.py:199:stop] epoch=0/micro_step=3440/global_step=3440, RunningAvgSamplesPerSec=191.35811597330334, CurrSamplesPerSec=193.1518272167073, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:40,587] [INFO] [logging.py:96:log_dist] [Rank 0] step=3450, skipped=31, lr=[6.180130411873486e-07, 6.180130411873486e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:40,596] [INFO] [timer.py:199:stop] epoch=0/micro_step=3450/global_step=3450, RunningAvgSamplesPerSec=191.35999192692023, CurrSamplesPerSec=192.73771746544605, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:42,256] [INFO] [logging.py:96:log_dist] [Rank 0] step=3460, skipped=31, lr=[5.717408461956952e-07, 5.717408461956952e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:42,265] [INFO] [timer.py:199:stop] epoch=0/micro_step=3460/global_step=3460, RunningAvgSamplesPerSec=191.36242427938305, CurrSamplesPerSec=192.2168584284982, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:43,925] [INFO] [logging.py:96:log_dist] [Rank 0] step=3470, skipped=31, lr=[5.272489544723619e-07, 5.272489544723619e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:43,934] [INFO] [timer.py:199:stop] epoch=0/micro_step=3470/global_step=3470, RunningAvgSamplesPerSec=191.36500198222734, CurrSamplesPerSec=192.61546144827122, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:45,591] [INFO] [logging.py:96:log_dist] [Rank 0] step=3480, skipped=31, lr=[4.84540608534953e-07, 4.84540608534953e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:45,600] [INFO] [timer.py:199:stop] epoch=0/micro_step=3480/global_step=3480, RunningAvgSamplesPerSec=191.3683182182581, CurrSamplesPerSec=192.3449484592224, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:47,263] [INFO] [logging.py:96:log_dist] [Rank 0] step=3490, skipped=31, lr=[4.4361892091831225e-07, 4.4361892091831225e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:47,272] [INFO] [timer.py:199:stop] epoch=0/micro_step=3490/global_step=3490, RunningAvgSamplesPerSec=191.36964310332382, CurrSamplesPerSec=191.79989139443825, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:48,931] [INFO] [logging.py:96:log_dist] [Rank 0] step=3500, skipped=31, lr=[4.044868739476959e-07, 4.044868739476959e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:48,940] [INFO] [timer.py:199:stop] epoch=0/micro_step=3500/global_step=3500, RunningAvgSamplesPerSec=191.37226281165158, CurrSamplesPerSec=192.0589780721085, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:50,598] [INFO] [logging.py:96:log_dist] [Rank 0] step=3510, skipped=31, lr=[3.671473195214159e-07, 3.671473195214159e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:50,607] [INFO] [timer.py:199:stop] epoch=0/micro_step=3510/global_step=3510, RunningAvgSamplesPerSec=191.3752046241602, CurrSamplesPerSec=191.9134301638913, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:52,266] [INFO] [logging.py:96:log_dist] [Rank 0] step=3520, skipped=31, lr=[3.3160297890300894e-07, 3.3160297890300894e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:52,275] [INFO] [timer.py:199:stop] epoch=0/micro_step=3520/global_step=3520, RunningAvgSamplesPerSec=191.37788461195314, CurrSamplesPerSec=192.49363650184006, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:53,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=3530, skipped=31, lr=[2.97856442522898e-07, 2.97856442522898e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:53,945] [INFO] [timer.py:199:stop] epoch=0/micro_step=3530/global_step=3530, RunningAvgSamplesPerSec=191.38002010344002, CurrSamplesPerSec=192.48839127061413, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:34:54,091] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:34:55,608] [INFO] [logging.py:96:log_dist] [Rank 0] step=3540, skipped=31, lr=[2.6591016978961826e-07, 2.6591016978961826e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:55,617] [INFO] [timer.py:199:stop] epoch=0/micro_step=3540/global_step=3540, RunningAvgSamplesPerSec=191.38130321997198, CurrSamplesPerSec=192.48866732828895, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:57,280] [INFO] [logging.py:96:log_dist] [Rank 0] step=3550, skipped=31, lr=[2.3576648891056875e-07, 2.3576648891056875e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:57,289] [INFO] [timer.py:199:stop] epoch=0/micro_step=3550/global_step=3550, RunningAvgSamplesPerSec=191.38244967272672, CurrSamplesPerSec=192.16319401194346, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:34:58,956] [INFO] [logging.py:96:log_dist] [Rank 0] step=3560, skipped=31, lr=[2.074275967223427e-07, 2.074275967223427e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:34:58,965] [INFO] [timer.py:199:stop] epoch=0/micro_step=3560/global_step=3560, RunningAvgSamplesPerSec=191.3824576141193, CurrSamplesPerSec=190.94337754421207, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:00,628] [INFO] [logging.py:96:log_dist] [Rank 0] step=3570, skipped=31, lr=[1.8089555853061934e-07, 1.8089555853061934e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:00,637] [INFO] [timer.py:199:stop] epoch=0/micro_step=3570/global_step=3570, RunningAvgSamplesPerSec=191.38377400510933, CurrSamplesPerSec=190.41108607941166, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3571 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3571 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3571 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3571 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3571 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:35:00,933] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3571 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3571 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3571 [2023-04-21 23:35:00,933] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-21 23:35:02,265] [INFO] [logging.py:96:log_dist] [Rank 0] step=3580, skipped=32, lr=[1.5856318518868985e-07, 1.5856318518868985e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:02,274] [INFO] [timer.py:199:stop] epoch=0/micro_step=3580/global_step=3580, RunningAvgSamplesPerSec=191.3963193648925, CurrSamplesPerSec=193.1184773568997, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:03,940] [INFO] [logging.py:96:log_dist] [Rank 0] step=3590, skipped=32, lr=[1.3546938777672101e-07, 1.3546938777672101e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:03,949] [INFO] [timer.py:199:stop] epoch=0/micro_step=3590/global_step=3590, RunningAvgSamplesPerSec=191.39680029645072, CurrSamplesPerSec=192.4825942423469, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:05,701] [INFO] [logging.py:96:log_dist] [Rank 0] step=3600, skipped=32, lr=[1.1418768859227935e-07, 1.1418768859227935e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:05,710] [INFO] [timer.py:199:stop] epoch=0/micro_step=3600/global_step=3600, RunningAvgSamplesPerSec=191.36963168562423, CurrSamplesPerSec=154.05703948825843, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:07,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=3610, skipped=32, lr=[9.471963862098532e-08, 9.471963862098532e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:07,380] [INFO] [timer.py:199:stop] epoch=0/micro_step=3610/global_step=3610, RunningAvgSamplesPerSec=191.37167528709523, CurrSamplesPerSec=192.2196112578267, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:09,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=3620, skipped=32, lr=[7.706665667180091e-08, 7.706665667180091e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:09,050] [INFO] [timer.py:199:stop] epoch=0/micro_step=3620/global_step=3620, RunningAvgSamplesPerSec=191.3734162882679, CurrSamplesPerSec=193.1190330935252, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:10,708] [INFO] [logging.py:96:log_dist] [Rank 0] step=3630, skipped=32, lr=[6.12300292736262e-08, 6.12300292736262e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:10,717] [INFO] [timer.py:199:stop] epoch=0/micro_step=3630/global_step=3630, RunningAvgSamplesPerSec=191.3762769053204, CurrSamplesPerSec=192.91861979353823, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:12,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=3640, skipped=32, lr=[4.721091058154936e-08, 4.721091058154936e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:12,392] [INFO] [timer.py:199:stop] epoch=0/micro_step=3640/global_step=3640, RunningAvgSamplesPerSec=191.37682241585716, CurrSamplesPerSec=192.27193000048706, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:14,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=3650, skipped=32, lr=[3.5010322292722275e-08, 3.5010322292722275e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:14,076] [INFO] [timer.py:199:stop] epoch=0/micro_step=3650/global_step=3650, RunningAvgSamplesPerSec=191.37410023694883, CurrSamplesPerSec=191.68347555358787, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:15,771] [INFO] [logging.py:96:log_dist] [Rank 0] step=3660, skipped=32, lr=[2.462915357190343e-08, 2.462915357190343e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:15,780] [INFO] [timer.py:199:stop] epoch=0/micro_step=3660/global_step=3660, RunningAvgSamplesPerSec=191.3666058822583, CurrSamplesPerSec=178.21559379647334, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:17,442] [INFO] [logging.py:96:log_dist] [Rank 0] step=3670, skipped=32, lr=[1.6068160986662527e-08, 1.6068160986662527e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:17,451] [INFO] [timer.py:199:stop] epoch=0/micro_step=3670/global_step=3670, RunningAvgSamplesPerSec=191.36837489255697, CurrSamplesPerSec=192.30168177272427, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-04-21 23:35:17,931] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:35:17,931] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:35:17,931] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:35:17,931] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:35:17,931] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:35:17,931] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:35:17,931] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:35:17,931] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:35:17,931] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:35:17,932] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:35:17,931] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:35:17,932] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:35:17,932] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:35:17,932] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:35:17,932] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-21 23:35:17,932] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-21 23:35:19,113] [INFO] [logging.py:96:log_dist] [Rank 0] step=3680, skipped=32, lr=[9.327968452232938e-09, 9.327968452232938e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:35:19,122] [INFO] [timer.py:199:stop] epoch=0/micro_step=3680/global_step=3680, RunningAvgSamplesPerSec=191.3700414340361, CurrSamplesPerSec=192.5115828759038, MemAllocated=4.32GB, MaxMemAllocated=12.79GB Epoch 1/1 with loss inf ***** Evaluating reward, Epoch 1/1 ***** chosen_last_scores (higher is better) : -1.7075250148773193, acc (higher is better) : 0.6717171669006348 saving model ... [2023-04-21 23:35:34,700] [INFO] [launch.py:460:main] Process 2885170 exits successfully. [2023-04-21 23:35:35,701] [INFO] [launch.py:460:main] Process 2885169 exits successfully. [2023-04-21 23:35:35,701] [INFO] [launch.py:460:main] Process 2885171 exits successfully. [2023-04-21 23:35:35,701] [INFO] [launch.py:460:main] Process 2885167 exits successfully. [2023-04-21 23:35:35,701] [INFO] [launch.py:460:main] Process 2885173 exits successfully. [2023-04-21 23:35:36,702] [INFO] [launch.py:460:main] Process 2885168 exits successfully. [2023-04-21 23:35:36,703] [INFO] [launch.py:460:main] Process 2885172 exits successfully. [2023-04-21 23:35:37,704] [INFO] [launch.py:460:main] Process 2885166 exits successfully.