Making ProtGPT2-medium and ProtGPT2-small available?
@nferruz Hi Noelia,
Will consider also having ProtGPT2-medium and ProtGPT2-small, please?
This will be of great help to people who want to debug or don't have large-capacity GPU machines.
Currently with this parameter running on AWS p3.16xlarge, the program crash with a CUDA memory error.
AWS EC2 p3.16xlarge instance type is powered by 8 NVIDIA Tesla V100 GPUs, each with 16 GB of GPU memory.
In total, the p3.16xlarge instance provides 128 GB of GPU memory
Do you have any suggest what parameter I can use to avoid that?
TRAINING_FILE="data/ha_filtered_108k.train.gpt2_format.txt" # 80K lines
VALIDATION_FILE="data/ha_filtered_108k.validation.gpt2_format.txt" # 20K lines
MODEL_OUTPUT_DIR="gpt2_model/ha_filtered_108k"
python run_clm.py --model_name_or_path nferruz/ProtGPT2 \
--train_file ${TRAINING_FILE} \
--validation_file ${VALIDATION_FILE} \
--tokenizer_name nferruz/ProtGPT2 \
--do_train \
--do_eval \
--output_dir ${MODEL_OUTPUT_DIR} \
--overwrite_output_dir \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps=16 \
--fp16 \
--learning_rate 1e-06
Sincerely,
Littleworth
Hi Littleworth,
I never trained a small or medium version, I am afraid. I trained another model but it is even bigger.
Sorry I don't have better news for now!
@nferruz Hi Noelia,
Thanks. I finally managed to get it running with the help of DeepSpeed.
Here is the full code:
#!/bin/bash
export LC_ALL=C
TRAINING_FILE="data/ha_filtered_108k.train.gpt2_format.txt"
VALIDATION_FILE="data/ha_filtered_108k.validation.gpt2_format.txt"
MODEL_OUTPUT_DIR="gpt2_model/ha_filtered_108k"
DS_CONFIG_FILE="ds_config.json"
/home/ubuntu/storage1/conda_envs/py38/bin/deepspeed --num_gpus=8 run_clm.py --model_name_or_path nferruz/ProtGPT2 \
--train_file ${TRAINING_FILE} \
--validation_file ${VALIDATION_FILE} \
--tokenizer_name nferruz/ProtGPT2 \
--do_train \
--do_eval \
--output_dir ${MODEL_OUTPUT_DIR} \
--overwrite_output_dir \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps=16 \
--fp16 \
--learning_rate 1e-06 \
--deepspeed ${DS_CONFIG_FILE}
And ds_config.json
file content is:
{
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-6,
"betas": [
0.9,
0.999
],
"eps": 1e-8,
"weight_decay": 0
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 1e-6,
"warmup_num_steps": "auto"
}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 16,
"gradient_clipping": 1.0
}
Everything is completed in less than 10 minutes with p3.16xlarge
.
Hope this information will help others.
Regards,
littleworth
hi, thank you for sharing all these tricks. may i ask are you still using the 8x v100 GPUs with 16gb in this case with DeepSpeed? Thanks!