Does gradient checkpointing work with this model?
Thank you for publishing such a wonderful model.
I am experiencing an issue where setting gradient_checkpointing=True in the TrainingArguments does not seem to reduce the VRAM usage during training.
Though my understanding may not be thorough, when I compare the source code of modeling_gpt_neox.py with modeling_gpt_neox_japanese.py, it appears that the latter does not include the conditional statement concerning self.gradient_checkpointing as seen here:
https://github.com/huggingface/transformers/blob/118e9810687dd713b6be07af79e80eeb1d916908/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L546
Is this an intentional modification or perhaps an oversight? I would appreciate any insights you might have regarding this.
transformers v4.29.1
Thanks for looking at all the details and asking the question.
The difference regarding gradient_checkpointing is not intentional. At the time we submitted our pull request, GPT NeoX had the same configuration, but the gradient checkpointing has been corrected in the following commit, and the difference is now in place.
https://github.com/huggingface/transformers/commit/225c36fbe5ae2bdb1880da52e093c7e53596a7d1
Thank you for your response! I now understand the situation.
It might be helpful if you could incorporate the support for gradient_checkpointing or provide a warning when this flag is set to True.
We cannot promise a completion date, but we have started preparing for PR. Thank you for reminding up of the update opportunity!