Following blog for fine tuning gemma-2b doesn't yield same results

#60

by chongdashu - opened May 19, 2024

Discussion

chongdashu

May 19, 2024

Following the blog here: https://huggingface.co./blog/gemma-peft

I've replicated the entire blog but don't get the same result.
It still outputs the same as prior to fine-tuning.

Here is the notebook

chongdashu

May 19, 2024

It seems if i rely on the latest dependencies i.e.

!pip install -q -U accelerate bitsandbytes git+https://github.com/huggingface/transformers.git
!pip install datasets -q
!pip install peft -q

I get the failure to train.
But if I use the following...

!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.1

I can get the same results.

I am surprised that the change in libs would cause such a big drop -off

ybelkada

May 20, 2024

Hi @chongdashu
Thanks for the report !
To isolate which lib is responsible, can you try the same experiment with:

peft == 0.8.2 vs peft == 0.11.0 (while keeping all other libs to the 'stable' version)
trl == 0.7.2 vs trl == 0.8.6 (while keeping all other libs to the 'stable' version)
I will also try to reproduce on my end and report here

chongdashu

May 20, 2024

@ybelkada sure thing, let me give it a whirl

chongdashu

May 20, 2024

•

edited May 20, 2024

With peft==0.11.0

I get the following error on trying to train

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:258, in GradScaler._unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16)
    256     continue
    257 if (not allow_fp16) and param.grad.dtype == torch.float16:
--> 258     raise ValueError("Attempting to unscale FP16 gradients.")
    259 if param.grad.is_sparse:
    260     # is_coalesced() == False means the sparse grad has values with duplicate indices.
    261     # coalesce() deduplicates indices and adds all values that have the same index.
    262     # For scaled fp16 values, there's a good chance coalescing will cause overflow,
    263     # so we should check the coalesced _values().
    264     if param.grad.dtype is torch.float16:

ValueError: Attempting to unscale FP16 gradients.

With trl==0.8.6 I replicate the issue where the training loss basically never reduces and the fine tuning doesn't complete successfully.

chongdashu

May 22, 2024

Hi @ybelkada - any idea on what might be going on here with TRL?

merve

Google org May 22, 2024

@chongdashu we are about to merge a change to transformers that'll fix finetuning issues. I will post a notebookized version of blog soon after I confirm it works well

chongdashu

May 22, 2024

@merve great to hear thanks!

merve

Google org May 22, 2024

@chongdashu we have made a few changes around finetuning (also a smol change in API) you can see here: https://colab.research.google.com/drive/1x_OEphRK0H97DqqxEyiMewqsTiLD_Xmi?usp=sharing

chongdashu

May 22, 2024

Thanks @merve , will check it out!

chongdashu

May 22, 2024

•

edited May 22, 2024

@merve does this need an update of the transformers version?

edit
Oh wait I see it git+https://github.com/huggingface/transformers.git

Though it's not immediately obvious what the API change is?

chongdashu

May 22, 2024

•

edited May 22, 2024

I've tried using the latest transformers with trl, but still the same issue with training loss on gemma-2b.

!pip install --force-reinstall trl accelerate datasets peft bitsandbytes git+https://github.com/huggingface/transformers.git

import transformers
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir=".outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
    packing=False
)
trainer.train()

lkv

Google org 13 days ago

•

edited 13 days ago

Hi @chongdashu ,

Fine-tuning very large models like Gemma-2B requires proper memory management, appropriate learning rates, and a solid reward structure if we are using RL (or reinforcement-based fine-tuning). I have reproduced the issue. Could you please refer this gist file for reference. Kindly try and let me know if you have any concerns.

Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment