Merge cekal/mpt-7b-peft-compatible

#42
by muelletm - opened

Merges https://huggingface.co./cekal/mpt-7b-peft-compatible by @cekal .

This will add support for peft as well as qlora.

I tested that qlora starts training:

https://github.com/artidoro/qlora/issues/10

git clone https://huggingface.co./mosaicml/mpt-7b
pushd mpt-7b 
git fetch origin refs/pr/42:pr/42
git checkout pr/42
popd

python qlora.py \
    --model_name_or_path ./mpt-7b \
    --trust_remote_code True \
    --output_dir /output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval True \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 10000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 1000 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 1000 \
    --optim paged_adamw_32bit
muelletm changed pull request status to open

Any Differences to #25 ?

Looks pretty similar TBH.

One difference is this line that is needed to work properly with device_map="auto":

(Around L290)

        outputs = self.transformer(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id, return_dict=return_dict, output_attentions=output_attentions, output_hidden_states=output_hidden_states, use_cache=use_cache, inputs_embeds=inputs_embeds)
        
        last_hidden_state = outputs.last_hidden_state
        if self.model_parallel:
            last_hidden_state = last_hidden_state.to(self.transformer.wte.weight.device)
        logits = F.linear(last_hidden_state, self.transformer.wte.weight)

But that line could also be added there, I suppose.

There might be subtle differences in other places, too, but as I said the code looks pretty similar.

I'm not sure why the additional param inputs_embeds is needed. Maybe it's being used for something where they already have the embedding? Someone knows?

I made a similar version of this for 30B too on top of the latest foundry changes and it trains with QLORA https://huggingface.co./eluzhnica/mpt-30b-peft-compatible. It does train well from what I've tried.

can do the same thing for the 30b version?

I'm not sure why the additional param inputs_embeds is needed. Maybe it's being used for something where they already have the embedding? Someone knows?

I made a similar version of this for 30B too on top of the latest foundry changes and it trains with QLORA https://huggingface.co./eluzhnica/mpt-30b-peft-compatible. It does train well from what I've tried.

I tried this and it gives the error:
TypeError: forward() takes 2 positional arguments but 3 were given

I think this is the same error when one sets "--gradient_checkpointing False".

deleted

So I know MPT-7B doesn't support gradient checkpointing while using the Huggingface Trainer, but if you set it to false, you get the "TypeError: forward() takes 2 positional arguments but 3 were given" error? Because I have been dealing with that error for weeks now and this might be the breakthrough I needed to convince me to just abandon MPT all together

Cannot merge
This branch has merge conflicts in the following files:
  • modeling_mpt.py

Sign up or log in to comment