Fine-tune the model?
Hi everyone, is it possible to fine-tune the BLOOM model? If yes, how to do that? Thanks!
Hey all,
I'm also interesting in understanding how can I fine-tune this model to do a specific generation task after giving it many prompts.
Hi everyone,
If you have enough compute you could fine tune BLOOM on any downstream task but you would need enough GPU RAM to store the model + gradient (optimizer state) which is quite costly. A common tasks people fine-tune auto regressive models is Question Answering. I would say if you are interested of doing that you can first try it on one of the BLOOM-small models (ideally 1b3 since it is the one of a small fully trained model)
Another option could be to do "prompt-tuning": https://youtu.be/8HwHGGb1zpQ?t=2455 it could be interesting to apply this method on BLOOM as it wont require to store the optimizer state of the whole model
Hi everyone,
If you have enough compute you could fine tune BLOOM on any downstream task but you would need enough GPU RAM to store the model + gradient (optimizer state) which is quite costly. A common tasks people fine-tune auto regressive models is Question Answering. I would say if you are interested of doing that you can first try it on one of the BLOOM-small models (ideally 1b3 since it is the one of a small fully trained model)
Another option could be to do "prompt-tuning": https://youtu.be/8HwHGGb1zpQ?t=2455 it could be interesting to apply this method on BLOOM as it wont require to store the optimizer state of the whole model
Thanks for your reply,
I'm not able to find any code on how it should be done.
Can you please refer me to the code that can do the fine-tuning?
Hi everyone,
If you have enough compute you could fine tune BLOOM on any downstream task but you would need enough GPU RAM to store the model + gradient (optimizer state) which is quite costly. A common tasks people fine-tune auto regressive models is Question Answering. I would say if you are interested of doing that you can first try it on one of the BLOOM-small models (ideally 1b3 since it is the one of a small fully trained model)
Another option could be to do "prompt-tuning": https://youtu.be/8HwHGGb1zpQ?t=2455 it could be interesting to apply this method on BLOOM as it wont require to store the optimizer state of the whole model
Hi ybelkada, thank you for your reply. And I just watched the video you shared with us, thanks for the video!
I have the same question as AnaRhisT: is there any code we can use to do model tuning?
And is there any code to do prompt tuning as mentioned in the video?
Thank you very much!
It is highly recommended that you use the same codebase that was originally used to train the model rather than the huggingface port. You can find that codebase here
It is highly recommended that you use the same codebase that was originally used to train the model rather than the huggingface port. You can find that codebase here
Hi stellaathena, thank you for your message!
Indeed I should try the codebase that has been used originally!
I'm quite new to this field, in the link you shared I only see the code to pretrain BERT, GPT and T5, but no BLOOM.
Should I re-use the GPT code? (I saw that they are similar) And may I download the BLOOM model from Hugging Face, or I'd better download the model from somewhere else?
Do you have any code using the original codebase to fine-tune BLOOM?
Thank you!
@NXBY if you are that new to this field, finetuning this model is almost certainly not what you want to be doing. This model is expensive to finetune and to do inference with, and requires highly specialized hardware. You should start off with a model like GPT-2 or GPT-Neo, which is several orders of magnitude smaller and substantially cheaper to use
@NXBY if you are that new to this field, finetuning this model is almost certainly not what you want to be doing. This model is expensive to finetune and to do inference with, and requires highly specialized hardware. You should start off with a model like GPT-2 or GPT-Neo, which is several orders of magnitude smaller and substantially cheaper to use
@stellaathena
Hi, thank you for your suggestion. I've tried to fine-tune GPT-3 Babbage via the OpenAI API, and the fine-tuned model is not performing well for my task which is kind of complicated. I'm not sure if a smaller model can have a better performance.
I'd like to see if there exist any code so that I can do fine-tuning on a smaller version of BLOOM, and then use the same code to fine-tune a larger version of BLOOM on a server.
If you could share with us any code it would be great. I'm willing to learn the code :-)
Thank you for your messages :-)
Exactly what NXBY said, I'd love to see as well a code that can do fine-tuning on a smaller version of BLOOM. Then we can easily transfer it to the larger version.
Side note: About the hardware stuff, please don't worry about it, I understand I need at least 8x A100 (80GPU versions) for the largest model - which isn't a problem as well.
Same question here - I'm happy to hack around using the GPT code, etc. but if someone has already solved for customizing a small BLOOM model on something like SQUAD or similar, it would save me (and probably a bunch of others) a ton of time.
Thank you for your messages! (And sorry to respond late here, just noticed the notifications)
I have found a script for Named Entity Recognition on transformers
that has been used for BLOOM - https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification (check this PR: https://github.com/huggingface/transformers/pull/18632)
I believe we can use this codebase as a starting point, in which task do you plan to fine tune the model?
Thanks ybelkada! I think at least a few of us are looking to tune BLOOM for question answering.
Thank you @jasoneden for your message and @ybelkada for your reply! I'm going to dive into the code you shared with us as well :-)
Hi
@jasoneden
&
@NXBY
!
We have just merged a PR that adds BloomForQuestionAnswering
, here: https://github.com/huggingface/transformers/pull/19310
Can you try this model out with the provided trained script for question answering? https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering / Though you need the latest version of transformers
(by doing for example pip install git+https://github.com/huggingface/transformers.git@main
) and of course let me know if you face into any issue! ;)
Wow, that's fantastic! Cannnot wait to give it a whirl. Thank you so much!
Hey @ybelkada - Gave this a go just a bit ago. Installed the latest version of transformers (testing in a colab notebook so have to do that every time anyway) and cloned the repo, then ran the script with some configurations I had practiced with on the T5 sequence to sequence tuner (the one listed in the link you gave, which I had previously successfully ran...) I'm not sure if I'm doing something wrong, or if there's something else in transformers that needs to be updated, but I'm getting the error that says BLOOM is not one of the models that is supported by this. Here's the code I used:
https://colab.research.google.com/drive/1aReC_7AQ5t1HcA_lAm2UrJknNx3O14tU#
The specific error I've seen before:
"ValueError: Unrecognized configuration class <class 'transformers.models.bloom.configuration_bloom.BloomConfig'> for this kind of AutoModel: AutoModelForSeq2SeqLM.
Model type should be one of BartConfig, BigBirdPegasusConfig, BlenderbotConfig, BlenderbotSmallConfig, EncoderDecoderConfig, FSMTConfig, LEDConfig, LongT5Config, M2M100Config, MarianConfig, MBartConfig, MT5Config, MvpConfig, PegasusConfig, PegasusXConfig, PLBartConfig, ProphetNetConfig, T5Config, XLMProphetNetConfig."
Am I using the wrong script, or do I just need to modify something to bypass this check?
(And by the way - I know the colab is underpowered to actually do the model training. I'm just looking at this as an easy way to POC that the procedure works, and also easy way to share the code...)
Thanks for your message !
In my opinion you are probably using the wrong code, since the one you shared me uses run_seq2seq_qa.py
that supports Seq2Seq models, so Encoder-Decoder models such as T5. Fortunately there is another script called run_qa.py
that supports AutoModelForQuestionAnswering
models.
I tried your colab notebook by running the script run_qa.py
and seems to work fine ! Below is the command I ran:
!python run_qa.py \
--model_name_or_path bigscience/bloom-560m \
--dataset_name squad_v2 \
--do_train \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_seq2seq_squad/ \
--eval_accumulation_steps 1 \
--version_2_with_negative \
--overwrite_output_dir
However, I had to slightly change the run_qa.py
and I had a question that I believe you can answer. It seems that for Encoder-Decoder models and Encoder-based models, we label the impossible answers with the index of the CLS_token. I know for BLOOM and other auto-regressive decoder-based models this token does not exist, so the line 404: cls_index = input_ids.index(tokenizer.cls_token_id)
throws an error. I had to modify it with cls_index = input_ids[-1]
but I think this is wrong. We should probably pre-pend the tokenized input with a special token, for example the <bos>
token and use this token instead as the token that we use for impossible answer. I am not sure here so would like to hear from you, how would you approach the problem here?
Ah, great catch! Yeah, I suspected the seq2seq.py might be the wrong script and I hadn't put together yet that the auto models should work now (doh!), so thanks for that pointer and confirming the code at least runs!
I find it interesting that there isn't (yet) a script specifically for decoder-only models, and I'm not surprised there was a catch when you ran the script at first. I think your idea of using the bos token may be the right answer. I may see if I can get that to work, and if so, what the outcomes are. Will keep you updated!
Perfect thank you very much! Can't wait to see the first BLOOM model finetuned on Question Answering !
Hi @ybelkada and @jasoneden , thank you for your messages! I tried to pre-pend the input with a special token "¤" and the code you shared is running without error messages! (It's been 15 minutes that the code is running. I'll see if any error message appears later).
I was able to get it running as well. Tried a few things - the bos_token didn't work (expecting an int) nor did bos_token_id - no surprise. I tried a few other things, and eventually landed on assigning an out of range int (sys.maxsize, then subtract one...) and that seemed to work for training, etc. With the 560m model, my free Colab quickly ran out of RAM during training though, so I haven't actually gotten to see it work. Trying to see if I can get this to run locally or if I'll have to pull out my credit card and pony up for a bigger runtime.
Since the presumed answer for "no answer" would be the same - i.e. "I don't know..." - I'm wondering if the IDs need to be unique, or if just assigning a number that's not already taken is going to suffice. If any int value will do (within reason), maybe that's good enough? Of course, I think I'm logically just representing the same solution that @ybelkada came up with in the first place, but am looking forward to giving it a full training. It's too bad they took the 350m model out of the repository, but maybe that one would have been too large as well.
@NXBY
@jasoneden
Thanks a lot for your comments and insights !!
@NXBY
great to hear that the code is running!! Would love to see the Q&A results as soon as they are ready !!
@jasoneden
, I think you could try to run the colab by adding the --fp16
flag, it should reduce the size of the model (thus optimizer states too) by 2, I believe you will be able to fit the Colab notebook then ;)
Regarding the pre-pending of the token, I think assigning an integer is fine, as it is something that is expected, and also usually the integers 0
, 1
, 2
corresponds to the unk
, bos
and eos
token, see here for eg: https://huggingface.co./bigscience/bloom-7b1/blob/main/config.json#L9 so maybe you could give it a try with that ;) Overall I would second your intuition here
Yeah, I actually got it to work by assigning a static value of 0 (which made sense) but I worried a little about overwriting an expected value of some sort with that approach, which is why I switched to the sys.maxsize minus one value, which also seemed to work. If you're curious...
cls_index = 9223372036854775806
Thanks for the tip on the --fp16 flag! I'll give that a whirl.
Hey
@NXBY
!
I don't think it is necessary to save all the checkpoints indeed, maybe you can slightly tweak the trainer: https://github.com/huggingface/transformers/blob/ae3e3bc60a5f0834d952dfead4b28b1ce506125d/src/transformers/trainer.py#L2055 or here: https://github.com/huggingface/transformers/blob/ae3e3bc60a5f0834d952dfead4b28b1ce506125d/src/transformers/trainer.py#L2100 to save it only once, I see that you can also change it here: https://github.com/huggingface/transformers/blob/ae3e3bc60a5f0834d952dfead4b28b1ce506125d/src/transformers/trainer_callback.py#L140 maybe? Let me know if you need more help!
Hi
@ybelkada
, thank you for your message and sorry for my late reply! The code is long and I took some time to read it, and here I still have one question: here in line 140 of traner_callback.py file you shared last time, we see that the default value of the parameter "should_save" is set to False.
So I imagine that it has been converted to True somewhere in the code that we run.
Do you think that in this for loop of trainer.py, we can do something in order that we set the parameter "should_save" to True only in the last epoch of the loop here for example? Or perhaps what I think is wrong.
Could you please share your ideas?
Thank you very much!
Edited
Hey all - Just updating the thread here. I was unable to get the training to completion on Colab (free version) but was able to tune the scripts a bit. As stated above, I set the cls_index value to a large number, and also figured out I needed to lower the batch size. On a single GPU, changing
--per_device_train_batch_size 4 \
...from the default value of 12 seemed to work on the Colab GPUs. The real issue is environment timeout though. The 560M parameter model is just too large for the training on the Squad data even with the minimal parameters set (I believe - I'd love for someone to show me I'm wrong...) So I have now switched to a Vertex AI workbench instance with a V100 GPU. Again, even there I had to lower the batch size, but the training is running significantly faster now. I believe this is a path to success - albeit one that will run up some cloud charges - and will report back once I have the model "fully trained." I have no illusions it will be a good model with these parameters, but as a proof-of-concept just having the steps in place that could be replicated on larger model / hardware / etc. has value I think.
EDIT: Had a newb error first time I tried training the model - forgot to change my output directory and ran out of disk space on my boot disk. If you're using Vertex like I am, make sure you prepend the output directory to something like this:
--output_dir /home/jupyter/tmp/debug_bloom_squad/
...instead of just /tmp.
Also, upped the batch size on Vertex from 4 to 6, no out of memory errors. Once I've got a model built, I may go back and toy with this to see how large I can make it without running into the out of memory error on the GPU.
"Experience enables you to recognize a mistake when you make it again." - Franklin P. Jones
OK - model is trained! Haven't tested it yet, but I'm calling this milestone complete.
The /tmp directory ended up burning about 600 GB of disk space for the model and settings I was using, so make sure you have enough disk space configured. (Another Achilles' heel of Colab I ran into.)
Thank you for all the support on this! Now to see if it actually works.....
Wow that's so cool!!
You can also upload your model on the Hub so that we can try it out!! Can't wait to see the results!!
Yes, that is the plan! I will try to do that ASAP and let everyone know when it's live. It's not going to be too impressive, I don't suspect, but it may serve as a starting point!
Hey folks, I'm kicking the tires on my model and running into output I don't understand. See the screenshot below:
General issues:
- Output is not integer
- Start tokens are larger values than end tokens
I've walked through the question answering guide on colab and there's nothing really on how to deal with outputs like these. Any pointers?
I have uploaded the model (correctly, I think... please advise if not) for anyone that wants to give it a whirl:
https://huggingface.co./jasoneden/bloom560m-squad-helloworld/tree/main
Once I figure out how to upload from from my Vertex AI Notebook, I will upload all of the checkpoint folders as well.
@jasoneden
Your model ROCKS!!
And it works as expected on the Hub! I really appreciate your great effort for training the model !!
I think that you can use your model directly using pipeline
, see: https://huggingface.co./docs/transformers/v4.23.1/en/main_classes/pipelines#transformers.QuestionAnsweringPipeline - it is how the inference API works under the hood
Also this link could be a great pointer: https://huggingface.co./docs/transformers/v4.23.1/en/task_summary#extractive-question-answering
Hmm.
First off, thanks ybelkada for the pointers. I guess I'm a little surprised: I thought that the end result might be a generative rather than extractive model. In hindsight, maybe I shouldn't be. This actually begs a couple of questions:
Did it actually matter what data I used to train BLOOM, or since we're making it an extractive system, could I have used a small set of question / answer pairs and gotten similar results?
Is there a way to train BLOOM such that it results in a generative (and preferably, closed generative) question answering model?
I am curious if it would be possible to train the model for data-to-text generation. I attempted to use the Seq2Seq trainer for this, framing the problem as a generic text-to-text task, but it seems like the model only ever generates more "data". This set up worked for T5, but not for bloom. Has anyone else explored this?
I am curious if it would be possible to train the model for data-to-text generation. I attempted to use the Seq2Seq trainer for this, framing the problem as a generic text-to-text task, but it seems like the model only ever generates more "data". This set up worked for T5, but not for bloom. Has anyone else explored this?
Yeah this is actually possible, we've seen this on our side. One of the reason I believe T5 works well is that inputs and targets are processed seperately (in different stacks), whereas in decoder only model like BLOOM has a single stack to fix this issue. One of the key issue is how you merge inputs and targets together for the decoder model:
- using a space would make it seems like language modeling task, as it keeps predicting future tokens so it doesn't have a strong delimitation between your "data" and your target text.
- using "\n" or more clear patterns that make it seem like you require something from the model.
- using a special characters. The BLOOM models have a vocabulary that's larger than what the tokenizer can provide (we have 200 extra tokens). The reasons why is there exists a bunch of reason why you'd want to use special tokens (bert has a mask token for making single tokens, t5 uses special tokens to make pointers for spans etc ...) So one of the idea is to seperate your append a special token to your input to signal the model to start generating the target.
Bear in mind those are my hunches.
We finetuned BLOOM to produce BLOOMZ, maybe our guide can help some people in this thread 🤗
Can you tell me the number of GPUs with memory that you used to fine-tune Bloomz and also the time for finetuning. By the way, Can I finetune 176B model using 16 A100 40 GB without offloading to CPU or Nvme with batch-size = 1 ?
Can you tell me the number of GPUs with memory that you used to fine-tune Bloomz and also the time for finetuning. By the way, Can I finetune 176B model using 16 A100 40 GB without offloading to CPU or Nvme with batch-size = 1 ?
Sure, all GPU numbers are in the model cards of each bloomz model, e.g. for the big one: https://huggingface.co./bigscience/bloomz#hardware (also some comments here).
The time can be found in the logs. It will depend a lot on your hardware & the model layout though. For us, it took 142s / 4.2M tokens. So for the 2.1B tokens trained, roughly 20 hours.
[default7]: iteration 409/ 3100 | consumed samples: 837632 | consumed tokens: 1715470336 | elapsed time per iteration (s): 140.78 | learning rate: 2.000E-05 | global batch size: 2048 | lm loss: 1.117569E+00 | grad norm: 0.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 14.547 | TFLOPs: 148.50
[default7]: iteration 410/ 3100 | consumed samples: 839680 | consumed tokens: 1719664640 | elapsed time per iteration (s): 141.97 | learning rate: 2.000E-05 | global batch size: 2048 | lm loss: 1.119031E+00 | grad norm: 0.552 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 14.425 | TFLOPs: 147.26 |
I think the 176B is not easily possible with 16 A100 40 GB. For the 7B1 it should be enough, probably with pipeline parallel = 4 & data parallel = 4 & tp = 1.
Can you tell me the number of GPUs with memory that you used to fine-tune Bloomz and also the time for finetuning. By the way, Can I finetune 176B model using 16 A100 40 GB without offloading to CPU or Nvme with batch-size = 1 ?
Sure, all GPU numbers are in the model cards of each bloomz model, e.g. for the big one: https://huggingface.co./bigscience/bloomz#hardware (also some comments here).
The time can be found in the logs. It will depend a lot on your hardware & the model layout though. For us, it took 142s / 4.2M tokens. So for the 2.1B tokens trained, roughly 20 hours.
[default7]: iteration 409/ 3100 | consumed samples: 837632 | consumed tokens: 1715470336 | elapsed time per iteration (s): 140.78 | learning rate: 2.000E-05 | global batch size: 2048 | lm loss: 1.117569E+00 | grad norm: 0.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 14.547 | TFLOPs: 148.50 [default7]: iteration 410/ 3100 | consumed samples: 839680 | consumed tokens: 1719664640 | elapsed time per iteration (s): 141.97 | learning rate: 2.000E-05 | global batch size: 2048 | lm loss: 1.119031E+00 | grad norm: 0.552 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 14.425 | TFLOPs: 147.26 |
I think the 176B is not easily possible with 16 A100 40 GB. For the 7B1 it should be enough, probably with pipeline parallel = 4 & data parallel = 4 & tp = 1.
Actually I do not have enough computing resources to finetune the 176B model totally using GPU. In this case, I think I should try Zero-offload or Zero-infinity. Have you released the deep-speed checkpoints of Bloom models , I only see the HuggingFace checkpoints not the original Deep Speed checkpoints of Bloom. I am trying to modify your code from https://github.com/bigscience-workshop/xmtf to run Zero-offload and Zero-infinity. By the way, I see from the training script that: "# important: bf16 must use z0! it implements its own zero stage 1 equivalent" So I we can not use zero-offload (in zero 2) and zero-infinity (in zero 3) with BF16 right ?
Actually I do not have enough computing resources to finetune the 176B model totally using GPU. In this case, I think I should try Zero-offload or Zero-infinity. Have you released the deep-speed checkpoints of Bloom models , I only see the HuggingFace checkpoints not the original Deep Speed checkpoints of Bloom. I am trying to modify your code from https://github.com/bigscience-workshop/xmtf to run Zero-offload and Zero-infinity. By the way, I see from the training script that: "# important: bf16 must use z0! it implements its own zero stage 1 equivalent" So I we can not use zero-offload (in zero 2) and zero-infinity (in zero 3) with BF16 right ?
All Megatron-DeepSpeed checkpoints should be uploaded,
e.g.
- https://huggingface.co./bigscience/bloom-optimizer-states
- https://huggingface.co./bigscience/bloomz-optimizer-states
- https://huggingface.co./bigscience/bloom-7b1-optimizer-states
....
RE: BF16 + Zero2 & Zero3 best to open an issue on the DeepSpeed Repo: https://github.com/microsoft/DeepSpeed
Actually I do not have enough computing resources to finetune the 176B model totally using GPU. In this case, I think I should try Zero-offload or Zero-infinity. Have you released the deep-speed checkpoints of Bloom models , I only see the HuggingFace checkpoints not the original Deep Speed checkpoints of Bloom. I am trying to modify your code from https://github.com/bigscience-workshop/xmtf to run Zero-offload and Zero-infinity. By the way, I see from the training script that: "# important: bf16 must use z0! it implements its own zero stage 1 equivalent" So I we can not use zero-offload (in zero 2) and zero-infinity (in zero 3) with BF16 right ?
All Megatron-DeepSpeed checkpoints should be uploaded,
e.g.
- https://huggingface.co./bigscience/bloom-optimizer-states
- https://huggingface.co./bigscience/bloomz-optimizer-states
- https://huggingface.co./bigscience/bloom-7b1-optimizer-states
....RE: BF16 + Zero2 & Zero3 best to open an issue on the DeepSpeed Repo: https://github.com/microsoft/DeepSpeed
Thank you so much for figuring out things for me ! By the way, I think the use of BF16 in pretraining Bloom model is to avoid being Nan in computation, however in the perspective of finetuning 176B model, had you tried using FP16 or you just used BF16 from start ?
Thank you so much for figuring out things for me ! By the way, I think the use of BF16 in pretraining Bloom model is to avoid being Nan in computation, however in the perspective of finetuning 176B model, had you tried using FP16 or you just used BF16 from start ?
BF16 is used for pretraining of the 176B because it's been shown to be more stable than FP16. We also wanted to use it for the smaller models but could not due to GPU constraints, thus they are pretrained in FP16.
We also used BF16 for the 176B finetuning (& FP16 for the small models). I think it's not a good idea to switch BF16 -> FP16 or FP16 -> BF16, as the numeric ranges are different. You can always switch from FP16/BF16 to FP32 though.
Thank you so much for figuring out things for me ! By the way, I think the use of BF16 in pretraining Bloom model is to avoid being Nan in computation, however in the perspective of finetuning 176B model, had you tried using FP16 or you just used BF16 from start ?
BF16 is used for pretraining of the 176B because it's been shown to be more stable than FP16. We also wanted to use it for the smaller models but could not due to GPU constraints, thus they are pretrained in FP16.
We also used BF16 for the 176B finetuning (& FP16 for the small models). I think it's not a good idea to switch BF16 -> FP16 or FP16 -> BF16, as the numeric ranges are different. You can always switch from FP16/BF16 to FP32 though.
Thank you so much for your provided information. By the way, this question is not about finetuning but I think you are an expert in Bloom model so you may know this: do you have or know about the conversion script or the technique behind quantizing Bloom model to int8, like: microsoft/bloom-deepspeed-inference-int8 ?
Thank you so much for your provided information. By the way, this question is not about finetuning but I think you are an expert in Bloom model so you may know this: do you have or know about the conversion script or the technique behind quantizing Bloom model to int8, like: microsoft/bloom-deepspeed-inference-int8 ?
I think it uses bitsandbytes, but @ybelkada & @stas are much more knowledgeable on that than I am, so they may know more 🤗
@khaimaitien
, thanks for reaching out!
There are 2 ways to run bloom in int8, let me explain you how you would do it using bitsandbytes
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, BloomForConditionalGeneration
PATH_TO_BLOOM = XXX
tokenizer = AutoTokenizer.from_pretrained(PATH_TO_BLOOM)
model = BloomForConditionalGeneration.from_pretrained(PATH_TO_BLOOM, device_map="auto", load_in_8bit=True, torch_dtype="auto")
input_text = "Hey my name is"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
You will need at least ~180GB GPU RAM to make it run. Let me know if you face into any issue by opening a new issue dedicated for int8 inference!
@khaimaitien , thanks for reaching out!
There are 2 ways to run bloom in int8, let me explain you how you would do it usingbitsandbytes
# pip install bitsandbytes accelerate from transformers import AutoTokenizer, BloomForConditionalGeneration PATH_TO_BLOOM = XXX tokenizer = AutoTokenizer.from_pretrained(PATH_TO_BLOOM) model = BloomForConditionalGeneration.from_pretrained(PATH_TO_BLOOM, device_map="auto", load_in_8bit=True, torch_dtype="auto") input_text = "Hey my name is" input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0]))
You will need at least ~180GB GPU RAM to make it run. Let me know if you face into any issue by opening a new issue dedicated for int8 inference!
Thank you for answering my question. Actually I mean do you have the script or know how to convert the model to int8 and save the weights then later we can use the int8 weight in loading the model. For example, I think that: microsoft/bloom-deepspeed-inference-int8 is converted from bigscience/bloom.
Hey folks - I just delivered my Capstone presentation for a M.S. in Health Data Science program, which was the reason I was exploring BLOOM as a QA system. Two potential things of interest:
- I uploaded a model tuned on Covid 19 data from the CDC, 100 epochs over a set of 10k QA pairs: https://huggingface.co./jasoneden/BLOOM-560-QA-CDC_Covid19-100epochs
In the files section, I have posted the qa train and dev pairs I used to build and evaluate the model performance.
- If you want to follow the code journey (which is, full disclosure, a spaghetti mess, but all of the basics are in there) I've uploaded everything to github: https://github.com/jasondeden/capstone
Of particular note in the code is the ETL required to take output from Haystack QA production and convert it into SQuAD data format, but there's a lot of work that went in there that I am happy to share and see others improve on (shouldn't be too hard...) :)
Thanks to @ybelkada x 100,000 for the major assistance. I'm not sure I could have done it without your help / intervention.
I have uploaded the model (correctly, I think... please advise if not) for anyone that wants to give it a whirl:
https://huggingface.co./jasoneden/bloom560m-squad-helloworld/tree/main
Once I figure out how to upload from from my Vertex AI Notebook, I will upload all of the checkpoint folders as well.
Hi @jasoneden , I'm fairly new to all this, but just tried out the Question Answering fine-tuned version of BLOOM-560M. For some reason, when I ask questions in English, it provides answers in Tamil. Is it normal for it to do this? If not, do you know what I might be able to do to ensure the answers come out in English instead?
@rbos - I honestly have no idea how you're getting it to do that. As far as I know, the model should pull something out of the context you provide and supply that as an answer, so if you're supplying an English context and you're getting Tamil back, you've found something I'd be very interested to figure out how it happened. Can you provide a screenshot?
Thanks so much @jasoneden for your message, very happy that I was able to help here ;) And congratulations for your presentation !! Looking forward to your next contributions
Out of curiosity, is there any recording of the presentation?
Hi everyone!
We've made a proof-of-concept of decentralized BLOOM-176B fine-tuning and inference from Google Colab. If you're interested in doing that without a GPU cluster, please take a look in this thread: https://huggingface.co./bigscience/bloom/discussions/152
@rbos - I honestly have no idea how you're getting it to do that. As far as I know, the model should pull something out of the context you provide and supply that as an answer, so if you're supplying an English context and you're getting Tamil back, you've found something I'd be very interested to figure out how it happened. Can you provide a screenshot?
@jasoneden Yeah, this is the output I am getting. I noticed that the Tamil word it gives me means "Early" in English, and it keeps it repeating it. Even when I have change my question, it still gives me the same exact output word in Tamil (again):
@jasoneden Yeah, this is the output I am getting. I noticed that the Tamil word it gives me means "Early" in English, and it keeps it repeating it. Even when I have change my question, it still gives me the same exact output word in Tamil (again):
@rbos - Ah, I think I know what the problem is. You're trying to use the model as a generative, when it has been trained for extractive QA. You actually need to supply a context for it to pull the answer from, rather than just asking it a question.
Here's an example of the type of code the model is expecting:
https://github.com/jasondeden/capstone/blob/main/model-train-and-final-ETL/QnATests.ipynb
@rbos try removing trailing space in your prompt.
@jasoneden Thanks for the explanation and code to look at. Being new to this, what I had originally thought was that fine-tuning the BLOOM model would allow someone to use the usual generative format to ask questions and obtain answers that would still come from the original training set that BLOOM was trained on, but with an enhanced ability to perform better specifically on the task of Question Answering compared to other types of tasks. Is your goal eventually to tweak the code to what I've just mentioned in a generative model approach, or is it to extract information from a context sentence/paragraph provided to the model?
My interest in fine-tuning for the Question Answering task actually is based on the results in the BLOOM paper, "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model", where it showed that BLOOMZ had better zero-shot performance on several different types of (I think non-Question-Answering) tasks because it had been fine-tuned. The same thing occurred with DeepMind's Flamingo few-shot learning vision-language model when it was fine-tuned (Alayrac et al., 2022): better performance for the fine-tuned model over the regular few-shot generative model.
So, perhaps I misunderstand the goal of fine-tuning a generative model (and/or the task of Question Answering itself), but can the BLOOM model be fine-tuned and still be used as a generative model - just one that performs better on the fine-tuned task?
@rbos try removing trailing space in your prompt.
@TimeRobber . I tried that too - the same problem occurs, unfortunately.
@rbos - From a generative QA standpoint, you could tune the standard BLOOM model to about any text dataset and provide it prompts like you were trying, and it would come up with an answer (probably in the same language you asked it in... lol... unless you were specifically asking it to do a translation.) However, my project was specific to QA, and specifically related to BLOOM as a SQuAD-type modeling approach. If you have a general generative model, you could write a program (or use a BERT model, etc.) to turn a question into a prompt, feed that prompt to BLOOM, and have it finish the sentence. I was trying to see what happens if you take a generative model and try to turn it into a specific-use QA tool. Thus, the extractive QA head, need for context, etc.
In my case, if you read through my research paper, you'll find out BLOOM performs pretty poorly in this task relative to Encoders - which, if you think about it, makes sense. You're asking a generative model - something which is designed to "know the right answer" to do a task it is not built for - i.e. "find an answer." So the BLOOM QA-specific model on SQuAD and similar (I trained a model on CDC data as well) just doesn't really know what to do. Sometimes it's right, but it doesn't know how to be sure that it is, and sometimes it's very, very wrong. But in all cases, the extractive QA approach requires the provision of context.
If I were to start over and try to build a generative QA model, I wouldn't use the SQuAD approach, and instead would simply train standard BLOOM on a set of data files. I would then figure out how to write a front-end that does what I described above: takes a question, reformats it into a prompt (maybe using a BERT variant?), feed the prompt to the custom-trained BLOOM model, and then see how well it did in responding. In this context, things like "exact match" or EM scores - one of the key metrics used to evaluate for SQuAD - doesn't really make sense in context, because you might say the same thing three different ways and you'd still be right each time. Then again, for a commercial application, you're back to the significant risk inherent in generative models: getting back an answer that would potentially get a human being fired. :)
I hope that helps / makes sense. In the end, my efforts to train a BLOOM model for QA were a failure in the sense of "I built a better mousetrap" but a success in the sense of "that's one more way not to invent a lightbulb." Looking forward to seeing what others have done and how that moves the ball forward for BLOOM!
@jasoneden That's really interesting. Yeah, I’m hoping to keep both the question and answer in English, lol.
For a generative QA approach, would it be best to just fine-tune it on any text at all, or would it be better to fine-tune it on a huge number of pairs of questions and correct answers without any context paragraph (e.g., just so that BLOOM understands more of what correctly providing specific correct answers to specific questions is like, and then I assume try to replicate this same QA process on the data it was originally trained on)? If it would best to fine-tune it on regular text instead, wouldn't that just be teaching the model to do more of what it already knows how to do? Or would it somehow become better at answering questions in the process?
For writing a program to turn a question into a prompt, would it need to be different than the BLOOM model code I showed earlier with prompt = “some question here”, feeding it into the tokenizer, and feeding the tokenizer to the fine-tuned model?
One other thing: do you know why the answers that the BLOOM model gives in the output – at least on the huggingface interface api – will often repeat themselves over and over until the maximum_length argument’s number of characters has been reached? I’m assuming there must some way to insert a line of code to prevent the model from repeating itself. I’ve noticed BLOOM will even sometimes present its own new questions for itself to then answer - sometimes not even related to the user's original question).
@jasoneden That's really interesting. Yeah, I’m hoping to keep both the question and answer in English, lol.
For a generative QA approach, would it be best to just fine-tune it on any text at all, or would it be better to fine-tune it on a huge number of pairs of questions and correct answers without any context paragraph (e.g., just so that BLOOM understands more of what correctly providing specific correct answers to specific questions is like, and then I assume try to replicate this same QA process on the data it was originally trained on)? If it would best to fine-tune it on regular text instead, wouldn't that just be teaching the model to do more of what it already knows how to do? Or would it somehow become better at answering questions in the process?
For writing a program to turn a question into a prompt, would it need to be different than the BLOOM model code I showed earlier with prompt = “some question here”, feeding it into the tokenizer, and feeding the tokenizer to the fine-tuned model?
One other thing: do you know why the answers that the BLOOM model gives in the output – at least on the huggingface interface api – will often repeat themselves over and over until the maximum_length argument’s number of characters has been reached? I’m assuming there must some way to insert a line of code to prevent the model from repeating itself. I’ve noticed BLOOM will even sometimes present its own new questions for itself to then answer - sometimes not even related to the user's original question).
If anyone else knows the answers to the questions above, your help would be much appreciated!
We finetuned BLOOM to produce BLOOMZ, maybe our guide can help some people in this thread 🤗
According to the paper "Crosslingual Generalization through multitask finetuning", bloomz was trained on bloom which is decoder-only model, however in the training script of bloomz, finetune_t0.py should be based on t5 which is a encoder-decoder model, and I can not find the actual finetune_t0.py file right now, am I missing something?
We finetuned BLOOM to produce BLOOMZ, maybe our guide can help some people in this thread 🤗
According to the paper "Crosslingual Generalization through multitask finetuning", bloomz was trained on bloom which is decoder-only model, however in the training script of bloomz, finetune_t0.py should be based on t5 which is a encoder-decoder model, and I can not find the actual finetune_t0.py file right now, am I missing something?
The naming is bad, but it uses the finetune_t0.py
file. With t0 we just wanted to refer to the multitask finetuning part.
You can find it if you switch to the t0loading
branch.
Hey. I have been exploring BLOOM and its API closely. I have learned how the parameters effect the response. After I am getting the responses I usually process it and remove garbage token model has produced.
Is there any way that BLOOM can stop producing more token if the context of sentence is completed.
GPT-3 basically stops itself when the context of sentence is completed or the prompt is answered.
But BLOOM produce extra tokens just to complete the max_token length provided.
Is there any way BLOOM can stop producing tokens once context is complete just like GPT-3
For Example: What is Machine Learning?
BLOOM will answer it fine with first 2 or 3 sentences but produce random text(garbage - it may be some code - regular expression formulas)
Hey. I have been exploring BLOOM and its API closely. I have learned how the parameters effect the response. After I am getting the responses I usually process it and remove garbage token model has produced.
Is there any way that BLOOM can stop producing more token if the context of sentence is completed.
GPT-3 basically stops itself when the context of sentence is completed or the prompt is answered.
But BLOOM produce extra tokens just to complete the max_token length provided.
Is there any way BLOOM can stop producing tokens once context is complete just like GPT-3For Example: What is Machine Learning?
BLOOM will answer it fine with first 2 or 3 sentences but produce random text(garbage - it may be some code - regular expression formulas)
You may want to try BLOOMZ, which stops by itself when it deems the question to be answered.
Questions like What is Machine Learning?
should work quite well.
Hey. I have been exploring BLOOM and its API closely. I have learned how the parameters effect the response. After I am getting the responses I usually process it and remove garbage token model has produced.
Is there any way that BLOOM can stop producing more token if the context of sentence is completed.
GPT-3 basically stops itself when the context of sentence is completed or the prompt is answered.
But BLOOM produce extra tokens just to complete the max_token length provided.
Is there any way BLOOM can stop producing tokens once context is complete just like GPT-3For Example: What is Machine Learning?
BLOOM will answer it fine with first 2 or 3 sentences but produce random text(garbage - it may be some code - regular expression formulas)You may want to try BLOOMZ, which stops by itself when it deems the question to be answered.
Questions likeWhat is Machine Learning?
should work quite well.
Thanks
@Muennighoff
I cannot really find Inference API to BLOOMZ? Can you provide a link to the page to checkout the API. Other then that I do have an idea to make it work using transformers.
Hey. I have been exploring BLOOM and its API closely. I have learned how the parameters effect the response. After I am getting the responses I usually process it and remove garbage token model has produced.
Is there any way that BLOOM can stop producing more token if the context of sentence is completed.
GPT-3 basically stops itself when the context of sentence is completed or the prompt is answered.
But BLOOM produce extra tokens just to complete the max_token length provided.
Is there any way BLOOM can stop producing tokens once context is complete just like GPT-3For Example: What is Machine Learning?
BLOOM will answer it fine with first 2 or 3 sentences but produce random text(garbage - it may be some code - regular expression formulas)You may want to try BLOOMZ, which stops by itself when it deems the question to be answered.
Questions likeWhat is Machine Learning?
should work quite well.Thanks @Muennighoff
I cannot really find Inference API to BLOOMZ? Can you provide a link to the page to checkout the API. Other then that I do have an idea to make it work using transformers.
The widget on the hub is turned off, but you can try it using this colab: https://huggingface.co./bigscience/bloomz/discussions/28
The model is also available here, but with some prompting already done: http://chat.petals.ml/
@RafayPunch @Muennighoff Regarding http://chat.petals.ml, feel free to drop default prompts by clicking "Enable few-shot mode" at the bottom of the page.
raise MlflowException("API request to %s failed with exception %s" % (url, e))
mlflow.exceptions.MlflowException: API request to https://https://adb-5844287990719898.18.azuredatabricks.net/api/2.0/mlflow/runs/create failed with exception HTTPSConnectionPool(host='https', port=443): Max retries exceeded with url: //adb-5844287990719898.18.azuredatabricks.net/api/2.0/mlflow/runs/create (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5d8df50af0>: Failed to establish a new connection: [Errno -2] Name or service not known')).
How to solve this issue ?
When I tried chat.petals.ml, every thing I tried produced a python error, pretty much boiling down to this final line:
File "/home/borzunov/.local/lib/python3.10/site-packages/hivemind/p2p/p2p_daemon_bindings/utils.py", line 72, in raise_if_failed
raise ControlFailure(f"Connect failed. msg={response.error.msg}")
hivemind.p2p.p2p_daemon_bindings.utils.ControlFailure: Connect failed. msg=routing: not found
Thoughts...?
hi ..........I try to implement the extractive QA from bloom-560m model ........ I need the training script for bloom extractive QA and implement steps..................pls help me
hi ..........I try to implement the extractive QA from bloom-560m model ........ I need the training script for bloom extractive QA and implement steps..................pls help me
Here's the work I did for my acadmic project. That's been a while ago now, so I don't know if the code is still functional, and I will not be able to provide support or advice for anything that isn't working today. However, to the degree it's helpful:
https://github.com/jasondeden/capstone
Good luck!
Thanks @jasoneden
Hey. I have been exploring BLOOM and its API closely. I have learned how the parameters effect the response. After I am getting the responses I usually process it and remove garbage token model has produced.
Is there any way that BLOOM can stop producing more token if the context of sentence is completed.
GPT-3 basically stops itself when the context of sentence is completed or the prompt is answered.
But BLOOM produce extra tokens just to complete the max_token length provided.
Is there any way BLOOM can stop producing tokens once context is complete just like GPT-3For Example: What is Machine Learning?
BLOOM will answer it fine with first 2 or 3 sentences but produce random text(garbage - it may be some code - regular expression formulas)You may want to try BLOOMZ, which stops by itself when it deems the question to be answered.
Questions likeWhat is Machine Learning?
should work quite well.Thanks @Muennighoff
I cannot really find Inference API to BLOOMZ? Can you provide a link to the page to checkout the API. Other then that I do have an idea to make it work using transformers.
Hey @RafayPunch , you mentioned that you know how to tackle this problem using transformers. Ive been stuck on the same issue and haven't found a solution for it yet. If you have, could you please explain how you did it.
Where i can view tensorboard of tr13 experiments?
https://github.com/bigscience-workshop/bigscience/blob/master/train/tr13-mtf/tr13-176B-mtf-xp3mt.slurm
I want to use bloomz-7b1-mt version of the model and make it more ChatGPT like for my language Punjabi. Is there a way I can shred off tokenizers and embeddings for languages other than the one I want, since it can be done for mt5 which reduced the model size by more than half. Also, are there any smaller versions of this model coming soon since I dont have access to a cluster of GPUs.
I have seen multiple tutorials on using the QLORA and PEFT techniques to fine-tune many 7B parameter models but they dont seem to work for this one here. I want to fine-tune it using a free version on colab and I dont want it to take much space, can anyone please help?
You can try e.g. https://huggingface.co./bigscience/bloomz-3b or https://huggingface.co./bigscience/bloomz-3b or https://huggingface.co./bigscience/bloomz-1b7 which should be similar to bloomz-7b1-mt. For -mt
, 7b1 is the smallest one though