How to get better responses and prevent repetition?
I have been running dolly-v2-12b quantized and have tried to copy the config from https://github.com/databrickslabs/dolly. I have found the model likes to hallucinate in responses or have lots of repetition. It often returns multiple responses to the same instruction. Is everyone experiencing this and has anyone found config/prompt changes to resolve this?
e.g.
Instruction:
give me 5 examples of colours
Response:
green, blue, orange, red, yellow
Green - plants and trees contain chlorophyll, which is colourless.
Blue - Paint contains cobalt or ultramarine, which is a pale blue.
Orange -Found in carrots, and many other fruits and vegetables.
Red - Colour found in most plants and animals, as well as many flowers.
Yellow - Paint contains Ferric Oxalate, which gives it its colour.
Here are some more examples of colours:
White - Found in snow and frost, as well as pure desert sand.
Black - The absence of colour.
Magenta - Found in Middle Eastern rugs, as well as the rosseta, a traditional Spanish outfit.
Green blue - Often refers to the spectrum of colour when blue and green are put together.
Here are 5 more examples of colours:
White - Often found in clothing and other items, as well as behind modern decor.
Black - Always the absence of colour.
Magenta - Deep, intense red colour makes it a good choice for bloodstain pattern.
Green blue - Often refers to the spectrum of colour when blue and green are put together.
Here are
Setup:
I have copied to prompting technique from their github: https://github.com/databrickslabs/dolly/blob/master/training/consts.py
Which looks like:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
give me 5 examples of colours
### Response
And generation config to match their github
generation_config = GenerationConfig(
max_new_tokens=256,
top_p=0.92,
top_k=0,
do_sample=True
)
I have experienced this when the EOS token isn't dealt with properly. Are you using the generation code snippet, or a pipeline()? Those should work in this respect, but if you're doing this manually like i have in some other cases with HF pipelines, I find you need to explicitly tell the pipeline the EOS token. Barring that you can try setting repetition_penalty=0.3, that kind of thing, to discourage this.
Echoing
@srowen
, It looks like you haven't configured the EOS token. Make sure you are using the pipeline
, as this will use the pipeline code in this repo for generation. From your example it appears that maybe the response ends after green, blue, orange, red, yellow
but that the EOS token is being ignored and then the generation continues.
Following the updated model card instructions, I used LangChain to create some examples:
for _ in range(20):
print(llm_chain.predict(instruction="give me 5 examples of colours").lstrip())
print("======")
Output:
blue, red, green, yellow, black
======
- red
- blue
- yellow
- green
- purple
======
Red
Blue
Yellow
Green
Indigo
======
violet, blue, green, yellow and orange
======
Red
Yellow
Blue
Green
Purple
======
red
blue
yellow
green
orange
======
- red
- blue
- green
- orange
- purple
======
blue, yellow, green, red, black
======
black, blue, green, yellow, orange
======
- orange
- blue
- green
- red
- purple
======
black, white, yellow, green, blue
======
Red
Yellow
Blue
Orange
Green
======
Red
Yellow
Blue
Green
Orange
======
black, white, blue, red and green
======
black, white, blue, red, yellow
======
The colours can be divided in to 3 primary colours, and 2 secondary colours. The primary colours are, Red, Yellow, and Blue. The secondary colours are, Orange, and Turquoise.
======
Blue, green, orange, purple, red
======
red
blue
yellow
green
pink
======
blue, red, green, yellow, black
======
- red
- blue
- green
- yellow
- purple
======
Hi @jacobgoss Quick question: When you say quantized, did you quantize it? Or is it available here in the hub?
@7th-Samurai
You can quantize the model when loading it from huggingface by using the load_in_8bit
kwarg like this
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map='auto', load_in_8bit=True)
I was able to fit the 12b model into a g5d.2xlarge instance on AWS which has 32GB RAM and an A10 GPU.
This does require a couple of libraries to be installed like accelerate
and bitsandbytes
@jacobgoss Great! Thanks!
@srowen @matthayes Thanks for the advice.
What's confusing me still is that if the EOS token was not configured correctly then the model would be generating the EOS token and then continuing to generate after it. However the model isn't outputting the EOS token (### End) or the token_id for it.
I have tried the smaller model without quantisation and it seemed to work as expected (generating the eos token and stopping generation) however the quantized 12b model just never generated the EOS token.
I wonder if its some issue with the quantization
Facing the same issue on using quantized model
Are you using the current InstructPIpeline that you get when you load with pipeline()? it will at least handle end of sequence as intended. It doesn't mean you can't get repetition, but if it's very apparent, probably you aren't doing what this model's pipeline does