Incomplete Output even with max_new_tokens
So the output of my model ends abruptly and I ideally want it to complete the paragraph/sentences/code which it was it between of.
Although I have provided max_new_tokens = 300 and also in prompt I give to limit by 300 words.
The response is always big and ends abruptly. Any way I can ask for a complete output within desired number of output tokens?
I think you need to put max token output higher than max word count... For example put it 350 or 380
@sniffski this is my current configuration
generation_config = GenerationConfig(
do_sample=True,
top_k=10,
temperature=0.01,
pad_token_id=tokenizer.eos_token_id,
early_stopping = True,
max_new_tokens=300,
return_full_text=False
)
so what change you are proposing here?
Well, first thing you should know is one word is not one token... I think the rule of thumb was one token is 0.75 words... so in that case if you are requesting in the prompt answer not more than 300 words you need to set max_new_tokens=400,
because 300*0.75=400
i tried with max_new_tokens=400
. but still, the response ends abruptly problem exists. the generation suddenly stops as soon as it reaches the specified number of max_new_tokens reached, without checking whether the sentence is completed or not.
Can you copy the output in a temp file like wc-test.txt
then run in shell wc wc-test
to see how many words are there (second number from the output of wc command)... If they are more than 300 than the model doesn't obey your request in the prompt for maximum words in response and issue is not the max token limit... I guess you would need to find better prompt... Try something like starting with "Your task is to respond with 300 words or less..."
I have similar case, I am using API interference to run this model, my output is incomplete and usually the same length (65-80) words and most of the times it doesn't even end correctly, see below example of input and output followed by some part of the code.
Input : "Write a detailed essay about trees"
Output : "Trees are one of the most important elements of nature. They provide us with oxygen, clean the air, and provide shade from the sun. They also provide us with a variety of other benefits, such as providing food and shelter for animals, and helping to regulate the climate. Trees are also a source of beauty and inspiration, and can be used to create a sense of calm and peace in our lives. In this essay, I will explore the many benefits of trees, as well as their"
import requests
API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.1"
HEADERS = {"Authorization": "Bearer xxxxxx"}
def query(payload):
try:
response = requests.post(API_URL, headers=HEADERS, json=payload)
response.raise_for_status()
return response.json()
..............
@SumanVakare how did you solve this issue?
Not solved yet, I am looking for solution too.
Has anyone found a solution for this abrupt ending issue?
Running into this issue as well.
Hey guys, anyone found a solution?
There's no "solution" to what you're asking. The model output is limited by the number of tokens you allow it to output. If the expected response is longer, either increase max_new_tokens so it has room to write whatever it wanted to, or, in your Web-UI (or whatever other UI you use) there's a continue button/function to ask the model to continue with whatever it was writing.
Notes:
- tokens are more like 2-4 characters than whole words
- If you're using this model in particular, it's not instruction tuned, so it may continue to output BS forever (well, until it reaches max token). What you want is likely the instruction tuned version of Mistral.
- It's useless to specify in your prompt the number of words you expect: LLMs can't do real math, and are even less capable of counting their own word quota as they produce the output. At best, you can give indications like "one short paragraph" or "2 long paragraphs", and it'll generally give better results (well with instruction tuned models, this base model, I'm not so sure).