Spaces:
Running
HuggingChat: Input validation error: `inputs` tokens + `max_new_tokens` must be..
I use the meta-llama/Meta-Llama-3-70B-Instruct model. After a certain number of moves, the AI refuses to walk and gives an error : "Input validation error: inputs
tokens + max_new_tokens
must be <= 8192. Given: 6391 inputs
tokens and 2047 max_new_tokens
". Is this a bug or some new limitation? I still don't get it to be honest and I hope I get an answer here. I'm new to this site.
Same issue all of the sudden today
Can you see if this still happens? Should be fixed now.
I keep getting this error as well. Using CohereForAI
Same error, "Meta-Llama-3-70B-Instruct" model.
I have also been running into this error. Is there a workaround or solution at all?
"Input validation error: inputs
tokens + max_new_tokens
must be <= 8192. Given: 6474 inputs
tokens and 2047 max_new_tokens
"
Using the meta-llama/Meta-Llama-3-70B-Instruct model.
Keep getting same error on llama3-70b. If the message prompt crosses context length shouldn't it automatically truncate or something like that?
It happens more often than not, even when using like 7 words.
Using the meta-llama/Meta-Llama-3-70B-Instruct model.
Input validation error: inputs
tokens + max_new_tokens
must be <= 8192. Given: 6477 inputs
tokens and 2047 max_new_tokens
Happening to me right now:Input validation error: `inputs` tokens + `max_new_tokens` must be <= 8192. Given: 6398 `inputs` tokens and 2047 `max_new_tokens
Just to check, are you having long conversations and/or using the websearch? Sorry for the inconvenience, trying to find a fix.
Happens even without web search, just long conversation.
No web search, not really long. The old conversation should be somewhere around 8000 tokens. Like the error sais:
Input validation error: inputs
tokens + max_new_tokens
must be <= 8192. Given: 8015 inputs
tokens and 2047 max_new_tokens
In new chat length was less before getting the same error. Again, like it is stated in the error it should be somewhere around 6150 tokens.
Input validation error: inputs
tokens + max_new_tokens
must be <= 8192. Given: 6150 inputs
tokens and 2047 max_new_tokens
Just to check, are you having long conversations and/or using the websearch? Sorry for the inconvenience, trying to find a fix.
I was having a long conversation without web search
Just to check, are you having long conversations and/or using the websearch? Sorry for the inconvenience, trying to find a fix.
Likewise, a long conversation without a web search.
Just to check, are you having long conversations and/or using the websearch? Sorry for the inconvenience, trying to find a fix.
There is no inconvenience at all, we appreciate your time and effort trying to fix this. On my end, no web search here, just default. "Assistant will not use internet to do information retrieval and will respond faster. Recommended for most Assistants."
This is def a weird bug, doesn't matter how many words you use in the context it just throw the error and blocks you, you can try to reduce the prompt to 1 word it will throw the error still.
Seems to happen with long conversations. Like I'm hitting a hard limit. I could do a token count if that helps.
I'd really appreciate if you could count the tokens indeed. You can grab the raw prompt by clicking the bottom right icon on the message that gave you the error. It will open a JSON with a field called prompt
which contains the raw prompt.
Otherwise if someone feels comfortable sharing a conversation, I can have a look directly.
Otherwise if someone feels comfortable sharing a conversation, I can have a look directly.
Here you go: https://hf.co/chat/r/v_U0GXB
I'd really appreciate if you could count the tokens indeed. You can grab the raw prompt by clicking the bottom right icon on the message that gave you the error. It will open a JSON with a field called
prompt
which contains the raw prompt.Otherwise if someone feels comfortable sharing a conversation, I can have a look directly.
Here you go, please: https://hf.co/chat/r/7MLJ8EX
Otherwise if someone feels comfortable sharing a conversation, I can have a look directly.
Hi there! It happens here too. Here's my conversation https://hf.co/chat/r/1yeBRAV
Thanks in advance!
Anyone know if this is fixed?
Anyone know if this is fixed?
nope, still same error.
I asked internally, trying to get to the bottom of this, sorry for the inconvenience!
I'm also getting this problem. It's very annoying. I know the service is free, but I wouldn't mind paying for it if it got rid of this error.
When will they fix this error? It's literally annoying especially when I was trying to make LLama 3 fix the code
Any news on fixing this bug?
Hi . I also have the same problem when sending a link of a facebook page, but i've already done it in other chats and there were no problem.
The issue for me is that i need to change conversation because i can't use the chat anymore, and that's a problem because I was using to deliver a business service .
I would appreciate you very much for trying, I can share the conversation as well
Running into the same issue: I'm iterating over a defined set of strings trying out best prompting strategy, and it gives me this error with random strings at random times. Can't make sense of it. Using the meta-llama/Meta-Llama-3-8B-Instruct model.
Any updates?
I'm also getting the same problem. can i help in any way?
I am getting the same error as well usually on long conversations which involves code reviews, documentations etc
yeah, still getting this issue , its so annoying.
Bruh it is never going to be fixed I guess ๐ญ
Same issueee Input validation error:
inputstokens +
max_new_tokensmust be <= 4096. Given: 4076
inputstokens and 100
max_new_tokens``
I've got the same issue in a long conversation. If I branch a prompt, the ai answer me but I can't add any request after the branch. I tried with several models and I'v got the same result : Input validation error: inputs
tokens + max_new_tokens
must be <= 8192. Given: 6269 inputs
tokens and 2047 max_new_tokens
. If I go to another conversation it's working.
Any updates?
Seems that the issue lies on how the context is being handled, I think the best approach here would be to clear the context after a few messages to maintain it always with enough tokens to keep conversation going, maybe set to retrieve only the last 3-4 messages which would create less context but would probably avoid that error which seems to be when context is full and you have to add a new chat and start all over again till it happen again.
I think the best approach here would be to clear the context after a few messages
Can you give an example on how to do this?
I think the best approach here would be to clear the context after a few messages
Can you give an example on how to do this?
The error we're encountering is probably due to the limitation on the total number of tokens that can be processed by the LLaMA model. To resolve this issue, developers can implement a mechanism to truncate the conversation context after a certain number of messages.
Something like this could work for let's say the last 5 messages, but this has to be done in the backend:
conversation_history = []
def process_message(user_input):
global conversation_history
# Add the user's input to the conversation history
conversation_history.append(user_input)
# Truncate the conversation history to keep only the last 5 messages
if len(conversation_history) > 5:
conversation_history = conversation_history[-5:]
# Prepare the input for the LLaMA model
input_text = "\n".join(conversation_history)
# Call the LLaMA model with the truncated input
response = llama_model(input_text)
# Append the response to the conversation history
conversation_history.append(response)
return response
Or this for the frontend
let conversationHistory = [];
function processMessage(userInput) {
conversationHistory.push(userInput);
// Truncate the conversation history to keep only the last 5 messages
if (conversationHistory.length > 5) {
conversationHistory = conversationHistory.slice(-5);
}
// Prepare the input for the LLaMA model
let inputText = conversationHistory.join("\n");
// Call the LLaMA model with the truncated input
$.ajax({
type: "POST",
url: "/llama-endpoint", // Replace with your LLaMA model endpoint
data: { input: inputText },
success: function(response) {
// Append the response to the conversation history
conversationHistory.push(response);
// Update the conversation display
$("#conversation-display").append(`<p>${response}</p>`);
}
});
}
That could potentially fix this bug.
Thanks for the help but Iโm using the Huggingface chat website. Iโve not clue how to input this code.
Thanks for the help but Iโm using the Huggingface chat website. Iโve not clue how to input this code.
I know, I mean developers have to check if that can fix the issue on their end FYI @nsarrazin
Any updates, friends?
None, it is still getting stuck with error every time when chat log reaches some limit. Like someone stated earlier it seems it will not be fixed.
None, it is still getting stuck with an error every time the chat log reaches some limit. Like someone stated earlier it seems it will not be fixed.
Yeah, unfortunately at the frontend, there isn't any way to fix it once it reaches that error. Editing a few messages above the latest which caused the error asking for a summarization of the chat context than starting a new one works but this isn't a solution.
I don't know how the chat is deployed or what language but if I could help to fix it, I would. I use the chat on a daily basis.
We should have a fix for it in TGI, will make sure it's deployed tomorrow!
Amazing! Thank you
We should have a fix for it in TGI, will make sure it's deployed tomorrow!
Amazing! Thank you
Have you run into the error again? So far so good for me
Still same error for me.
Well done devs! Great stuff @nsarrazin thank you!
Seems fixed to me. I couldn't see that error anymore.
Yep the issue should be fixed on all models, if you still see it feel free to ping me!
@nsarrazin
Getting this error on codellama-7b-instruct, and llama2-70b-chat models
ValidationError: Input validation error: inputs
tokens + max_new_tokens
must be <= 6144. Given: 4183 inputs
tokens and 2023 max_new_tokens
@nsarrazin
I am getting the same error with Qwen/Qwen2-72B-Instruct
using Inference Endpoints:
Input validation error: inputs
tokens + max_new_tokens
must be <= 1512. Given: 970 inputs
tokens and 5000 max_new_tokens
The model works if I set max_new_tokens
to 500 (970+500 <= 1512), though. Is this a limitation of the model or Hugging Face Inference Endpoints?
Edit: I just noticed that for a Text Generation task, Max Number of Tokens (per Query) can be set under the Advanced Configuration settings of a dedicated Inference Endpoint. The default value is 1512 and increasing it to let's say 3000 fixed my issue.
@nsarrazin
Though you have mentioned that the issue is solved for all the models.
Iam facing issue for meta-llama/Meta-Llama-3-8B-Instruct model.
{
"error": "Input validation error: inputs
tokens + max_new_tokens
must be <= 4096. Given: 4092 inputs
tokens and 16 max_new_tokens
",
"error_type": "validation"
}
I think this is still an error. "'error': 'Input validation error: inputs
tokens + max_new_tokens
must be <= 4096". Using the dockerized TGI, with params --model-id Qwen/Qwen2-72B-Instruct-GPTQ-Int8 --quantize gptq. This limits to 4096 when the context should be allowed to be much bigger than that.
If you are using the Dockerized TGI, try setting the --max-total-tokens
parameter. The default is 4096 and that may be the origin of the issue.
Hi, I saw the above thread and was wondering if its an issue or limitation.
I am using meta-llama/Meta-Llama-3.1-70B-Instruct which has a context window of 128k. But I get this when I send large input.
Input validation error: inputs
tokens + max_new_tokens
must be <= 8192. Given: 12682 inputs
tokens and 4000 max_new_tokens
Using Hugging Chat, https://huggingface.co./chat/
Model: meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
Input validation error: inputs
tokens + max_new_tokens
must be <= 16384. Given: 14337 inputs
tokens and 2048 max_new_tokens
Hello,
I have exactly the same error when calling Meta-Llama-3.1-70B-Instruct using Haystack v2.0's HuggingFaceTGIGenerator in the context of a RAG application:
It is very puzzling because Meta-Llama-3.1-70B-Instruct should have a context window size of 128k tokens. This, and the multilingual capabilities, are major upgrades with respect to the previous iteration of the model.
Still, here's the result:
I am calling the model using serverless API. Perhaps creating a dedicated, paid API endpoint would solve the issue? Did anyone try this?
Hi,
I had the same problem when using the Serverless Inference API and meta-llama/Meta-Llama-3.1-8B-Instruct. The problem is that the API only supports a context length of 8k for this model, while the model supports 128k. I got around the problem by running a private endpoint and changing the 'Container Configuration', specifically the token settings to whatever length I required.
Hi AlbinLidback,
Yes, I ended up doing the same thing and it solved the problem. HuggingFace could save users a lots of frustration by explicitly mentioning this on the model cards.
Hi @AlbinLidback , @JulienGuy
I'm totally new to the Hugging Face.
I also got the same problem with meta-llama/Meta-Llama-3.1-8B-Instruct and 70B-Instruct.
Could you share hot to "running a private endpoint and changing the 'Container Configuration' with the 128k token length?
Hi @pineapple96 ,
This part is relatively straightforward. Go to the the model card (e.g. https://huggingface.co./meta-llama/Meta-Llama-3.1-8B-Instruct), Click on "Deploy" in the top right corner and select "Inference Endpoint". In the next page you can choose what hardware you want to run the model on, which will impact how much you will pay per hour. Set "Automatic Scale to Zero" to some value other than "never" to switch off the endpoint after X amount of time without request, so that you won't be paying for the endpoint while it's not in use. Then go to "Advanced Configuration" and set the maximum amount of tokens to whatever makes sense for your use case. With this procedure you will be able to make full use of the larger context windows of Llama 3 models.
Thanks a lot for the detailed how-to guide, JulienGuy. Appreciate it!