HuggingChat: Input validation error: `inputs` tokens + `max_new_tokens` must be..

#430
by Kostyak - opened

I use the meta-llama/Meta-Llama-3-70B-Instruct model. After a certain number of moves, the AI refuses to walk and gives an error : "Input validation error: inputs tokens + max_new_tokens must be <= 8192. Given: 6391 inputs tokens and 2047 max_new_tokens". Is this a bug or some new limitation? I still don't get it to be honest and I hope I get an answer here. I'm new to this site.

Kostyak changed discussion status to closed
Kostyak changed discussion title from Input validation error: `inputs` tokens + `max_new_tokens` must be.. to HuggingChat: Input validation error: `inputs` tokens + `max_new_tokens` must be..
Kostyak changed discussion status to open
Kostyak changed discussion status to closed
Kostyak changed discussion status to open

Same issue all of the sudden today

Hugging Chat org

Can you see if this still happens? Should be fixed now.

This comment has been hidden

Can you see if this still happens? Should be fixed now.

Still same error, except numbers have changed a little.
Screenshot_20.png

I keep getting this error as well. Using CohereForAI

Same error, "Meta-Llama-3-70B-Instruct" model.

I have also been running into this error. Is there a workaround or solution at all?

"Input validation error: inputs tokens + max_new_tokens must be <= 8192. Given: 6474 inputs tokens and 2047 max_new_tokens"

Using the meta-llama/Meta-Llama-3-70B-Instruct model.

Hi, I saw the above thread and was wondering if its an issue or limitation.

I am using meta-llama/Meta-Llama-3.1-70B-Instruct which has a context window of 128k. But I get this when I send large input.

Input validation error: inputs tokens + max_new_tokens must be <= 8192. Given: 12682 inputs tokens and 4000 max_new_tokens

Using Hugging Chat, https://huggingface.co./chat/

Model: meta-llama/Meta-Llama-3.1-405B-Instruct-FP8

Input validation error: inputs tokens + max_new_tokens must be <= 16384. Given: 14337 inputs tokens and 2048 max_new_tokens

Hello,

I have exactly the same error when calling Meta-Llama-3.1-70B-Instruct using Haystack v2.0's HuggingFaceTGIGenerator in the context of a RAG application:

cmd.png

It is very puzzling because Meta-Llama-3.1-70B-Instruct should have a context window size of 128k tokens. This, and the multilingual capabilities, are major upgrades with respect to the previous iteration of the model.

Still, here's the result:

error 422.png

I am calling the model using serverless API. Perhaps creating a dedicated, paid API endpoint would solve the issue? Did anyone try this?

Hi,

I had the same problem when using the Serverless Inference API and meta-llama/Meta-Llama-3.1-8B-Instruct. The problem is that the API only supports a context length of 8k for this model, while the model supports 128k. I got around the problem by running a private endpoint and changing the 'Container Configuration', specifically the token settings to whatever length I required.

Hi AlbinLidback,

Yes, I ended up doing the same thing and it solved the problem. HuggingFace could save users a lots of frustration by explicitly mentioning this on the model cards.

Hi @AlbinLidback , @JulienGuy

I'm totally new to the Hugging Face.
I also got the same problem with meta-llama/Meta-Llama-3.1-8B-Instruct and 70B-Instruct.

Could you share hot to "running a private endpoint and changing the 'Container Configuration' with the 128k token length?

Hi @pineapple96 ,

This part is relatively straightforward. Go to the the model card (e.g. https://huggingface.co./meta-llama/Meta-Llama-3.1-8B-Instruct), Click on "Deploy" in the top right corner and select "Inference Endpoint". In the next page you can choose what hardware you want to run the model on, which will impact how much you will pay per hour. Set "Automatic Scale to Zero" to some value other than "never" to switch off the endpoint after X amount of time without request, so that you won't be paying for the endpoint while it's not in use. Then go to "Advanced Configuration" and set the maximum amount of tokens to whatever makes sense for your use case. With this procedure you will be able to make full use of the larger context windows of Llama 3 models.

Thanks a lot for the detailed how-to guide, JulienGuy. Appreciate it!

Sign up or log in to comment