I'm trying to run this in Inference Endpoints but keep getting this error...

#8
by GollyJer - opened

What is the correct setup in the Inference Endpoints UI? Thanks!

message: "Server error: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0"
target: "text_generation_router_v3::client"
filename: "backends/v3/src/client/mod.rs"
line_number: 45
span: {"name":"warmup"}
spans: [{"max_batch_size":"None","max_input_length":"None","max_prefill_tokens":4096,"max_total_tokens":"None","name":"warmup"},{"name":"warmup"}]

I found some other discussion about this... https://github.com/huggingface/text-generation-inference/issues/2879

Setting CUDA_GRAPHS to 0 did the trick.

image.png

GollyJer changed discussion status to closed

Sign up or log in to comment