allow dynamic batch size

#3
by l-i - opened
No description provided.
AWS Inferentia and Trainium org

Hi @l-i ,

Thanks for opening the issue. You would always be able to enable dynamic_batch_size while compiling the checkpoint.

According to previous experience, we used to reach better latency with static batch size. But maybe we can add a checkpoint with dynamic batch size @philschmid , WDYT?

AWS Inferentia and Trainium org

Yes, the latest information we have is that its optimized for BS=1, but @l-i you should be able to compile it using: https://huggingface.co./docs/optimum-neuron/guides/export_model#exporting-stable-diffusion-xl-to-neuron and setting the batch size as you want.

thank you for sharing the details!

I have two follow up questions:

  • if I enable dynamic batch size during compilation but still inference at the same original static batch size, would it still affect the latency?
  • if I compile with a larger static batch size, how much larger my machine needs to be? (I tried with 6 on 24xl, and it seemed to fail)
AWS Inferentia and Trainium org

Just to point out if you are compiling for large batch size and even 24xlarge run oom, you could try with CPU-only instance just for the compilation.

Publish this branch
This branch is in draft mode, publish it to be able to merge.

Sign up or log in to comment