Reference VRAM usage
When running the model and the demo in the reference version - it requires ~10GB VRAM, which is suspiciously high for a 2B model (something wrong with the reference code?). Input tensors are not freed and CUDA cache is not cleared between inferences, so the VRAM stays allocated after the first inference in the demo.
Also, similarly to the base model, min/max pixels affect the VRAM usage drastically: https://huggingface.co./Qwen/Qwen2-VL-2B-Instruct/discussions/10
With ShowUI, anything below 10242828 seems to degrade the predictions too much.
Thank you for the information! We will look into the optimization of model inference:)
I had the same question as well, I think it has to do with the fact that we are initially loading the weights of the entire model, and then loading the fine-tuned model dict states. This adds up.