Check it out, and let me know what you think!
Join the conversation
Join the community of Machine Learners and AI enthusiasts.
Sign UpCheck it out, and let me know what you think!
There are links to existing papers in the blog post if you want to dive into the field.
so good!
hi @visheratin , do you have any guides on how to train similar model? Phi-2 + SigLIP vision encoder?
I used mainly the LLaVA training codebase with some changes to support multi-crop. I'll be working on the next post about fine-tuning MC-LLaVA on a task-specific dataset and will open all the code.
I found your blog post really interesting.
I have a question regarding training models: in your method, you mentioned that images are divided into max_crop
patches and then fed into an image encoder. Does this mean that, compared to the original LLaVA, the forward pass of the model requires max_crop
times more time or memory consumption?
Or is there a more efficient way to implement this?
You are right. The method requires multiple passes for the vision encoder, which increases memory usage. This is not such a big problem during inference, but it makes training harder because of the gradients stored. At the moment, I don't have a solution to make it more efficient. But this is an ongoing project, so maybe I will find one =)