Llava or LlavaNextForConditionalGeneration?
In the demo you are using the LlavaModel, but the llava-interleave model should be a LlavaNextModel?
@mdoeir yes, for HF implementation it's a LlavaModel. It's because we don't support anything except multimodal_patch_merge_type == "spatial_unpad" for LLaVaNeXT, and the arch of interleave models is same as in LLaVa if we follow the flags set by the original impl. So it's okay to use LlavaModel here
@RaushanTurganbay
. Many thanks! Another question though,
I tried to run the pure transformer demo with the 0.5b model, but I got the following error:
Traceback (most recent call last):
File "/data/root/code/opensource/llava-next-interleave/llava_next_demo_hf.py", line 32, in
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
File "/data/root/miniconda3/envs/pl/lib/python3.10/site-packages/transformers/processing_utils.py", line 926, in apply_chat_template
raise ValueError(
ValueError: No chat template is set for this processor. Please either set the chat_template
attribute, or provide a chat template as an argument. See https://huggingface.co./docs/transformers/main/en/chat_templating for more information.
$ pip list | grep "transformers"
transformers 4.42.4
transformers-stream-generator 0.0.5
Did I miss something from the demo?
Yes, you have to update your transformers version via !pip install --upgrade git+https://github.com/huggingface/transformers.git
to use chat templates. It was added just a few days ago and didnt yet make it to the PyPi release :)
Hello! I have same question about this which is how to process single image? I notice this repo using LLavaProcessor which is no "spatial unpad" method so that the dimension of single image will be processed into [1, channel, weight, width] instead of [patches, other dimension].
If I want to continue finetuning based on this "llava-next-interleave" checkpoint by using transformers with my own datasets, is this no patched single image ok? I thought this "llava-next-interleave" checkpoint was trained by using patched single image no matter pretraining or finetuning.
Yes, each image will be [1, channel, weight, width]
shape consistent with what the inference was like in the original repo (https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/playground/demo/interleave_demo.py)
Thank you for your reply! But actually I'm asking about continue finetuning using my own dataset based on this checkpoint , not inference. What dimension is about the single image when I want to finetune? Does this checkpoint not support fineuning?
@lalalandw
I see, but llava-OV afaik is not supposed to be trained with muti-patch because in that case we would also infer with multi-patch. Let me know if there is any resource that states explicitly how the model was trained, since the paper doesn't mention anyres
and doesn't give much detail except for the backbones used
Thank you for you reply! Their released paper about LLava-Next-Interleave : LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
(https://arxiv.org/pdf/2407.07895) explicitly said that they used anyres
to train sinlge images. Please see Section 5.2 (Multi-patch (single-image) Results.). The paper said 'We use the anyres training for single-image data, which divides an image into multiple patches, forming another multiimage setting'. Maybe the inference of single image with multi-patch and without multi-patch are both ok? But I'think LlavaNextForConditionalGeneration as the model type mabye better because it support anyres
method.
Oke, lemme check that. But I don't think llava-next would fit here because it cannot do inference in non-patch way. In general if you want to tune with different parameters for patching/padding/newline, you can use the original impl in LLaVA-VL repo as it supports setting them in different combinations, while transformers takes only particular model architectures for integration. After tuning the model, you will be able to conver it to HF format with one of our conversion scripts