meta-llama/Llama-3.2-11B-Vision · Position of <image> token in prompt for fine-tuning

Sep 19

•

In the example of landing page (MD file), <image> is placed at the first position of the prompt.
While applying the chat template, the position of image token is in "<|begin_of_text|><|start_header_id|>user<|end_header_id|><image>...". For single image prompt, I wonder if the position of image token should be at the very beginning?

Asking as according to Line 90 get_cross_attention_token_mask function in processing_mllama.py, when only one image is present, the mask is undo from the position of image token to the end of sequence. If we don't place the image token in the beginning, the fine-tuning returns error on loss calculation.

hxgy610

Sep 20

Kindly follow up on the issue and if anyone can share some insights

pcuenq

Meta Llama org Sep 21

cc @merve @RaushanTurganbay

hxgy610

Sep 24

I think I found the issue: https://huggingface.co./nltpt/transformers/discussions/6

could you please help confirm? many thanks!

franciscoliu

Sep 27

I am also wondering about this, how do we properly finetune this new llama model?