Position of <image> token in prompt for fine-tuning

#2
by hxgy610 - opened

In the example of landing page (MD file), <image> is placed at the first position of the prompt.
While applying the chat template, the position of image token is in "<|begin_of_text|><|start_header_id|>user<|end_header_id|><image>...". For single image prompt, I wonder if the position of image token should be at the very beginning?

Asking as according to Line 90 get_cross_attention_token_mask function in processing_mllama.py, when only one image is present, the mask is undo from the position of image token to the end of sequence. If we don't place the image token in the beginning, the fine-tuning returns error on loss calculation.

Kindly follow up on the issue and if anyone can share some insights

I think I found the issue: https://huggingface.co./nltpt/transformers/discussions/6

could you please help confirm? many thanks!

I am also wondering about this, how do we properly finetune this new llama model?

Sign up or log in to comment