How to convert new LLaVA model into HF format?
Hi, I trained LLaVA-1.5-13b on a customized dataset, how do I convert my saved model + config into HF format? My model is here: https://huggingface.co./zzxslp/som-llava-v1.5-13b, it should be the same setting as https://huggingface.co./liuhaotian/llava-v1.5-13b.
I tried rename key values in safetensors following format in this model checkpoint, but couldn't load the model correctly. For example, the vocab size in this repo is 32064 while the original LLaVA-1.5 used 32000.
Hey!
You should be able to convert llava weights for Hf format using this script by running. First clone transformers and then run the script:
git clone https://github.com/huggingface/transformers
python transformers/src/transformers/models/llava/convert_llava_weights_to_hf.py --text_model_id lmsys/vicuna-13b-v1.5 --vision_model_id openai/clip-vit-large-patch14-336 --output_hub_path zzxslp/som-llava-v1.5-13b-hf --old_state_dict_id zzxslp/som-llava-v1.5-13b
Hi there, thanks for the reply, I've converted my model into HF format by slightly modifying the script you provided!
Still, a question is why we need to expand the vocab from 32000 into 32064? Also the original image_token_index in LLaVA is set to -200, and in HF model it is <image>: 32000.
Yes, we expand the vocab size by adding an "image" token and "pad" token, but the final vocab size is 30k + 64 for hardware computation efficiency reasons. Having a sequence length multiple of 64 for "float16" precision on A100 GPUs can speed up tensor multiplications. More on that here
And for the "image_token_index", I guess it is done for ease of tokenizing in transformers
since "-200" as a negative value cannot be a valid token id.
Hi Raushan, thanks for the reply, learned something new!
A follow up question on the "image token index", the original image_token_index in LLaVA is -200 (which means 32000 - 200 = 31800?), but here you have 32000, does this mean the llava-hf model use a different image token index from the original model, how would that work?
which means 32000 - 200 = 31800?)
No, tokenization doesn't work like that. Image token is simply a placeholder that we need to assign some value (theoretically could be any as long as we can recover where to put image embeddings) until image embeddings are obtained. Later all image tokens are replaces with actual embeddings from images. Hope this clarifies your doubts!
It seems that the conversion leads to different output due to the change of vocabulary size (i.e. embedding layer and lm head). In this case, do we need to further train the hf model?
@JackBAI
the outputs shouldn't differ much with the change of vocab size and the final logits are expected to be approx close with torch.allclose()
. The reply from https://huggingface.co./llava-hf/llava-1.5-7b-hf/discussions/26 should be helpful ;)