why len(processor.tokenizer)!= model_vocab_size ?

#5
by PerRing - opened

in 'llava-hf/llava-1.5-7b-hf',
len(processor.tokenizer) is 32002 and model_vocab_size(model.language_model.model.embed_tokens) is 32064.

Why are they different sizes? Shouldn't it be the same?

Llava Hugging Face org

WHy should they be the same?

Llava Hugging Face org

The tokenizer has the exact number of token we need, but the lm head need to be padded to a multiple for the number of SM on your machine for performance issues

Llava Hugging Face org

The assumption that they should be equal is wrong

I thought it should be same because if model generate token id '32060', the tokenizer can't decode token id '32060'(because there is no token id '32060' in tokenizer).

Llava Hugging Face org

Yeah but the model would not generate these tokens because it was not taught to do so. It's a common practice to reserve vocabulary for users when they finetune etc

PerRing changed discussion status to closed

Sign up or log in to comment