Discrepancy between vocabulary size in model and tokenizer leading to bugs

#3
by jaanli - opened

Hi! Had a quick question about the discrepancy between the input embeddings:

model = AutoModel.from_pretrained('UFNLP/gatortron-base')
model.embeddings.word_embeddings.shape

There are 50176 in this module, but the tokenizer has 50101 vocabulary items (https://huggingface.co./UFNLP/gatortron-base/raw/main/vocab.txt).

Is there a reason for this discrepancy? It is making us hard-code the vocabulary size to fix this, and we hope we are correctly initializing from gatortron.

Otherwise, thank you so much for open sourcing this! It is extremely helpful :)

University of Florida NLP Group org
edited Mar 15

NVIDIA implements it by padding the vocabulary to be a multiple of 8 to effectively utilize tensor cores during training, particularly when using mixed precision. Please see the documents: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensor-core-shape.

Understood -- super helpful! The discrepancy is 75 tokens though; is that normal / is there code for where the vocabulary size selects a multiple of 8?

Asking because other standard codebases choose the nearest multiple of 8 for the vocabulary size: https://github.com/pytorch-labs/gpt-fast/blob/main/model.py#L61

It sounds like we are safe using the first 50101 vocabulary items then. Appreciate the help!

Sign up or log in to comment