Error during quantization

by Gryphe - opened Jul 21, 2023

Jul 21, 2023

Just a FYI (I'm aware you made a GGML available yourself)

Exception: Vocab size mismatch (model has 32032, but I:\HF\Storage\NousResearch_Nous-Hermes-Llama2-13b\tokenizer.model combined with I:\HF\Storage\NousResearch_Nous-Hermes-Llama2-13b\added_tokens.json has 32001).

Gryphe changed discussion title from Errors during quantization to Error during quantization Jul 21, 2023

Doctor-Shotgun

Jul 21, 2023

Same finding here.

Also when I attempted quantization from the provided ggml fp16, I'm getting notified that certain tensors aren't k-quant compatible due to dimensions not being a multiple of 256 - presumably also related to the vocab changes.

Praneet

Jul 22, 2023

Yup, doesn't seem to work with 4 bit or 8 bit quantization offered through bitsandbytes

Henk717

Jul 22, 2023

BnB on newer transformers can be fixed with pretraining_tp": 1 in the config file

ozcanesen

Jul 23, 2023

Same problem here.
config.json says "vocab_size": 32032
while largest id in tokenizer.json is 32000

Does anyone know how to solve this?

Doctor-Shotgun

Jul 24, 2023

Same problem here.
config.json says "vocab_size": 32032
while largest id in tokenizer.json is 32000

Does anyone know how to solve this?

You can add 32 dummy tokens to added_tokens.json to make it match the tensor size. Not sure the reason it's set up like this.

karan4d

NousResearch org Jul 24, 2023

BnB on newer transformers can be fixed with pretraining_tp": 1 in the config file

this is the real fix. its an issue on behalf of huggingface and broke lots of the llama 2 finetunes dropped that day.

fix has been pushed on the model if you wanna just download the new config.json

if still having issues can do the dummy token thing but not recommended

crestf411

Jul 24, 2023

I upgraded transformers and bitsandbytes to latest versions, but I am still getting vocab size mismatch when trying to run convert.py in llama.cpp. What am I missing?

crestf411

Jul 26, 2023

Only solution I could find was to add a bunch of dummy tokens to add_tokens.json, which works, but seems like a dumb fix that could lead to issues. Better than nothing, I guess.

zhanluwufang

Jul 27, 2023

Only solution I could find was to add a bunch of dummy tokens to add_tokens.json, which works, but seems like a dumb fix that could lead to issues. Better than nothing, I guess.

Please tell me .How do I add a bunch of dummy tokens?

ozcanesen

Jul 27, 2023

Only solution I could find was to add a bunch of dummy tokens to add_tokens.json, which works, but seems like a dumb fix that could lead to issues. Better than nothing, I guess.

Please tell me .How do I add a bunch of dummy tokens?

this is my added_tokens.json file with dummy tokens to make it total of 32032 tokens:

{"<pad>": 32000, "<pad1>": 32001, "<pad2>": 32002, "<pad3>": 32003, "<pad4>": 32004, "<pad5>": 32005, "<pad6>": 32006, "<pad7>": 32007, "<pad8>": 32008, "<pad9>": 32009, "<pad10>": 32010, "<pad11>": 32011, "<pad12>": 32012, "<pad13>": 32013, "<pad14>": 32014, "<pad15>": 32015, "<pad16>": 32016, "<pad17>": 32017, "<pad18>": 32018, "<pad19>": 32019, "<pad20>": 32020, "<pad21>": 32021, "<pad22>": 32022, "<pad23>": 32023, "<pad24>": 32024, "<pad25>": 32025, "<pad26>": 32026, "<pad27>": 32027, "<pad28>": 32028, "<pad29>": 32029, "<pad30>": 32030,"<pad31>": 32031}```

teknium

NousResearch org Jul 29, 2023

Same problem here.
config.json says "vocab_size": 32032
while largest id in tokenizer.json is 32000

Does anyone know how to solve this?

You can add 32 dummy tokens to added_tokens.json to make it match the tensor size. Not sure the reason it's set up like this.

Seems it was the trainer we used, axolotl. It has been fixed in the trainer but still dont know how to fix it here

teknium changed discussion status to closed Jul 29, 2023

bart-ml-lora

Aug 11, 2023

•

edited Aug 11, 2023

Python script to generate valid tokenizer.model:


from pathlib import Path
from datasets import load_dataset
from transformers import AutoTokenizer

tokenizer_model_name = 'NousResearch/Llama-2-7b-hf'
model_path = 'output'
new_tokens = [f"<pad{i}>" for i in range(31)]

tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer.add_tokens(new_tokens)

tokenizer.save_pretrained(Path(model_path))
tokenizer.save_vocabulary(model_path)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment