Tokenizer output differs from `EleutherAI/pythia-1.3b`
The tokenizer return dict for EleutherAI/pythia-1.3b-deduped
contains a token_type_ids
attribute unlike any other models in the Pythia suite. Is this intended behavior?
It leads to irregular errors in places like generate
calls with tracebacks such as:
5 inputs = tokenizer(text, return_tensors="pt")
----> 6 model.generate(**inputs)
2 frames
/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py in _validate_model_kwargs(self, model_kwargs)
991
992 if unused_model_args:
--> 993 raise ValueError(
994 f"The following `model_kwargs` are not used by the model: {unused_model_args} (note: typos in the"
995 " generate arguments will also show up in this list)"
ValueError: The following `model_kwargs` are not used by the model: ['token_type_ids'] (note: typos in the generate arguments will also show up in this list)
repro: https://colab.research.google.com/drive/1Ow_UjVEmsOKP8LBn4GGZtYv1RVNuRzx0?usp=sharing
This is surprising to me. @hails , are you aware of this?
That's very strange, I don't know what's up with this! Looking at it I see that this model hasn't had the special_tokens_map.json
file filled out, so I must have not pushed the GPT-NeoX-20b tokenizer (just the NeoX tokenizer from the json file we have internally, which doesn't keep all this info like special tokens for some reason when you save it as a PretrainedTokenizer
). Will push the additional files when I add the rest of the checkpoints tmrw!
There have been a couple weird behaviors with the tokenizer saving/uploading from JSON files/from HF. merges.txt
also wasn't added to any of these repos, though it exists for NeoX-20b's tokenizer.
If the pythia-1.3b
model doesn't give this error, then the above will fix it!
Resolved!