Tokenizer output differs from `EleutherAI/pythia-1.3b`

by jon-tow - opened Dec 6, 2022

EleutherAI org Dec 6, 2022

The tokenizer return dict for EleutherAI/pythia-1.3b-deduped contains a token_type_ids attribute unlike any other models in the Pythia suite. Is this intended behavior?

It leads to irregular errors in places like generate calls with tracebacks such as:

      5 inputs = tokenizer(text, return_tensors="pt")
----> 6 model.generate(**inputs)

2 frames
/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py in _validate_model_kwargs(self, model_kwargs)
    991 
    992         if unused_model_args:
--> 993             raise ValueError(
    994                 f"The following `model_kwargs` are not used by the model: {unused_model_args} (note: typos in the"
    995                 " generate arguments will also show up in this list)"

ValueError: The following `model_kwargs` are not used by the model: ['token_type_ids'] (note: typos in the generate arguments will also show up in this list)

repro: https://colab.research.google.com/drive/1Ow_UjVEmsOKP8LBn4GGZtYv1RVNuRzx0?usp=sharing

jon-tow changed discussion title from Tokenizer output differs from `EleutherAI/pythia-1.3b-deduped` to Tokenizer output differs from `EleutherAI/pythia-1.3b` Dec 6, 2022

stellaathena

EleutherAI org Dec 8, 2022

This is surprising to me. @hails , are you aware of this?

hails

EleutherAI org Dec 8, 2022

That's very strange, I don't know what's up with this! Looking at it I see that this model hasn't had the special_tokens_map.json file filled out, so I must have not pushed the GPT-NeoX-20b tokenizer (just the NeoX tokenizer from the json file we have internally, which doesn't keep all this info like special tokens for some reason when you save it as a PretrainedTokenizer). Will push the additional files when I add the rest of the checkpoints tmrw!

There have been a couple weird behaviors with the tokenizer saving/uploading from JSON files/from HF. merges.txt also wasn't added to any of these repos, though it exists for NeoX-20b's tokenizer.

If the pythia-1.3b model doesn't give this error, then the above will fix it!

hails

EleutherAI org Dec 13, 2022

Resolved!

hails changed discussion status to closed Dec 13, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment