Clarification on the way the tokenizer should be used

by vince62s - opened Oct 4, 2024

Oct 4, 2024

For more detail you can read this post: https://github.com/huggingface/transformers/issues/31513#issuecomment-2393320476

in essence, if you use the snippet of the model card you are getting this:

In [10]: import torch
    ...: from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
    ...: 
    ...: tokenizer = AutoTokenizer.from_pretrained("utter-project/EuroLLM-1.7B-Instruct", padding_side='left')
    ...: prompt = f"<|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n"
    ...: input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=256, truncation=True).input_ids.cuda()
    ...: print(input_ids)
    ...: print(prompt)
    ...: outputs = tokenizer.batch_decode(input_ids, skip_special_tokens=False)
    ...: print(outputs)
tensor([[     1,      3,  15236,    271,  31702,  31817,    557,   5302,   6001,
           1061,   6771,   2023,   5256, 119735,    271,  31601, 119782,  97849,
           4437,    271,  60457, 119782,      4, 119715,    271,      3,  58406,
            271]], device='cuda:0')
<|im_start|>user
Translate the following text from English into German.
English: Hello world
German:<|im_end|>
<|im_start|>assistant

['<s><|im_start|> user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|> \n<|im_start|> assistant\n']

You can cannot really see but there are spaces added before user, assistant, and between <|im_end|> and "\n"
it uses some specific tokens for user (15236) and assistant (58406)

Now if you add the following flag to the tokenizer:

In [11]: import torch
    ...: from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
    ...: 
    ...: tokenizer = AutoTokenizer.from_pretrained("utter-project/EuroLLM-1.7B-Instruct", padding_side='left', add_prefix_space=False)
    ...: prompt = f"<|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n"
    ...: input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=256, truncation=True).input_ids.cuda()
    ...: print(input_ids)
    ...: print(prompt)
    ...: outputs = tokenizer.batch_decode(input_ids, skip_special_tokens=False)
    ...: print(outputs)
tensor([[     1,      3,  13676,    271,  31702,  31817,    557,   5302,   6001,
           1061,   6771,   2023,   5256, 119735,    271,  31601, 119782,  97849,
           4437,    271,  60457, 119782,      4,    271,      3,    788,  35441,
            271]], device='cuda:0')
<|im_start|>user
Translate the following text from English into German.
English: Hello world
German:<|im_end|>
<|im_start|>assistant

['<s><|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n']

You can see the spaces are no longer there and tokens are not the same.

SO the question is: what tokenizer did you use at training ? if HF, then can you please specify the token IDs and flags so that we use the same at inference ?

Thanks
Vincent

BTW: the issue is similar with Tower.

nunonmg

UTTER - Unified Transcription and Translation for Extended Reality org Oct 4, 2024

Hi Vincent,
Thanks for pointing out this issue. We are aware of this and we may find an alternative solution in future models (same for Tower). For now, please use the tokenizer as in the first snippet. For example, these are the tokens the model sees during training for "<|im_start|>user\n": [3, 15236, 271]. This is a strange issue indeed because [tokenizer.decode([num]) for num in [3, 15236, 271]] yields ['<|im_start|>', 'user', '\n'].
We may experiment training further iterations with the add_prefix_space option set to False.

P.S.: Please note that, for best results, EuroLLM requires adding a system placeholder even if the system message is "".

vince62s

Oct 4, 2024

you also realize that the first snippet triggers this sequence [4, 119715, 271] with this useless character between im_end and \n

nunonmg

UTTER - Unified Transcription and Translation for Extended Reality org Oct 4, 2024

Yes, indeed. That was also seen during training. This character is usually triggered before numbers and \n.

phmartins changed discussion status to closed Oct 18, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment