update config.json pad token id in respect to tokenizer_config pad token

by tapos999 - opened 10 days ago

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

-3

tapos999

10 days ago

tokenizer_config.json is using pad_token , here in config.json pad_token_id is missing which is breaking running the model via TEI as it expecting a value in pad_token_id

update config.json pad token id in respect to tokenizer_config pad tokenf9f0e360

update missing valuesd2e454a2

FremyCompany

Parallia org 10 days ago

Thank you for this contribution!

FremyCompany changed pull request status to merged 10 days ago

FremyCompany

Parallia org 10 days ago

I also fixed this in the other models of the series. But this is a bit strange to put in the config.json since this depends on the tokenizer. But ok, I guess most models are made to work on a single tokenizer, this model is a bit of an exception, it's normal it requires some bending of the rules.

tapos999

9 days ago

Thanks @FremyCompany basically was trying to run through via a version of Text-embedding-inference, now at this extent i could run small text embedding, but if it exceeds more than 128 tokens it throws error. any idea?

I also found in tokenizer.json model_max_length says "model_max_length": 1000000000000000019884624838656, instead of 8192? could you please check?

FremyCompany

Parallia org 9 days ago

@tapos999 I have never used TEI, could you share some small script I could run to debug the issue?

tapos999

8 days ago

@tapos999 I have never used TEI, could you share some small script I could run to debug the issue?

its officially not yet. but i followed to install this version as mentioned here https://github.com/huggingface/text-embeddings-inference/issues/457#issuecomment-2572214745

then, i created a docker compose to run TEI:

services:
tei-embedder:
container_name: tei_modernbert
restart: always
environment:
HF_TOKEN:
TRANSFORMERS_OFFLINE: 0
ports:
- 8001:80
image: text-embedding-inference:custom
command: >
--max-batch-tokens=8192
--max-batch-requests=8
--model-id=Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-DE

then, in the swagger api /embed if you try, for this model it doesn't accept any more token than 128 and throws error. On the contrary, nomicai/ base model still works for large text. so I am not sure if some config issue or model itself trained as such?

FremyCompany

Parallia org 8 days ago

Hmmm, maybe the ROPE config didn't get saved properly? I'll take a look tonight.

It's really strange how many things got saved wrong by save_pretrained... I will have to file a bug when I get some time.

tapos999

8 days ago

Hmmm, maybe the ROPE config didn't get saved properly? I'll take a look tonight.

It's really strange how many things got saved wrong by save_pretrained... I will have to file a bug when I get some time.

Thank you.

FremyCompany

Parallia org 7 days ago

After diffing the various config files, I did not see any issue with the RoPE configuration. I fixed the issue with the model_max_length, but I doubt this is the issue. Could you give this another try?

I also tried to confirm that the model works with longer input using Sentence_Transformer, which it seems it does. So, the issue seems specific to TEI.

Unfortunately, I was not able to build the docker on the GPU environment of my university, so I could not try to reproduce your issue. I could try at home, but this isn't very convenient.

So, if there is still a further issue after the model_max_length fix, may I suggest filing an issue on the TEI repository? Their modernbert feature is not yet stable after all, so it's quite possible there is a bug in their code which they haven't yet uncovered but might want to fix.

tapos999

7 days ago

After diffing the various config files, I did not see any issue with the RoPE configuration. I fixed the issue with the model_max_length, but I doubt this is the issue. Could you give this another try?

I also tried to confirm that the model works with longer input using Sentence_Transformer, which it seems it does. So, the issue seems specific to TEI.

Unfortunately, I was not able to build the docker on the GPU environment of my university, so I could not try to reproduce your issue. I could try at home, but this isn't very convenient.

So, if there is still a further issue after the model_max_length fix, may I suggest filing an issue on the TEI repository? Their modernbert feature is not yet stable after all, so it's quite possible there is a bug in their code which they haven't yet uncovered but might want to fix.

Thank you for trying. I will have a look and check.

tapos999

3 days ago

After diffing the various config files, I did not see any issue with the RoPE configuration. I fixed the issue with the model_max_length, but I doubt this is the issue. Could you give this another try?

I also tried to confirm that the model works with longer input using Sentence_Transformer, which it seems it does. So, the issue seems specific to TEI.

Unfortunately, I was not able to build the docker on the GPU environment of my university, so I could not try to reproduce your issue. I could try at home, but this isn't very convenient.

So, if there is still a further issue after the model_max_length fix, may I suggest filing an issue on the TEI repository? Their modernbert feature is not yet stable after all, so it's quite possible there is a bug in their code which they haven't yet uncovered but might want to fix.

Hi I checked it. that fixes, plus there was an issue on TEI side as well which is fixed. now i can use the model on TEI without any issue. Thanks again!

FremyCompany

Parallia org 3 days ago

•

edited 3 days ago

Wonderful news! 🥳

Btw, I'm currently training an improved version of the model, so you can also expect another update in a couple of days. Probably Friday, when I will have some time to work on this. This new version will stay closer to the English model wrt its embeddings, to prevent the bad quality of the multilingual datasets to deteriorate too much the embedding quality obtained with careful English finetuning data by NomicAI, based on feedback from @tomaarsen . I already have a better version now, but I still need to figure out which data can still be added to improve things a bit before I make the effort to publish a new release. Plus, I should probably add eval results this time, inspired by what other models have done in the area.

Saying that just in case you are ok waiting a bit to embed your full dataset, then it may make sense to do so for another week.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment