model max_seq_length?

by abpani1994 - opened Aug 19, 2024

Discussion

abpani1994

Aug 19, 2024

what is the model max length is it 384 tokens or 384 characters?

eriknovak

Department for Artificial Intelligence, Jožef Stefan Institute org Aug 19, 2024

Hi, it is the maximum token length.

abpani1994

Aug 19, 2024

so the maximum token length is from the deberta tokenization..not with just splitting the text by space
please correct if I am wrong.

eriknovak

Department for Artificial Intelligence, Jožef Stefan Institute org Aug 19, 2024

That is correct, the maximum token length specifies how many tokens, created by the deberta tokenization, can be processed at the same time.

If more tokens are provided, the input text is truncated to the maximum token length.

To use the model to process the text sentence by sentence, I would suggest to try out gliner-spacy. It is a wrapper which enables using the GLiNER model with spacy. With spacy, you can split the text into sentences using the in-built methods, and then use gliner-spacy to extract entities.

abpani1994

Aug 20, 2024

thank you . this fine tune is really good

abpani1994 changed discussion status to closed Aug 20, 2024

eriknovak

Department for Artificial Intelligence, Jožef Stefan Institute org Aug 21, 2024

Glad to hear that 😊

abpani1994

Sep 4, 2024

So I verified with microsoft/mdeberta-v3-base tokenization. The max length with the tokenization and this finetuned model is really different.
ids = tokenizer(text, add_special_tokens=False, max_length=256, stride=10, return_overflowing_tokens=True, truncation=True, padding=False)
len(ids.input_ids[0]) : 256
text = tokenizer.decode(ids.input_ids[0])
model.predict_entities(text, labels=labels, threshold=0.5)
gliner/data_processing/processor.py:206: UserWarning: Sentence of length 422 has been truncated to 384

still I get this error.

abpani1994 changed discussion status to open Sep 4, 2024

abpani1994

Sep 4, 2024

Please help

eriknovak

Department for Artificial Intelligence, Jožef Stefan Institute org Sep 5, 2024

The GLiNER models use a different approach to tokenization. They have a method called token_splitter, which returns a generator that generates the tokens of the original text the GLiNER models can use. If the text is longer, it gets truncated (as you already found).

If you want to ensure the input text is not too long, you can use the token_splitter to first determine the tokens and their number.

from gliner import GLiNER

# load the model
model = GLiNER.from_pretrained("E3-JSI/gliner-multi-pii-domains-v1")

# prepare the text for entity extraction
text = "This is an example text having the name John Doe and the date 15-08-1985."

# create the token generator
token_generator =  model.token_splitter(text)

# get the tokens
tokens = [t for t in token_generator]
len(tokens) # length = 15

Using this length, you can decide how you want to split your text so it does not get truncated.

Hope this helps.

abpani1994

Sep 5, 2024

/python3.10/site-packages/torch/nn/modules/module.py:1709, in Module.getattr(self, name)
1707 if name in modules:
1708 return modules[name]
-> 1709 raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")

AttributeError: 'GLiNER' object has no attribute 'token_splitter'

eriknovak

Department for Artificial Intelligence, Jožef Stefan Institute org Sep 5, 2024

The GLiNER module seems to be updated. After looking into the GLiNER official code, I found that the token_splitter can be accessed in the following way:

token_generator =  model.data_processor.token_splitter(text)

This should do the trick if you are using the latest version of GLiNER.

eriknovak

Department for Artificial Intelligence, Jožef Stefan Institute org Sep 5, 2024

Also, since this does not seem to be a problem with the model but rather with how to solve the original author's issue, I am closing this thread.

eriknovak changed discussion status to closed Sep 5, 2024

abpani1994

Sep 5, 2024

•

edited Sep 5, 2024

oh thank you but
----> 1 model.data_processor.token_splitter(text)

AttributeError: 'SpanProcessor' object has no attribute 'token_splitter'
can you tell which version you are using I am using the latest version

eriknovak

Department for Artificial Intelligence, Jožef Stefan Institute org Sep 5, 2024

•

edited Sep 5, 2024

The GLiNERmodule I tested with has version 0.2.10.

I double-checked the example above. The method is called word_splitter and not token_splitter. I mistakenly copied the wrong parts. Apologies.

The complete (fixed) code for counting the number of tokens is the following:

from gliner import GLiNER

# load the model
model = GLiNER.from_pretrained("E3-JSI/gliner-multi-pii-domains-v1")

# prepare the text for entity extraction
text = "This is an example text having the name John Doe and the date 15-08-1985."

# create the token generator
# NOTE: the use of `words_splitter`
token_generator =  model.data_processor.words_splitter(text)

# get the tokens
tokens = [t for t in token_generator]
len(tokens) # length = 15

Hope it resolves the problem.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment