Token Classification
GLiNER
PyTorch
English
NER
GLiNER
information extraction
encoder
entity recognition
modernbert

Spaces in tokens

#1
by johnowhitaker - opened

I dug through the GLiNER codebase a while back, and while I'm still not sure, I think the default WordSplitter is used, and that it doesn't include spaces at the start of each word. Since ModernBERT uses an OLMO-style tokenizer most of the vocab has spaces before the word! When I was trying out GLiNER as an eval during training I ended up rolling my own to work around this, might be worth a look in case this gives even better performance.

image.png

(It seems to be working well so perhaps this isn't an issue, but it feels like the kind of thing that might result in mysterious underperformance)

Knowledgator Engineering org

@johnowhitaker , thank you for pointing out this issue, it can explain why we get bad results for uni-encoder token-level GLiNER and in general ModernBERT version requires more data. This bi-encoder GLiNER is span-level so maybe it mitigates the issue but it is worth investigating it more deeply.

Sign up or log in to comment