Adding new stopwords list.

#4
by Minuano - opened

Can I submit a PR for a stopwords list for a new language? Would I have to submit a PR to fastembed as well to add the new language to some factory class? If you could be so kind as to point me towards the right direction for this I'd be grateful.

Qdrant org

Hey @Minuano

Unfortunately, stopwords file is not the only thing which should be added in order to support a new language.
The stemmer has to be updated as well, we're using a port of snowball stemmer for it https://github.com/qdrant/py-rust-stemmers

What is instead feasible to do, is to disable the stemmer in fastembed and do preprocessing on the users side.
It is possible to disable the stemmer as of fastembed 0.5.0 and is done via setting disable_stemmer argument to True

I appreciate your response @jmzzomg ,
I was going to ask if I'm able to submit a PR to your port of the stemmer but then I noticed it's written in Rust with Python bindings for which I have little to no experience unfortunately. I understand since latency is a very real concern when it comes to text preprocessing, it's simply not feasible to implement such libraries in Python. As such, I'm going to implement your solution while I wait for hopefully someone else to take up the mantle.

Minuano changed discussion status to closed

Sign up or log in to comment