The current unigrams.txt file is empty? For my master's thesis I made the kenLM model from scratch from the same ParlaSpeechHR-v1.0 dataset (JSONL file), and this is my resulting unigrams.txt that I found to work rather well.

CLASSLA - CLARIN Knowledge Centre for South Slavic Languages org

You probably saw it, we now have the much larger ParlaSpeech-HR v2.0 available as well (https://huggingface.co./datasets/classla/ParlaSpeech-HR) if you have good use cases. @5roop will look into your request and will merge upon inspection, thanks!

I see you have similar interests as we do otherwise, would not mind we exchange insights and plans forward.

5roop changed pull request status to merged
CLASSLA - CLARIN Knowledge Centre for South Slavic Languages org
edited Aug 2

Thanks for your contribution, @porupski , I tested your unigrams on the two files we have in the repo, and the new version works OK. It would be good to check performance on a non-ParlaSpeech-HR dataset, but let's leave this for some later date.

Sign up or log in to comment