GuideLines for training urdu tokenizer for much larger corpus
#1
by
hadidev
- opened
HI, I'm trying to train tokenizer as well as BERT model for urdu data. Can you share your step by step used in this model
π Installation
Urduhack officially supports Python 3.6β3.7, and runs great on PyPy.
Installing with tensorflow cpu version.
$ pip install urduhack[tf]
Installing with tensorflow gpu version.
$ pip install urduhack[tf-gpu]
Usage
import urduhack
# Downloading models
urduhack.download()
nlp = urduhack.Pipeline()
text = ""
doc = nlp(text)
for sentence in doc.sentences:
print(sentence.text)
for word in sentence.words:
print(f"{word.text}\t{word.pos}")
for token in sentence.tokens:
print(f"{token.text}\t{token.ner}")