|
--- |
|
language: |
|
- ur |
|
tags: |
|
- ner |
|
--- |
|
|
|
# NER in Urdu |
|
## muril_base_cased_urdu_ner |
|
|
|
Base model is [google/muril-base-cased](https://huggingface.co./google/muril-base-cased), a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. |
|
Urdu NER dataset is translated from the Hindi NER dataset from [HiNER](https://github.com/cfiltnlp/HiNER). |
|
|
|
## Usage |
|
### example: |
|
```python |
|
from transformers import AutoModelForTokenClassification, AutoTokenizer |
|
import torch |
|
|
|
model = AutoModelForTokenClassification.from_pretrained("MichaelHuang/muril_base_cased_urdu_ner") |
|
tokenizer = AutoTokenizer.from_pretrained("google/muril-base-cased") |
|
|
|
# Define the labels dictionary |
|
labels_dict = { |
|
0: "B-FESTIVAL", |
|
1: "B-GAME", |
|
2: "B-LANGUAGE", |
|
3: "B-LITERATURE", |
|
4: "B-LOCATION", |
|
5: "B-MISC", |
|
6: "B-NUMEX", |
|
7: "B-ORGANIZATION", |
|
8: "B-PERSON", |
|
9: "B-RELIGION", |
|
10: "B-TIMEX", |
|
11: "I-FESTIVAL", |
|
12: "I-GAME", |
|
13: "I-LANGUAGE", |
|
14: "I-LITERATURE", |
|
15: "I-LOCATION", |
|
16: "I-MISC", |
|
17: "I-NUMEX", |
|
18: "I-ORGANIZATION", |
|
19: "I-PERSON", |
|
20: "I-RELIGION", |
|
21: "I-TIMEX", |
|
22: "O" |
|
} |
|
|
|
def ner_predict(sentence, model, tokenizer, labels_dict): |
|
# Tokenize the input sentence |
|
inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128) |
|
|
|
# Perform inference |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
# Get the predicted labels |
|
predicted_labels = torch.argmax(outputs.logits, dim=2) |
|
|
|
# Convert tokens and labels to lists |
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
labels = predicted_labels.squeeze().tolist() |
|
|
|
# Map numeric labels to string labels |
|
predicted_labels = [labels_dict[label] for label in labels] |
|
|
|
# Combine tokens and labels |
|
result = list(zip(tokens, predicted_labels)) |
|
|
|
return result |
|
|
|
test_sentence = "امیتابھ اور ریکھا کی فلم 'گنگا کی سوگندھ' 10 فروری سنہ 1978 کو ریلیز ہوئی تھی۔ اس کے بعد راکھی، رندھیر کپور اور نیتو سنگھ کے ساتھ 'قسمے وعدے' 21 اپریل 1978 کو ریلیز ہوئی۔" |
|
predictions = ner_predict(test_sentence, model, tokenizer, labels_dict) |
|
|
|
for token, label in predictions: |
|
print(f"{token}: {label}") |
|
``` |