MichaelHuang commited on
Commit
b1f890d
1 Parent(s): 9a9a2da

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -0
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ur
4
+ tags:
5
+ - ner
6
+ ---
7
+
8
+ # NER in Urdu
9
+ ## muril_base_cased_urdu_ner
10
+
11
+ Base model is [google/muril-base-cased](https://huggingface.co/google/muril-base-cased), a BERT model pre-trained on 17 Indian languages and their transliterated counterparts.
12
+ Urdu NER dataset is translated from the Hindi NER dataset from [HiNER](https://github.com/cfiltnlp/HiNER).
13
+
14
+ ## Usage
15
+ ### example:
16
+ ```python
17
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
18
+ import torch
19
+
20
+ model = AutoModelForTokenClassification.from_pretrained("MichaelHuang/muril_base_cased_urdu_ner")
21
+ tokenizer = AutoTokenizer.from_pretrained("google/muril-base-cased")
22
+
23
+ # Define the labels dictionary
24
+ labels_dict = {
25
+ 0: "B-FESTIVAL",
26
+ 1: "B-GAME",
27
+ 2: "B-LANGUAGE",
28
+ 3: "B-LITERATURE",
29
+ 4: "B-LOCATION",
30
+ 5: "B-MISC",
31
+ 6: "B-NUMEX",
32
+ 7: "B-ORGANIZATION",
33
+ 8: "B-PERSON",
34
+ 9: "B-RELIGION",
35
+ 10: "B-TIMEX",
36
+ 11: "I-FESTIVAL",
37
+ 12: "I-GAME",
38
+ 13: "I-LANGUAGE",
39
+ 14: "I-LITERATURE",
40
+ 15: "I-LOCATION",
41
+ 16: "I-MISC",
42
+ 17: "I-NUMEX",
43
+ 18: "I-ORGANIZATION",
44
+ 19: "I-PERSON",
45
+ 20: "I-RELIGION",
46
+ 21: "I-TIMEX",
47
+ 22: "O"
48
+ }
49
+
50
+ def ner_predict(sentence, model, tokenizer, labels_dict):
51
+ # Tokenize the input sentence
52
+ inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128)
53
+
54
+ # Perform inference
55
+ with torch.no_grad():
56
+ outputs = model(**inputs)
57
+
58
+ # Get the predicted labels
59
+ predicted_labels = torch.argmax(outputs.logits, dim=2)
60
+
61
+ # Convert tokens and labels to lists
62
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
63
+ labels = predicted_labels.squeeze().tolist()
64
+
65
+ # Map numeric labels to string labels
66
+ predicted_labels = [labels_dict[label] for label in labels]
67
+
68
+ # Combine tokens and labels
69
+ result = list(zip(tokens, predicted_labels))
70
+
71
+ return result
72
+
73
+ test_sentence = "امیتابھ اور ریکھا کی فلم 'گنگا کی سوگندھ' 10 فروری سنہ 1978 کو ریلیز ہوئی تھی۔ اس کے بعد راکھی، رندھیر کپور اور نیتو سنگھ کے ساتھ 'قسمے وعدے' 21 اپریل 1978 کو ریلیز ہوئی۔"
74
+ predictions = ner_predict(test_sentence, model, tokenizer, labels_dict)
75
+
76
+ for token, label in predictions:
77
+ print(f"{token}: {label}")
78
+ ```