elmadany commited on
Commit
7b9eb95
·
1 Parent(s): 5e127ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -2
README.md CHANGED
@@ -14,17 +14,36 @@ tags:
14
  ---
15
  # AraT5v2-base-1024
16
 
 
 
 
17
  ## What's new?
18
- - **More data.** AraT5v2-base-1024 trained on multiple varieties of Arabic data.
19
  - **Large sequence length.** We increase the sequence length from 512 to 1024 in this version.
20
  - **Converge faster.** AraT5v2-base-1024 converges more than 10x compared with the previous version (AraT5-base.
21
  - **Extra IDs.** AraT5v2-base-1024 supports 100 sentinel tokens (a.k.a unique mask tokens).
22
 
23
 
24
 
25
- <span style="color:red"><b>We recommend using AraT5v2-base-1024 instead of the previous version (AraT5-base).</b></span>
 
 
 
 
 
26
 
 
 
 
 
 
27
 
 
 
 
 
 
 
28
 
29
 
30
 
 
14
  ---
15
  # AraT5v2-base-1024
16
 
17
+ <span style="color:red"><b>We recommend using AraT5v2-base-1024 instead of the previous version (AraT5-base).</b></span>
18
+
19
+
20
  ## What's new?
21
+ - **More data.** `AraT5v2-base-1024` trained on multiple varieties of Arabic data.
22
  - **Large sequence length.** We increase the sequence length from 512 to 1024 in this version.
23
  - **Converge faster.** AraT5v2-base-1024 converges more than 10x compared with the previous version (AraT5-base.
24
  - **Extra IDs.** AraT5v2-base-1024 supports 100 sentinel tokens (a.k.a unique mask tokens).
25
 
26
 
27
 
28
+ ## An example of predicted masked token
29
+ ```python
30
+ from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
31
+
32
+ tokenizer = T5Tokenizer.from_pretrained("UBC-NLP/AraT5v2-base-1024")
33
+ model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5v2-base-1024")
34
 
35
+ prompt="عاصمة ألمانيا هي <extra_id_0> "
36
+ input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids
37
+ outputs = model.generate(input_ids)
38
+ print("Tokenized input:", tokenizer.tokenize(prompt))
39
+ print("Decoded output:", tokenizer.decode(outputs[0], skip_special_tokens=True))
40
 
41
+ ```
42
+ Output:
43
+ ```bash
44
+ Tokenized input: ['▁عاصمة', '▁ألمانيا', '▁هي', '<extra_id_0>']
45
+ Decoded output: برلين
46
+ ```
47
 
48
 
49