UBC-NLP
/

AraT5v2-base-1024

Arabic Machine Translation

Arabic Text Summarization

Arabic News Title and Question Generation

Arabic Paraphrasing and Transliteration

Arabic Code-Switched Translation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

elmadany commited on Aug 16, 2023

Commit

7b9eb95

·

1 Parent(s): 5e127ab

Update README.md

Files changed (1) hide show

README.md +21 -2

README.md CHANGED Viewed

@@ -14,17 +14,36 @@ tags:
 ---
 # AraT5v2-base-1024
 ## What's new?
-- **More data.** AraT5v2-base-1024 trained on multiple varieties of Arabic data.
 - **Large sequence length.** We increase the sequence length from 512 to 1024 in this version.
 - **Converge faster.** AraT5v2-base-1024 converges more than 10x compared with the previous version (AraT5-base.
 - **Extra IDs.**  AraT5v2-base-1024 supports 100 sentinel tokens (a.k.a unique mask tokens).
-<span style="color:red"><b>We recommend using AraT5v2-base-1024 instead of the previous version (AraT5-base).</b></span>

 ---
 # AraT5v2-base-1024
+<span style="color:red"><b>We recommend using AraT5v2-base-1024 instead of the previous version (AraT5-base).</b></span>
 ## What's new?
+- **More data.** `AraT5v2-base-1024` trained on multiple varieties of Arabic data.
 - **Large sequence length.** We increase the sequence length from 512 to 1024 in this version.
 - **Converge faster.** AraT5v2-base-1024 converges more than 10x compared with the previous version (AraT5-base.
 - **Extra IDs.**  AraT5v2-base-1024 supports 100 sentinel tokens (a.k.a unique mask tokens).
+## An example of predicted masked token
+```python
+from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
+tokenizer = T5Tokenizer.from_pretrained("UBC-NLP/AraT5v2-base-1024")
+model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5v2-base-1024")
+prompt="عاصمة ألمانيا هي <extra_id_0> "
+input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids
+outputs = model.generate(input_ids)
+print("Tokenized input:", tokenizer.tokenize(prompt))
+print("Decoded output:", tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+Output:
+```bash
+Tokenized input: ['▁عاصمة', '▁ألمانيا', '▁هي', '<extra_id_0>']
+Decoded output: برلين
+```