Byte Level (BPE) Tokenizer for Arabic
Byte Level Tokenizer for Arabic, a robust tokenizer designed to handle Arabic text with precision and efficiency.
This tokenizer utilizes a Byte-Pair Encoding (BPE)
approach to create a vocabulary of 50,000
tokens, catering specifically to the intricacies of the Arabic language.
Goal
This tokenizer was created as part of the development of an Arabic BART transformer model for summarization from scratch using PyTorch
.
In adherence to the configurations outlined in the official BART paper, which specifies the use of BPE tokenization, I sought a BPE tokenizer specifically tailored for Arabic.
While there are Arabic-only tokenizers and multilingual BPE tokenizers, a dedicated Arabic BPE tokenizer was not available. This gap inspired the creation of a BPE
tokenizer focused solely on Arabic, ensuring alignment with BART's recommended configurations and enhancing the effectiveness of Arabic NLP tasks.
Checkpoint Information
- Name:
IsmaelMousa/arabic-bpe-tokenizer
- Vocabulary Size:
50,000
Overview
The Byte Level Tokenizer is optimized to manage Arabic text, which often includes a range of diacritics, different forms of the same word, and various prefixes and suffixes. This tokenizer addresses these challenges by breaking down text into byte-level tokens, ensuring that it can effectively process and understand the nuances of the Arabic language.
Features
- Byte-Pair Encoding (BPE): Efficiently manages a large vocabulary size while maintaining accuracy.
- Comprehensive Coverage: Handles Arabic script, including diacritics and various word forms.
- Flexible Integration: Easily integrates with the
tokenizers
library for seamless tokenization.
Installation
To use this tokenizer, you need to install the tokenizers
library. If you haven’t installed it yet, you can do so using pip:
pip install tokenizers
Example Usage
Here is an example of how to use the Byte Level Tokenizer with the tokenizers
library.
This example demonstrates tokenization of the Arabic sentence "لاشيء يعجبني, أريد أن أبكي":
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("IsmaelMousa/arabic-bpe-tokenizer")
text = "لاشيء يعجبني, أريد أن أبكي"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded.ids)
print("Encoded Tokens:", encoded.tokens)
print("Token IDs:", encoded.ids)
print("Decoded Text:", decoded)
output:
Encoded Tokens: ['<s>', 'ÙĦا', 'ĠØ´ÙĬØ¡', 'ĠÙĬع', 'جب', 'ÙĨÙĬ', ',', 'ĠأرÙĬد', 'ĠØ£ÙĨ', 'Ġأب', 'ÙĥÙĬ', '</s>']
Token IDs: [0, 419, 1773, 667, 2281, 489, 16, 7578, 331, 985, 1344, 2]
Decoded Text: لا شيء يعجبني, أريد أن أبكي
Tokenizer Details
- Byte-Level Tokenization: This method ensures that every byte of input text is considered, making it suitable for languages with complex scripts.
- Adaptability: Can be fine-tuned or used as-is, depending on your specific needs and application scenarios.
License
This project is licensed under the MIT
License.