IsmaelMousa/arabic-bpe-tokenizer

Byte Level (BPE) Tokenizer for Arabic

Byte Level Tokenizer for Arabic, a robust tokenizer designed to handle Arabic text with precision and efficiency. This tokenizer utilizes a Byte-Pair Encoding (BPE) approach to create a vocabulary of 50,000 tokens, catering specifically to the intricacies of the Arabic language.

Goal

This tokenizer was created as part of the development of an Arabic BART transformer model for summarization from scratch using PyTorch. In adherence to the configurations outlined in the official BART paper, which specifies the use of BPE tokenization, I sought a BPE tokenizer specifically tailored for Arabic. While there are Arabic-only tokenizers and multilingual BPE tokenizers, a dedicated Arabic BPE tokenizer was not available. This gap inspired the creation of a BPE tokenizer focused solely on Arabic, ensuring alignment with BART's recommended configurations and enhancing the effectiveness of Arabic NLP tasks.

Checkpoint Information

Name: IsmaelMousa/arabic-bpe-tokenizer
Vocabulary Size: 50,000

Overview

The Byte Level Tokenizer is optimized to manage Arabic text, which often includes a range of diacritics, different forms of the same word, and various prefixes and suffixes. This tokenizer addresses these challenges by breaking down text into byte-level tokens, ensuring that it can effectively process and understand the nuances of the Arabic language.

Features

Byte-Pair Encoding (BPE): Efficiently manages a large vocabulary size while maintaining accuracy.
Comprehensive Coverage: Handles Arabic script, including diacritics and various word forms.
Flexible Integration: Easily integrates with the tokenizers library for seamless tokenization.

Installation

To use this tokenizer, you need to install the tokenizers library. If you haven’t installed it yet, you can do so using pip:

pip install tokenizers

Example Usage

Here is an example of how to use the Byte Level Tokenizer with the tokenizers library.

This example demonstrates tokenization of the Arabic sentence "لاشيء يعجبني, أريد أن أبكي":

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("IsmaelMousa/arabic-bpe-tokenizer")

text = "لاشيء يعجبني, أريد أن أبكي"

encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded.ids)

print("Encoded Tokens:", encoded.tokens)
print("Token IDs:", encoded.ids)
print("Decoded Text:", decoded)

output:

Encoded Tokens: ['<s>', 'ÙĦØ§', 'ĠØ´ÙĬØ¡', 'ĠÙĬØ¹', 'Ø¬Ø¨', 'ÙĨÙĬ', ',', 'ĠØ£Ø±ÙĬØ¯', 'ĠØ£ÙĨ', 'ĠØ£Ø¨', 'ÙĥÙĬ', '</s>'] 

Token IDs: [0, 419, 1773, 667, 2281, 489, 16, 7578, 331, 985, 1344, 2] 

Decoded Text: لا شيء يعجبني, أريد أن أبكي

Tokenizer Details

Byte-Level Tokenization: This method ensures that every byte of input text is considered, making it suitable for languages with complex scripts.
Adaptability: Can be fine-tuned or used as-is, depending on your specific needs and application scenarios.

License

This project is licensed under the MIT License.