PersianBPETokenizer Model Card

Model Details

Model Description

The PersianBPETokenizer is a custom tokenizer specifically designed for the Persian (Farsi) language. It leverages the Byte-Pair Encoding (BPE) algorithm to create a robust vocabulary that can effectively handle the unique characteristics of Persian text. This tokenizer is optimized for use with advanced language models like BERT and RoBERTa, making it a valuable tool for various Persian NLP tasks.

Model Type

Tokenization Algorithm: Byte-Pair Encoding (BPE)
Normalization: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)
Pre-tokenization: Whitespace
Post-processing: TemplateProcessing for special tokens

Model Version

Version: 1.0
Date: September 6, 2024

License

License: MIT

Developers

Developed by: Mohammad Shojaei
Contact: [email protected]

Citation

If you use this tokenizer in your research, please cite it as:

Mohammad Shojaei. (2024). PersianBPETokenizer [Software]. Available at https://huggingface.co./mshojaei77/PersianBPETokenizer.

Model Use

Intended Use

Primary Use: Tokenization of Persian text for NLP tasks such as text classification, named entity recognition, machine translation, and more.
Secondary Use: Integration with pre-trained language models like BERT and RoBERTa for fine-tuning on Persian datasets.

Out-of-Scope Use

Non-Persian Text: This tokenizer is not designed for languages other than Persian.
Non-NLP Tasks: It is not intended for use in non-NLP tasks such as image processing or audio analysis.

Data

Training Data

Dataset: mshojaei77/PersianTelegramChannels
Description: A rich collection of Persian text extracted from various Telegram channels. This dataset provides a diverse range of language patterns and vocabulary, making it suitable for training a general-purpose Persian tokenizer.
Size: 60,730 samples

Data Preprocessing

Normalization: Applied NFD Unicode normalization, removed accents, converted text to lowercase, stripped leading and trailing whitespace, and removed ZWNJ characters.
Pre-tokenization: Used whitespace pre-tokenization.

Performance

Evaluation Metrics

Tokenization Accuracy: The tokenizer has been tested on various Persian sentences and has shown high accuracy in tokenizing and encoding text.
Compatibility: Fully compatible with Hugging Face Transformers, ensuring seamless integration with advanced language models.

Known Limitations

Vocabulary Size: The current vocabulary size is based on the training data. For very specialized domains, additional fine-tuning or training on domain-specific data may be required.
Out-of-Vocabulary Words: Rare or domain-specific words may be tokenized as unknown tokens ([UNK]).

Training Procedure

Training Steps

Environment Setup: Installed necessary libraries (datasets, tokenizers, transformers).
Data Preparation: Loaded the mshojaei77/PersianTelegramChannels dataset and created a batch iterator for efficient training.
Tokenizer Model: Initialized the tokenizer with a BPE model and applied normalization and pre-tokenization steps.
Training: Trained the tokenizer on the Persian text corpus using the BPE algorithm.
Post-processing: Set up post-processing to handle special tokens.
Saving: Saved the tokenizer to disk for future use.
Compatibility: Converted the tokenizer to a PreTrainedTokenizerFast object for compatibility with Hugging Face Transformers.

Hyperparameters

Special Tokens: [UNK], [CLS], [SEP], [PAD], [MASK]
Batch Size: 1000 samples per batch
Normalization Steps: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)

How to Use

Installation

To use the PersianBPETokenizer, first install the required libraries:

pip install -q --upgrade datasets tokenizers transformers

Loading the Tokenizer

You can load the tokenizer using the Hugging Face Transformers library:

from transformers import AutoTokenizer

persian_tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianBPETokenizer")

Tokenization Example

test_sentence = "سلام، چطور هستید؟ امیدوارم روز خوبی داشته باشید"
tokens = persian_tokenizer.tokenize(test_sentence)
print("Tokens:", tokens)
encoded = persian_tokenizer(test_sentence)
print("Input IDs:", encoded["input_ids"])
print("Decoded:", persian_tokenizer.decode(encoded["input_ids"]))

Acknowledgments

Dataset: mshojaei77/PersianTelegramChannels
Libraries: Hugging Face datasets, tokenizers, and transformers

mshojaei77
/

PersianBPETokenizer