kannada-tokenizer / README.md
charanhu's picture
Update README.md
10d210c verified
metadata
language: kn
tags:
  - kannada
  - tokenizer
  - bpe
  - nlp
  - huggingface
license: mit
datasets:
  - Cognitive-Lab/Kannada-Instruct-dataset
pipeline_tag: text-generation

Kannada Tokenizer

Hugging Face

This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the translated_output column from the Cognitive-Lab/Kannada-Instruct-dataset. It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text.

Model Description

  • Model Type: Byte-Pair Encoding (BPE) Tokenizer
  • Language: Kannada (kn)
  • Vocabulary Size: 32,000
  • Special Tokens:
    • [UNK] (Unknown token)
    • [PAD] (Padding token)
    • [CLS] (Classifier token)
    • [SEP] (Separator token)
    • [MASK] (Masking token)
  • License: MIT License
  • Dataset Used: Cognitive-Lab/Kannada-Instruct-dataset
  • Algorithm: Byte-Pair Encoding (BPE)

Intended Use

This tokenizer is intended for NLP applications involving the Kannada language, such as:

  • Language Modeling
  • Text Generation
  • Text Classification
  • Machine Translation
  • Named Entity Recognition
  • Question Answering
  • Summarization

How to Use

You can load the tokenizer directly from the Hugging Face Hub:

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("charanhu/kannada-tokenizer")

# Example usage
text = "ನೀವು ಹೇಗಿದ್ದೀರಿ?"
encoding = tokenizer.encode(text)
tokens = tokenizer.convert_ids_to_tokens(encoding)
decoded_text = tokenizer.decode(encoding)

print("Original Text:", text)
print("Tokens:", tokens)
print("Decoded Text:", decoded_text)

Output:

Original Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
Tokens: ['ನೀವು', 'ಹೇಗಿದ್ದೀರಿ', '?']
Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?

Training Data

The tokenizer was trained on the translated_output column from the Cognitive-Lab/Kannada-Instruct-dataset. This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization.

  • Dataset Size: The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada.
  • Data Preprocessing: Text normalization was applied using NFKC normalization to standardize characters.

Training Procedure

  • Normalization: NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently.
  • Pre-tokenization: The text was pre-tokenized using whitespace splitting.
  • Tokenizer Algorithm: Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada.
  • Vocabulary Size: Set to 32,000 to balance between coverage and efficiency.
  • Special Tokens: Included [UNK], [PAD], [CLS], [SEP], [MASK] to support various downstream tasks.
  • Training Library: The tokenizer was built using the Hugging Face Tokenizers library.

Evaluation

The tokenizer was qualitatively evaluated on a set of Kannada sentences to ensure reasonable tokenization. However, quantitative evaluation metrics such as tokenization efficiency or perplexity were not computed.

Limitations

  • Vocabulary Coverage: While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada, especially rare or domain-specific terms.
  • Biases: The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications.
  • Out-of-Vocabulary Words: Out-of-vocabulary words may be broken into subword tokens or mapped to the [UNK] token, which could affect performance in downstream tasks.

Ethical Considerations

  • Data Privacy: The dataset used is publicly available, and care was taken to ensure that no personal or sensitive information is included.
  • Bias Mitigation: No specific bias mitigation techniques were applied. Users should be aware of potential biases in the tokenizer due to the training data.

Recommendations

  • Fine-tuning: For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data.
  • Evaluation: Users should evaluate the tokenizer in their specific context to ensure it meets their requirements.

Acknowledgments

License

This tokenizer is released under the MIT License.

Citation

If you use this tokenizer in your research or applications, please consider citing it:

@misc{kannada_tokenizer_2023,
  title={Kannada Tokenizer},
  author={charanhu},
  year={2023},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co./charanhu/kannada-tokenizer}},
}