|
|
|
--- |
|
library_name: transformers |
|
tags: |
|
- LLM |
|
- Multilingual |
|
- Transformer |
|
- Non-English |
|
- Tokenizer |
|
- Indian |
|
- Assamese |
|
--- |
|
|
|
# Assamese Tokenizer (50K Vocabulary) |
|
|
|
## Model Details |
|
|
|
This repository contains a custom tokenizer for the Assamese language with a vocabulary size of 50,000 tokens. The tokenizer was trained on the Assamese language subset of the CC-100 multilingual dataset. This tokenizer can be used for various Natural Language Processing (NLP) tasks involving the Assamese language. |
|
|
|
## Repository Details |
|
|
|
- **Repository Name:** tamang0000/assamese-tokenizer-50k |
|
- **Tokenizer Vocabulary Size:** 50,000 tokens |
|
- **Training Dataset:** CC-100 Multilingual Dataset (Assamese Language Subset) |
|
- **Model Type:** Tokenizer |
|
- **Framework:** Hugging Face Transformers |
|
- **License:** MIT License |
|
|
|
## Tokenizer Usage |
|
|
|
You can load and use this tokenizer with the Hugging Face `transformers` library. Below are the steps to load and use the tokenizer in your projects. |
|
|
|
## Training Details |
|
|
|
- **Dataset:** The tokenizer was trained exclusively on the Assamese language subset of the CC-100 multilingual dataset. |
|
- **Vocabulary Size:** 50,000 tokens. |
|
- **Normalization:** Includes normalization steps such as lowercasing and stripping accents. |
|
|