tamang0000
/

assamese-tokenizer-50k

Inference Endpoints

Model card Files Files and versions Community

assamese-tokenizer-50k / README.md

tamang0000's picture

Update README.md

7fa3768 verified 7 months ago

|

1.25 kB


	---
	library_name: transformers
	tags:
	- LLM
	- Multilingual
	- Transformer
	- Non-English
	- Tokenizer
	- Indian
	- Assamese
	---

	# Assamese Tokenizer (50K Vocabulary)

	## Model Details

	This repository contains a custom tokenizer for the Assamese language with a vocabulary size of 50,000 tokens. The tokenizer was trained on the Assamese language subset of the CC-100 multilingual dataset. This tokenizer can be used for various Natural Language Processing (NLP) tasks involving the Assamese language.

	## Repository Details

	- Repository Name: tamang0000/assamese-tokenizer-50k
	- Tokenizer Vocabulary Size: 50,000 tokens
	- Training Dataset: CC-100 Multilingual Dataset (Assamese Language Subset)
	- Model Type: Tokenizer
	- Framework: Hugging Face Transformers
	- License: MIT License

	## Tokenizer Usage

	You can load and use this tokenizer with the Hugging Face `transformers` library. Below are the steps to load and use the tokenizer in your projects.

	## Training Details

	- Dataset: The tokenizer was trained exclusively on the Assamese language subset of the CC-100 multilingual dataset.
	- Vocabulary Size: 50,000 tokens.
	- Normalization: Includes normalization steps such as lowercasing and stripping accents.