Language Detection Model

This project trains a BERT-based language detection model on the Hugging Face hac541309/open-lid-dataset, which contains 121 million sentences across 200 languages. The trained model is designed for fast and accurate language identification in text classification tasks.

📌 Model Details

Architecture: BertForSequenceClassification
Hidden Size: 384
Layers: 4
Attention Heads: 6
Max Sequence Length: 512
Dropout: 0.1
Vocabulary Size: 50,257

🚀 Training Process

Dataset: Preprocessed and split into train (90%) and test (10%) sets.
Tokenizer: Custom PreTrainedTokenizerFast for text tokenization.
Evaluation Metrics: Tracked using compute_metrics function.
Hyperparameters:
- Learning Rate: 2e-5
- Batch Size: 256 (train) / 512 (test)
- Epochs: 1
- Scheduler: cosine
Trainer: Utilizes Hugging Face Trainer API with wandb logging.

📊 Evaluation Results

The model was evaluated on a separate test set, and the results are shared in this repository.

https://wandb.ai/eak/lang_detection/reports/Language-detection--VmlldzoxMTMzNjc2NQ

alexneakameni
/

language_detection

Language Detection Model

📌 Model Details

🚀 Training Process

📊 Evaluation Results

Dataset used to train alexneakameni/language_detection

Space using alexneakameni/language_detection 1