Language Detection Model

This project trains a BERT-based language detection model on the Hugging Face hac541309/open-lid-dataset, which contains 121 million sentences across 200 languages. The trained model is designed for fast and accurate language identification in text classification tasks.

πŸ“Œ Model Details

  • Architecture: BertForSequenceClassification
  • Hidden Size: 384
  • Layers: 4
  • Attention Heads: 6
  • Max Sequence Length: 512
  • Dropout: 0.1
  • Vocabulary Size: 50,257

πŸš€ Training Process

  • Dataset: Preprocessed and split into train (90%) and test (10%) sets.
  • Tokenizer: Custom PreTrainedTokenizerFast for text tokenization.
  • Evaluation Metrics: Tracked using compute_metrics function.
  • Hyperparameters:
    • Learning Rate: 2e-5
    • Batch Size: 256 (train) / 512 (test)
    • Epochs: 1
    • Scheduler: cosine
  • Trainer: Utilizes Hugging Face Trainer API with wandb logging.

πŸ“Š Evaluation Results

The model was evaluated on a separate test set, and the results are shared in this repository.

https://wandb.ai/eak/lang_detection/reports/Language-detection--VmlldzoxMTMzNjc2NQ

Downloads last month
0
Safetensors
Model size
24.5M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train alexneakameni/language_detection

Space using alexneakameni/language_detection 1