Language Detection Model
This project trains a BERT-based language detection model on the Hugging Face hac541309/open-lid-dataset
, which contains 121 million sentences across 200 languages. The trained model is designed for fast and accurate language identification in text classification tasks.
π Model Details
- Architecture:
BertForSequenceClassification
- Hidden Size:
384
- Layers:
4
- Attention Heads:
6
- Max Sequence Length:
512
- Dropout:
0.1
- Vocabulary Size:
50,257
π Training Process
- Dataset: Preprocessed and split into train (90%) and test (10%) sets.
- Tokenizer: Custom
PreTrainedTokenizerFast
for text tokenization. - Evaluation Metrics: Tracked using
compute_metrics
function. - Hyperparameters:
- Learning Rate:
2e-5
- Batch Size:
256
(train) /512
(test) - Epochs:
1
- Scheduler:
cosine
- Learning Rate:
- Trainer: Utilizes
Hugging Face Trainer
API withwandb
logging.
π Evaluation Results
The model was evaluated on a separate test set, and the results are shared in this repository.
https://wandb.ai/eak/lang_detection/reports/Language-detection--VmlldzoxMTMzNjc2NQ
- Downloads last month
- 0
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.