RoBERTa for Multilabel Language Classification

Training

RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

Implemented heuristic algorithm for multilingual training data creation - https://github.com/n1kstep/lang-classifier

data source language
open_subtitles ka, he, en, de
oscar be, kk, az, hu
tatoeba ru, uk

Validation

The metrics obtained from validation on the another part of dataset (~1k samples per language).

Training Loss Validation Loss F1-Score Roc Auc Accuracy Support
0.161500 0.110949 0.947844 0.953939 0.762063 26858
Downloads last month
114
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Datasets used to train nikitast/multilang-classifier-roberta