This model is a camembert-base model fine-tuned on a French translated toxic-chat dataset plus additional synthetic data. The model is trained to classify user prompts into three categories: "Toxic", "Non-Toxic", and "Sensible".
- Toxic: Prompts that contain harmful or abusive language, including jailbreaking prompts which attempt to bypass restrictions.
- Non-Toxic: Prompts that are safe and free of harmful content.
- Sensible: Prompts that, while not toxic, are sensitive in nature, such as those discussing suicidal thoughts, aggression, or asking for help with a sensitive issue.
The evaluation results are as follows (still under evaluation, more data is needed):
Precision | Recall | F1-Score | |
---|---|---|---|
Non-Toxic | 0.97 | 0.95 | 0.96 |
Sensible | 0.95 | 0.99 | 0.98 |
Toxic | 0.87 | 0.90 | 0.88 |
Accuracy | 0.94 | ||
Macro Avg | 0.93 | 0.95 | 0.94 |
Weighted Avg | 0.94 | 0.94 | 0.94 |
Note: This model is still under development, and its performance and characteristics are subject to change as training is not yet complete.
- Downloads last month
- 19
Inference API (serverless) is not available, repository is disabled.