--- license: mit datasets: - Silly-Machine/TuPyE-Dataset language: - pt pipeline_tag: text-classification base_model: neuralmind/bert-large-portuguese-cased widget: - text: 'Bom dia, flor do dia!!' model-index: - name: Yi-34B results: - task: type: text-classfication dataset: name: TuPyE-Dataset type: Silly-Machine/TuPyE-Dataset metrics: - type: accuracy value: 0.907 name: Accuracy verified: true - type: f1 value: 0.903 name: F1-score verified: true - type: precision value: 0.901 name: Precision verified: true - type: recall value: 0.907 name: Recall verified: true --- ## Introduction Tupy-BERT-Large-Multilabel is a fine-tuned BERT model designed specifically for multilabel classification of hate speech in Portuguese. Derived from the [BERTimbau large](https://huggingface.co./neuralmind/bert-large-portuguese-cased), TuPy-Large is a refined solution for addressing categorical hate speech concerns (ageism, aporophobia, body shame, capacitism, LGBTphobia, political, racism, religious intolerance, misogyny, and xenophobia). For more details or specific inquiries, please refer to the [BERTimbau repository](https://github.com/neuralmind-ai/portuguese-bert/). The efficacy of Language Models can exhibit notable variations when confronted with a shift in domain between training and test data. In the creation of a specialized Portuguese Language Model tailored for hate speech classification, the original BERTimbau model underwent fine-tuning processe carried out on the [TuPy Hate Speech DataSet](https://huggingface.co./datasets/Silly-Machine/TuPyE-Dataset), sourced from diverse social networks. ## Available models | Model | Arch. | #Layers | #Params | | ---------------------------------------- | ---------- | ------- | ------- | | `Silly-Machine/TuPy-Bert-Base-Binary-Classifier` | BERT-Base |12 |109M| | `Silly-Machine/TuPy-Bert-Large-Binary-Classifier` | BERT-Large | 24 | 334M | | `Silly-Machine/TuPy-Bert-Base-Multilabel` | BERT-Base | 12 | 109M | | `Silly-Machine/TuPy-Bert-Large-Multilabel` | BERT-Large | 24 | 334M | ## Example usage ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig import torch import numpy as np from scipy.special import softmax def classify_hate_speech(model_name, text): model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) config = AutoConfig.from_pretrained(model_name) # Tokenize input text and prepare model input model_input = tokenizer(text, padding=True, return_tensors="pt") # Get model output scores with torch.no_grad(): output = model(**model_input) scores = softmax(output.logits.numpy(), axis=1) ranking = np.argsort(scores[0])[::-1] # Print the results for i, rank in enumerate(ranking): label = config.id2label[rank] score = scores[0, rank] print(f"{i + 1}) Label: {label} Score: {score:.4f}") # Example usage model_name = "Silly-Machine/TuPy-Bert-Large-Multilabel" text = "Bom dia, flor do dia!!" classify_hate_speech(model_name, text) ```