--- license: mit datasets: - Silly-Machine/TuPyE-Dataset language: - pt pipeline_tag: text-classification base_model: neuralmind/bert-large-portuguese-cased widget: - text: 'Bom dia, flor do dia!!' model-index: - name: Yi-34B results: - task: type: text-classfication dataset: name: TuPyE-Dataset type: Silly-Machine/TuPyE-Dataset metrics: - type: accuracy value: 0.907 name: Accuracy verified: true - type: f1 value: 0.903 name: F1-score verified: true - type: precision value: 0.901 name: Precision verified: true - type: recall value: 0.907 name: Recall verified: true --- ## Introduction TuPy-Bert-Large-Binary-Classifier is a fine-tuned BERT model designed specifically for binary classification of hate speech in Portuguese. Derived from the [BERTimbau base](https://huggingface.co./neuralmind/bert-large-portuguese-cased), TuPy-Bert-Large-Binary-Classifier is a refined solution for addressing binary hate speech concerns (hate or not hate). For more details or specific inquiries, please refer to the [BERTimbau repository](https://github.com/neuralmind-ai/portuguese-bert/). The efficacy of Language Models can exhibit notable variations when confronted with a shift in domain between training and test data. In the creation of a specialized Portuguese Language Model tailored for hate speech classification, the original BERTimbau model underwent fine-tuning processe carried out on the [TuPy Hate Speech DataSet](https://huggingface.co./datasets/Silly-Machine/TuPyE-Dataset), sourced from diverse social networks. ## Available models | Model | Arch. | #Layers | #Params | | ---------------------------------------- | ---------- | ------- | ------- | | `Silly-Machine/TuPy-Bert-Base-Binary-Classifier` | BERT-Base |12 |109M| | `Silly-Machine/TuPy-Bert-Large-Binary-Classifier` | BERT-Large | 24 | 334M | | `Silly-Machine/TuPy-Bert-Base-Multilabel` | BERT-Base | 12 | 109M | | `Silly-Machine/TuPy-Bert-Large-Multilabel` | BERT-Large | 24 | 334M | ## Example usage ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig import torch import numpy as np from scipy.special import softmax def classify_hate_speech(model_name, text): model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) config = AutoConfig.from_pretrained(model_name) # Tokenize input text and prepare model input model_input = tokenizer(text, padding=True, return_tensors="pt") # Get model output scores with torch.no_grad(): output = model(**model_input) scores = softmax(output.logits.numpy(), axis=1) ranking = np.argsort(scores[0])[::-1] # Print the results for i, rank in enumerate(ranking): label = config.id2label[rank] score = scores[0, rank] print(f"{i + 1}) Label: {label} Score: {score:.4f}") # Example usage model_name = "Silly-Machine/TuPy-Bert-Large-Binary-Classifier" text = "Bom dia, flor do dia!!" classify_hate_speech(model_name, text) ```