--- license: mit language: - en - ja - zh - ko metrics: - accuracy base_model: google-bert/bert-base-multilingual-cased pipeline_tag: text-classification tags: - sex - filename - dectection - content - mbert - Multilingual --- # Model Card for Model ID Detect sexual content in text or file names. ## Model Details ### Model Description - **Developed by:** liu wei - **License:** MIT - **Finetuned from model:** bert-base-multilingual-cased - **Task:** Simple Classification - **Language:** Multilingual - **Max Length:** 128 - **Updated Time:** 2024-8-22 ### Model Training Information - **Training Dataset Size:** 100,000 manually annotated data with noise - **Data Distribution:** 50:50 - **Batch Size:** 8 - **Epochs:** 5 - **Accuracy:** 92% - **F1:** 92% Buy me a cup of coffee,thanks ## Uses - Supports multiple languages, such as English, Chinese, Japanese, etc. - Use for detect content submitted by users in forums, magnetic search engines, cloud disks, etc. - Detect semantics and variant content, Porn movie numbers or variant file names. - Compared with GPT4O-mini, The detection accuracy is greatly improved. ### Examples - Example **English** ```python predict("Tiffany Doll - Wine Makes Me Anal (31.03.2018)_1080p.mp4") ``` ```json { "predictions": 1, "label": "Sexual" } ``` - Example **Chinese** ```python predict("橙子 · 保安和女业主的一夜春宵。路见不平拔刀相助,救下苏姐,以身相许!") ``` ```json { "predictions": 1, "label": "Sexual" } ``` - Example **Japanese** ```python predict("MILK-217-UNCENSORED-LEAKピタコス Gカップ痴女 完全着衣で濃密5PLAY 椿りか 580 2.TS") ``` ```json { "predictions": 1, "label": "Sexual" } ``` - Example **Porn Movie Numbers** ```python predict("DVAJ-548_CH_SD") ``` ```json { "predictions": 1, "label": "Sexual" } ``` ## How to Get Started with the Model ### step 1: Create a python file under this model, such as 'use_model.py' ```python import torch from transformers import BertForSequenceClassification, BertTokenizer # load model tokenizer = BertTokenizer.from_pretrained("uget/sexual_content_dection") model = BertForSequenceClassification.from_pretrained("uget/sexual_content_dection") def predict(text): encoding = tokenizer(text, return_tensors="pt") encoding = {k: v.to(model.device) for k,v in encoding.items()} outputs = model(**encoding) probs = torch.sigmoid(outputs.logits) predictions = torch.argmax(probs, dim=-1) label_map = {0: "None", 1: "Sexual"} predicted_label = label_map[predictions.item()] print(f"Predictions:{predictions.item()}, Label:{predicted_label}") return {"predictions": predictions.item(), "label": predicted_label} predict("Tiffany Doll - Wine Makes Me Anal (31.03.2018)_1080p.mp4") ``` ### step 2: Run ```shell python3 use_model.py ``` Response JSON ```json { "predictions": 1, "label": "Sexual" } ``` ### Explanation The results only include two situations: - predictions-0 **Not Dectection** sexual content; - predictions-1 **Sexual** content was detected. Buy me a cup of coffee,thanks ## Model Card Contact Email: jack813@gmail.com