jack813liu's picture
Update README.md
4250d8b verified
metadata
license: mit
language:
  - en
  - ja
  - zh
  - ko
metrics:
  - accuracy
base_model: google-bert/bert-base-multilingual-cased
pipeline_tag: text-classification
tags:
  - sex
  - filename
  - dectection
  - content
  - mbert
  - Multilingual

Model Card for Model ID

Detect sexual content in text or file names.

Model Details

Model Description

  • Developed by: liu wei
  • License: MIT
  • Finetuned from model: bert-base-multilingual-cased
  • Task: Simple Classification
  • Language: Multilingual
  • Max Length: 128
  • Updated Time: 2024-8-22

Model Training Information

  • Training Dataset Size: 100,000 manually annotated data with noise
  • Data Distribution: 50:50
  • Batch Size: 8
  • Epochs: 5
  • Accuracy: 92%
  • F1: 92%

Buy me a cup of coffee,thanks

Uses

  • Supports multiple languages, such as English, Chinese, Japanese, etc.
  • Use for detect content submitted by users in forums, magnetic search engines, cloud disks, etc.
  • Detect semantics and variant content, Porn movie numbers or variant file names.
  • Compared with GPT4O-mini, The detection accuracy is greatly improved.

Examples

  • Example English
predict("Tiffany Doll - Wine Makes Me Anal (31.03.2018)_1080p.mp4")
{
    "predictions": 1,
    "label": "Sexual"
}
  • Example Chinese
predict("橙子 · 保安和女业主的一夜春宵。路见不平拔刀相助,救下苏姐,以身相许!")
{
    "predictions": 1,
    "label": "Sexual"
}
  • Example Japanese
predict("MILK-217-UNCENSORED-LEAKピタコス Gカップ痴女 完全着衣で濃密5PLAY 椿りか 580 2.TS")
{
    "predictions": 1,
    "label": "Sexual"
}
  • Example Porn Movie Numbers
predict("DVAJ-548_CH_SD")
{
    "predictions": 1,
    "label": "Sexual"
}

How to Get Started with the Model

step 1:

Create a python file under this model, such as 'use_model.py'

import torch
from transformers import BertForSequenceClassification, BertTokenizer

# load model
tokenizer = BertTokenizer.from_pretrained("uget/sexual_content_dection")
model = BertForSequenceClassification.from_pretrained("uget/sexual_content_dection")

def predict(text):
    encoding = tokenizer(text, return_tensors="pt")
    encoding = {k: v.to(model.device) for k,v in encoding.items()}

    outputs = model(**encoding)
    probs = torch.sigmoid(outputs.logits)
    
    predictions = torch.argmax(probs, dim=-1)
    label_map = {0: "None", 1: "Sexual"}
    predicted_label = label_map[predictions.item()]
    print(f"Predictions:{predictions.item()}, Label:{predicted_label}")
    return {"predictions": predictions.item(), "label": predicted_label}

predict("Tiffany Doll - Wine Makes Me Anal (31.03.2018)_1080p.mp4")

step 2:

Run

python3 use_model.py

Response JSON

{
    "predictions": 1,
    "label": "Sexual"
}

Explanation

The results only include two situations:

  • predictions-0 Not Dectection sexual content;
  • predictions-1 Sexual content was detected.

Buy me a cup of coffee,thanks

Model Card Contact

Email: [email protected]