Model Description

A BERT-based model trained to classify text as either Cantonese or Traditional Chinese.

Intended Use

  • Primary Application: Language classification for Cantonese and Traditional Chinese texts.
  • Users: NLP researchers, developers working with Chinese language data.

Training Data

Utilizes the "raptorkwok/cantonese-traditional-chinese-parallel-corpus" from Hugging Face Datasets.

Training Procedure

  • Base Model: bert-base-chinese
  • Epochs: 3
  • Learning Rate: 2e-5

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("ming030890/chinese-langid")
model = AutoModelForSequenceClassification.from_pretrained("ming030890/chinese-langid")
text = "係唔係廣東話?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# 0 for Cantonese, 1 for Traditional Chinese
prediction = outputs.logits.argmax(-1).item()
Downloads last month
40
Safetensors
Model size
102M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ming030890/chinese-langid