bart-base-cantonese
This is the Cantonese model of BART base. It is obtained by a second-stage pre-training on the LIHKG dataset based on the fnlp/bart-base-chinese model.
This project is supported by Cloud TPUs from Google's TPU Research Cloud (TRC).
Note: To avoid any copyright issues, please do not use this model for any purpose.
GitHub Links
- Dataset: ayaka14732/lihkg-scraper
- Tokeniser: ayaka14732/bert-tokenizer-cantonese
- Base model: ayaka14732/bart-base-jax
- Pre-training: ayaka14732/bart-base-cantonese
Usage
from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
tokenizer = BertTokenizer.from_pretrained('Ayaka/bart-base-cantonese')
model = BartForConditionalGeneration.from_pretrained('Ayaka/bart-base-cantonese')
text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
output = text2text_generator('聽日就要返香港,我激動到[MASK]唔着', max_length=50, do_sample=False)
print(output[0]['generated_text'].replace(' ', ''))
# output: 聽日就要返香港,我激動到瞓唔着
Note: Please use the BertTokenizer
for the model vocabulary. DO NOT use the original BartTokenizer
.
Training Details
- Optimiser: SGD 0.03 + Adaptive Gradient Clipping 0.1
- Dataset: 172937863 sentences, pad or truncate to 64 tokens
- Batch size: 640
- Number of epochs: 7 epochs + 61440 steps
- Time: 44.0 hours on Google Cloud TPU v4-16
WandB link: 1j7zs802
- Downloads last month
- 158
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.