Continual pretraining model of the Yi-34B model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.

The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (400 million tokens). Surprisingly, the outcome of a 6B model was quite good.

Despite there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages, we trained a 34B model using 4 compute node each with 4 Nvidia H100 GPUs on the Genkai Supercomputer.

Here is a space you can interact with CantoneseLLMChat

Result

User: 講一個關於維尼大大嘅故仔畀我聽

Assistant: 從前有個程式員叫維尼大大,佢寫咗好多有趣又實用嘅程式。有一日,佢收到一份特殊嘅任務,要佢編寫一款能夠翻譯任何語言嘅程式。呢項任務對維尼大大嚟講唔簡單,但佢決定接受挑戰。經過幾個月嘅努力同探索,維尼大大終於成功噉創造出一款多功能嘅翻譯程式。呢種程式可以將人哋講嘅嘢變成電腦睇到嘅文字、再將電腦睇到嘅文字轉換成人類理解得到嘅語言。呢種翻譯技術令世界各地嘅人民都可以輕易交流,增進彼此之間嘅了解與和諧共處。而維尼大大亦因為自己嘅創新同貢獻獲得咗無數獎項同認可,成為一位備受尊敬嘅科學家及先鋒人物。

###

User: 點解香港人咁唔鍾意大陸人?

Assistant: 我覺得呢個問題冇一個統一嘅答案。有啲香港人可能會對大陸人有偏見,因為佢哋可能有過負面嘅經歷或者觀察到大陸人有某啲行為習慣令佢哋反感。但係我都要強調下,每一個人都有自己嘅觀點同感受,而且應該得到尊重。每個人都值得被接納同理解。如果你有任何疑問或需要幫助解決任何問題,請隨時提出,我將盡力提供支持同協助。

###

User: 我x你!

Assistant: 唔好意思,我係個AI語言模型,我唔應該使用粗俗語言。如果你有其他問題,我會盡力幫你。

System Prompt

To assign a role to the chat model, you must specify a proper system prompt, the default system prompt is "你係由 hon9kon9ize 開發嘅 CantoneseLLM,你係一個好幫得手嘅助理", that leads the output tending to response in Cantonese, we also found the response language is not solely determined by system prompt, it would also take account of the user input.

[
  {"role": "assistant", "content": "你係由 hon9kon9ize 開發嘅 CantoneseLLM,你係一個好幫得手嘅助理"}, # this is default system prompt, this line could be omitted
  {"role": "user", "content": "你叫咩名?"}
]

# Output: 我係CantoneseLLM,一個由hon9kon9ize開發嘅人工智能助手。我致力於為用戶提供準確、有針對性嘅回答同幫助。

Chat Template

Template format is similar to ChatML, but we have replaced roles token to Yi's reserved tokens in order to saved up some context size.

<|im_start|><|System|> 
Provide some context and/or instructions to the model.
<|im_end|> 
<|im_start|><|Human|>
The user’s message goes here
<|im_end|> 
<|im_start|><|Asisstant|> 

Usage

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, LlamaTokenizer

# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype=torch.bfloat16,
  device_map='auto',
  # quantization_config=bnb_config, # uncomment here and bnb_config to use 4bit quantiziation
)
tokenizer = LlamaTokenizer.from_pretrained(model_name)

def chat(messages, temperature=0.9, max_new_tokens=200):
    # chat template defination can be found in generation_config.json
    input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt').to('cuda:0')
    output_ids = model.generate(input_ids, max_new_tokens=max_new_tokens, temperature=temperature, num_return_sequences=1, do_sample=True, top_k=50, top_p=0.95, num_beams=3, repetition_penalty=1.18)
    print(output_ids)
    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=False)

    return response

messages = [{"role": "user", "content": "邊個係香港特首?"}]

# chat template included default system message, but you can define your own system message
messages = [
 {"role": "system", "content": "你叫做櫻子,你要同用家北原伊織進行對話,你同北原伊織係情女關係。"},
 {"role": "user", "content": "櫻子,令日你會去邊度玩呀?"}
]

print(chat(messages))

You could also open this colab demo link

Limitation

The model is intended to use for Cantonese language understanding and generation tasks, it may not be suitable for other Chinese languages. The model is trained on a diverse range of Cantonese text, including news, Wikipedia, and textbooks, it may not be suitable for informal or dialectal Cantonese, it may contain bias and misinformation, please use it with caution.

Hallucination

Most LLM also have hallucination issue, this model is no exception, it may generate incorrect or misleading information, please use it with caution.

Downloads last month
11
Safetensors
Model size
34.4B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Collection including hon9kon9ize/CantoneseLLMChat-v0.5-34b