Fine-Tuning 01-ai/Yi-Coder-1.5B for UniBasic Code Generation with ChatML Format

#1
by Sumeet93 - opened

I am working on fine-tuning the 01-ai/Yi-Coder-1.5B model to support code generation in UniBasic, a niche programming language that has limited public data available. My goal is to enhance the model's ability to write accurate UniBasic programs. I am using ChatML format, as specified in the documentation.
Here’s a sample of how I have structured my training data:
[
{
"role": "system",
"content": "You are a helpful assistant that writes UniBasic programs. The task has a difficulty of beginner and covers the concepts of PRINT, PROGRAM, END."
},
{
"role": "user",
"content": "Write a UniBasic program named HelloWorld that follows the prompt: Write a UniBasic program named HelloWorld that clears the screen and prints the message 'Hello, World! Sumit Here'."
},
{
"role": "assistant",
"content": "PROGRAM HelloWorld\n CLEAR\n PRINT "Hello, World! Sumit Here"\nEND"
}
]
Fine Tunning Process:
I've extended the tokenizer with special tokens relevant to UniBasic (e.g., PROGRAM, CLEAR, PRINT, END).
Training Data: 117 training data

Training Arguments:
training_args = TrainingArguments(
output_dir="./fine_tuned_yicoder_model",
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
eval_steps=100,
save_steps=500,
warmup_steps=100,
learning_rate=5e-5,
weight_decay=0.01,
logging_dir='./logs',
evaluation_strategy="steps",
save_total_limit=3
)

Current Output: For simple addition program also I am getting [UNK] tokens which is basically unknown token.
Help me to answer below questions:

  1. How can I fine-tune a model effectively on a small dataset for a niche language like UniBasic?
  2. What are the best practices for handling tokenization and adding special tokens for a niche programming language?
  3. How can I improve the model's performance when dealing with limited training data?
  4. What hyperparameter adjustments should I make when fine-tuning with a small code dataset?
  5. What evaluation strategies can I use to measure the model’s accuracy and generation quality with limited test data?

Hi 👋@Sumeet93 Thanks for your questions, let's tackle them one by one!
Regarding question 1: You can address this by expanding the examples in your corpus, incorporating code from similar domains (like BASIC), or using data augmentation techniques. For instance, you could generate more programs with the same structure but different content, similar to automatically creating multiple variations of a "HelloWorld" program.

Regarding question 2: I believe this is related to the data volume. With a limited dataset, the model might not have learned the language's representation adequately. Alternatively, you can expand the tokenizer's vocabulary to address the specific issue. You can use the AutoTokenizer's add_special_tokens method to add these special tokens to the model's vocabulary. When fine-tuning, ensure that the training data is consistent with the new tokenizer.

Regarding question 3: Similar to question 1, data augmentation is an effective way to improve the model's generalization ability. You can modify the existing 117 training data points to generate more programs with similar structures but different content.

Regarding question 4: With a small dataset, consider reducing the batch size and increasing the number of training epochs. This allows the model to learn repeatedly from each data point. Your current batch_size=16 might be too large for a small dataset. I suggest trying to adjust per_device_train_batch_size and per_device_eval_batch_size to 8 or even smaller. I also recommend lowering the learning rate; the current learning_rate=5e-5 could be further reduced to 2e-5 or even 1e-5.

Regarding question 5: Indeed, when fine-tuning on a small dataset, traditional accuracy evaluation might not be sufficient to assess the quality of code generation. You can evaluate the correctness of the generated code by writing specialized unit tests and having the model-generated code pass those tests.

Thanks @haijian06 for prompt response. Can you please confirm that the training data format mentioned above is correct ? Another thing which I forgot to mentioned is yes we have added specific tokens to the tokenizer but I got [UNK] tokens as output for simple program .

Sign up or log in to comment