KancilGPT

(Note: KancilGPT is still under development.)

Once upon a time, in a digital data forest, there was a language model called KancilGPT.

Model Description

KancilGPT is a fine-tuned version of indobenchmark/indogpt. Its task is generating an Indonesian fable story. In addition, this model name is based on a famous, wise (but also master at trolling), cute fable character: kancil. KancilGPT learns on the unpublished dataset, gathered from dongengceritarakyat.com.

Dataset and Prompt

The dataset consists of 388 Indonesian fable stories. These stories was gathered from dongengceritarakyat.com at January 8, 2024. The duplicated stories without any paraphrashing was removed, based on the value of cosine similarity of TF-IDF trigram words. Furthermore, the remaining stories were cleaned manually for removing non-fable stories, incomplete stories (e.g. synopsis), some misused punctuations, and some typos. This cleaning were continued until now. If a mistake is found, the dataset will be modified as soon as possible.

The cleaned stories was splitted with 80:10:10 ratio, giving

310 stories for training,
39 stories for evaluation, and
39 stories for test (for now, it's unused).

The splitting is based on the value of cosine similarity of TF-IDF trigram words, same as duplicate story handling. The stories are chosen one by one, and the smaller of maximum cosine similarity of a story is prioritized. The first 39 stories is used for test, and the rest is used for training and evaluation, randomly. This method is used to make sure no duplicate paraphrasing story exists in the test data.

To make the KancilGPT understand to generate a story, the prompts were built with the following formats:

<s> awal cerita | judul: <title> | <first-one-or-two-story-chunks> | tamat </s>
<s> awal cerita | judul: <title> | <first-two-story-chunks> | bersambung </s>
<s> pertengahan cerita | judul: <title> | <last-two-story-chunks> | tamat </s>
<s> pertengahan cerita | judul: <title> | <two-story-chunks> | bersambung </s>

Indonesian language was used for all prompts. Generally, there are four part of the prompt:

story part type—it can be the beginning of a story (awal cerita) or it can be the middle of a story (pertengahan cerita);
story title (judul);
story chunks; and
story end status—it can be "to be continued" (bersambung) or "the end" (tamat).

A story chunk consists of some sentences that totally contains at least 300 characters. For the prompt, two chunks should be adjacent each other with original order. For a story that consists of n chunks, then the prompts were built for that story for first and second chunks with the second prompt format, k-th and (k + 1)-th chunks with the third prompt format for 2 ≤ k ≤ (n - 2), and (n - 1)-th and n-th chunks with the fourth prompt format. The first prompt format is a special case for a small story with one or two chunks in total.

How to Use

After learns how to generate an Indonesian fable story, KancilGPT can generate a random fable story with the specified procedures. All steps using the generate arguments do_sample=True, max_new_tokens=512, and pad_token_id=<eos_token_id>. Huggingface pipeline can not be used yet since KancilGPT uses IndoNLGTokenizer class from indobenchmark-toolkit.

Step 1: Begin the story

Use this prompt to generate the beginning of a story, including the generation of a title (judul):

<s> awal cerita | judul:

Below is the example output:

<s> awal cerita | judul: elang dan singa | suatu hari, sebuah gunung meletus dan menimpa banyak penduduk. karena kejadian ini, banyak hewan yang mengungsi, termasuk mereka warga gunung yang tinggal di sana. dari mereka para serigala selalu ada di gunung tersebut. di antara mereka ada seekor singa. dia pun mempunyai sifat yang sangat baik hati. sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa. | bersambung</s>

Notice that the real output has longer leading </s> with another random tokens. That's normal. From the generated output, notice the end status of a story before the </s> token. If it's tamat, the story ends. Go to step 3. If it's bersambung, the story should be continued. Mark the first k sentences that contain at least 300 characters, with k value as small as possible. Take the remaining sentences as the next chunk for the next prompt in step 2. Below is the next chunk from the example output:

sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa.

Step 2: Continue the story

With the existing title and next chunk, use this prompt format to continue the story:

<s> pertengahan cerita | judul: <title> | <next-chunk>

Below is the example prompt from the example next chunk from the step 1:

<s> pertengahan cerita | judul: elang dan singa | sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa.

Below is the example output:

<s> pertengahan cerita | judul: elang dan singa | sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa. " kita tidak makan rumput liar dan kita akan mati kelaparan nanti bersama-sama. " jawab singa. " tetapi aku masih tetap memakan rumput liar ini. " jawab singa. " serigala tidak akan memangsa kita, meskipun kita sudah meminta makan. " kata singa, penuh penyesalan. | bersambung</s>

From the generated output, notice the end status of a story before the </s> token. If it's tamat, the story ends. Go to step 3. If it's bersambung, the story should be continued. Unlike step 1, the next chunk is taken from the new tokens only. Below is the next chunk from the example output:

" kita tidak makan rumput liar dan kita akan mati kelaparan nanti bersama-sama. " jawab singa. " tetapi aku masih tetap memakan rumput liar ini. " jawab singa. " serigala tidak akan memangsa kita, meskipun kita sudah meminta makan. " kata singa, penuh penyesalan.

Do step 2 with the current next chunk until the end status is tamat.

Step 3: Finish the story

Take all the story chunks from the generated outputs, and put it together. The story is finished!

Below is the example of generated story:

elang dan singa

suatu hari, sebuah gunung meletus dan menimpa banyak penduduk. karena kejadian ini, banyak hewan yang mengungsi, termasuk mereka warga gunung yang tinggal di sana
dari mereka para serigala selalu ada di gunung tersebut. di antara mereka ada seekor singa. dia pun mempunyai sifat yang sangat baik hati. sangat mudah untuk
berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat
marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa
pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa. " kita tidak makan rumput liar dan kita akan mati
kelaparan nanti bersama-sama. " jawab singa. " tetapi aku masih tetap memakan rumput liar ini. " jawab singa. " serigala tidak akan memangsa kita, meskipun kita
sudah meminta makan. " kata singa, penuh penyesalan. " tidak usah bersedih. memang benar kata serigala. kita tidak boleh makan rumput liar seperti kemarin. serigala
tidak boleh makan rumput liar. serigala tidak boleh makan rumput liar. " jawab singa. tikus berpikir bahwa dengan memakan rumput liar ia juga akan menderita
serigala. ia berpikir untuk mencari rumput liar di padang rumput. tikus pun berpikir, bahwa serigala tidak akan dapat memangsa serigala dan tidak akan dapat memangsa
serigala. namun, ternyata dengan memakan rumput liar, ia dapat mengalahkan serigala.

Below is the English translation, with the helps of OpusMT model:

eagle and lion

one day, a mountain erupted and struck many inhabitants. because of this incident, many animals were displaced, including those of the mountain people who lived
there. of them, the wolves were always on that mountain. among them was a lion. he has a very kind personality too. it's easy to be friends with the wolves. he's
very obedient and respectful to the king of the forest. one day, a lion saw a lion eating wild grass. the lion was very angry to hear of this. it made the wolves
angry. "oh no... this isn't about me. i'm just a small little animal capable of eating only regular grass without ever feeding that little goat. what if i give
something to the wolves?" the lion exclaimed. "we don't eat wild grass and we'll starve to death together." replied the lion. " but i still eat the wild grasses, "
the lion replied. " wolves won't eat us, even though we've asked for food. " says the lion, full of regret. " don't be sad. it's true that wolves say we can't eat
wild grasses like yesterday. the wolves can't eat wild grasses. the wolves can't eat wild grasses. " answers the lion. the rat thought that by eating the wild grasses he
would also suffer the wolves. he thought of looking for the wild grasses in the meadow. even the rat thought, that wolves would not be able to prey on wolves and
would not be able to prey on wolves. however, it turns out that by eating wild grasses, it can defeat wolves.

Limitations

The reader probably got confused after reading the previous generated story. This shows the limitation of KancilGPT. The generated story sometimes

gives no correlation between title and content (Where is the eagle?),
introduces new character out-of-nowhere (Where the rat come from?),
introduces new character with the same name that leads to confusing anaphora resolution ("One day, a lion saw a lion eating wild grass."), and
gives an illogical sentence ("By eating wild grasses, it can defeat wolves.").

Furthermore, all stories involved with KancilGPT were lowercased because the pretrained model was trained on lowercase texts. In the end, all of the limitations opened some opportunities to make KancilGPT better from time to time. This is just the beginning. By exploring the digital forest deeper, KancilGPT will generate a high quality Indonesian fable story in the future.

The end.

Behind The Story: Training Procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 10

Early stopping callback is also used with early_stopping_patience=3.

Training results

Training Loss	Epoch	Step	Validation Loss
1.1795	1.0	280	1.9856
1.0138	2.0	560	1.9729
0.9222	3.0	840	1.9975
0.8509	4.0	1120	2.0245
0.7956	5.0	1400	2.0466

Choosing the best validation loss, KancilGPT achieves loss=1.9729 on the evaluation set.

Framework versions

Transformers 4.35.2
Pytorch 2.1.0+cu121
Datasets 2.16.1
Tokenizers 0.15.0

abdiharyadi
/

kancilgpt-v20240113