abdiharyadi
/

kancilgpt-v20240113

@@ -1,4 +1,5 @@
 ---
 license: mit
 base_model: indobenchmark/indogpt
 tags:
@@ -8,28 +9,150 @@ model-index:
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# kancilgpt
-This model is a fine-tuned version of [indobenchmark/indogpt](https://huggingface.co/indobenchmark/indogpt) on the None dataset.
-It achieves the following results on the evaluation set:
-- Loss: 1.9729
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters
@@ -42,6 +165,8 @@ The following hyperparameters were used during training:
 - lr_scheduler_type: linear
 - num_epochs: 10
 ### Training results
 | Training Loss | Epoch | Step | Validation Loss |
@@ -52,6 +177,7 @@ The following hyperparameters were used during training:
 | 0.8509        | 4.0   | 1120 | 2.0245          |
 | 0.7956        | 5.0   | 1400 | 2.0466          |
 ### Framework versions
@@ -59,3 +185,4 @@ The following hyperparameters were used during training:
 - Pytorch 2.1.0+cu121
 - Datasets 2.16.1
 - Tokenizers 0.15.0

 ---
+inference: false
 license: mit
 base_model: indobenchmark/indogpt
 tags:
   results: []
 ---
+# KancilGPT
+Once upon a time, in a digital data forest, there was a language model called KancilGPT.
+## Model Description
+KancilGPT is a fine-tuned version of [indobenchmark/indogpt](https://huggingface.co/indobenchmark/indogpt).
+Its task is generating an Indonesian fable story.
+In addition, this model name is based on a famous, wise (but also master at trolling), cute fable character: [_kancil_](https://en.wikipedia.org/wiki/Chevrotain).
+KancilGPT learns on the unpublished dataset, gathered from [dongengceritarakyat.com](https://dongengceritarakyat.com/).
+## Dataset and Prompt
+The dataset consists of 388 Indonesian fable stories.
+These stories was gathered from [dongengceritarakyat.com](https://dongengceritarakyat.com/) at January 8, 2024.
+The duplicated stories without any paraphrashing was removed, based on the value of cosine similarity of TF-IDF trigram words.
+Furthermore, the remaining stories were cleaned manually for removing non-fable stories, incomplete stories (e.g. synopsis), some misused punctuations, and some typos.
+This cleaning were continued until now. If a mistake is found, the dataset will be modified as soon as possible.
+The cleaned stories was splitted with 80:10:10 ratio, giving
+- 310 stories for training,
+- 39 stories for evaluation, and
+- 39 stories for test (for now, it's unused).
+The splitting is based on the value of cosine similarity of TF-IDF trigram words, same as duplicate story handling.
+The stories are chosen one by one, and the smaller of maximum cosine similarity of a story is prioritized.
+The first 39 stories is used for test, and the rest is used for training and evaluation, randomly.
+This method is used to make sure no duplicate paraphrasing story exists in the test data.
+To make the KancilGPT understand to generate a story, the prompts were built with the following formats:
+1. `<s> awal cerita | judul: <title> | <first-one-or-two-story-chunks> | tamat </s>`
+2. `<s> awal cerita | judul: <title> | <first-two-story-chunks> | bersambung </s>`
+3. `<s> pertengahan cerita | judul: <title> | <last-two-story-chunks> | tamat </s>`
+4. `<s> pertengahan cerita | judul: <title> | <two-story-chunks> | bersambung </s>`
+Indonesian language was used for all prompts. Generally, there are four part of the prompt:
+1. story part type—it can be the beginning of a story (`awal cerita`) or it can be the middle of a story (`pertengahan cerita`);
+2. story title (`judul`);
+3. story chunks; and
+4. story end status—it can be "to be continued" (`bersambung`) or "the end" (`tamat`).
+A story chunk consists of some sentences that totally contains at least 300 characters.
+For the prompt, two chunks should be adjacent each other with original order. For a story that consists
+of _n_ chunks, then the prompts were built for that story for first and second chunks with the second prompt format,
+_k_-th and _(k + 1)_-th chunks with the third prompt format for _2 ≤ k ≤ (n - 2)_, and _(n - 1)_-th and _n_-th chunks
+with the fourth prompt format. The first prompt format is a special case for a small story with one or two chunks in total.
+## How to Use
+After learns how to generate an Indonesian fable story, KancilGPT can generate a random fable story with the specified procedures.
+All steps using the generate arguments `do_sample=True`, `max_new_tokens=512`, and `pad_token_id=<eos_token_id>`.
+Huggingface pipeline can not be used yet since KancilGPT uses `IndoNLGTokenizer` class from [`indobenchmark-toolkit`](https://github.com/indobenchmark/indobenchmark-toolkit).
+### Step 1: Begin the story
+Use this prompt to generate the beginning of a story, including the generation of a title (`judul`):
+```
+<s> awal cerita | judul:
+```
+Below is the example output:
+```
+<s> awal cerita | judul: elang dan singa | suatu hari, sebuah gunung meletus dan menimpa banyak penduduk. karena kejadian ini, banyak hewan yang mengungsi, termasuk mereka warga gunung yang tinggal di sana. dari mereka para serigala selalu ada di gunung tersebut. di antara mereka ada seekor singa. dia pun mempunyai sifat yang sangat baik hati. sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa. | bersambung</s>
+```
+Notice that the real output has longer leading `</s>` with another random tokens. That's normal.
+From the generated output, notice the end status of a story before the `</s>` token. If it's `tamat`, the story ends. Go to step 3.
+If it's `bersambung`, the story should be continued. Mark the first _k_ sentences that contain at least 300 characters, with _k_ value as small as possible.
+Take the remaining sentences as the next chunk for the next prompt in step 2. Below is the next chunk from the example output:
+```
+sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa.
+```
+### Step 2: Continue the story
+With the existing title and next chunk, use this prompt format to continue the story:
+```
+<s> pertengahan cerita | judul: <title> | <next-chunk>
+```
+Below is the example prompt from the example next chunk from the step 1:
+```
+<s> pertengahan cerita | judul: elang dan singa | sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa.
+```
+Below is the example output:
+```
+<s> pertengahan cerita | judul: elang dan singa | sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa. " kita tidak makan rumput liar dan kita akan mati kelaparan nanti bersama-sama. " jawab singa. " tetapi aku masih tetap memakan rumput liar ini. " jawab singa. " serigala tidak akan memangsa kita, meskipun kita sudah meminta makan. " kata singa, penuh penyesalan. | bersambung</s>
+```
+From the generated output, notice the end status of a story before the `</s>` token. If it's `tamat`, the story ends. Go to step 3.
+If it's `bersambung`, the story should be continued. Unlike step 1, the next chunk is taken from the new tokens only. Below is the next chunk from the example output:
+```
+" kita tidak makan rumput liar dan kita akan mati kelaparan nanti bersama-sama. " jawab singa. " tetapi aku masih tetap memakan rumput liar ini. " jawab singa. " serigala tidak akan memangsa kita, meskipun kita sudah meminta makan. " kata singa, penuh penyesalan.
+```
+Do step 2 with the current next chunk until the end status is `tamat`.
+### Step 3: Finish the story
+Take all the story chunks from the generated outputs, and put it together. The story is finished!
+Below is the example of generated story:
+```
+elang dan singa
+suatu hari, sebuah gunung meletus dan menimpa banyak penduduk. karena kejadian ini, banyak hewan yang mengungsi, termasuk mereka warga gunung yang tinggal di sana
+dari mereka para serigala selalu ada di gunung tersebut. di antara mereka ada seekor singa. dia pun mempunyai sifat yang sangat baik hati. sangat mudah untuk
+berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat
+marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa
+pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa. " kita tidak makan rumput liar dan kita akan mati
+kelaparan nanti bersama-sama. " jawab singa. " tetapi aku masih tetap memakan rumput liar ini. " jawab singa. " serigala tidak akan memangsa kita, meskipun kita
+sudah meminta makan. " kata singa, penuh penyesalan. " tidak usah bersedih. memang benar kata serigala. kita tidak boleh makan rumput liar seperti kemarin. serigala
+tidak boleh makan rumput liar. serigala tidak boleh makan rumput liar. " jawab singa. tikus berpikir bahwa dengan memakan rumput liar ia juga akan menderita
+serigala. ia berpikir untuk mencari rumput liar di padang rumput. tikus pun berpikir, bahwa serigala tidak akan dapat memangsa serigala dan tidak akan dapat memangsa
+serigala. namun, ternyata dengan memakan rumput liar, ia dapat mengalahkan serigala.
+```
+Below is the English translation, with the helps of [OpusMT model](https://huggingface.co/Helsinki-NLP/opus-mt-id-en):
+```
+eagle and lion
+one day, a mountain erupted and struck many inhabitants. because of this incident, many animals were displaced, including those of the mountain people who lived
+there. of them, the wolves were always on that mountain. among them was a lion. he has a very kind personality too. it's easy to be friends with the wolves. he's
+very obedient and respectful to the king of the forest. one day, a lion saw a lion eating wild grass. the lion was very angry to hear of this. it made the wolves
+angry. "oh no... this isn't about me. i'm just a small little animal capable of eating only regular grass without ever feeding that little goat. what if i give
+something to the wolves?" the lion exclaimed. "we don't eat wild grass and we'll starve to death together." replied the lion. " but i still eat the wild grasses, "
+the lion replied. " wolves won't eat us, even though we've asked for food. " says the lion, full of regret. " don't be sad. it's true that wolves say we can't eat
+wild grasses like yesterday. the wolves can't eat wild grasses. the wolves can't eat wild grasses. " answers the lion. the rat thought that by eating the wild grasses he
+would also suffer the wolves. he thought of looking for the wild grasses in the meadow. even the rat thought, that wolves would not be able to prey on wolves and
+would not be able to prey on wolves. however, it turns out that by eating wild grasses, it can defeat wolves.
+```
+## Limitations
+The reader probably got confused after reading the previous generated story. This shows the limitation of KancilGPT.
+The generated story sometimes
+1. gives no correlation between title and content (Where is the eagle?),
+2. introduces new character out-of-nowhere (Where the rat come from?),
+3. introduces new character with the same name that leads to confusing anaphora resolution ("One day, _a lion_ saw _a lion_ eating wild grass."), and
+4. gives an illogical sentence ("By eating wild grasses, it can defeat wolves.").
+Furthermore, all stories involved with KancilGPT were lowercased because the pretrained model was trained on lowercase texts.
+In the end, all of the limitations opened some opportunities to make KancilGPT better from time to time. This is just the beginning.
+By exploring the digital forest deeper, KancilGPT will generate a high quality Indonesian fable story in the future.
+The end.
+---
+## Behind The Story: Training Procedure
 ### Training hyperparameters
 - lr_scheduler_type: linear
 - num_epochs: 10
+Early stopping callback is also used with `early_stopping_patience=3`.
 ### Training results
 | Training Loss | Epoch | Step | Validation Loss |
 | 0.8509        | 4.0   | 1120 | 2.0245          |
 | 0.7956        | 5.0   | 1400 | 2.0466          |
+Choosing the best validation loss, KancilGPT achieves `loss=1.9729` on the evaluation set.
 ### Framework versions
 - Pytorch 2.1.0+cu121
 - Datasets 2.16.1
 - Tokenizers 0.15.0