abdiharyadi commited on
Commit
0a7fd4e
1 Parent(s): 15e0474

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -20
README.md CHANGED
@@ -1,4 +1,5 @@
1
  ---
 
2
  license: mit
3
  base_model: indobenchmark/indogpt
4
  tags:
@@ -8,28 +9,150 @@ model-index:
8
  results: []
9
  ---
10
 
11
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
12
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- # kancilgpt
15
-
16
- This model is a fine-tuned version of [indobenchmark/indogpt](https://huggingface.co/indobenchmark/indogpt) on the None dataset.
17
- It achieves the following results on the evaluation set:
18
- - Loss: 1.9729
19
-
20
- ## Model description
21
-
22
- More information needed
23
-
24
- ## Intended uses & limitations
25
-
26
- More information needed
27
-
28
- ## Training and evaluation data
29
-
30
- More information needed
31
 
32
- ## Training procedure
33
 
34
  ### Training hyperparameters
35
 
@@ -42,6 +165,8 @@ The following hyperparameters were used during training:
42
  - lr_scheduler_type: linear
43
  - num_epochs: 10
44
 
 
 
45
  ### Training results
46
 
47
  | Training Loss | Epoch | Step | Validation Loss |
@@ -52,6 +177,7 @@ The following hyperparameters were used during training:
52
  | 0.8509 | 4.0 | 1120 | 2.0245 |
53
  | 0.7956 | 5.0 | 1400 | 2.0466 |
54
 
 
55
 
56
  ### Framework versions
57
 
@@ -59,3 +185,4 @@ The following hyperparameters were used during training:
59
  - Pytorch 2.1.0+cu121
60
  - Datasets 2.16.1
61
  - Tokenizers 0.15.0
 
 
1
  ---
2
+ inference: false
3
  license: mit
4
  base_model: indobenchmark/indogpt
5
  tags:
 
9
  results: []
10
  ---
11
 
12
+ # KancilGPT
13
+
14
+ Once upon a time, in a digital data forest, there was a language model called KancilGPT.
15
+
16
+ ## Model Description
17
+ KancilGPT is a fine-tuned version of [indobenchmark/indogpt](https://huggingface.co/indobenchmark/indogpt).
18
+ Its task is generating an Indonesian fable story.
19
+ In addition, this model name is based on a famous, wise (but also master at trolling), cute fable character: [_kancil_](https://en.wikipedia.org/wiki/Chevrotain).
20
+ KancilGPT learns on the unpublished dataset, gathered from [dongengceritarakyat.com](https://dongengceritarakyat.com/).
21
+
22
+ ## Dataset and Prompt
23
+ The dataset consists of 388 Indonesian fable stories.
24
+ These stories was gathered from [dongengceritarakyat.com](https://dongengceritarakyat.com/) at January 8, 2024.
25
+ The duplicated stories without any paraphrashing was removed, based on the value of cosine similarity of TF-IDF trigram words.
26
+ Furthermore, the remaining stories were cleaned manually for removing non-fable stories, incomplete stories (e.g. synopsis), some misused punctuations, and some typos.
27
+ This cleaning were continued until now. If a mistake is found, the dataset will be modified as soon as possible.
28
+
29
+ The cleaned stories was splitted with 80:10:10 ratio, giving
30
+ - 310 stories for training,
31
+ - 39 stories for evaluation, and
32
+ - 39 stories for test (for now, it's unused).
33
+
34
+ The splitting is based on the value of cosine similarity of TF-IDF trigram words, same as duplicate story handling.
35
+ The stories are chosen one by one, and the smaller of maximum cosine similarity of a story is prioritized.
36
+ The first 39 stories is used for test, and the rest is used for training and evaluation, randomly.
37
+ This method is used to make sure no duplicate paraphrasing story exists in the test data.
38
+
39
+ To make the KancilGPT understand to generate a story, the prompts were built with the following formats:
40
+ 1. `<s> awal cerita | judul: <title> | <first-one-or-two-story-chunks> | tamat </s>`
41
+ 2. `<s> awal cerita | judul: <title> | <first-two-story-chunks> | bersambung </s>`
42
+ 3. `<s> pertengahan cerita | judul: <title> | <last-two-story-chunks> | tamat </s>`
43
+ 4. `<s> pertengahan cerita | judul: <title> | <two-story-chunks> | bersambung </s>`
44
+
45
+ Indonesian language was used for all prompts. Generally, there are four part of the prompt:
46
+ 1. story part type—it can be the beginning of a story (`awal cerita`) or it can be the middle of a story (`pertengahan cerita`);
47
+ 2. story title (`judul`);
48
+ 3. story chunks; and
49
+ 4. story end status—it can be "to be continued" (`bersambung`) or "the end" (`tamat`).
50
+
51
+ A story chunk consists of some sentences that totally contains at least 300 characters.
52
+ For the prompt, two chunks should be adjacent each other with original order. For a story that consists
53
+ of _n_ chunks, then the prompts were built for that story for first and second chunks with the second prompt format,
54
+ _k_-th and _(k + 1)_-th chunks with the third prompt format for _2 ≤ k ≤ (n - 2)_, and _(n - 1)_-th and _n_-th chunks
55
+ with the fourth prompt format. The first prompt format is a special case for a small story with one or two chunks in total.
56
+
57
+ ## How to Use
58
+ After learns how to generate an Indonesian fable story, KancilGPT can generate a random fable story with the specified procedures.
59
+ All steps using the generate arguments `do_sample=True`, `max_new_tokens=512`, and `pad_token_id=<eos_token_id>`.
60
+ Huggingface pipeline can not be used yet since KancilGPT uses `IndoNLGTokenizer` class from [`indobenchmark-toolkit`](https://github.com/indobenchmark/indobenchmark-toolkit).
61
+
62
+ ### Step 1: Begin the story
63
+ Use this prompt to generate the beginning of a story, including the generation of a title (`judul`):
64
+ ```
65
+ <s> awal cerita | judul:
66
+ ```
67
+
68
+ Below is the example output:
69
+ ```
70
+ <s> awal cerita | judul: elang dan singa | suatu hari, sebuah gunung meletus dan menimpa banyak penduduk. karena kejadian ini, banyak hewan yang mengungsi, termasuk mereka warga gunung yang tinggal di sana. dari mereka para serigala selalu ada di gunung tersebut. di antara mereka ada seekor singa. dia pun mempunyai sifat yang sangat baik hati. sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa. | bersambung</s>
71
+ ```
72
+ Notice that the real output has longer leading `</s>` with another random tokens. That's normal.
73
+ From the generated output, notice the end status of a story before the `</s>` token. If it's `tamat`, the story ends. Go to step 3.
74
+ If it's `bersambung`, the story should be continued. Mark the first _k_ sentences that contain at least 300 characters, with _k_ value as small as possible.
75
+ Take the remaining sentences as the next chunk for the next prompt in step 2. Below is the next chunk from the example output:
76
+ ```
77
+ sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa.
78
+ ```
79
+
80
+ ### Step 2: Continue the story
81
+ With the existing title and next chunk, use this prompt format to continue the story:
82
+ ```
83
+ <s> pertengahan cerita | judul: <title> | <next-chunk>
84
+ ```
85
+
86
+ Below is the example prompt from the example next chunk from the step 1:
87
+ ```
88
+ <s> pertengahan cerita | judul: elang dan singa | sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa.
89
+ ```
90
+
91
+ Below is the example output:
92
+ ```
93
+ <s> pertengahan cerita | judul: elang dan singa | sangat mudah untuk berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa. " kita tidak makan rumput liar dan kita akan mati kelaparan nanti bersama-sama. " jawab singa. " tetapi aku masih tetap memakan rumput liar ini. " jawab singa. " serigala tidak akan memangsa kita, meskipun kita sudah meminta makan. " kata singa, penuh penyesalan. | bersambung</s>
94
+ ```
95
+
96
+ From the generated output, notice the end status of a story before the `</s>` token. If it's `tamat`, the story ends. Go to step 3.
97
+ If it's `bersambung`, the story should be continued. Unlike step 1, the next chunk is taken from the new tokens only. Below is the next chunk from the example output:
98
+ ```
99
+ " kita tidak makan rumput liar dan kita akan mati kelaparan nanti bersama-sama. " jawab singa. " tetapi aku masih tetap memakan rumput liar ini. " jawab singa. " serigala tidak akan memangsa kita, meskipun kita sudah meminta makan. " kata singa, penuh penyesalan.
100
+ ```
101
+
102
+ Do step 2 with the current next chunk until the end status is `tamat`.
103
+
104
+ ### Step 3: Finish the story
105
+ Take all the story chunks from the generated outputs, and put it together. The story is finished!
106
+
107
+ Below is the example of generated story:
108
+ ```
109
+ elang dan singa
110
+
111
+ suatu hari, sebuah gunung meletus dan menimpa banyak penduduk. karena kejadian ini, banyak hewan yang mengungsi, termasuk mereka warga gunung yang tinggal di sana
112
+ dari mereka para serigala selalu ada di gunung tersebut. di antara mereka ada seekor singa. dia pun mempunyai sifat yang sangat baik hati. sangat mudah untuk
113
+ berteman dengan para serigala. dia sangat patuh dan menghormati sang raja hutan. suatu hari, singa melihat seekor singa sedang memakan rumput liar. singa sangat
114
+ marah mendengar hal ini. hal itu membuat serigala marah. " oh tidak... ini bukan tentang aku. aku hanya hewan kecil yang kecil hanya mampu makan rumput biasa tanpa
115
+ pernah memberi makan kambing kecil itu. bagaimana jika aku memberikan sesuatu untuk para serigala? " seru singa. " kita tidak makan rumput liar dan kita akan mati
116
+ kelaparan nanti bersama-sama. " jawab singa. " tetapi aku masih tetap memakan rumput liar ini. " jawab singa. " serigala tidak akan memangsa kita, meskipun kita
117
+ sudah meminta makan. " kata singa, penuh penyesalan. " tidak usah bersedih. memang benar kata serigala. kita tidak boleh makan rumput liar seperti kemarin. serigala
118
+ tidak boleh makan rumput liar. serigala tidak boleh makan rumput liar. " jawab singa. tikus berpikir bahwa dengan memakan rumput liar ia juga akan menderita
119
+ serigala. ia berpikir untuk mencari rumput liar di padang rumput. tikus pun berpikir, bahwa serigala tidak akan dapat memangsa serigala dan tidak akan dapat memangsa
120
+ serigala. namun, ternyata dengan memakan rumput liar, ia dapat mengalahkan serigala.
121
+ ```
122
+
123
+ Below is the English translation, with the helps of [OpusMT model](https://huggingface.co/Helsinki-NLP/opus-mt-id-en):
124
+ ```
125
+ eagle and lion
126
+
127
+ one day, a mountain erupted and struck many inhabitants. because of this incident, many animals were displaced, including those of the mountain people who lived
128
+ there. of them, the wolves were always on that mountain. among them was a lion. he has a very kind personality too. it's easy to be friends with the wolves. he's
129
+ very obedient and respectful to the king of the forest. one day, a lion saw a lion eating wild grass. the lion was very angry to hear of this. it made the wolves
130
+ angry. "oh no... this isn't about me. i'm just a small little animal capable of eating only regular grass without ever feeding that little goat. what if i give
131
+ something to the wolves?" the lion exclaimed. "we don't eat wild grass and we'll starve to death together." replied the lion. " but i still eat the wild grasses, "
132
+ the lion replied. " wolves won't eat us, even though we've asked for food. " says the lion, full of regret. " don't be sad. it's true that wolves say we can't eat
133
+ wild grasses like yesterday. the wolves can't eat wild grasses. the wolves can't eat wild grasses. " answers the lion. the rat thought that by eating the wild grasses he
134
+ would also suffer the wolves. he thought of looking for the wild grasses in the meadow. even the rat thought, that wolves would not be able to prey on wolves and
135
+ would not be able to prey on wolves. however, it turns out that by eating wild grasses, it can defeat wolves.
136
+ ```
137
+
138
+ ## Limitations
139
+
140
+ The reader probably got confused after reading the previous generated story. This shows the limitation of KancilGPT.
141
+ The generated story sometimes
142
+ 1. gives no correlation between title and content (Where is the eagle?),
143
+ 2. introduces new character out-of-nowhere (Where the rat come from?),
144
+ 3. introduces new character with the same name that leads to confusing anaphora resolution ("One day, _a lion_ saw _a lion_ eating wild grass."), and
145
+ 4. gives an illogical sentence ("By eating wild grasses, it can defeat wolves.").
146
+
147
+ Furthermore, all stories involved with KancilGPT were lowercased because the pretrained model was trained on lowercase texts.
148
+ In the end, all of the limitations opened some opportunities to make KancilGPT better from time to time. This is just the beginning.
149
+ By exploring the digital forest deeper, KancilGPT will generate a high quality Indonesian fable story in the future.
150
+
151
+ The end.
152
 
153
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
 
155
+ ## Behind The Story: Training Procedure
156
 
157
  ### Training hyperparameters
158
 
 
165
  - lr_scheduler_type: linear
166
  - num_epochs: 10
167
 
168
+ Early stopping callback is also used with `early_stopping_patience=3`.
169
+
170
  ### Training results
171
 
172
  | Training Loss | Epoch | Step | Validation Loss |
 
177
  | 0.8509 | 4.0 | 1120 | 2.0245 |
178
  | 0.7956 | 5.0 | 1400 | 2.0466 |
179
 
180
+ Choosing the best validation loss, KancilGPT achieves `loss=1.9729` on the evaluation set.
181
 
182
  ### Framework versions
183
 
 
185
  - Pytorch 2.1.0+cu121
186
  - Datasets 2.16.1
187
  - Tokenizers 0.15.0
188
+