Text Generation
Transformers
Safetensors
Thai
English
gemma2
conversational
text-generation-inference
Inference Endpoints
mrpeerat commited on
Commit
121fa35
·
verified ·
1 Parent(s): 50d3660

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -3
README.md CHANGED
@@ -1,3 +1,131 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ datasets:
4
+ - airesearch/WangchanThaiInstruct
5
+ - airesearch/WangchanX-FLAN-v6.1
6
+ - airesearch/wangchanx-seed-free-synthetic-instruct-thai-120k
7
+ language:
8
+ - th
9
+ - en
10
+ base_model:
11
+ - aisingapore/gemma2-9b-cpt-sea-lionv3-base
12
+ pipeline_tag: text-generation
13
+ library_name: transformers
14
+ ---
15
+
16
+ # Gemma2 9B WangchanLIONv2 Instruct
17
+
18
+ WangchanLION is a joint effort between VISTEC and AI Singapore to develop a Thai-specific collection of Large Language Models (LLMs), pre-trained for Southeast Asian (SEA) languages, and instruct-tuned specifically for the Thai language.
19
+
20
+ Gemma2 9B WangchanLIONv2 Instruct is a multilingual model that has been fine-tuned with around **3,760,000 Thai instruction-completion pairs** from human-annotated instructions, automatic data construction in FLAN-style, and synthetic samples.
21
+
22
+ - **Developed by:** Products Pillar, AI Singapore, and VISTEC
23
+ - **Funded by:** Singapore NRF, PTT Public Company Limited, SCB Public Company Limited, and SCBX Public Company Limited
24
+ - **Model type:** Decoder
25
+ - **Languages:** English, Thai
26
+ - **License:** [Gemma Community License](https://ai.google.dev/gemma/terms)
27
+
28
+ ## Model Details
29
+ ### Model Description
30
+ We performed instruction tuning in Thai on [continued pre-trained Gemma2 9B CPT SEA-LIONv3](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-base), a decoder model using the Gemma2 architecture, to create Gemma2 9B CPT WangchanLIONv2 Instruct.
31
+
32
+ For tokenization, the model employs the default tokenizer used in Gemma-2-9B. The model has a context length of 8192.
33
+
34
+ ### Benchmark Performance
35
+ We evaluated Gemma2 9B WangchanLIONv2 Instruct in Thai using the [Thai LLM Benchmark](https://blog.opentyphoon.ai/introducing-the-thaillm-leaderboard-thaillm-evaluation-ecosystem-508e789d06bf). The benchmark consists of Thai multiple-choice exams, multi-turn chat, reading comprehension and language generation. The evaluation results are available on [the leaderboard](https://huggingface.co/spaces/ThaiLLM-Leaderboard/leaderboard).
36
+
37
+ ### Usage
38
+ **NOTE** This model has not been trained to use a system prompt or to use tool calling.
39
+
40
+ Gemma2 9B WangchanLIONv2 Instruct can be run using the 🤗 Transformers library
41
+ ```python
42
+ # Please use transformers==4.45.2
43
+
44
+ import transformers
45
+ import torch
46
+
47
+ model_id = "aisingapore/WangchanLIONv2"
48
+
49
+ pipeline = transformers.pipeline(
50
+ "text-generation",
51
+ model=model_id,
52
+ model_kwargs={"torch_dtype": torch.bfloat16},
53
+ device_map="auto",
54
+ )
55
+ messages = [
56
+ {"role": "user", "content": "แต่งกลอนให้หน่อย"},
57
+ ]
58
+
59
+ outputs = pipeline(
60
+ messages,
61
+ max_new_tokens=256,
62
+ )
63
+ print(outputs[0]["generated_text"][-1])
64
+ ```
65
+
66
+ ### Caveats
67
+ It is important for users to be aware that our model exhibits certain limitations that warrant consideration. Like many LLMs, the model can hallucinate and occasionally generates irrelevant content, introducing fictional elements that are not grounded in the provided context. Users should also exercise caution in interpreting and validating the model's responses due to the potential inconsistencies in its reasoning.
68
+
69
+ ## Limitations
70
+ ### Safety
71
+ Current SEA-LION models, including this commercially permissive release, have not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes.
72
+
73
+ ## Technical Specifications
74
+ ### Fine-Tuning Details
75
+ Gemma2 9B WangchanLIONv2 Instruct was built using parameter-efficient fine-tuning. The training process for fine-tuning was approximately 3 days on 8x H100-80GB GPUs.
76
+
77
+ The following hyperparameters were used during training:
78
+ - quantization_bit: 4
79
+ - quantization_method: bitsandbytes
80
+ - finetuning_type: lora
81
+ - lora_target: all
82
+ - cutoff_len: 2048
83
+ - per_device_train_batch_size: 4
84
+ - gradient_accumulation_steps: 8
85
+ - learning_rate: 1.0e-4
86
+ - num_train_epochs: 3
87
+ - lr_scheduler_type: cosine
88
+ - warmup_ratio: 0.1
89
+ - bf16: true
90
+ - val_size: 0.01
91
+ - per_device_eval_batch_size: 1
92
+ - eval_strategy: steps
93
+ - eval_steps: 4000
94
+
95
+ We use [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) framework for the fine-tuning process.
96
+
97
+ ## Data
98
+ Gemma2 9B WangchanLIONv2 Instruct was trained on
99
+ 1. [Human-Annotated Thai Instruction Dataset](https://huggingface.co/datasets/airesearch/WangchanThaiInstruct)
100
+ 2. [FLAN-like dataset in Thai](https://huggingface.co/datasets/airesearch/WangchanX-FLAN-v6.1)
101
+ 3. [WangchanX Seed-Free Synthetic Instruct Thai 120k](https://huggingface.co/datasets/airesearch/wangchanx-seed-free-synthetic-instruct-thai-120k)
102
+
103
+ ## Call for Contributions
104
+ We encourage researchers, developers, and language enthusiasts to actively contribute to the enhancement and expansion of SEA-LION. Contributions can involve identifying and reporting bugs, sharing pre-training, instruction, and preference data, improving documentation usability, proposing and implementing new model evaluation tasks and metrics, or training versions of the model in additional Southeast Asian languages. Join us in shaping the future of SEA-LION by sharing your expertise and insights to make these models more accessible, accurate, and versatile. Please check out our GitHub for further information on the call for contributions.
105
+
106
+ ## The Team
107
+
108
+ ### AISG
109
+ Chan Adwin, Choa Esther, Cheng Nicholas, Huang Yuli, Lau Wayne, Lee Chwan Ren, Leong Wai Yi, Leong Wei Qi, Limkonchotiwat Peerat, Liu Bing Jie Darius, Montalan Jann Railey, Ng Boon Cheong Raymond, Ngui Jian Gang, Nguyen Thanh Ngan, Ong Brandon, Ong Tat-Wee David, Ong Zhi Hao, Rengarajan Hamsawardhini, Siow Bryan, Susanto Yosephine, Tai Ngee Chia, Tan Choon Meng, Teo Eng Sipp Leslie, Teo Wei Yi, Tjhi William, Teng Walter, Yeo Yeow Tong, Yong Xianbin
110
+
111
+ ### WangchanX
112
+ Can Udomcharoenchaikit, Chalermpun Mai-On, Chayapat Uthayopas, Ekapol Chuangsuwanich, Lalita Lowphansirikul, Nonthakit Chaiwong, Panuthep Tasawong, Patomporn Payoungkhamdee, Pume Tuchinda, Romrawin Chumpu, Sarana Nutanong, Wannaphong Phatthiyaphaibun
113
+
114
+ ## Acknowledgements
115
+ [AI Singapore](​​https://aisingapore.org/) is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the National Research Foundation or the National University of Singapore.
116
+ This release is part of WangchanX, a Large Language Model (LLM) research and development project supported by PTT Public Company Limited, SCB Public Company Limited, and SCBX Public Company Limited. The project is a collaborative effort originated by PyThaiNLP and VISTEC-depa Thailand AI Research Institute, focusing on the development of Adaptation Toolsets, Instruction Tuning & Alignment Datasets, and Benchmarks.
117
+ ## Contact
118
+ - Chalermpun Mai-On [email protected]
119
+ - Patomporn Payoungkhamdee [email protected]
120
+ - Peerat Limkonchotiwat [email protected]
121
+
122
+ [Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
123
+
124
+ [Link to WangchanX FLAN-like Dataset Creation Github repository](https://github.com/vistec-AI/WangchanX/tree/datasets)
125
+
126
+ ## Disclaimer
127
+ This is the repository for the commercial instruction-tuned model.
128
+ The model has _not_ been aligned for safety.
129
+ Developers and users should perform their own safety fine-tuning and related security measures.
130
+ In no event shall the authors be held liable for any claims, damages, or other liabilities arising from the use of the released weights and codes.
131
+