holylovenia commited on
Commit
37c45b5
·
verified ·
1 Parent(s): 52de312

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -0
README.md CHANGED
@@ -1,3 +1,123 @@
1
  ---
2
  license: cc-by-sa-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-sa-4.0
3
  ---
4
+ # SEA-LION-7B-Instruct-C
5
+
6
+ SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
7
+ The size of the models range from 3 billion to 7 billion parameters.
8
+ This is the card for the SEA-LION 7B Instruct (Commercial) model.
9
+
10
+ For more details on the base model, please refer to the [base model's model card](https://huggingface.co/aisingapore/sealion7b).
11
+
12
+ SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
13
+
14
+
15
+ ## Model Details
16
+
17
+ ### Model Description
18
+
19
+ The SEA-LION model is a significant leap forward in the field of Natural Language Processing,
20
+ specifically trained to understand the SEA regional context.
21
+
22
+ SEA-LION is built on the robust MPT architecture and has a vocabulary size of 256K.
23
+
24
+ For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
25
+
26
+ The pre-training data for the base SEA-LION model encompasses 980B tokens.
27
+ The model was then further instruction-tuned on <b>commercially-permissive Indonesian data only</b>.
28
+
29
+ - **Developed by:** Products Pillar, AI Singapore
30
+ - **Funded by:** Singapore NRF
31
+ - **Model type:** Decoder
32
+ - **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
33
+ - **License:** CC BY-SA 4.0 License
34
+
35
+ ### Benchmark Performance
36
+
37
+ Coming soon.
38
+
39
+ ## Technical Specifications
40
+
41
+ ### Model Architecture and Objective
42
+
43
+ SEA-LION is a decoder model using the MPT architecture.
44
+
45
+ | Parameter | SEA-LION 7B |
46
+ |-----------------|:-----------:|
47
+ | Layers | 32 |
48
+ | d_model | 4096 |
49
+ | head_dim | 32 |
50
+ | Vocabulary | 256000 |
51
+ | Sequence Length | 2048 |
52
+
53
+
54
+ ### Tokenizer Details
55
+
56
+ We sample 20M lines from the training data to train the tokenizer.<br>
57
+ The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
58
+ The tokenizer type is Byte-Pair Encoding (BPE).
59
+
60
+ ### Example Usage
61
+
62
+ ```python
63
+ # Please use transformers==4.34.1
64
+
65
+ from transformers import AutoModelForCausalLM, AutoTokenizer
66
+
67
+ tokenizer = AutoTokenizer.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
68
+ model = AutoModelForCausalLM.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
69
+
70
+ prompt_template = "### USER:\n{human_prompt}\n\n### RESPONSE:\n"
71
+ prompt = """Apa sentimen dari kalimat berikut ini?
72
+ Kalimat: Buku ini sangat membosankan.
73
+ Jawaban: """
74
+ full_prompt = prompt_template.format(human_prompt=prompt)
75
+
76
+ tokens = tokenizer(full_prompt, return_tensors="pt")
77
+ output = model.generate(tokens["input_ids"], max_new_tokens=20, eos_token_id=tokenizer.eos_token_id)
78
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
79
+
80
+ ```
81
+
82
+
83
+ ## The Team
84
+
85
+ Lam Wen Zhi Clarence<br>
86
+ Leong Wei Qi<br>
87
+ Li Yier<br>
88
+ Liu Bing Jie Darius<br>
89
+ Lovenia Holy<br>
90
+ Montalan Jann Railey<br>
91
+ Ng Boon Cheong Raymond<br>
92
+ Ngui Jian Gang<br>
93
+ Nguyen Thanh Ngan<br>
94
+ Ong Tat-Wee David<br>
95
+ Rengarajan Hamsawardhini<br>
96
+ Susanto Yosephine<br>
97
+ Tai Ngee Chia<br>
98
+ Tan Choon Meng<br>
99
+ Teo Jin Howe<br>
100
+ Teo Eng Sipp Leslie<br>
101
+ Teo Wei Yi<br>
102
+ Tjhi William<br>
103
+ Yeo Yeow Tong<br>
104
+ Yong Xianbin<br>
105
+
106
+ ## Acknowledgements
107
+
108
+ AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
109
+ Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
110
+
111
+ ## Contact
112
+
113
+ For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6)
114
+
115
+ [Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
116
+
117
+ ## Disclaimer
118
+
119
+ This the repository for the non-commercial instruction-tuned model.
120
+ The model has _not_ been aligned for safety.
121
+ Developers and users should perform their own safety fine-tuning and related security measures.
122
+ In no event shall the authors be held liable for any claim, damages, or other liability
123
+ arising from the use of the released weights and codes.