mav23 commited on
Commit
c33a294
·
verified ·
1 Parent(s): d58919f

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +283 -0
  3. sabia-7b.Q4_0.gguf +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ sabia-7b.Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ model-index:
5
+ - name: sabia-7b
6
+ results:
7
+ - task:
8
+ type: text-generation
9
+ name: Text Generation
10
+ dataset:
11
+ name: ENEM Challenge (No Images)
12
+ type: eduagarcia/enem_challenge
13
+ split: train
14
+ args:
15
+ num_few_shot: 3
16
+ metrics:
17
+ - type: acc
18
+ value: 55.07
19
+ name: accuracy
20
+ source:
21
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
22
+ name: Open Portuguese LLM Leaderboard
23
+ - task:
24
+ type: text-generation
25
+ name: Text Generation
26
+ dataset:
27
+ name: BLUEX (No Images)
28
+ type: eduagarcia-temp/BLUEX_without_images
29
+ split: train
30
+ args:
31
+ num_few_shot: 3
32
+ metrics:
33
+ - type: acc
34
+ value: 47.71
35
+ name: accuracy
36
+ source:
37
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
38
+ name: Open Portuguese LLM Leaderboard
39
+ - task:
40
+ type: text-generation
41
+ name: Text Generation
42
+ dataset:
43
+ name: OAB Exams
44
+ type: eduagarcia/oab_exams
45
+ split: train
46
+ args:
47
+ num_few_shot: 3
48
+ metrics:
49
+ - type: acc
50
+ value: 41.41
51
+ name: accuracy
52
+ source:
53
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
54
+ name: Open Portuguese LLM Leaderboard
55
+ - task:
56
+ type: text-generation
57
+ name: Text Generation
58
+ dataset:
59
+ name: Assin2 RTE
60
+ type: assin2
61
+ split: test
62
+ args:
63
+ num_few_shot: 15
64
+ metrics:
65
+ - type: f1_macro
66
+ value: 46.68
67
+ name: f1-macro
68
+ source:
69
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
70
+ name: Open Portuguese LLM Leaderboard
71
+ - task:
72
+ type: text-generation
73
+ name: Text Generation
74
+ dataset:
75
+ name: Assin2 STS
76
+ type: eduagarcia/portuguese_benchmark
77
+ split: test
78
+ args:
79
+ num_few_shot: 15
80
+ metrics:
81
+ - type: pearson
82
+ value: 1.89
83
+ name: pearson
84
+ source:
85
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
86
+ name: Open Portuguese LLM Leaderboard
87
+ - task:
88
+ type: text-generation
89
+ name: Text Generation
90
+ dataset:
91
+ name: FaQuAD NLI
92
+ type: ruanchaves/faquad-nli
93
+ split: test
94
+ args:
95
+ num_few_shot: 15
96
+ metrics:
97
+ - type: f1_macro
98
+ value: 58.34
99
+ name: f1-macro
100
+ source:
101
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
102
+ name: Open Portuguese LLM Leaderboard
103
+ - task:
104
+ type: text-generation
105
+ name: Text Generation
106
+ dataset:
107
+ name: HateBR Binary
108
+ type: ruanchaves/hatebr
109
+ split: test
110
+ args:
111
+ num_few_shot: 25
112
+ metrics:
113
+ - type: f1_macro
114
+ value: 61.93
115
+ name: f1-macro
116
+ source:
117
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
118
+ name: Open Portuguese LLM Leaderboard
119
+ - task:
120
+ type: text-generation
121
+ name: Text Generation
122
+ dataset:
123
+ name: PT Hate Speech Binary
124
+ type: hate_speech_portuguese
125
+ split: test
126
+ args:
127
+ num_few_shot: 25
128
+ metrics:
129
+ - type: f1_macro
130
+ value: 64.13
131
+ name: f1-macro
132
+ source:
133
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
134
+ name: Open Portuguese LLM Leaderboard
135
+ - task:
136
+ type: text-generation
137
+ name: Text Generation
138
+ dataset:
139
+ name: tweetSentBR
140
+ type: eduagarcia-temp/tweetsentbr
141
+ split: test
142
+ args:
143
+ num_few_shot: 25
144
+ metrics:
145
+ - type: f1_macro
146
+ value: 46.64
147
+ name: f1-macro
148
+ source:
149
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
150
+ name: Open Portuguese LLM Leaderboard
151
+ ---
152
+
153
+ Sabiá-7B is Portuguese language model developed by [Maritaca AI](https://www.maritaca.ai/).
154
+
155
+ **Input:** The model accepts only text input.
156
+
157
+ **Output:** The Model generates text only.
158
+
159
+ **Model Architecture:** Sabiá-7B is an auto-regressive language model that uses the same architecture of LLaMA-1-7B.
160
+
161
+ **Tokenizer:** It uses the same tokenizer as LLaMA-1-7B.
162
+
163
+ **Maximum sequence length:** 2048 tokens.
164
+
165
+ **Pretraining data:** The model was pretrained on 7 billion tokens from the Portuguese subset of ClueWeb22, starting with the weights of LLaMA-1-7B and further trained for an additional 10 billion tokens, approximately 1.4 epochs of the training dataset.
166
+
167
+ **Data Freshness:** The pretraining data has a cutoff of mid-2022.
168
+
169
+ **License:** The licensing is the same as LLaMA-1's, restricting the model's use to research purposes only.
170
+
171
+ **Paper:** For more details, please refer to our paper: [Sabiá: Portuguese Large Language Models](https://arxiv.org/pdf/2304.07880.pdf)
172
+
173
+
174
+ ## Few-shot Example
175
+
176
+ Given that Sabiá-7B was trained solely on a language modeling objective without fine-tuning for instruction following, it is recommended for few-shot tasks rather than zero-shot tasks, like in the example below.
177
+
178
+ ```python
179
+ import torch
180
+ from transformers import LlamaTokenizer, LlamaForCausalLM
181
+
182
+ tokenizer = LlamaTokenizer.from_pretrained("maritaca-ai/sabia-7b")
183
+ model = LlamaForCausalLM.from_pretrained(
184
+ "maritaca-ai/sabia-7b",
185
+ device_map="auto", # Automatically loads the model in the GPU, if there is one. Requires pip install acelerate
186
+ low_cpu_mem_usage=True,
187
+ torch_dtype=torch.bfloat16 # If your GPU does not support bfloat16, change to torch.float16
188
+ )
189
+
190
+ prompt = """Classifique a resenha de filme como "positiva" ou "negativa".
191
+
192
+ Resenha: Gostei muito do filme, é o melhor do ano!
193
+ Classe: positiva
194
+
195
+ Resenha: O filme deixa muito a desejar.
196
+ Classe: negativa
197
+
198
+ Resenha: Apesar de longo, valeu o ingresso.
199
+ Classe:"""
200
+
201
+ input_ids = tokenizer(prompt, return_tensors="pt")
202
+
203
+ output = model.generate(
204
+ input_ids["input_ids"].to("cuda"),
205
+ max_length=1024,
206
+ eos_token_id=tokenizer.encode("\n")) # Stop generation when a "\n" token is dectected
207
+
208
+ # The output contains the input tokens, so we have to skip them.
209
+ output = output[0][len(input_ids["input_ids"][0]):]
210
+
211
+ print(tokenizer.decode(output, skip_special_tokens=True))
212
+ ```
213
+
214
+ If your GPU does not have enough RAM, try using int8 precision.
215
+ However, expect some degradation in the model output quality when compared to fp16 or bf16.
216
+ ```python
217
+ model = LlamaForCausalLM.from_pretrained(
218
+ "maritaca-ai/sabia-7b",
219
+ device_map="auto",
220
+ low_cpu_mem_usage=True,
221
+ load_in_8bit=True, # Requires pip install bitsandbytes
222
+ )
223
+ ```
224
+
225
+ ## Results in Portuguese
226
+
227
+ Below we show the results on the Poeta benchmark, which consists of 14 Portuguese datasets.
228
+
229
+ For more information on the Normalized Preferred Metric (NPM), please refer to our paper.
230
+
231
+ |Model | NPM |
232
+ |--|--|
233
+ |LLaMA-1-7B| 33.0|
234
+ |LLaMA-2-7B| 43.7|
235
+ |Sabiá-7B| 48.5|
236
+
237
+ ## Results in English
238
+
239
+ Below we show the average results on 6 English datasets: PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, and OpenBookQA.
240
+
241
+ |Model | NPM |
242
+ |--|--|
243
+ |LLaMA-1-7B| 50.1|
244
+ |Sabiá-7B| 49.0|
245
+
246
+
247
+ ## Citation
248
+
249
+ Please use the following bibtex to cite our paper:
250
+ ```
251
+ @InProceedings{10.1007/978-3-031-45392-2_15,
252
+ author="Pires, Ramon
253
+ and Abonizio, Hugo
254
+ and Almeida, Thales Sales
255
+ and Nogueira, Rodrigo",
256
+ editor="Naldi, Murilo C.
257
+ and Bianchi, Reinaldo A. C.",
258
+ title="Sabi{\'a}: Portuguese Large Language Models",
259
+ booktitle="Intelligent Systems",
260
+ year="2023",
261
+ publisher="Springer Nature Switzerland",
262
+ address="Cham",
263
+ pages="226--240",
264
+ isbn="978-3-031-45392-2"
265
+ }
266
+ ```
267
+
268
+ # [Open Portuguese LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard)
269
+ Detailed results can be found [here](https://huggingface.co/datasets/eduagarcia-temp/llm_pt_leaderboard_raw_results/tree/main/maritaca-ai/sabia-7b)
270
+
271
+ | Metric | Value |
272
+ |--------------------------|---------|
273
+ |Average |**47.09**|
274
+ |ENEM Challenge (No Images)| 55.07|
275
+ |BLUEX (No Images) | 47.71|
276
+ |OAB Exams | 41.41|
277
+ |Assin2 RTE | 46.68|
278
+ |Assin2 STS | 1.89|
279
+ |FaQuAD NLI | 58.34|
280
+ |HateBR Binary | 61.93|
281
+ |PT Hate Speech Binary | 64.13|
282
+ |tweetSentBR | 46.64|
283
+
sabia-7b.Q4_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7fc0992650989260e69e8ed53c6d8c4d206e78763ce309595571582a1eb2caaf
3
+ size 3825807456