MarinaPlius commited on
Commit
5247d02
1 Parent(s): e597de5

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +102 -3
  2. special_tokens_map.json +34 -0
README.md CHANGED
@@ -1,3 +1,102 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - BSC-LT/salamandra-7b-instruct
4
+ datasets:
5
+ - alinia/EADOP-RAG-out-of-domain
6
+ language:
7
+ - ca
8
+ - es
9
+ library_name: transformers
10
+ license: apache-2.0
11
+ pipeline_tag: text-generation
12
+ tags:
13
+ - legal
14
+ ---
15
+
16
+ # Salamandra 7B aligned EADOP Model Card
17
+ Salamandra 7B aligned EADOP is a full-finetuning version of
18
+ [BSC Language Technologies Unit](https://huggingface.co/BSC-LT)'s
19
+ [Salamndra Instruct 7B](https://huggingface.co/BSC-LT/salamandra-7b-instruct)
20
+ model by the at the Barcelona Supercomputing Center focused on improving
21
+ the handling of out-of-domain Questions in a RAG instruction-following setting.
22
+
23
+ The model has been finetuned on a dataset consisting of 2,000+ human annotated in-
24
+ and out-of-domain user messages and assistant responses in the context of a chatbot that can
25
+ provide helpful information about the current Catalan legislation.
26
+ The dataset [alinia/EADOP-RAG-out-of-domain](https://huggingface.co/datasets/alinia/EADOP-RAG-out-of-domain)
27
+ was collected in collaboration with the
28
+ [Entitat Autònoma del Diari Oficial i de Publicacions (EADOP)](https://dogc.gencat.cat/ca/sobre-el-dogc/eadop/)
29
+ and it consists of user messages and assistant responses in Catalan and Spanish.
30
+
31
+ > [!WARNING]
32
+ > **DISCLAIMER:** This model is a proof-of-concept designed to demonstrate the effects of
33
+ finetuning an Instruction model with a small dataset of out-of-domain questions in the model's
34
+ capability to politely and informatively refuse to answer questions that are out-of-domain.
35
+ > As a proof-of-concept, the model is still prone to generate harmful or inappropriate content.
36
+ ---
37
+
38
+ ## Model Details
39
+ Please refer to the [Salamndra Instruct 7B model details](https://huggingface.co/BSC-LT/salamandra-7b-instruct#model-details)
40
+ for the specific details about the model architecture and pretraining.
41
+
42
+ ## Intended Use
43
+ This model was developed as a proof-of-concept to demonstrate the effects of finetuning
44
+ an Instruction model with a small dataset of in- and out-of-domain questions in the model's
45
+ capability to politely and informatively refuse to answer questions that are out-of-domain in
46
+ the context of a domain-specific RAG-based chatbot.
47
+
48
+ ## How to use
49
+
50
+ This model uses the ChatML, the same instruction-following conversation format as the base model.
51
+
52
+ ```python
53
+ from datetime import datetime
54
+ from transformers import AutoTokenizer, AutoModelForCausalLM
55
+ import transformers
56
+ import torch
57
+
58
+ model_id = "BSC-LT/salamandra-7b-instruct"
59
+
60
+ text = "At what temperature does water boil?"
61
+
62
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
63
+ model = AutoModelForCausalLM.from_pretrained(
64
+ model_id,
65
+ device_map="auto",
66
+ torch_dtype=torch.bfloat16
67
+ )
68
+
69
+ message = [ { "role": "user", "content": text } ]
70
+
71
+ prompt = tokenizer.apply_chat_template(
72
+ message,
73
+ tokenize=False,
74
+ add_generation_prompt=True
75
+ )
76
+
77
+ inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
78
+ outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
79
+
80
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
81
+ ```
82
+
83
+ Using this template, each turn is preceded by a `<|im_start|>` delimiter and the role of the entity
84
+ (either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the `<|im_end|>` token.
85
+
86
+ ---
87
+
88
+ ## Finetuning Data
89
+ Please refer to [alinia/EADOP-RAG-out-of-domain](https://huggingface.co/datasets/alinia/EADOP-RAG-out-of-domain) for the Dataset Card.
90
+
91
+
92
+ ### Author
93
+ This model has been finetuned by [Alinia AI](https://alinia.ai/).
94
+
95
+
96
+ ### Contact
97
+ For further information, please email [[email protected]](mailto:[email protected]).
98
+
99
+
100
+ ### Acknowledgements
101
+ This project is part of a partnership with the Language Technologies Unit at the [Barcelona Supercomputing Center](https://www.bsc.es/).
102
+ The data collection process was supported by the [Entitat Autònoma del Diari Oficial i de Publicacions (EADOP)](https://dogc.gencat.cat/ca/sobre-el-dogc/eadop/).
special_tokens_map.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>"
5
+ ],
6
+ "bos_token": {
7
+ "content": "<s>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false
12
+ },
13
+ "eos_token": {
14
+ "content": "</s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "pad_token": {
21
+ "content": "<unk>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ },
27
+ "unk_token": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ }
34
+ }