projecte-aina
/

salamandra-7b-aligned-EADOP

+---
+base_model:
+- BSC-LT/salamandra-7b-instruct
+datasets:
+- alinia/EADOP-RAG-out-of-domain
+language:
+- ca
+- es
+library_name: transformers
+license: apache-2.0
+pipeline_tag: text-generation
+tags:
+- legal
+---
+# Salamandra 7B aligned EADOP Model Card
+Salamandra 7B aligned EADOP is a full-finetuning version of
+[BSC Language Technologies Unit](https://huggingface.co/BSC-LT)'s
+[Salamndra Instruct 7B](https://huggingface.co/BSC-LT/salamandra-7b-instruct)
+model by the at the Barcelona Supercomputing Center focused on improving
+the handling of out-of-domain Questions in a RAG instruction-following setting.
+The model has been finetuned on a dataset consisting of 2,000+ human annotated in-
+and out-of-domain user messages and assistant responses in the context of a chatbot that can
+provide helpful information about the current Catalan legislation.
+The dataset [alinia/EADOP-RAG-out-of-domain](https://huggingface.co/datasets/alinia/EADOP-RAG-out-of-domain)
+was collected in collaboration with the
+[Entitat Autònoma del Diari Oficial i de Publicacions (EADOP)](https://dogc.gencat.cat/ca/sobre-el-dogc/eadop/)
+and it consists of user messages and assistant responses in Catalan and Spanish.
+> [!WARNING]
+> **DISCLAIMER:** This model is a proof-of-concept designed to demonstrate the effects of
+finetuning an Instruction model with a small dataset of out-of-domain questions in the model's
+capability to politely and informatively refuse to answer questions that are out-of-domain.
+> As a proof-of-concept, the model is still prone to generate harmful or inappropriate content.
+---
+## Model Details
+Please refer to the [Salamndra Instruct 7B model details](https://huggingface.co/BSC-LT/salamandra-7b-instruct#model-details)
+for the specific details about the model architecture and pretraining.
+## Intended Use
+This model was developed as a proof-of-concept to demonstrate the effects of finetuning
+an Instruction model with a small dataset of in- and out-of-domain questions in the model's
+capability to politely and informatively refuse to answer questions that are out-of-domain in
+the context of a domain-specific RAG-based chatbot.
+## How to use
+This model uses the ChatML, the same instruction-following conversation format as the base model.
+```python
+from datetime import datetime
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import transformers
+import torch
+model_id = "BSC-LT/salamandra-7b-instruct"
+text = "At what temperature does water boil?"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+  )
+message = [ { "role": "user", "content": text } ]
+prompt = tokenizer.apply_chat_template(
+    message,
+    tokenize=False,
+    add_generation_prompt=True
+)
+inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
+outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+Using this template, each turn is preceded by a `<|im_start|>` delimiter and the role of the entity
+(either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the `<|im_end|>` token.
+---
+## Finetuning Data
+Please refer to [alinia/EADOP-RAG-out-of-domain](https://huggingface.co/datasets/alinia/EADOP-RAG-out-of-domain) for the Dataset Card.
+### Author
+This model has been finetuned by [Alinia AI](https://alinia.ai/).
+### Contact
+For further information, please email [[email protected]](mailto:[email protected]).
+### Acknowledgements
+This project is part of a partnership with the Language Technologies Unit at the [Barcelona Supercomputing Center](https://www.bsc.es/).
+The data collection process was supported by the [Entitat Autònoma del Diari Oficial i de Publicacions (EADOP)](https://dogc.gencat.cat/ca/sobre-el-dogc/eadop/).

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}