|
--- |
|
license: apache-2.0 |
|
library_name: peft |
|
tags: |
|
- alignment-handbook |
|
- trl |
|
- sft |
|
- generated_from_trainer |
|
base_model: mistralai/Mistral-7B-v0.1 |
|
model-index: |
|
- name: Cimphony-Mistral-Law-7B |
|
results: |
|
- task: |
|
type: text-generation |
|
dataset: |
|
type: cais/mmlu |
|
name: MMLU |
|
metrics: |
|
- name: International Law |
|
type: accuracy |
|
value: 0.802 |
|
verified: false |
|
- task: |
|
type: text-generation |
|
dataset: |
|
type: cais/mmlu |
|
name: MMLU |
|
metrics: |
|
- name: Jurisprudence |
|
type: accuracy |
|
value: 0.704 |
|
verified: false |
|
- task: |
|
type: text-generation |
|
dataset: |
|
type: cais/mmlu |
|
name: MMLU |
|
metrics: |
|
- name: Professional Law |
|
type: accuracy |
|
value: 0.416 |
|
verified: false |
|
- task: |
|
type: text-generation |
|
dataset: |
|
type: coastalcph/lex_glue |
|
name: LexGLUE |
|
metrics: |
|
- name: ECtHR A |
|
type: balanced accuracy |
|
value: 0.631 |
|
verified: false |
|
- task: |
|
type: text-generation |
|
dataset: |
|
type: coastalcph/lex_glue |
|
name: LexGLUE |
|
metrics: |
|
- name: LEDGAR |
|
type: balanced accuracy |
|
value: 0.741 |
|
verified: false |
|
- task: |
|
type: text-generation |
|
dataset: |
|
type: coastalcph/lex_glue |
|
name: LexGLUE |
|
metrics: |
|
- name: CaseHOLD |
|
type: accuracy |
|
value: 0.776 |
|
verified: false |
|
- task: |
|
type: text-generation |
|
dataset: |
|
type: coastalcph/lex_glue |
|
name: LexGLUE |
|
metrics: |
|
- name: Unfair-ToS |
|
type: balanced accuracy |
|
value: 0.809 |
|
verified: false |
|
|
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Cimphony-Mistral-Law-7B |
|
|
|
We introduce Cimphony-Mistral-Law-7B, a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co./mistralai/Mistral-7B-v0.1). |
|
|
|
Cimphony’s LLMs present state-of-the-art performance on legal benchmarks, suppressing models trained on a much larger corpus with significantly more resources, even GPT-4, OpenAI’s flagship model. |
|
|
|
Checkout and register on our [https://cimphony.ai](https://app.cimphony.ai/signup?callbackUrl=https://app.cimphony.ai/) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/657d36d3647c0211e7746ed9/Yjx96bC58SPgNwmDxx_yx.png) |
|
|
|
## Model description |
|
|
|
The model was trained on 600M tokens. We use novel methods to expose the model to this corpus during training, blending a variety of legal reading comprehension tasks, as well as general language data. |
|
|
|
|
|
## Legal Evaluation Results |
|
|
|
We evaluate on the legal splits of the MMLU benchmark, as well as LexGLUE. While both are multiple option benchmarks, prompts were adapted so that the models output a single answer. In some cases, additional post-processing was required. |
|
|
|
Benchmarks for which the labels were A-E multiple-choice options use an accuracy mertic. Benchmarks that have a closed list of options (e.g. Unfair-ToS) use a balanced-accuracy metric, as classes may not be balanced. |
|
|
|
| Model / Benchmark | International Law (MMLU) | Jurisprudence (MMLU) | Professional law (MMLU) | ECtHR A (LexGlue) | LEDGAR (LexGlue) | CaseHOLD (LexGlue) | Unfair-ToS (LexGlue) | |
|
|:-----------------------------------|:--------------------------|:----------------------|:-------------------------|:-------------------|:------------------|:--------------------|:-----------------------| |
|
| Mistral-7B-Instruct-v0.2 | 73.6% | 69.4% | 41.2% | 67.5% | 50.6% | 56.3% | 36.6% | |
|
| AdaptLLM | 57.0% | 52.8% | 36.1% | 51.9% | 46.3% | 50.0% | 51.3% | |
|
| Saul-7B | 69.4% | 63.0% | **43.2%** | **71.2%** | 55.9% | 65.8% | 80.3% | |
|
|<tr style="background-color:yellow;"><td>Cimphony-7B</td><td>**80.2%**</td><td>**70.4%**</td><td>41.6%</td><td>63.1%</td><td>**74.1%**</td><td>**77.6%**</td><td>**80.9%**</td></tr>| |
|
|
|
## Training and evaluation data |
|
|
|
Following the framework presented in [AdaptLLM](https://huggingface.co./AdaptLLM/law-chat), we convert the raw legal text into reading comprehension. Taking inspiration from human learning via reading comprehension - practice after reading improves the ability to answer questions based on the learned knowledge. |
|
|
|
We developed a high-quality prompt database, considering the capabilities we’d like the model to possess. LLMs were prompt with the raw text and a collection of prompts, and it returned answers, additional questions, and transformations relevant to the input data. With further post-processing of these outputs, we created our legal reading comprehension dataset. |
|
|
|
|
|
| Domain | Dataset | Tokens | License | |
|
|:-------------------|:--------------------|:------:|:------------| |
|
| Legal | The Pile (FreeLaw) | 180M | MIT | |
|
| Legal | LexGlue (train split only) | 108M | CC-BY-4.0 | |
|
| Legal | USClassActions | 12M | GPL-3.0 | |
|
| Math (CoT) | AQUA-RAT | 3M | Apache-2.0 | |
|
| Commonsense (CoT) | ECQA | 2.4M | Apache-2.0 | |
|
| Reasoning (CoT) | EntailmentBank | 1.8M | Apache-2.0 | |
|
| Chat | UltraChat | 90M | MIT | |
|
| Code | Code-Feedback | 36M | Apache-2.0 | |
|
| Instruction | OpenOrca | 180M | MIT | |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
This model can be used for use cases involving legal domain text generation. |
|
|
|
As with any language model, users must not solely relay on model generations. This model has not gone through a human-feedback alignment (RLHF). The model may generate responses containing hallucinations and biases. |
|
|
|
Example use: |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from peft import PeftModel |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("cimphonyadmin/Cimphony-Mistral-Law-7B") |
|
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1") |
|
model = PeftModel.from_pretrained(model, "cimphonyadmin/Cimphony-Mistral-Law-7B") |
|
|
|
# Put your input here: |
|
user_input = '''What can you tell me about ex post facto laws?''' |
|
|
|
# Apply the prompt template |
|
prompt = tokenizer.apply_chat_template(user_input, tokenize=False) |
|
|
|
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device) |
|
outputs = model.generate(input_ids=inputs, max_length=4096)[0] |
|
|
|
answer_start = int(inputs.shape[-1]) |
|
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True) |
|
|
|
print(f'### User Input:\n{user_input}\n\n### Assistant Output:\n{pred}') |
|
``` |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 0.0005 |
|
- train_batch_size: 8 |
|
- eval_batch_size: 24 |
|
- seed: 42 |
|
- distributed_type: multi-GPU |
|
- num_devices: 4 |
|
- gradient_accumulation_steps: 4 |
|
- total_train_batch_size: 128 |
|
- total_eval_batch_size: 96 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: cosine |
|
- lr_scheduler_warmup_ratio: 0.05 |
|
- num_epochs: 1 |
|
|
|
|
|
### Framework versions |
|
|
|
- PEFT 0.8.2 |
|
- Transformers 4.37.2 |
|
- Pytorch 2.1.2+cu121 |
|
- Datasets 2.14.6 |
|
- Tokenizers 0.15.2 |