|
--- |
|
license: llama3 |
|
--- |
|
# Llama 3 70B Instruct no refusal |
|
|
|
This is a model that uses the orthogonal feature ablation as featured in this |
|
[paper](https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction). |
|
|
|
Calibration data: |
|
- 256 prompts from [jondurbin/airoboros-2.2](https://huggingface.co./datasets/jondurbin/airoboros-2.2) |
|
- 256 prompts from [AdvBench](https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv) |
|
- The direction is extracted between layer 40 and 41 |
|
|
|
I haven't tested the model but like the 8B model, may still refuse some instructions. |
|
**Use this model responsibly, I decline any liability resulting of the use of this model.** |
|
|
|
I will post the code later. |